Snapstream Searcher
URL: http://johnny.sas.upenn.edu/~pemantle/
By Philip Pham
Snapstream Searcher is a search engine for closed-captioning television scripts. It features a domain-specific language for queries of arbitrary complexity and the ability to correlate a large number of terms by their proximity to each other.
The most interesting part of the search engine is perhaps the domain-specific language. It was developed because a researcher may be doing a search for the country Greece, but wants to avoid references to Greek yogurt. Hence, a search may look something like ({greece} + {greek}) !@300 ({yogurt} + {yoplait})
, which matches the terms greece
and greek
not within 300 characters of yogurt
or yoplait
. To parse such expressions, I convert the expression into Reverse Polish notation (RPN) with the shunting-yard algorithm, and then, evaluate the expression with the postfix algorithm, which basically pops the stack when an operator is encountered.
Also, when searching a large list of expressions, I tried to limit the number of repetitive reads with a priority queue. I find the first occurence of every term and put all the matches in a priority queue such that the matches earlier in the text have highest priority. Everytime a term is popped, I find the next occurence of the popped term and put it back in the priority queue, so terms are read in the order in which they appear.
Another neat algorithm that I made use of is Rolling String Hashes. The impetus for using this was problems that we were having with commericals. If a country appeared in a commercial, both its total matches and program matches would spike. To only count such an occurrence once, we invented the concept of contexts, which looks at the surrounding text. Rather than store the match and its surrounding text, we store the string hashes of the surrounding text. To compute the hash quickly anywhere in a body of text for arbitrarily long strings, I use the Rabin-Karp algorithm as described in my blog post.
The engine is written in C++, and the search is done by a CGI script. The size of closed-captioning text is on the scale of several gigabyes, and a file-based store is used. The program is not exactly paragon of engineering as I was very restricted by the Penn IT department with what I could do with the server. It's not particularly fast either. However, it accomplishes the stated goals well. Correctness of results are assured by test-driven development. Everything is unit-tested with Google Test as the test driver. The code is on GitHub.
There are also several data visualization components, written in D3.js. For the output of a multiple search, a time series is plotted. For an example, check out 2016 Republican Candidates on TV in 2015. The output of a matrix search can be visualized as graph. We can try to cluster this graph by how often countries co-occur. See Country Relationships and Spring Embedding.
I developed it as a research tool for Professors Robin Pemantle and Diana Mutz at the University of Pennsylvania. I believe that it is intented for graduate students to do research on media in the United States.
Try the tool at Snapstream Searcher.