Simple Pattern extraction from Google n-grams

02 Jul 2011

Google has released n-gram datasets for multiple languages, including English and German. For my needs (lots of patterns, with lemmatization), writing a small bit of C++ allows me to extract pattern instances in bulk, more quickly and comfortably than with bzgrep.

Normally, people build a database around the n-gram datasets to be able to perform ad-hoc queries, such as Stefan Evert's Web1T5-Easy, or at least uncompress each file to be able to seek through it, and then do queries online. If you're looking for specific patterns anyway, however, it can make sense to just extract these patterns.

A couple years ago, when trying out pattern searches for BART, I would query Google's n-gram data like I would query a search engine API (back in the day when it seemed like this would give you reproducible results and the downside of "it takes horribly long" wasn't discovered yet): given two lemmas that were interesting to you (such as "falafel" and "food", you would construct the appropriate morphological variant (i.e., plural "foods") fit the whole thing into a pattern. Then, armed with such a query string (or possibly a gappy version of it such as "foods * falafel"), you would do a query for it and the query engine, in the case of BART, would do a bisection search in the uncompressed n-gram files. This actually works decently (sub-second query times), as long as you're interested in things individually, and you don't mind keeping around the 80GB (uncompressed text) or 220GB (Web1T5-Easy) of data.

If you want to extract features for clustering, you know beforehand which queries for which patterns you're interested in, and you need to find out about a large number of fillers for these patterns - hence, it is much easier to do bulk queries (for reasonably precise patterns), and keep these extracted instances around (usually only hundreds of thousands of lines, instead of gigabytes over gigabytes). My old method for achieving this first used bzgrep to extract everything that looked like a pattern instance, and then used a python script to read in the patterns, throw out the pattern part of the line, and lemmatize the words you're interested in.

With the new approach (which I will talk about now), I have only one program that I need to call, I don't have to call the lemmatization part over and over or make up my mind on how to cache this. You just call extract_ngrams @NNS '%and$' '%other$' @NNS and it will use a pre-determined list of nouns that were tagged as plural nouns in the corpus I'm interested in, and their lemmas. The result is actually a nice example for the transparent decompression of streams in the Boost C++ library, and the use of multithreading through OpenMP, which makes it excellent fodder for a blog post.

With Boost's IO library, transparently opening Bzip2 files has actually become (almost) as easy as it is in Python or Java: you just set up a pipeline of objects, as in

std::ifstream myfile(filename, std::ios::binary|std::ios::in);
BI::filtering_stream<BI::input> my_filter;

and then keep reading from the file with calls of


and that's all. You could do fancier things than reading in the lines, in here the line is just read in, then split with ''strtok_r'' (remember, this is multithreaded code, so no non-reentrant functions for you), and then the single tokens of the line are matched against the template supplied by the user.

Using OpenMP to search through multiple bzipped files at once is similarly trivial: you put an appropriate pragma (''pragma omp parallel for'') in the position, make sure that all non-shared variables are declared inside the loop, and that's it:

#pragma omp parallel for
for (int i=0; i<info.n_files; i++) {
  char buf[BUF_SIZE];
  sprintf(buf, info.path_template,i);

That's it? Well, the buffering of the output is done per thread (in a thread-local buffer) and as it turned out, you have to put a critical section around the ''write'' calls, because occasional bad things happen otherwise. (In the code snippet, writeout is the name of the critical section).

if ((out_ptr-out_buf)>BUF_SIZE) {
#pragma omp critical (writeout)
   write(1, out_buf, (out_ptr-out_buf));

If you think this can be useful to you, find the source on bitbucket. You will need the Boost IO libraries, PCRE (the regular expressions library), and you will need to fix the paths to the ngram data in the source code.

Blog posts

Neural Networks are Quite Neat (a rant)
After decades of Neural Network overhype, and a following time of disrespect, Neural Networks have become popular again - for a reason, as they can fit large amounts of data better than the feature-based models that came before them. Nonetheless, people who lived through the first overhyped episod are asking critical questions - the answers to which are (hopefully!) enlightening (more ...)

The brave new world of search engines
In an earlier post, I talked about current Google's search results in terms of personalization, and whether to like it or not. This post takes another aspect of 2011 Google search: what they do with complex queries. (more...)

Simple Pattern extraction from Google n-grams
Google has released n-gram datasets for multiple languages, including English and German. For my needs (lots of patterns, with lemmatization), writing a small bit of C++ allows me to extract pattern instances in bulk, more quickly and comfortably than with bzgrep. (more...)

Useful links

Fast dependency parsing
For doing syntactic preprocessing without spending too much time (CPU or engineering) on it, SpaCy and NLP4J should be among the first things to try. SpaCy covers English and German, whereas NLP4J covers only English, but is trained on biomedical treebanks (in addition to the WSJ news that everyone trains on), which makes it especially useful for that kind of texts. If you're looking towards parsing French, the Bonsai Model collection from the French Alpage group and the Mate Parser from Bernd Bohnet (now at Google) are good first guesses.

Neural Network Toolkits
My favorite toolkit for modeling natural language text using LSTMs and other gadgetry is DyNet, which uses dynamically constructed computation graphs and allows to model recursive neural networks and other gadgetry without much fuss. The static network structure of more standard neural network libraries such as TensorFlow trade off flexibility for the ability to join groups of examples in a minibatch (which DyNet allows, but does not enforce), which leads to greater training speed.

Conditional Random Fields.
Hanna Wallach has a very useful link collection on Conditional Random Fields. I'd recommend especially her tutorial on CRFs (which is also the introductory part of her MSc thesis) as well as Simon Lacoste-Juliens tutorial on SVMs, graphical models, and Max-Margin Markov Networks (also linked there).

Nice blogs

Language Log
Technologies du Langage
Earning my Turns
Leiter Reports