Simple Pattern extraction from Google n-grams

02 Jul 2011

Google has released n-gram datasets for multiple languages, including English and German. For my needs (lots of patterns, with lemmatization), writing a small bit of C++ allows me to extract pattern instances in bulk, more quickly and comfortably than with bzgrep.

Normally, people build a database around the n-gram datasets to be able to perform ad-hoc queries, such as Stefan Evert's Web1T5-Easy, or at least uncompress each file to be able to seek through it, and then do queries online. If you're looking for specific patterns anyway, however, it can make sense to just extract these patterns.

A couple years ago, when trying out pattern searches for BART, I would query Google's n-gram data like I would query a search engine API (back in the day when it seemed like this would give you reproducible results and the downside of "it takes horribly long" wasn't discovered yet): given two lemmas that were interesting to you (such as "falafel" and "food", you would construct the appropriate morphological variant (i.e., plural "foods") fit the whole thing into a pattern. Then, armed with such a query string (or possibly a gappy version of it such as "foods * falafel"), you would do a query for it and the query engine, in the case of BART, would do a bisection search in the uncompressed n-gram files. This actually works decently (sub-second query times), as long as you're interested in things individually, and you don't mind keeping around the 80GB (uncompressed text) or 220GB (Web1T5-Easy) of data.

If you want to extract features for clustering, you know beforehand which queries for which patterns you're interested in, and you need to find out about a large number of fillers for these patterns - hence, it is much easier to do bulk queries (for reasonably precise patterns), and keep these extracted instances around (usually only hundreds of thousands of lines, instead of gigabytes over gigabytes). My old method for achieving this first used bzgrep to extract everything that looked like a pattern instance, and then used a python script to read in the patterns, throw out the pattern part of the line, and lemmatize the words you're interested in.

With the new approach (which I will talk about now), I have only one program that I need to call, I don't have to call the lemmatization part over and over or make up my mind on how to cache this. You just call extract_ngrams @NNS '%and$' '%other$' @NNS and it will use a pre-determined list of nouns that were tagged as plural nouns in the corpus I'm interested in, and their lemmas. The result is actually a nice example for the transparent decompression of streams in the Boost C++ library, and the use of multithreading through OpenMP, which makes it excellent fodder for a blog post.

With Boost's IO library, transparently opening Bzip2 files has actually become (almost) as easy as it is in Python or Java: you just set up a pipeline of objects, as in

std::ifstream myfile(filename, std::ios::binary|std::ios::in);
BI::filtering_stream<BI::input> my_filter;

and then keep reading from the file with calls of


and that's all. You could do fancier things than reading in the lines, in here the line is just read in, then split with ''strtok_r'' (remember, this is multithreaded code, so no non-reentrant functions for you), and then the single tokens of the line are matched against the template supplied by the user.

Using OpenMP to search through multiple bzipped files at once is similarly trivial: you put an appropriate pragma (''pragma omp parallel for'') in the position, make sure that all non-shared variables are declared inside the loop, and that's it:

#pragma omp parallel for
for (int i=0; i<info.n_files; i++) {
  char buf[BUF_SIZE];
  sprintf(buf, info.path_template,i);

That's it? Well, the buffering of the output is done per thread (in a thread-local buffer) and as it turned out, you have to put a critical section around the ''write'' calls, because occasional bad things happen otherwise. (In the code snippet, writeout is the name of the critical section).

if ((out_ptr-out_buf)>BUF_SIZE) {
#pragma omp critical (writeout)
   write(1, out_buf, (out_ptr-out_buf));

If you think this can be useful to you, find the source on bitbucket. You will need the Boost IO libraries, PCRE (the regular expressions library), and you will need to fix the paths to the ngram data in the source code.

Blog posts

The brave new world of search engines
In an earlier post, I talked about current Google's search results in terms of personalization, and whether to like it or not. This post takes another aspect of 2011 Google search: what they do with complex queries. (more...)

Simple Pattern extraction from Google n-grams
Google has released n-gram datasets for multiple languages, including English and German. For my needs (lots of patterns, with lemmatization), writing a small bit of C++ allows me to extract pattern instances in bulk, more quickly and comfortably than with bzgrep. (more...)

Where to buy Music
After searching around a disproportionate time to find nice music that I want to buy, I decided to compile this list of internet shops that sell music in MP3 format to German citizens. (And no, I can't/won't use iTunes unless they make a Linux client).

Useful links

WCDG parser.
The Weighted Constraint Dependency Grammar parser which is one of the best parsers for German that you can get. It's available under an open source license and there is an online demo.

BitPar and SFST.
Helmut Schmid has written several tools that may come in useful in your next NLP application, including the TreeTagger, a decision-tree based part of speech tagger, BitPar, a fast PCFG parsing engine, and SFST, a set of highly useful tools for finite-state morphology analysis.

Conditional Random Fields.
Hanna Wallach has a very useful link collection on Conditional Random Fields. I'd recommend especially her tutorial on CRFs (which is also the introductory part of her MSc thesis) as well as Simon Lacoste-Juliens tutorial on SVMs, graphical models, and Max-Margin Markov Networks (also linked there).

Nice blogs

Language Log
Technologies du Langage
Earning my Turns
Leiter Reports