Injecting a TCP server

17 Jun 2011

You can network-enable any program by dynamically adding appropriate code that performs as a TCP server and redirects the program's standard input and output.

For many programs that use machine learning, you have something similar to the following behavior:

  • Read a model
  • Read data from stdin
  • Write data to stdout

Adhering to this model means that you can easily construct a pipeline out of multiple programs which feed each other - for example, take raw text, run it through a tokenizer, then a part-of-speech tagger, then a parser. Things become more complicated when you want to transform little pieces of data without always having the startup time that is involved in reading the model data.

For code that you do not control, or that is written in another language, you have to achieve this using a wrapper: you write a module, in your favorite programming language, that runs that other program, feeds it portions of the appropriate input (using a pair of pipes), and takes back corresponding portions of the appropriate output.

In some cases, though, a program has non-negligible startup time and still wants to read its input in one go (or in irregular pieces). For example, TreeTagger (a popular part-of-speech tagger) reads an arbitrary portion of its input before giving you the part-of-speech tag for the first input line.

A nicer solution -- if we can make the program cooperate somehow -- would be if we could, say, attach via a TCP socket and the program, instead of reading input only once, would do its work for each new incoming connection.

Instead of

load_model()
while things_to_do:
  read_data()
  write_processed_output()
exit()

it would be much nicer to have something along the lines of

load_model()
open_socket()
for each connection:
    fork subprocess:
         read_data()
         write_processed_output()

To make the program do this despite not being written for it, we would need to intercept libc's read function, which then looks out for reads, and just before the read from stdin, does the accept-connection-and-fork thing and then returns the control to the original program.

One possibility for intercepting the read function would be to write a replacement (which then calls the old read) and tell the dynamic linker to load the code via the LD_PRELOAD environment variable. Such an approach is sketched here, and is used in Debian's fakeroot command (which runs programs in a mode where they appear to be able to do things that only root can do, including installing things in /usr/bin).

This does not work because the program we are looking at is statically linked and won't care about LD_PRELOAD. What a pity. We can, however, run the program in debugging mode using the ptrace system call and stop the program, write to its registers and memory, or (hear, hear), intercept system calls.

Hence the approach I'll describe here does the following:

  • start the program in debugging mode (with PTRACE_TRACEME)
  • wait until the first system call that reads from stdin
  • add ('inject') a bit of program code for the accept-connection-and-fork task
  • give control to the bit of program code we added, which will then run the TCP server code and fork off a new process that return the control to the original program (after replacing standard input and output by the file descriptor of the network connection).

Sounds easy? There's a catch though: In the added program, we cannot use the standard library (libc or anything else), and because we're loading program code directly, we have to write it in assembler and scrape together the raw bytes. (There's probably a more comfortable way if you write your own ELF loader and some standard library routines, but that would take even more time).

Find the source on bitbucket. (Note: there is a github project with the same name, which does the simpler variant of always starting the program anew. Discovered it too late.)

Blog posts

Neural Networks are Quite Neat (a rant)
After decades of Neural Network overhype, and a following time of disrespect, Neural Networks have become popular again - for a reason, as they can fit large amounts of data better than the feature-based models that came before them. Nonetheless, people who lived through the first overhyped episod are asking critical questions - the answers to which are (hopefully!) enlightening (more ...)

The brave new world of search engines
In an earlier post, I talked about current Google's search results in terms of personalization, and whether to like it or not. This post takes another aspect of 2011 Google search: what they do with complex queries. (more...)

Simple Pattern extraction from Google n-grams
Google has released n-gram datasets for multiple languages, including English and German. For my needs (lots of patterns, with lemmatization), writing a small bit of C++ allows me to extract pattern instances in bulk, more quickly and comfortably than with bzgrep. (more...)

Useful links

Fast dependency parsing
For doing syntactic preprocessing without spending too much time (CPU or engineering) on it, SpaCy and NLP4J should be among the first things to try. SpaCy covers English and German, whereas NLP4J covers only English, but is trained on biomedical treebanks (in addition to the WSJ news that everyone trains on), which makes it especially useful for that kind of texts. If you're looking towards parsing French, the Bonsai Model collection from the French Alpage group and the Mate Parser from Bernd Bohnet (now at Google) are good first guesses.

Neural Network Toolkits
My favorite toolkit for modeling natural language text using LSTMs and other gadgetry is DyNet, which uses dynamically constructed computation graphs and allows to model recursive neural networks and other gadgetry without much fuss. The static network structure of more standard neural network libraries such as TensorFlow trade off flexibility for the ability to join groups of examples in a minibatch (which DyNet allows, but does not enforce), which leads to greater training speed.

Conditional Random Fields.
Hanna Wallach has a very useful link collection on Conditional Random Fields. I'd recommend especially her tutorial on CRFs (which is also the introductory part of her MSc thesis) as well as Simon Lacoste-Juliens tutorial on SVMs, graphical models, and Max-Margin Markov Networks (also linked there).

Nice blogs

Language Log
NLPers
hunch.net
Technologies du Langage
Earning my Turns
Leiter Reports