Injecting a TCP server

17 Jun 2011

You can network-enable any program by dynamically adding appropriate code that performs as a TCP server and redirects the program's standard input and output.

For many programs that use machine learning, you have something similar to the following behavior:

  • Read a model
  • Read data from stdin
  • Write data to stdout

Adhering to this model means that you can easily construct a pipeline out of multiple programs which feed each other - for example, take raw text, run it through a tokenizer, then a part-of-speech tagger, then a parser. Things become more complicated when you want to transform little pieces of data without always having the startup time that is involved in reading the model data.

For code that you do not control, or that is written in another language, you have to achieve this using a wrapper: you write a module, in your favorite programming language, that runs that other program, feeds it portions of the appropriate input (using a pair of pipes), and takes back corresponding portions of the appropriate output.

In some cases, though, a program has non-negligible startup time and still wants to read its input in one go (or in irregular pieces). For example, TreeTagger (a popular part-of-speech tagger) reads an arbitrary portion of its input before giving you the part-of-speech tag for the first input line.

A nicer solution -- if we can make the program cooperate somehow -- would be if we could, say, attach via a TCP socket and the program, instead of reading input only once, would do its work for each new incoming connection.

Instead of

load_model()
while things_to_do:
  read_data()
  write_processed_output()
exit()

it would be much nicer to have something along the lines of

load_model()
open_socket()
for each connection:
    fork subprocess:
         read_data()
         write_processed_output()

To make the program do this despite not being written for it, we would need to intercept libc's read function, which then looks out for reads, and just before the read from stdin, does the accept-connection-and-fork thing and then returns the control to the original program.

One possibility for intercepting the read function would be to write a replacement (which then calls the old read) and tell the dynamic linker to load the code via the LD_PRELOAD environment variable. Such an approach is sketched here, and is used in Debian's fakeroot command (which runs programs in a mode where they appear to be able to do things that only root can do, including installing things in /usr/bin).

This does not work because the program we are looking at is statically linked and won't care about LD_PRELOAD. What a pity. We can, however, run the program in debugging mode using the ptrace system call and stop the program, write to its registers and memory, or (hear, hear), intercept system calls.

Hence the approach I'll describe here does the following:

  • start the program in debugging mode (with PTRACE_TRACEME)
  • wait until the first system call that reads from stdin
  • add ('inject') a bit of program code for the accept-connection-and-fork task
  • give control to the bit of program code we added, which will then run the TCP server code and fork off a new process that return the control to the original program (after replacing standard input and output by the file descriptor of the network connection).

Sounds easy? There's a catch though: In the added program, we cannot use the standard library (libc or anything else), and because we're loading program code directly, we have to write it in assembler and scrape together the raw bytes. (There's probably a more comfortable way if you write your own ELF loader and some standard library routines, but that would take even more time).

Find the source on bitbucket. (Note: there is a github project with the same name, which does the simpler variant of always starting the program anew. Discovered it too late.)

Blog posts

The brave new world of search engines
In an earlier post, I talked about current Google's search results in terms of personalization, and whether to like it or not. This post takes another aspect of 2011 Google search: what they do with complex queries. (more...)

Simple Pattern extraction from Google n-grams
Google has released n-gram datasets for multiple languages, including English and German. For my needs (lots of patterns, with lemmatization), writing a small bit of C++ allows me to extract pattern instances in bulk, more quickly and comfortably than with bzgrep. (more...)

Where to buy Music
After searching around a disproportionate time to find nice music that I want to buy, I decided to compile this list of internet shops that sell music in MP3 format to German citizens. (And no, I can't/won't use iTunes unless they make a Linux client).

Useful links

WCDG parser.
The Weighted Constraint Dependency Grammar parser which is one of the best parsers for German that you can get. It's available under an open source license and there is an online demo.

BitPar and SFST.
Helmut Schmid has written several tools that may come in useful in your next NLP application, including the TreeTagger, a decision-tree based part of speech tagger, BitPar, a fast PCFG parsing engine, and SFST, a set of highly useful tools for finite-state morphology analysis.

Conditional Random Fields.
Hanna Wallach has a very useful link collection on Conditional Random Fields. I'd recommend especially her tutorial on CRFs (which is also the introductory part of her MSc thesis) as well as Simon Lacoste-Juliens tutorial on SVMs, graphical models, and Max-Margin Markov Networks (also linked there).

Nice blogs

Language Log
NLPers
hunch.net
Technologies du Langage
Earning my Turns
Leiter Reports