Injecting a TCP server

You can network-enable any program by dynamically adding appropriate code that performs as a TCP server and redirects the program's standard input and output.

For many programs that use machine learning, you have something similar to the following behavior:

Read a model
Read data from stdin
Write data to stdout

Adhering to this model means that you can easily construct a pipeline out of multiple programs which feed each other - for example, take raw text, run it through a tokenizer, then a part-of-speech tagger, then a parser. Things become more complicated when you want to transform little pieces of data without always having the startup time that is involved in reading the model data.

For code that you do not control, or that is written in another language, you have to achieve this using a wrapper: you write a module, in your favorite programming language, that runs that other program, feeds it portions of the appropriate input (using a pair of pipes), and takes back corresponding portions of the appropriate output.

In some cases, though, a program has non-negligible startup time and still wants to read its input in one go (or in irregular pieces). For example, TreeTagger (a popular part-of-speech tagger) reads an arbitrary portion of its input before giving you the part-of-speech tag for the first input line.

A nicer solution -- if we can make the program cooperate somehow -- would be if we could, say, attach via a TCP socket and the program, instead of reading input only once, would do its work for each new incoming connection.

Instead of

load_model()
while things_to_do:
  read_data()
  write_processed_output()
exit()

it would be much nicer to have something along the lines of

load_model()
open_socket()
for each connection:
    fork subprocess:
         read_data()
         write_processed_output()

To make the program do this despite not being written for it, we would need to intercept libc's read function, which then looks out for reads, and just before the read from stdin, does the accept-connection-and-fork thing and then returns the control to the original program.

One possibility for intercepting the read function would be to write a replacement (which then calls the old read) and tell the dynamic linker to load the code via the LD_PRELOAD environment variable. Such an approach is sketched here, and is used in Debian's fakeroot command (which runs programs in a mode where they appear to be able to do things that only root can do, including installing things in /usr/bin).

This does not work because the program we are looking at is statically linked and won't care about LD_PRELOAD. What a pity. We can, however, run the program in debugging mode using the ptrace system call and stop the program, write to its registers and memory, or (hear, hear), intercept system calls.

Hence the approach I'll describe here does the following:

start the program in debugging mode (with PTRACE_TRACEME)
wait until the first system call that reads from stdin
add ('inject') a bit of program code for the accept-connection-and-fork task
give control to the bit of program code we added, which will then run the TCP server code and fork off a new process that return the control to the original program (after replacing standard input and output by the file descriptor of the network connection).

Sounds easy? There's a catch though: In the added program, we cannot use the standard library (libc or anything else), and because we're loading program code directly, we have to write it in assembler and scrape together the raw bytes. (There's probably a more comfortable way if you write your own ELF loader and some standard library routines, but that would take even more time).

Find the source on bitbucket. (Note: there is a github project with the same name, which does the simpler variant of always starting the program anew. Discovered it too late.)

Yannick Versley

(my Blog)

Injecting a TCP server

Blog posts

Useful links

Nice blogs