Practical and Portable X86 Recompilation

Logo

Binary recompilation is a subject of intense research, but for mere mortals, recompiling binary code or executables can seem completely off-limits.

In my case, I was working on an open server implementation for Cube World, a game with procedurally generated terrain. Reverse-engineering the network protocol turned out to be a simple task, but to actually manipulate or do anything in the game world, we needed data about the terrain.

Unfortunately, Cube World does not save anything about the terrain to disk. Based on a seed, it generates the terrain on the fly, and does not relay anything about it to the client.

After some preliminary reverse-engineering of the terrain generator, it was clear that extracting the logic would take months. Initially, this was a big setback, but then the idea of somehow using the original x86 machine code presented itself.

However, for our open server, we need to support x86-64 as well, and in that case, we absolutely need emulation or recompilation.

Using something like QEMU or Bochs to emulate x86 seemed like a good idea, but they are both very big projects, and having them as a dependency would probably be a showstopper for anyone wanting to use our project. The memory overhead of a JIT would also not be very attractive, and it would be hard to make any guarantees about performance.

Static recompilation to assembler seemed like a much better option, but to keep it portable, we would need to write backends for x86, x86-64, and possibly ARM/PowerPC.

An idea

What about C++ as a target language? The x86 instruction set is a bit bloated, but most instructions map to simple C++ code (except for the condition flags, but we will get to that).

(LLVM as a target could also have been used, but it’s a big dependency, and would most likely not perform better than MSVC/GCC/Clang)

To make things easier for us, we use Python to transform the code from x86 to C++. For reference, you can take a look at the converter source.

The PE executable format: a primer

In our case, our source executable is an x86 win32 PE file. Using pefile.py, we can quickly get the data we need.

For anyone not familiar with PE files, they consist of

Header
- Image base (usually 0x400000 for executables)
- Entry point
Sections
- .text (x86 code)
- .data (read/write data)
- .rdata (read-only data)
- .reloc (relocation table)
Import tables
Other irrelevant junk

The code for reading in the PE file could look as follows:

import pefile

pe = pefile.PE('file.exe')
header = pe.OPTIONAL_HEADER

image_base = header.ImageBase
entry_point = header.ImageBase + header.AddressOfEntryPoint

section_map = {}
for section in pe.sections:
    name = section.Name.strip('\x00')[1:]
    if name not in ('text', 'data', 'rdata'):
        continue
    section_map[name] = section

import_map = {}
for entry in pe.DIRECTORY_ENTRY_IMPORT:
    for imp in entry.imports:
        # this is the virtual address where the function pointer is set
        address = imp.address
        import_map[address] = (imp, entry)

reloc_map = {}
for data in pe.DIRECTORY_ENTRY_BASERELOC:
    for entry in data.entries:
        t = entry.type
        # only support IMAGE_REL_BASED_HIGHLOW for now
        if t != 3:
            continue
        address = entry.rva + image_base
        src = pe.get_dword_at_rva(entry.rva)
        reloc_map[address] = src

Without relocation, the executable depends on its data to start at the image base. The code in .text references absolute addresses in virtual memory, which is obviously a problem when recompiling. However, using the relocation table, we can detect when an address is being referenced, and deal with it automatically.

The entry point is not the main function, but rather a wrapper called _start. This sets up the MSVCR library and calls static class constructors, then fires main. For simplicity, we will start conversion at the entry point.

Since the original executable depends on both the .data and .rdata sections, we will convert them to header files using char* arrays, i.e.

char rdata_section[...] = {
    ...
};

char data_section[...] = {
    ...
};

Doing this conversion by hand would be very tiresome, so doing it in code is more ideal.

The import tables in PE are used to import functions from DLLs, but in our case, we want to rewrite the calls to any imports to our own wrapper functions. In our converter, we can detect when an import is being called using the import map.

Disassembling x86

The x86 instruction set is huge, but luckily, something like pydasm can parse x86 neatly, giving us decoded instructions and operands. The decoding loop could look something like this:

import pydasm

class Subroutine(object):
    def __init__(self, address):
        self.instructions = []
        i = 0
        while True:
            # any x86 instruction can fit in 16 bytes
            size = 16
            data = self.get_memory(address, size)
            inst = pydasm.get_instruction(data, pydasm.MODE_32)
            mnemonic = pydasm.get_mnemonic_string(inst, pydasm.FORMAT_INTEL)
            mnemonic = mnemonic.strip()
            self.instructions.append(inst)
            if self.is_end(inst, mnemonic, i):
                break
            address += inst.length
            i += 1

    def is_end(self, inst, mnemonic, index):
        if mnemonic in ('retn', 'ret', 'int3'):
            return True
        if mnemonic == 'jmp' and inst.opcode == 0xFF:
            # far jump
            return True
        if mnemonic == 'jmp' and index == 0:
            # wrapper function (i.e. immediate 'jmp')
            return True
        return False

Detecting the end of a subroutine is quite tricky. Instructions like retn, ret, int3 and the far version of jmp usually denote the end of a subroutine, but YMMV. Subroutines that raise exceptions (like std::length_error) also cause the subroutine to end, but special checks need to be added in that case.

Usually, it would be a good idea to disassemble into basic blocks, but it turns out naively running through instructions worked in this case.

What if a subroutine makes a relative jump (using e.g. jnz) into the subroutine, then returns with ret? For example:

    cmp     eax, 1
    jge     short some_label ; short jump into subroutine
    ; 'ret' here, conversion will stop
    retn
some_label:
    ;... but subroutine continues here
    retn

We need to add some code to handle this case:

label_makers = set([
    'jnz',
    'jc',
    'jnz'
    # ... other jump mnemonics
])

class Subroutine(object):
    def __init__(self, address)
        self.instructions = []
        i = 0
        while True:
            # any x86 instruction can fit in 16 bytes
            size = 16
            data = self.get_memory(address, size)
            inst = pydasm.get_instruction(data, pydasm.MODE_32)
            mnemonic = pydasm.get_mnemonic_string(inst, pydasm.FORMAT_INTEL)
            mnemonic = mnemonic.strip()
            self.instructions.append(inst)
            if self.is_end(inst, mnemonic, i):
                # do we still have labels to process?
                if address > max(self.labels):
                    break
            if mnemonic in label_makers:
                label = inst.op1.immediate + inst.length + address
                self.labels.add(label)
            address += inst.length
            i += 1

When an instruction makes a relative jump, we add a ‘label’ at the destination address (this will later correspond to a C++ label). If we still haven’t reached the last label in the current subroutine, we keep disassembling.

This turns out to work for the majority of cases.

Handling condition flags

x86 has a lot of instructions that change the condition flags, also called EFLAGS. The ones relevant for our executable are

Overflow flag (OF)
Carry flag (CF)
Zero flag (ZF)
Sign flag (SF)
Direction flag (DF)

The rest are never referenced or used, so for performance, we ignore them completely. Also, DF is never changed, so we set its value to false.

If we take the add instruction, the OF, CF, ZF and SF flags are changed as follows (presented as C++):

static bool cf, zf, sf, of;

uint32_t add(uint32_t a, uint32_t b)
{
    uint32_t res = a + b;
    cf = res < a;
    zf = res == 0;
    sf = (int32_t)res < 0;
    of = (int32_t)((a ^ b ^ -1) & (a ^ res)) < 0;
    return res;
}

Even though we can hope the optimizer will eliminate unnecessary code, it is much better to calculate the EFLAGS only when necessary. Both Bochs and QEMU use a technique called ‘lazy EFLAGS’, where they save the operands and calculate the flags when necessary. However, if we search and find the EFLAGS dependencies for each instruction, we can do even better.

Take this piece of code, for example:

    add     eax, ecx
    mov     ecx, eax
    jae     short label_1 ; jump if above or equal
    retn
label_1:
    ...
    retn

It is simple to recognize that the jae instruction depends on the add instruction, so we add the jae instruction as a dependee to the add instruction. When the converter hits the add instruction, it will then calculate condition values for all dependees and save them as bools. As an example, we can transform the above x86 to

    uint32_t a = cpu.reg[EAX];   // add eax, ecx
    uint32_t b = cpu.reg[ECX];   // add eax, ecx
    uint32_t res = a + b;        // add eax, ecx
    bool cond_1 = res >= a;      // add eax, ecx
    cpu.reg[EAX] = res;          // add eax, ecx
    cpu.reg[ECX] = cpu.reg[EAX]; // mov ecx, eax
    if (cond_1)                  // jae label_1
        goto label_1;            // jae label_1
    return;                      // retn
label_1:
    ...
    return;                      // retn

This will most certainly be optimized away in the compiler, and should produce the original add and jae sequence.

The runtime library

Before we can have the converter convert our x86 into C++, we need a runtime library as our foundation.

In this case, we use two classes, CPU and Memory, and instantiate a global object of each.

Our CPU class could look as follows:

enum Register32
{
    EAX = 0,
    ECX,
    EDX,
    EBX,
    ESP,
    EBP,
    ESI,
    EDI
};


enum Register16
{
    AX = 0,
    CX,
    DX,
    BX,
    SP,
    BP,
    SI,
    DI
};

enum Register8
{
    AL = 0,
    CL,
    DL,
    BL,
    AH,
    CH,
    DH,
    BH
};

class CPU
{
public:
    static uint32_t reg[8];

    CPU();
    static uint8_t & get_reg8(Register8 i);
    static uint16_t & get_reg16(Register16 i);
    static void push_dword(uint32_t arg);
    static void pop_dword(uint32_t * arg);
};

extern CPU cpu; // global

This provides us with the means of accessing 32bit, 16bit and 8bit registers, and to push/pop stack arguments on/off the stack.

We are leaving out SSE and FPU registers here, but if you want to see a full implementation, look here.

Our Memory class could look as follows:

#define STACK_SIZE 0x100000
#define HEAP_SIZE (128 * 1024 * 1024) // 128mb

class Memory
{
public:
    static char stack[STACK_SIZE];
    static char heap[HEAP_SIZE];

    Memory();
    static void write(uint32_t address, const void * s, size_t len);
    static void read(uint32_t address, void * s, size_t len);
    static char * translate(uint32_t addr);
    static uint32_t translate(char * address);

    static uint32_t read_dword(uint32_t address);
    static void write_dword(uint32_t address, uint32_t arg);
};

extern Memory mem;

Again, the full implementation can be found here.

Heap allocation would need a longer write-up, but using something like dlmalloc, we can create a simple interface to heap allocation in a static array.

Because our addresses are limited to 32 bits, in case we are on a 64bit system, we need to have a way of translating 32 bit addresses to 64 bit (and the other way around). This is what the translate functions are for.

A good way of doing the translation is to use e.g. the stack as a base, and have everything offset from that. In that case, the translate would look as follows:

#define MEMORY_BASE_POINTER (&cpu.stack[0])

char * Memory::translate(uint32_t val)
{
#ifdef IS_64_BIT
    return MEMORY_BASE_POINTER + int32_t(val);
#else
    union {
        uint32_t in;
        char * out;
    };
    in = val;
    return out;
#endif
}

uint32_t Memory::translate(char * address)
{
#ifdef IS_64_BIT
    int64_t v = (int64_t)(address - MEMORY_BASE_POINTER);
    return uint32_t(int32_t(v));
#else
    union {
        char * in;
        uint32_t out;
    };
    in = address;
    return out;
#endif
}

This turns out to be very efficient on x86-64 with RIP-relative addressing, as some of offsets can be inlined when dealing with the data/rdata sections.

The converter

Now that we have a runtime library and can disassemble subroutines, how do we convert our executable to C++?

Our main subroutine loop could look as follows:

cpu = CPU()

cpu.function_stack = [entry_point]
while cpu.function_stack:
    address = cpu.function_stack.pop()
    sub = Subroutine(address)
    cpu.run(sub)

Our ´CPU´ object will then add new functions to the function stack as new functions are called and discovered.

The CPU class could look as follows:

class CPU(object):
    function_stack = None

    def run(self, sub):
        # open a .cpp file for the current subroutine
        writer = self.open_code(self.get_source_name())

        # add a function for the subroutine, e.g. void sub_4D1234()
        writer.putmeth('void sub_%s' % hex(sub.address)[2:])

        for instruction in sub.instructions:
            self.eip = instruction.address

            # add labels on the current address, if any
            if self.eip in sub.labels:
                writer.putlabel(get_label_name(self.eip))

            # call handler for current instruction mnemonic (e.g. MOV)
            name = 'on_%s' % instruction.mnemonic
            handler = getattr(self, name)
            handler(instruction)

        writer.end_brace()
        writer.close()

    def on_mov(self, i):
        # ...

    def on_call(self, i):
        # ...
        # also, add called subroutine to function_stack
        self.function_stack.append(called_address)

    def on_push(self, i):
        # ...

The class has been abbreviated for clarity, but with everything implemented, you would end up with something like the following:

void sub_4C8100()
{
    cpu.push_dword(cpu.reg[EBP]); // [4C8100] push ebp
    cpu.reg[EBP] = cpu.reg[ESP]; // [4C8101] mov ebp,esp
    cpu.push_dword(cpu.reg[ESI]); // [4C8103] push esi
    cpu.reg[ESI] = mem.read_dword(cpu.reg[EBP]+0x8); // [4C8104] mov esi,[ebp+0x8]
    cpu.reg[EAX] = mem.read_dword(cpu.reg[ESI]+0x48); // [4C8107] mov eax,[esi+0x48]
    bool cond_4254; // [4C810A] test eax,eax
    cond_4254 = cpu.reg[EAX] == 0; // [4C810A] test eax,eax
    if (cond_4254) { // [4C810C] jz 0x32
        goto loc_4C813E; // [4C810C] jz 0x32
    } // [4C810C] jz 0x32
    cpu.push_dword(mem.read_dword(cpu.reg[EBP]+0x8)); // [4C810E] push dword [ebp+0x8]
    cpu.reg[ECX] = cpu.reg[EBP]+0xb; // [4C8111] lea ecx,[ebp+0xb]
    cpu.push_dword(cpu.reg[ECX]); // [4C8114] push ecx
    cpu.push_dword(mem.read_dword(cpu.reg[ESI]+0x4c)); // [4C8115] push dword [esi+0x4c]
    cpu.push_dword(cpu.reg[EAX]); // [4C8118] push eax
    cpu.reg[ESP] = cpu.reg[ESP]-4; // [4C8119] call 0xfff3dd97
    sub_405EB0(); // [4C8119] call 0xfff3dd97
    cpu.push_dword(mem.read_dword(cpu.reg[ESI]+0x48)); // [4C811E] push dword [esi+0x48]
    cpu.reg[ESP] = cpu.reg[ESP]-4; // [4C8121] call 0x8251b
    sub_54A63C(); // [4C8121] call 0x8251b
    cpu.reg[ESP] = cpu.reg[ESP] + 20U; // [4C8126] add esp,0x14
    mem.write_dword(cpu.reg[ESI]+0x48, 0U); // [4C8129] mov dword [esi+0x48],0x0
    mem.write_dword(cpu.reg[ESI]+0x4c, 0U); // [4C8130] mov dword [esi+0x4c],0x0
    mem.write_dword(cpu.reg[ESI]+0x50, 0U); // [4C8137] mov dword [esi+0x50],0x0
    loc_4C813E: ;
    cpu.pop_dword(&cpu.reg[ESI]); // [4C813E] pop esi
    cpu.pop_dword(&cpu.reg[EBP]); // [4C813F] pop ebp
    cpu.reg[ESP] = cpu.reg[ESP]+8; // [4C8140] retn 0x4
    return; // [4C8140] retn 0x4
}

Using the relocation table

At the moment, if our code tries to access the rdata or data sections, we will get a segfault since the code will expect the data to be located according to the image base. However, since we do have a relocation table, we can detect when a data/rdata access is being made, and point it directly to our rdata_section and data_section tables.

This would transform something like this:

cpu.push_dword(0x558880); // [5303B5] push dword 0x558880

to this:

cpu.push_dword(mem.translate(rdata_section+2176)); // [5303B5] push dword 0x558880

When we detect a relocatable address, we traverse the section list, and determine which section is being accessed and the offset into it, and use that instead of the original offset.

Handling dynamic calls

Even though we can resolve calls to direct addresses, sometimes, we will meet instructions like call [eax]. There is no good way to resolve this statically, so we will have to maintain a list of dynamic subroutine addresses in the converter. This is one of the unfortunate consequences of doing static recompilation, but since our target executable is not bound to change much between releases, it is bearable to update our dynamic subroutine list each time. For our target, there are only 7 dynamic addresses, so it’s not a major problem anyway.

In the runtime library, we set up a std::unordered_map with address/function pointer pairs, so when we get a dynamic call, we just look up the function in the map.

Handling imports and replacing subroutines

There would be two ways to handle imports:

Follow import calls, and also convert subroutines in external DLLs
Write the imported subroutines manually

The MSVCR runtime uses a lot of exotic instructions, so chances are it’s a lot easier to just manually rewrite any imported subroutines from scratch. This way, we can also escape the emulated layer as much as possible, by using the native equivalent of whatever the MSVCR library is doing.

This does add some extra work when MSVCR C++ routines are called (for e.g. std::cout, std::stringstream, etc.), where the structure of the classes are implementation-specific. However, if we are clever, we can ignore any initialization-routines and just write a pointer to our native equivalent in the start of the guest structure, and so on.

With some work, we can also detect the usage of statically-linked libraries and insert our native equivalent. In this case, the SQLite library was used, so instead of letting the converter run through the SQLite subroutines, we added SQLite to the runtime library and forced the converter to use our handwritten subroutines.

For hot subroutines, it is also ideal to properly reverse-engineer them to C++ so the effect of the emulation layer isn’t as prevalent. In some cases, we can even optimize the subroutine further by adding optimizations the original developer didn’t (by e.g. caching cos values and so on).

Putting it all together

To actually make use of our generated C++ code, we create a simple interface which calls the subroutines related to the terrain generator and returns the terrain data we want. We then compile the generated code and interface into a dynamic library.

Here are some screenshots that show off the terrain data, generated by the converted x86 code:

Screenshot 1 Screenshot 2 Screenshot 3 Screenshot 4

Future research

Some things to look into:

Benchmarking

Since we don’t have control over the original code, it is hard to create a benchmark to test the performance of our C++ solution against the target executable. We could try and attach a debugger to the executable, and log the function arguments and execution times, but the debugger may add some overhead to the execution times.

Abstract/physical stacks

Our converter currently doesn’t care how the original code handles the stack. It lets it handle the EBP/ESP registers however it wants. Ideally, we should analyze the code, and figure out how big the stack frame for the current subroutine is. This would allow us to change all EBP/ESP-offset references to stack references. This gets rid of the EBP/ESP registers, but it would also potentially allow the optimizer to make better decisions about stack variables.

Pass arguments directly

Sort of related to the above, we could try and derive the stack arguments to a routine and pass them directly as arguments to the C++ function.

Remove global register variables

Since we use global variables for our registers, the optimizer will be forced to write all the changed registers back to the register variables at the end of the function. If we instead pass the registers as arguments to the function, at least on x86-64, the arguments would be passed as registers.

Direct x86/x86-64 backends

Ultimately, the fastest solution would be converting to x86/x86-64 assembler directly. However, we would lose some of the optimizations that come with MSVC/GCC, but it’s questionable whether the optimizations are enough to outweigh direct translation. It would be much harder to debug and maintain these backends though, so perhaps it’s better to keep optimizing the hot routines and stick with C++.

that mat blog

musings in gamedev, reverse-engineering and technology