Reunited with my good friend Olly

Reunited with my good friend Olly

by ferris

This week I took a break from the visual tooling I've been working on to work on something else that has fascinated me for years - an executable packer.

The main idea is basically to take the contents of an input executable, compress it, and produce an output executable that contains the compressed data as well as a stub that decompresses the data and transfers execution to it in such a way that the executable has no idea there was any additional trickery involved. This can be broken down into two parts, more or less. The first part is the PE (portable executable) processing, where we have to "unpack" the important data and metadata from the input executable and spit out an output executable with the right format/metadata. The second part is the compression/decompression.

Now, this isn't exactly new territory for me; my first attempt at anything like this was around 2009, and I did a second and third iteration in 2011. However, I was never able to make anything that actually produced working executables. I did some really small PE headers with code injected into them, some DLL import by hash stuff, and even some data compression, but never was able to get the project stable/organized enough to make all the pieces work together or produce a working .exe from proper input.

Somehow I decided I wanted to give it another go; this time at the 64k level. Since the most unknowns were in the PE processing side of things (not the data compression ironically, although doing that with modern techniques will be quite new to me) I started with that first. Since my impression from the previous attempts was that this problem has lots of details involved and is very tedious, I decided the best strategy was to start with very simple executables and work my way up to more advanced ones, while being able to process and output working .exe's at every step of the way.

Now, right off the bat, we can make some assumptions given our 64k target:

  • We'll only deal with x86, win32-specific binaries
  • We don't have to support certain features like thread-local storage
  • The input/output executables will be relatively small (always under 1mb at least)

These assumptions can really help to simplify things; especially the first one. Primarily, this reduces the complexity of what we have to support at each stage.

So, the first step of the way was loading an executable and simply writing out a bitwise copy. Easy peasy.

The second step was to do the same thing, only instead of just copying the whole thing over, we will first parse the exe headers (at a very basic level, mind you), and copy each bit of the file to the output file, guided be the info in these headers. This is also fairly straightforward because we don't have to understand anything about the different kinds of sections and how they're related to one another - we're really just copying data from one file to another, and treating the PE file as just some headers and some sections is sufficient enough for this stage to work.

Next up was shifting some data around. To shift around the actual sections of the PE file, which contain code/data/etc, we need to understand what the different parts consist of so we can move them properly. This could take some work, so before I did that I decided a better intermediate step was to try and shift the headers around a bit instead. Without getting into too much detail, a PE file's headers consist of a simple DOS executable that basically prints "don't run this in DOS", a PE header, and an object file header. If we treat this as two parts (the DOS header and the PE/obj header), this is very easy to manage.

Note: explaining the PE headers is a bit of a large task, so I'll just refer you to some nice documentation that I used instead: Peering Inside the PE, PE format.aspx), and The Portable Executable File from Top to Bottom are all fantastic resources!

So, the first change we'll make is to get rid of the DOS header. Now, we can't just zero out all the memory - there are two things we need to preserve. The first is the first two bytes of the DOS file, which are the characters MZ. These are required for the Windows loader to recognize that this is a valid executable. So we'll keep those. The next thing we need to preserve is the e_lfanew field in the DOS file header, which exists at offset 0x3c in the file. This is a 4-byte, little-endian (basically all values in the PE are little-endian) number that represents an offset to where the PE header is in the file. Since our file is basically useless to the Windows loader if it can't find the PE header, we'd like to keep this around as well.

Only keeping the MZ at offset 0x00 and the PE offset at 0x3c, and copying all the headers and sections, works perfectly fine (given you won't actually run the thing in DOS, which is a fair assumption). And we're still producing working .exe's, even though they haven't changed much :)

The next step is to collapse these headers on top of each other. Since the DOS program is basically empty now, it would be a shame to simply leave this space unused. And since we have to store the PE header as well, why don't we just put that there instead? Turns out this is fairly straightforward to do; we'll just write the PE header at some offset shortly after the MZ bytes and make e_lfanew at 0x3c point to that. Of course there's one small problem - the PE header is too big to fit from 0x02 - 0x3c; the header will actually overlap the e_lfanew field.

Now, the Windows PE loader is quite helpful in that it only really cares about certain fields in the executable header. We can take advantage of this and place our PE header in such a location that e_lfanew will overlap with another field that isn't used. And, as it turns out, offset 0x0c is perfect for this - just 10 bytes after MZ, and e_lfanew will overlap with the PE header's BaseOfData field, which isn't really used. This is a great location for our PE header :)

All we have to do is place the header at 0x0c and overwrite BaseOfData with the proper e_lfanew value, which is of course 0x0c. And with that, we've freed up some more space and still produced working executables :)

The rest of the process is fairly similar to how we've been working so far. Make some simple change, ensure it works, move on. For example, another important change we'll make is to merge the executable's sections. Basically what this does is take all of the sections, lay them out as they will be laid out in memory after loading the executable, and place that new chunk into the output executable as-is. The primary reason for doing this is we can reduce the amount of space the headers take up by quite a bit, as we only have to store a single section header instead of many. This will also simplify our decompressor later on, as it can just treat the data as one large, contiguous chunk instead of many smaller ones at various offsets. This is pretty straightforward, but actually makes our executable larger, as typically the sections are aligned on small boundaries on disk (typically 512kb) but are aligned to larger ones (typically 4kb due to paging) in memory. However, this won't be a problem, as this extra space will basically be filled with 0x00's, and will compress very well. What's also interesting about this stage is that we basically don't have to change any of the headers, since pretty much all of them point to data in the executable by referring to addresses in memory as if the image was already loaded, not in the file. So, moving stuff around in the file is perfectly fine as long as we retain the position of the data when loaded. Of course, one small thing we have to do is ensure that this new merged section has all of the right characteristics in its header to ensure its memory will be readable, writable, and executable once the exe is loaded (otherwise we'll run into some .. "fun" problems).

Well now my entry is dragging on a bit, but the idea was to keep making these incremental changes. Over the course of the week I made tons of progress; to show this here's the TODO list I made along with some interesting notes if you're curious about more of the process:

- [x] Read in executable and parse headers
- [x] Output working executable
 - Don't need any real transformations yet, just need it to produce something that works as a stepping stone
- [x] Adjust header position
 - Should start at 0x0c so that `e_lfanew` overlaps with the optional header's `BaseOfData` field (which is basically unused)
 - Note that all spec's claim PE header has to be on an 8-byte boundary, but many other packers (crinkler, kkrunchy) also use 0x0c without issues
- [x] Merge sections
 - Initially this will only work on exe's without imports as the merged section will no longer have this metadata accessible from the headers
- [x] Adjust file alignment
 - ~~Should be 0x1000 (by default it's 0x200 atm)~~
 - The goal here is to get the loader to load the file as-is into memory which should reduce a lot of pain later on. ~~This means the file alignment should be the same is the section alignment, and for paging reasons, the section alignment has to be a multiple of 0x1000, so we'll go with that :)~~
 - After thinking about this more, I think adjusting the whole file alignment to 0x1000 is actually a bad idea. I think all we need to do is make sure that in the file, our first section starts at its specified virtual address. This should get us to our goal of getting the binary to be loaded into memory as-is without requiring as many additional changes, and also allows us to ensure the file alignment is 0x200, so the file on disk can always be as small as possible (minus the space between headers and first section, but we can put stuff in there) while still playing nice with the spec.
- [x] Clear out unused sections
- [x] Adjust base address
 - This is to make space for the compressed stuff and depacker that won't conflict with where the compressed code expects to be loaded
 - Ended up having a few extra adjustments that the final thing won't have, but it's good to tweak this stuff as proof of concept later. Didn't make a huge mess at least :)
- [x] Initial stub
 - Likely this will just consist of a copy routine and a jump, but I want to do something super simple initially
- [x] "Compression" test
 - Just bitmask all the data or something to test that we're go on this
- [x] Resolve imports
 - This means a more complicated stub, but no more transforms than that should be necessary
 - I'm considering a couple things for this, such as using proper imports in the exe to get ahold of `LoadLibraryA` and `GetProcAddress` and loading that way and import by hash. I think a nice middle ground will be to do something similar to what Crinkler does to get ahold of kernel32 through the PEB and process the import modules by hand, but instead of importing by hash we do it by function name, as my gut feeling is these names will compress well enough and we won't have to worry about potential collisions. Later on, this will be a good thing to a/b test (on a full intro with many imports especially).
 - After getting this to work in most cases by doing what Crinkler does except with function names instead of hashes, I've run into the problem that some imported functions end up referring to forwarded exports. Later versions of Crinkler appear to handle this by following the forwarding chain at link-time and referring to the target instead, but that's a bit too far into the unreliability range for me, especially when a hand-rolled function import routine is ~100 bytes uncompressed already. I think I'll use the standard mechanism to get ahold of `LoadLibraryA` and `GetProcAddress` and load the rest of the proc's with those, since `GetProcAddress` can handle ordinal imports transparently (which is another thing I had to work around in the first attempt at solving this problem).
- [x] Resources
 - I wanted to wait with this, but seeing as many of my test executables have menu resources, I think it's smart to nail this to make sure everything up to this point is solid before moving on
- [ ] Prototype compressor/decompressor
 - I'm thinking LZ77 + ari, similar to LZMA and initial versions of kkrunchy
 - Might consider just ari + context mixing and doing a bit of digging there. I have a feeling it won't really be any harder to implement, just new things to learn
- [ ] Improve compressor
 - Basically get it "production-ready" both in terms of ratio and simplicity
- [ ] Write size-optimized depacker

Notice how the unchecked stuff at the end is basically all about the compressor, which is something I ended up researching a lot about during the week (both refreshing my memory as well as learning new things), but wasn't able to get around to during the first week. But I've made tons of progress, and I can tell that even if I don't end up with the best compressor for a bit, at least I'll finally be able to get something that works in a relatively short period of time. And considering this is a project I basically failed at 3 times before, that feels pretty damn good.

This next week I'll be continuing on that a bit, and also probably working on some more visual tooling again. Lots to do, but we've got loads of time and I'm having SO much fun these days :D

P.S. I realized I never really tied in the title/pic with the entry, haha. Anyways, Olly refers to OllyDbg, which is a fantastic debugger/disassembler for this kind of thing. It's a bit dated by now, but it's what I used the first times I was trying to do this kind of thing, and something about it is super nostalgic; doubt I'll switch to anything else anytime soon :)