by ferris

So after engage and Revision and all that, I've been procrastinating a bit on gettin back into writing these damn things. This ends TODAY!

In the last few weeks (besides getting sick) I've been riding the post-party high straight back into packerland! If you recall from some of my writings from last summer you'll know I started work on a new 64k intro packer. Not that we really need the extra space ourselves (elysian clocked in at 56kb, and the engage final at 63kb), but I've always thought packers are really rad and wanted to do one of my own. They combine a mix of things that still often feel like magic to me - compression, self-modifying code (well sortof, in the form of the stub decompressing the original data), neural networks, etc.

After Revision, I messaged a bunch of 64k folks about getting some uncompressed intros for testing. I figured this would be a good time to do so, since many of the active groups these days had entered the 64k compo at Revision, so I figured they'd have these at hand already. So far, Conspiracy, Ctrl-Alt-Test, Approximate, Peisik, and several others were very happy to oblige - and it's been very helpful!

After getting more test executables I went to work improving compression. I started with the solution I had last summer, and by reading up more on PAQ and the PAQ author's fantastic Data Compression Explained book, I started looking into SSE/APM, using more bits for the model weights in the mixer, understanding the sparse, match, and x86 models better, indirect models, and a bunch of other stuff. It's a lot of into to chug through, so I'm moving a bit slow with actually incorporating bits into my own framework, but making progress!

One of the things that became really evident early on is the fact that kkrunchy is *really* hard to beat. Like, *really* hard. This makes sense though - if you've read any of ryg's info on the subject, you'll know that kkrunchy does some heavy preprocessing of the code section in order to improve its compression by a pretty sizeable margin. Try as I might to implement a simpler x86 model similar to what PAQ was doing, I couldn't put a very significant dent in the code section it seemed. So, I started looking into more disassembly-based modelling. At this point, the structure isn't really in place yet, and there's still a lot left there to implement and play with, so I'll save that info for a later entry... I've got some other cool stuff to share for this one :) .

While working on general packer improvements and thinking about the disasm-based models that will come into play soon, it became increasingly clear that I would need to have to pick good models for the compressor, and hopefully not do all of this by hand. So I thought about ways to automate this a bit.

Basically, a model in my scheme is currently an 8-bit mask that selects which of the previous 8 bytes to include in a predictor's context hash. In crinkler, a similar representation is used, although they use fixed predictor weights, so those have to be incorporated as well. Crinkler also will compress the data multiple times to find the best model/weight combo for a given intro; in our case, we've got a bit of a heavier compressor with dynamic weights, so the compression itself is a bit too expensive to run very many times, and it makes a bit more sense to select a bunch of models statically and let their dynamic weights dictate their usefulness for a given intro. But I digress a bit :) . When the disassembly-based stuff is in, we'll also need some way to indicate which parts of the decoded x86 to pull history from as well. In both cases, we have some simple (static) data that selects where in the history to look for context, and we need to find a combination of these models that will yield good compression consistently.

Given a set of candidate models, we can compress a bunch of executables and sum up all of the compressed sizes - this gives us a way to measure how performant a set of models is, and a simple way to compare two models and figure out which is more fit for our purpose. It's also the case that many different combinations of models will work pretty well; additional gains between two pretty good model sets diminish fast. Also, iteration time isn't great for this, so we'd like not to do it by hand.

This looks like a job for - A GENETIC ALGORITHM! *applause*

After reading up a bit to get a refresher on the concept (I had never actually implemented this before, even though I felt like I understood it to some degree), I got to work isolating some decent tests and writing the top-level code to make this work. As it turns out, this was pretty straightforward, actually.

I chose a population size of 16, as this seemed like a sufficiently high number to get a good representation of models each generation and also introduce sufficient randomness. Also DigitalOcean provides a 16-core tier, which seemed perfect for this kind of task if I want to shell out a small amount of cash and fire up a box to chug on these for a day or two.

Initially, the population is made up of completely random solutions. Since the current compressor only needs a set of model masks, I chose a set size of 20 models, and represented chromosomes as 20 bytes directly. Each generation, I throw the current population on a thread pool and compress 3 different executables with each one. I chose 3 executables out of my test set that I figured would represent a decent diversity among the entire set (using the whole set is *very* slow, and I figured favoring getting more generations in would be worth the risk of overfitting a subset of the training data, which I didn't think was very likely anyways).

I let each solution process, then sort them from most fit to least. The two most fit solutions are then "bred" to produce 15 new offspring. The breeding process is trivial - produce a new sequence of 20 masks where each element is either the ith element of parent A (the fittest from the previous generation) or parent B (second fittest). The probability of which parent the element comes from is skewed slightly to favor parent A (70/30 split). Then, random mutation is applied to each of these new offspring by simply selecting 10 random elements in the sequence and replacing them with random masks.

The 16th member of the new population is simply the fittest member of the previous population. I chose to use elitism like this as I wanted to just dump all the results into a file, and quickly be able to identify the best solution by looking at the latest data.

The whole thing is quite small actually, ~60 lines of code:

```
let num_workers = 4; // Adjust for execution environment
let pool = ThreadPool::new(num_workers);
let population_size = 16;
let mut population =
(0..population_size)
.map(|_| (0..20).map(|_| rand::random::<u8>()).collect::<Vec<_>>())
.collect::<Vec<_>>();
loop {
let (tx, rx) = channel();
for i in 0..population_size {
let tx = tx.clone();
let models = population[i].clone();
pool.execute(move || {
let total_size = run(&models);
tx.send((total_size, models)).unwrap();
});
}
let mut sorted_results = rx.iter().take(population_size).collect::<BTreeMap<_, _>>();
let fittest = sorted_results.iter().next().unwrap();
// Print out fittest size and model from this generation
let parent_a = fittest.1;
let parent_b = sorted_results.iter().nth(1).unwrap().1;
population =
vec![fittest.1.clone()]
.into_iter()
.chain(
(0..population_size - 1)
.map(|_| {
let mut rng = rand::thread_rng();
let mut models = (0..20).map(|i| {
let crossover_prob = rng.next_f32();
let crossover_threshold = 0.7;
if crossover_prob < crossover_threshold {
parent_a[i]
} else {
parent_b[i]
}
}).collect::<Vec<_>>();
for _ in 0..10 {
let mutation_index = rng.gen::<usize>() % 20;
models[mutation_index] = rng.gen();
}
models
}))
.collect::<Vec<_>>();
}
```

After a couple initial tests and bugfixes I let this run overnight on a couple of my machines. They both came up with very nice solutions (sorted and converted to binary representation here for clarity):

```
00000000 00000001 00000010 00000011 00000110 00000111 00001101 00011101 00011111 00101010 00110001 00111000 00111000 01110111 10001000 10011110 11000101 11011101 11101011 11111011
```

```
00000000 00000001 00000010 00000011 00000100 00000101 00000110 00000111 00011101 00101001 00110000 00110101 01011010 01100111 10011110 10110011 11000011 11001000 11100001 11111111
```

Both of these sets of masks, when applied to all test executables, average ~800 bytes improvement, with maximum improvement for a single executable exceeding 1.5kb, and only a single test executable showed any expansion (<70 bytes in both cases). These solutions were found after ~200 generations, with returns diminishing very quickly (in fact they were diminishing already before these solutions, somewhere ~10-20 generations).

Now, while I'm no expert in this field and probably made some silly mistakes/assumptions (for example, keeping the fittest solution each iteration and skewing breeding to favor the more fit individual might be causing the algorithm to settle into local maxima too often), but there were some interesting observations here:

To reduce iteration time, I only used 3 of my test executables, hand-picked to (hopefully) represent a decent amount of diversity, and

*this actually appeared to work just fine*. I was slightly worried that a solution might overfit the sub-sample a bit, but I think the fact that the compressor has so many dynamic bits very much works in our favor here, and the variance between intros can be more or less mitigated.Looking at the chosen models, there's a pretty good representation of small values, which is something I expected to see, meaning these aren't coming up with completely bogus solutions. Basically all good solutions found by these tests contained most (if not all) of the values 0-7, which represents masks that look only at combinations of the last 3 bytes. There's also a fair amount of representation from other masks, and we can clearly see some masks that favor the oldest byte and some of the newer ones, or the oldest 2 bytes and some of the newer ones. These all appear to be sensible values, given we're compressing a stream of (mostly) binary data.

(and most importantly) Running these kinds of simulations actually produces useful weights that are more useful than what I had done by hand,

*and*with less models. Recall that I had previously used 20 models as described above, along with 10 additional models that tried to find patterns in x86 in a very naive way. These simulations found combinations of 20 models*without*the help of the additional 10 x86-based models that outperformed my previous hand-picked selections, and are faster due to using less models.

Now, an important thing to note is that this was really just an experiment. Since my model design isn't final, the masks produced by these tests aren't final either (although they help in the current configuration, which is always motivating). The point was to entertain the notion that when the compressor architecture design is more or less final, that running tests in this fashion could actually help select model sets, and it looks like it actually can :) .

It's also worth noting that due to the robustness of the compression algorithm and its ability to adjust itself to the input data, there's quite a bit of leeway in terms of selecting a set of models that produces consistently good results. The search space is rich with solutions, so selecting one with what is essentially a somewhat guided random search can actually make sense. But perhaps the *most* important part of all of this was that I finally tried out genetic algorithms, got some sensible outcomes, and had a lot of fun with *yet another* interesting part of building an executable compressor. :)

Last Edited on Sun May 21 2017 10:19:10 GMT-0400 (EDT)