Coming off of the slightly disappointing results from last week, this week I actually had a bit of a breakthrough!
Specifically, I had made a bit of a mistake (literally). My idea was to have a separate set of contexts/weights for each of the different byte categories in the x86 stream. While this was pretty easy to implement (and select these contexts via GA), I had made a very simple mistake.
The way I use the categorization is not only to pick different models/weights, but allow each model to pick from a set of distinct histories - history 0 is the previous 8 bytes, history 1 is the previous 8 prefix/opcode bytes, history 2 is the previous 8 modr/m bytes, etc. After each byte, the byte categorizer;s current state is used to udpate a particular history (or set of histories, as some could also be mixed, eg. prefix/opcode/modr/m tends to have correlations). Then the categorizer is updated, transitions to the next state, etc.
However, when implementing the distinct histories, due to how my code was laid out before, I had them implemented as bit buffers specifically, so they were only updated each bit. This caused me to make a subtle mistake - whenever using a model on a history that's not currently being pushed to (due to it not matching the current byte categorization), the history wouldn't contain context for the current byte - meaning that all 8 bits of context for the current byte would get smashed together, effectively the same as having a huge number of hash collisions!
The fix for this was really easy; keep all of the current byte context in the compressor and only have the histories track at a byte level, and hash using both bytes and the current byte context always. With this fixed, my (disappointing) ~3kb-per-intro gain from hooking it up initially became a ~10kb-per-intro gain, which was at least closer to what I had expected, which felt pretty good :) .
Interestingly enough, I was also able to re-introduce my indirect models that I had tried before, and they helped quite a bit, gaining another ~3kb per intro! And what's particularly interesting about that is how much it matches current research on the technique, specifically that indirect models are much more sensitive to hash collisions than their direct counterparts, which is effectively exactly what had happened when I hooked things up wrong before.
At this point these gains were sizeable, but not quite what I had hoped, so I think I'll use the categorizer code/tables to implement a filter instead like kkrunchy, as well as hack on it a bit to extract the exact data it compresses. I'm guessing that a simplified version of my compressor/model now without the x86 stuff, applied to a packed section with filtered x86, should outperform kkrunchy at the cost of some extra cpu cycles and memory (which we can afford). It will take a bit of time to try and verify, but I have high hopes, and wouldn't mind taking such a simpler/derivative approach now after trying my own thing :) . There's also some more stuff I can tweak like predictor precision etc, so I'm not out of this race yet!
P.S. I should also dig a bit into cmix to see if there's some interesting stuff in there..
Last Edited on Mon Oct 09 2017 07:48:25 GMT-0400 (EDT)