 ### ANS Frenzy

by ferris

So I guess last week I wasn't kidding when I wrote "Would also be nice to get back on some exe packing stuff, which is where my brain seems to be going these days anyways... :)" :P

It started by reading a bunch of ANS stuff in the weekday evenings, particularly cbloom's blog posts, cyan's blog posts, ryg's blog posts and ofc the original ANS paper. Admittedly, it was a bit much to wrap my head around (even during the weekend), and I ended up going into a bit of a rabbit hole on Saturday looking into some familiar territory with more arithmetic coding refreshers, and also some more unfamiliar shannon papers..

Long story short, I did actually end up getting some tANS stuff working! Now there are some parts in there that I don't quite get, particularly some of the table construction bits, which I more or less lifted from cbloom. But I must say, conceptually I really think I'm starting to get the big picture here - and even had a nice little revelation this morning.

So I kinda want to explain some of this stuff, but I don't really have the time/energy to make some nice pictures.. so I'll just rabble on here and let's see where this goes (and maybe lift a screenshot or two from other sources, we'll see, hehe).

So let's cut to the chase. ANS, short for Asymmetric Numeral Systems, is a family of entropy coding techniques that, similar to arithmetic coding, encodes an entire stream into one big number. At least conceptually. Let's start there.

I like to draw a parallel to arithmetic coding, where we have a number range like [0, 1) which we subdivide into separate, smaller intervals, that all sum up to the larger [0, 1) interval. Each of these subintervals corresponds to one of our possible symbols we could encode at each stage, and the size of the subintervals are proportional to that symbol's probability. For each symbol we progressively reduce our number range by recursively shrinking our interval into the corresponding symbol's subinterval. This image from wikipedia sums it up fairly nicely: At the end of all this symbol reduction, we end up transmitting any number within the final range. We get compression due to the fact that encoding highly probable symbols shrinks the range less, and the larger range we have the less specific number we need to transmit within that range, and the less specific number we need to transmit, the less bits we would need to transmit. Phew!

Now of course this isn't the whole story - all this needs to happen with fixed width integers. The short version of how to do this is to keep the low, high values of our range in machine words, and as the range reduces, we start to lose precision. To fix this, we output the top bit(s) from our range values, and double them (yes, I'm hand waving vigorously at this point, but bear with me). This way we still end up outputting a big, long binary number, with which the decoder can keep a pair of range values of its own, and progressively reduce/renormalize until the original message is reconstructed.

So now that that's explained, let's take a look at ANS at a super high level:

Similarly to how AC splits the available range of numbers into subintervals each scaled proportionally to a symbol's probability, rANS actually does this same thing (and so does tANS but in a rather indirect way, we'll ignore that case for now). Only in ANS this range is conceptually the entire set of natural numbers (`N`). With ANS, we have a single state `x` that starts at some value, let's say `1`. At each symbol, we effectively grow `x` by dividing it by our symbol's probability, giving us a larger result `x'`. So if our probability was 1/2, `x` would double. If it were `1/3`, it would grow to 3x its size. After that we do a little bit of bit magic (which I'll also hand-wave for now) and that puts `x'` at some higher value in `N` and, due to the way that we've partitioned `N`, we would know that `x'`'s value corresponds to our previous symbol. This other diagram from wikipedia kinda shows the partitioning, but I recommend you check out the source article for more info, and I also recommend you kinda ignore that for now and just keep it in mind that we do that: Because of this, we can work backwards from `x'` to `x` by identifying the symbol that `x'` represents and multipling `x'` by that symbol's probability (and some bit magic that we're ignoring), effectively shrinking our state back down to `x`. In this way, our state `x` actually forms a stack, and we encode LIFO.

So now we get to that bit magic. At each symbol after we grow the state to `x'`, we need to add in some small bits from `x` back into `x'` that we lost when we divided the range. This mostly comes from the fact that we're doing arithmetic with limited precision integers, and without it our newly formed `x'` would be at the slightly wrong value in `N`. Basically though, we can think of coding something like this:

• Scale `x` up by dividing it by our current symbol (`s`)'s probability `Fs`, yielding `x'` (almost).
• Add in some bits to `x'` from `x`, which effectively is like stuffing both our symbol `s` into the "newly allocated" space in `x'`, along with some lower bits of `x` for precision. Again I'm waving my hands like I'm trying to fly, but thinking of it like this is pretty useful I think.

From that we can basically reverse the process to decode (in reverse direction) and we can see how this construction is packing values. And similarly to AC where we get compression because more probable symbols shrink the range less, in ANS we get compression because more probable symbols grow the state less. Rad!

Only now we need to normalize things so our state `x` doesn't grow to infinity and we can actually do this in our machines. Instead of the range being normalized to [0, 1) like in AC, this range is normalized to [k, 2k - 1) for some `k` (which I'll gloss over). The fact that we have an upper bound is obvious since we're working with fixed-width integers. The fact that we need a lower bound is a bit odd, but we'll get to that.

For each symbol, ANS' state value (`x`, the thing we're encoding) will be in this normalization range (which we'll call `I`). Now this range `I` is partitioned similar to our range in AC into subintervals corresponding to symbols whose scales are proportional to the symbols' probabilities. So, to encode a symbol, the encoder could grow the state (`x`) like we expect. But then this might mean that the new value (`x'`) grows outside of `I`. At that point we could simply shift out low bits of `x'` to the output stream until `x'` is back in `I`. But this is not what we do (and side note, I'm not actually sure why, as this should be a reversible process just like what we'll actually do.. I'll have a bit of a think about this and we'll see where that gets me.. but for now we'll just talk about how we actually do it).

So what we end up doing is first scale down `x` (by shifting out its low bits to the output stream) so that when we scale it up based on the symbol prob, we get not only a new value `x'` in `I`, but one that corresponds to the exact symbol we just encoded. And since the amount we scale `x` down and the amount we grow the range by are both determined by the symbol we're coding, and we have that at this encoder stage, this is possible. Then the decoder reverses this process by first identifying the symbol (given by where `x'` is in `I`), shrinking the range by that symbol's probability, and then shifting in bits into the new `x` (`x' -> x` because decoding) until `x` is within `I` again. This will mirror the exact bits we shifted out of `x` in the encoder (since the range is set up so that the first shift takes it out of `I` always), so our stream is uniquely decodable.

The thing I think is most important to think about mechanically is how our state `x` enters and leaves `I` as we shift bits in/out and scale with the symbol probs, and how analagous this is to arithmetic coding, which I find to be a bit more intuitive. It really is kindof wonky and backwards but it's also quite similar and intuitive once the correct mental picture is built. Now granted I ended up glossing over a ton of stuff, like how the encoder knows how many bits to shift out to land `x'` in the right spot, how to partition `I` exactly (which differs between ANS flavors because reasons), why it's called ANS (which I kindof intentionally didn't write about because I think it's not a very helpful way to look at the whole thing), and a bunch of other stuff. But I think I've actually covered a lot here conceptually. And the key insight for me that "clicked" the whole thing together really was visualizing `x` and `x'` bouncing in and out of `I` as we move around.

Something I find unfortunate btw is that none of the existing diagrams really show `x` growing and shrinking like that, just the results of how `x'` moves around as we encode, which for me was a bit harder to swallow at least since it didn't really align with the intuitive growing/shrinking concept of how we move between states. This diagram in particular is a good example of this: It just shows which states/numbers we transition to, but not how we got between them, which for me at least who learns things visually and mechanically, was a bit of a drawback. But I want to include the picture here anyways to perhaps help bridge the gap in your head or something..

The rest of these details are covered in the other blog posts, and for now I'll just send you that way if you're curious. Granted there's a lot to get through and admittedly I haven't really found a great way to explain it I think, so this has really been more of a brain dump, which is what I needed after getting through all of this. And at this point it might be awhile until I put this into practice beyond just playing around with stuff. But I've got some cool applications in mind, so stay tuned.

Whew, that was a lot - but fun :) see you next time!

Last Edited on Mon Sep 25 2017 18:35:56 GMT-0400 (EDT)