nigeltao.github.io

Zstandard Worked Example Part 4: Huffman Codes

This blog post is one of a seven part series.

From Bitstrings to Symbols

Tables convert bitstrings to symbols and for Zstandard’s Literal data, there are up to 256 symbols. A symbol value of 0x40 naturally corresponds to the ASCII ‘@’ character, 0x41 corresponds to ‘A’, etc. If some of those 256 symbol values aren’t used, they don’t need to be explicitly listed in the mapping.

For Huffman codes, the length of each bitstring is inversely related to how frequently its symbol occurs. When compressing English textual input, common characters like ‘ ‘ and ‘e’ are typically assigned the shortest bitstrings.

Here’s the “bitstring to symbol table” (sorted lexicographically by bitstring) of the Huffman code used for the romeo.txt.zst Literals. In this case, the bitstring lengths range in 3 ..= 8. The “LOW ..= HIGH” range notation instead of “LOW .. HIGH” means that both the low and high bounds are inclusive.

Bitstring  Symbol (as Hex and ASCII)
00000000   0x21   '!'   24 8-bit codes.
00000001   0x32   '2'
00000010   0x3a   ':'
00000011   0x41   'A'
00000100   0x42   'B'
00000101   0x43   'C'
00000110   0x44   'D'
00000111   0x45   'E'
00001000   0x48   'H'
00001001   0x4a   'J'
00001010   0x4c   'L'
00001011   0x4d   'M'
00001100   0x4e   'N'
00001101   0x52   'R'
00001110   0x53   'S'
00001111   0x54   'T'
00010000   0x55   'U'
00010001   0x57   'W'
00010010   0x5b   '['
00010011   0x5d   ']'
00010100   0x6b   'k'
00010101   0x76   'v'
00010110   0x78   'x'
00010111   0x7a   'z'
0001100    0x27   '''   8 7-bit codes.
0001101    0x2e   '.'
0001110    0x3b   ';'
0001111    0x3f   '?'
0010000    0x49   'I'
0010001    0x4f   'O'
0010010    0x67   'g'
0010011    0x70   'p'
001010     0x2c   ','   8 6-bit codes.
001011     0x62   'b'
001100     0x63   'c'
001101     0x64   'd'
001110     0x66   'f'
001111     0x75   'u'
010000     0x77   'w'
010001     0x79   'y'
01001      0x0a   '\n'  7 5-bit codes.
01010      0x68   'h'
01011      0x69   'i'
01100      0x6c   'l'
01101      0x6d   'm'
01110      0x72   'r'
01111      0x73   's'
1000       0x61   'a'   4 4-bit codes.
1001       0x6e   'n'
1010       0x6f   'o'
1011       0x74   't'
110        0x20   ' '   2 3-bit codes.
111        0x65   'e'

This Hufffman code is complete: the first bitstring is all 0s and the last bitstring is all 1s. Mathematically, this is synonymous with the “24 8-bit codes, 8 7-bit codes, 8 6-bit codes, 7 5-bit codes, 4 4-bit codes, 2 3-bit codes” distribution satisfying the “sum of fractions” (24/256 + 8/128 + 8/64 + 7/32 + 4/16 + 2/8) equalling one.

Huffman Application

Recall the LSTREAM 1 BITSTREAM derived in Part 3: Bitstreams.

000011011010011011111010110100010010011011100 etc 0101000111001100

Running the Huffman code on the bitstream, we first match the “00001101” bitstring, producing ASCII 0x52 or ‘R’, then the “1010” bitstring, producing ASCII 0x6f or ‘o’, continuing up until the final “01100” bitstring, producing ASCII 0x61 or ‘l’:

000011011010011011111010110100010010011011100 etc 0101000111001100
RRRRRRRRoooommmmmeeeoooo___aaaannnndddddd___J etc mmyyyyyy___lllll

The output of the Huffman code on the LSTREAM 1 BITSTREAM produces the LSTRIP 1 part of the Literals string (as described in Part 1: Concepts):

|Romeo and Juliet@Excerpt from Act 2, Scene 2@@JULIET@O ,! wherefore a|
|rt thou?@Deny thy fatherrefusename;@Or, ifwilt not, be but sworn my l|

Applying the same Huffman code on three other bitstreams (LSTREAM N BITSTREAM) produces the other three quarters (LSTRIP N) of the Literals string, for N being 2, 3 and 4.

Once the Huffman decoder table has been prepared, each strip can be decoded independently, since we know each strip’s decoded and encoded size in bytes. On modern CPUs, decoding the four strips concurrently (just taking advantage of CPU pipelining and a relatively large number of registers, without needing separate threads) can be faster than decoding them serially.

Canonical Huffman Codes

The “bitstring to symbol table” form of the Huffman code, above, is fairly verbose. There is a much more efficient representation, given that it is a canonical Huffman code although, unusually, with longer bitstrings first. Specifically, longer bitstrings are listed (in lexicographic bitstring order) before shorter ones (e.g. 8-bit codes before 7-bit codes) and, for a given bitstring length, smaller symbols before larger ones (e.g. out of the 7-bit codes, the one for the symbol 0x4f ‘O’ before for 0x67 ‘g’). We can therefore represent the Huffman code unambiguously just by the bitstring length of each of the 256 symbols (with 0 meaning that the symbol isn’t present):

Bitstring Length                   Symbol Range (as Hex and ASCII)
0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0    0x00 ..= 0x0f    NUL, SOH, STX ..= SI
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0    0x10 ..= 0x1f    DLE, DC1, DC2 ..= US
3 8 0 0 0 0 0 7 0 0 0 0 6 0 7 0    0x20 ..= 0x2f    ' ', '!', '"' ..= '.'
0 0 8 0 0 0 0 0 0 0 8 7 0 0 0 7    0x30 ..= 0x3f    '0', '1', '2' ..= '?'
0 8 8 8 8 8 0 0 8 7 8 0 8 8 8 7    0x40 ..= 0x4f    '@', 'A', 'B' ..= 'O'
0 0 8 8 8 8 0 8 0 0 0 8 0 8 0 0    0x50 ..= 0x5f    'P', 'Q', 'R' ..= '_'
0 4 6 6 6 3 6 7 5 5 0 8 5 5 4 4    0x60 ..= 0x6f    '`', 'a', 'b' ..= 'o'
7 0 5 5 4 6 8 6 8 6 8 0 0 0 0 0    0x70 ..= 0x7f    'p', 'q', 'r' ..= DEL
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0    0x80 ..= 0x8f    Non-ASCII High Bytes
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0    0x90 ..= 0x9f    Non-ASCII High Bytes
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0    0xa0 ..= 0xaf    Non-ASCII High Bytes
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0    0xb0 ..= 0xbf    Non-ASCII High Bytes
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0    0xc0 ..= 0xcf    Non-ASCII High Bytes
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0    0xd0 ..= 0xdf    Non-ASCII High Bytes
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0    0xe0 ..= 0xef    Non-ASCII High Bytes
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0    0xf0 ..= 0xff    Non-ASCII High Bytes

Due to completeness (see the “sum of fractions” must equal one, above), we don’t have to explicitly state the last non-zero element (in this case, for ‘z’) and all the zeroes that follow. We can represent this Huffman table in only 122 numbers:

Bitstring Length                   Symbol Range (as Hex and ASCII)
0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0    0x00 ..= 0x0f    NUL, SOH, STX ..= SI
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0    0x10 ..= 0x1f    DLE, DC1, DC2 ..= US
3 8 0 0 0 0 0 7 0 0 0 0 6 0 7 0    0x20 ..= 0x2f    ' ', '!', '"' ..= '.'
0 0 8 0 0 0 0 0 0 0 8 7 0 0 0 7    0x30 ..= 0x3f    '0', '1', '2' ..= '?'
0 8 8 8 8 8 0 0 8 7 8 0 8 8 8 7    0x40 ..= 0x4f    '@', 'A', 'B' ..= 'O'
0 0 8 8 8 8 0 8 0 0 0 8 0 8 0 0    0x50 ..= 0x5f    'P', 'Q', 'R' ..= '_'
0 4 6 6 6 3 6 7 5 5 0 8 5 5 4 4    0x60 ..= 0x6f    '`', 'a', 'b' ..= 'o'
7 0 5 5 4 6 8 6 8 6                0x70 ..= 0x79    'p', 'q', 'r' ..= 'y'

Huffman Weights Representation

One last trick is to squash the range of these numbers from 0 ..= 8 (a bitstring length) to 0 ..= 6 (weights). Non-zero weights W correspond to a bitstring length of (MBL + 1 - W). The MBL (Maximum Bitstring Length, in this case 8) can be derived solely from the explicit weights (and completeness). The “sum of fractions”, without the now-implicit last non-zero element, must be between one half (inclusive) and one (exclusive).

Thus, romeo.txt.zst needs to somehow encode these 122 numbers (ranging in 0 ..= 6) to represent the Huffman code for producing the Literals:

Weight
0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 2 0 0 0 0 3 0 2 0
0 0 1 0 0 0 0 0 0 0 1 2 0 0 0 2
0 1 1 1 1 1 0 0 1 2 1 0 1 1 1 2
0 0 1 1 1 1 0 1 0 0 0 1 0 1 0 0
0 5 3 3 3 6 3 2 4 4 0 1 4 4 5 5
2 0 4 4 5 3 1 3 1 3

It turns out that the Huffman code weights are themselves encoded by FSE.


Next: Part 5: Finite State Entropy Codes.


Published: 2022-05-14