Nigel Tao

Zstandard Worked Example Part 7: Dictionaries

This blog post is one of a seven part series.

Dictionary File Structure

Dictionaries are supplied out-of-band to a Zstandard file. Each frame in a Zstandard file can refer to its own dictionary, identified by a uint32 number.

Dictionaries are optional and the romeo.txt.zst example doesn’t use them. Instead, here’s a 4 KiB dictionary built from 64 KiB chunks of Shakespeare’s complete works:

$ wget --quiet https://www.gutenberg.org/files/100/100-0.txt
$ zstd --train --maxdict=4096 -B65536 -o 100-0.dict 100-0.txt 2> /dev/null
$ hd 100-0.dict | a_hypothetical_program_to_annotate_zstd_dict_header_bytes
00000000  37 a4 30 ec 3e 7b 25 59  09 10 10 df 30 33 33 b3  |[MN][ID]H[T][HD |
00000010  77 0a 33 f1 78 3c 1e 8f  c7 e3 f1 78 3c cf f3 bc  |-][--- CMOT ----|
00000020  f7 d4 42 41 41 41 41 41  41 41 41 41 41 41 41 41  |][--------------|
00000030  41 41 41 41 41 41 41 41  41 41 41 41 a1 50 28 14  |----- MLT ------|
00000040  0a 85 42 a1 50 28 14 0a  85 a2 28 8a a2 28 4a 29  |----------------|
00000050  7d 74 e1 e1 e1 e1 e1 e1  e1 e1 e1 e1 e1 e1 e1 e1  |][--------------|
00000060  e1 e1 e1 e1 e1 f1 78 3c  1e 8f c7 e3 f1 78 9e e7  |----- LLT ------|
00000070  79 ef 01 01 00 00 00 04  00 00 00 08 00 00 00 20  |--][ REP OFFS ] |
00000080  74 68 65 20 62 65 74 74  e2 80 99 72 69 6e 67 20  |the bett...ring |
00000090  6f 66 20 74 68 65 20 74  69 6d 65 2c 0d 0a 41 6e  |of the time,..An|
000000a0  64 20 74 68 6f 75 67 68  20 74 68 65 79 74 61 72  |d though theytar|
000000b0  73 2c 0d 0a 41 6e 64 20  68 65 20 77 69 6c 6c 20  |s,..And he will |
000000c0  6d 61 6b 65 20 74 68 65  20 66 61 63 65 20 6f 66  |make the face of|
000000d0  20 68 65 61 76 65 6e 20  73 6f 20 66 69 6e 65 2e  | heaven so fine.|
000000e0  5f 5d 0d 0a 0d 0a 42 45  4e 56 4f 4c 49 4f 2e 0d  |_]....BENVOLIO..|
000000f0  0a 47 6f 6f 64 20 6d 6f  72 72 6f 77 2c 20 63 6f  |.Good morrow, co|
00000100  75 73 69 6e 2e 0d 0a 0d  0a 52 4f 4d 45 4f 2e 0d  |usin.....ROMEO..|
00000110  0a 20 61 74 74 65 6e 64  2e 20 20 20 20 20 20 20  |. attend.       |
00000120  20 45 78 69 74 0d 0a 20  20 51 55 45 45 4e 20 45  | Exit..  QUEEN E|
00000130  4c 49 5a 41 42 45 54 48  2e 20 54 68 6f 75 67 68  |LIZABETH. Though|
etc
00000fc0  61 6c 6c 20 6d 79 20 68  65 61 72 74 2c 20 6d 79  |all my heart, my|
00000fd0  20 6c 6f 72 64 2e 0d 0a  0d 0a 20 5b 5f 45 78 65  | lord..... [_Exe|
00000fe0  75 6e 74 2e 5f 5d 0d 0a  0d 61 6e 64 20 77 69 6c  |unt._]...and wil|
00000ff0  6c 20 6e 6f 74 20 6c 65  61 76 65 20 6d 65 2e 0d  |l not leave me..|
00001000

Other than a uint32 dictionary ID, most sections of a Zstandard dictionary are similar to the sections of a Zstandard file as discussed in Part 2: Structure. MN is a magic number (0xec30a437 for dictionaries instead of 0xfd2fb528), followed by the ID.

Four tables are next. H, T and HD define a Huffman table (and T is an FSE table). CMOT, MLT and LLT define Cooked Match Offset, Match Length and Literal Length FSE tables, the same as the previous discussion for Zstandard files. As before, the byte length of these FSE tables aren’t explicitly recorded. Processing the dictionary reads input bytes until the FSE tables are complete.

These tables are not fed any bitstreams per se. Instead, blocks in a Zstandard file’s frames can re-use the tables of previous blocks in that frame or, if there is no previous block, the tables from the dictionary. Those tables are then applied to bitstreams within the block.

REP OFFS contain the Repeat Offsets, three uint32 values to use instead of the default values. In this relatively small example, the Repeat Offsets are just the default values (1, 4 and 8) but are still explicitly written.

The remainder of the file is arbitrary bytes of content or virtual history, similar to a Zlib dictionary. These bytes are not copied directly to the output (when decompressing a Zstandard file that references this dictionary), but sufficiently large Raw Match Offsets will copy from there. The historical content effectively has negative byte offsets (compared to the zero byte offset for the start of the decompressed content).

For example, if the first Sequence of a block had a Literal Length of 100000 and then a Raw Match Offset of 103799, the net offset of -3799 would be invalid without a dictionary. With the dictionary above (of length 4096), the copy would start at 4096 - 3799 = 297 = 0x129 from the start of the dictionary: the “QUEEN ELIZABETH. etc” bytes.

Conclusion

To recap, other than a short header and footer, a Zstandard file consists of a number of frames and each frame consists of a number of blocks. In the common case where blocks are compressed, each block has one Huffman table (and its bitstream) to reproduce Literals and three FSE tables (and their interleaved bitstream) to reproduce Sequences (as each Sequence has three explicit fields). In terms of the wire format, the Huffman table is itself FSE compressed. Combining the Literals with the Sequences produces a series of alternating literal and match ops. Concatenating the ops’ emissions recover the block’s decompressed bytes.

Further Reading

If you want to read more about compression, try these blogs:

If you want to study compression implementations, I find Go or Wuffs source code easier to follow than e.g. C. I’m sure that there are very readable Java, Python or Rust implementations too, but I’m not as familiar with that space. Anyway, try:

If you’re interested specifically in the theory of Asymmetric Numeral Systems (Finite State Entropy codes are also called tANS or tabled Asymmetric Numeral Systems) and aren’t afraid of some math, try:

If you’d like a similar worked example for other compression formats, I’ve previously written ones for:

Update on 2022-05-24: This blog post series is discussed on Hacker News.


Published: 2022-05-17