Nigel Tao

Zstandard Worked Example Part 2: Structure

This blog post is one of a seven part series.

Overall Structure

Here’s romeo.txt.zst:

$ hd romeo.txt.zst
00000000  28 b5 2f fd 64 ae 02 0d  11 00 76 62 5e 23 30 6f  |(./.d.....vb^#0o|
00000010  9b 03 7d c7 16 0b be c8  f2 d0 22 4b 6b bc 54 5d  |..}......."Kk.T]|
00000020  a9 d4 93 ef c4 54 96 b2  e2 a8 a8 24 1c 54 40 29  |.....T.....$.T@)|
00000030  01 55 00 57 00 51 00 cc  51 73 3a 85 9e f7 59 fc  |.U.W.Q..Qs:...Y.|
00000040  c5 ca 6a 7a d9 82 9c 65  c5 45 92 e3 0d f3 ef 71  |..jz...e.E.....q|
00000050  ee dc d5 a2 e3 48 ad a3  bc 41 7a 3c aa d6 eb d0  |.....H...Az<....|
00000060  77 ea dc 5d 41 06 50 1c  49 0f 07 10 05 88 84 94  |w..]A.P.I.......|
00000070  02 fc 3c e3 60 25 c0 cb  0c b8 a9 73 bc 13 77 c6  |..<.`%.....s..w.|
00000080  e2 20 ed 17 7b 12 dc 24  5a df b4 21 9a cb 8f c7  |. ..{..$Z..!....|
00000090  58 54 11 a9 f1 47 82 9b  ba 60 b4 92 28 0e fb 8b  |XT...G...`..(...|
000000a0  1e 92 23 6a cf bf e5 45  b5 7e eb 81 f1 78 4b ad  |..#j...E.~...xK.|
000000b0  17 4d 81 9f bc 67 a7 56  ee b4 d9 e1 95 21 66 0c  |.M...g.V.....!f.|
000000c0  95 83 27 de ac 37 20 91  22 07 0b 91 86 94 1a 7b  |..'..7 ."......{|
000000d0  f6 4c b0 c0 e8 2e 49 65  d6 34 63 0c 88 9b 1c 48  |.L....Ie.4c....H|
000000e0  ca 2b 34 a9 6b 99 3b ee  13 3b 7c 93 0b f7 0d 49  |.+4.k.;..;|....I|
000000f0  69 18 57 be 3b 64 45 1d  92 63 7f e8 f9 a1 19 7b  |i.W.;dE..c.....{|
00000100  7b 6e d8 a3 90 23 82 f4  a7 ce c8 f8 90 15 b3 14  |{n...#..........|
00000110  f4 40 e7 02 78 d3 17 71  23 b1 19 ad 6b 49 ae 13  |.@..x..q#...kI..|
00000120  a4 75 38 51 47 89 67 b0  39 b4 53 86 a4 ac aa a3  |.u8QG.g.9.S.....|
00000130  34 89 ca 2e e9 c1 fe f2  51 c6 51 73 aa f7 9d 2d  |4.......Q.Qs...-|
00000140  ed d9 b7 4a b2 b2 61 e4  ef 98 f7 c5 ef 51 9b d8  |...J..a......Q..|
00000150  dc 60 6c 41 76 af 78 1a  62 b5 4c 1e 21 39 9a 5f  |.`lAv.x.b.L.!9._|
00000160  ac 9d e0 62 e8 e9 2f 2f  48 02 8d 53 c8 91 f2 1a  |...b..//H..S....|
00000170  d2 7c 0a 7c 48 bf da a9  e3 38 da 34 ce 76 a9 da  |.|.|H....8.4.v..|
00000180  15 91 de 21 f5 55 46 a8  21 9d 51 cc 18 42 44 81  |...!.UF.!.Q..BD.|
00000190  8c 94 b4 50 1e 20 42 82  98 c2 3b 10 48 ec a6 39  |...P. B...;.H..9|
000001a0  63 13 a7 01 94 40 ff 88  0f 98 07 4a 46 38 05 a9  |c....@.....JF8..|
000001b0  cb f6 c8 21 59 aa 38 45  bf 5c f8 55 9e 9f 04 ed  |...!Y.8E.\.U....|
000001c0  c8 03 42 2a 4b f6 78 7e  23 67 15 a2 79 29 f4 9b  |..B*K.x~#g..y)..|
000001d0  7e 00 bc 2f 46 96 99 ea  f1 ee 1c 6e 06 9c db e4  |~../F......n....|
000001e0  8c c2 05 f7 54 51 84 c0  33 02 01 b1 8c 80 dc 99  |....TQ..3.......|
000001f0  8f cb 46 ff d1 25 b5 b6  3a f3 25 be 85 50 84 f5  |..F..%..:.%..P..|
00000200  86 5a 71 f7 bd a1 4c 52  4f 20 a3 61 23 77 12 d3  |.Zq...LRO .a#w..|
00000210  b1 58 75 22 01 12 70 ec  14 91 f9 85 61 d5 7e 98  |.Xu"..p.....a.~.|
00000220  84 c9 76 84 bc b8 fe 4e  53 a5 06 82 14 95 51     |..v....NS.....Q|
0000022f

In general, a Zstandard file can contain multiple frames and frames can contain multiple blocks, but this relatively short romeo.txt.zst file has only one of each. The hierarchy has two tiers because frames are always independent, but within any one frame, later blocks can refer to previous blocks (or to the frame’s dictionary). These can be references to previously emitted bytes (or virtual history, for dictionaries) but also to previous tables.

Here’s the same 0x22f = 559 bytes, partitioned and labeled.

00000000  28 b5 2f fd 64 ae 02 0d  11 00 76 62 5e 23 30 6f  |[MN][F][B][-----|
00000010  9b 03 7d c7 16 0b be c8  f2 d0 22 4b 6b bc 54 5d  |----------------|
etc                                                         |-- BLOCK DATA --|
00000210  b1 58 75 22 01 12 70 ec  14 91 f9 85 61 d5 7e 98  |----------------|
00000220  84 c9 76 84 bc b8 fe 4e  53 a5 06 82 14 95 51     |----------][CS]|

We’ve read 10 bytes so far, followed by 545 bytes of BLOCK DATA, which takes us up to byte offset 0x22b = 555. After that, because the ContentChecksumFlag was set, the frame ends in a 4 byte xxHash64 checksum, labeled above as CS, at the end of the file.

Block Data

The BLOCK DATA can be decomposed further:

00000000  ++ ++ ++ ++ ++ ++ ++ ++  ++ ++ 76 62 5e 23 30 6f  |++++++++++[L]H[-|
00000010  9b 03 7d c7 16 0b be c8  f2 d0 22 4b 6b bc 54 5d  |- HUFFMAN CODE -|
00000020  a9 d4 93 ef c4 54 96 b2  e2 a8 a8 24 1c 54 40 29  |----------------|
00000030  01 55 00 57 00 51 00 cc  51 73 3a 85 9e f7 59 fc  |][JUMP][--------|
00000040  c5 ca 6a 7a d9 82 9c 65  c5 45 92 e3 0d f3 ef 71  |----------------|
00000050  ee dc d5 a2 e3 48 ad a3  bc 41 7a 3c aa d6 eb d0  | LSTREAM 1 DATA |
00000060  77 ea dc 5d 41 06 50 1c  49 0f 07 10 05 88 84 94  |----------------|
00000070  02 fc 3c e3 60 25 c0 cb  0c b8 a9 73 bc 13 77 c6  |----------------|
00000080  e2 20 ed 17 7b 12 dc 24  5a df b4 21 9a cb 8f c7  |-----------][---|
00000090  58 54 11 a9 f1 47 82 9b  ba 60 b4 92 28 0e fb 8b  |----------------|
000000a0  1e 92 23 6a cf bf e5 45  b5 7e eb 81 f1 78 4b ad  |----------------|
000000b0  17 4d 81 9f bc 67 a7 56  ee b4 d9 e1 95 21 66 0c  | LSTREAM 2 DATA |
000000c0  95 83 27 de ac 37 20 91  22 07 0b 91 86 94 1a 7b  |----------------|
000000d0  f6 4c b0 c0 e8 2e 49 65  d6 34 63 0c 88 9b 1c 48  |----------------|
000000e0  ca 2b 34 a9 6b 99 3b ee  13 3b 7c 93 0b f7 0d 49  |--][------------|
000000f0  69 18 57 be 3b 64 45 1d  92 63 7f e8 f9 a1 19 7b  |----------------|
00000100  7b 6e d8 a3 90 23 82 f4  a7 ce c8 f8 90 15 b3 14  | LSTREAM 3 DATA |
00000110  f4 40 e7 02 78 d3 17 71  23 b1 19 ad 6b 49 ae 13  |----------------|
00000120  a4 75 38 51 47 89 67 b0  39 b4 53 86 a4 ac aa a3  |----------------|
00000130  34 89 ca 2e e9 c1 fe f2  51 c6 51 73 aa f7 9d 2d  |---][-----------|
00000140  ed d9 b7 4a b2 b2 61 e4  ef 98 f7 c5 ef 51 9b d8  |----------------|
00000150  dc 60 6c 41 76 af 78 1a  62 b5 4c 1e 21 39 9a 5f  | LSTREAM 4 DATA |
00000160  ac 9d e0 62 e8 e9 2f 2f  48 02 8d 53 c8 91 f2 1a  |----------------|
00000170  d2 7c 0a 7c 48 bf da a9  e3 38 da 34 ce 76 a9 da  |----------------|
00000180  15 91 de 21 f5 55 46 a8  21 9d 51 cc 18 42 44 81  |-----]SS[-- LLT |
00000190  8c 94 b4 50 1e 20 42 82  98 c2 3b 10 48 ec a6 39  |----][CMOT][MLT]|
000001a0  63 13 a7 01 94 40 ff 88  0f 98 07 4a 46 38 05 a9  |[---------------|
000001b0  cb f6 c8 21 59 aa 38 45  bf 5c f8 55 9e 9f 04 ed  |----------------|
000001c0  c8 03 42 2a 4b f6 78 7e  23 67 15 a2 79 29 f4 9b  |----------------|
000001d0  7e 00 bc 2f 46 96 99 ea  f1 ee 1c 6e 06 9c db e4  |----------------|
000001e0  8c c2 05 f7 54 51 84 c0  33 02 01 b1 8c 80 dc 99  | SEQUENCES DATA |
000001f0  8f cb 46 ff d1 25 b5 b6  3a f3 25 be 85 50 84 f5  |----------------|
00000200  86 5a 71 f7 bd a1 4c 52  4f 20 a3 61 23 77 12 d3  |----------------|
00000210  b1 58 75 22 01 12 70 ec  14 91 f9 85 61 d5 7e 98  |----------------|
00000220  84 c9 76 84 bc b8 fe 4e  53 a5 06 ++ ++ ++ ++     |----------]++++|

Huffman Code

Since the LiteralsBlockType was CompressedLiterals, the HUFFMAN CODE bytes follow a similar structure to that after the SequencesSectionHeader, above, except that it’s only one FSE table instead of three:

We’ll go into more detail in Part 4: Huffman Codes.

00000000  ++ ++ ++ ++ ++ ++ ++ ++  ++ ++ ++ ++ ++ ++ 30 6f  |++++++++++++++[T|
00000010  9b 03 7d c7 16 0b be c8  f2 d0 22 4b 6b bc 54 5d  | ][-------------|
00000020  a9 d4 93 ef c4 54 96 b2  e2 a8 a8 24 1c 54 40 29  |- HUFFMAN DATA -|
00000030  01 ++ ++ ++ ++ ++ ++ ++  ++ ++ ++ ++ ++ ++ ++ ++  |]+++++++++++++++|

Next: Part 3: Bitstreams.


Published: 2022-05-12