Nigel Tao

XZ/LZMA Worked Example Part 5: XZ

This blog post is one of a five part series.

LZMA

After digesting all of the previous posts, we’re now ready to do a full “worked example” of a real LZMA file, compressing romeo.txt.

$ lzma --keep --compress romeo.txt

$ file romeo.txt.lzma
romeo.txt.lzma: LZMA compressed data, streamed

$ hd romeo.txt.lzma
00000000  5d 00 00 80 00 ff ff ff  ff ff ff ff ff 00 29 1b  |].............).|
00000010  c9 a6 6a 3f 39 3c 50 94  51 0f 22 ad 44 59 e8 14  |..j?9<P.Q.".DY..|
00000020  fe c8 f9 9b 35 c3 10 4a  dd 3b ae 3a b0 5d a7 92  |....5..J.;.:.]..|
00000030  11 18 4c 21 d6 9f bb 93  12 e2 09 eb cf e9 9e a9  |..L!............|
00000040  30 b9 6d f1 9e fa d2 ad  33 dd e3 c4 2e ee fb 74  |0.m.....3......t|
etc.
00000210  6f 86 0a 93 9d 3b 7e b1  0a de f6 27 4d b5 9e 9e  |o....;~....'M...|
00000220  7c c2 aa 2b 40 60 ae 82  82 a1 c4 4e 3a dd c8 c1  ||..+@`.....N:...|
00000230  ed 56 da 05 13 4a 0f 68  06 7c 01 16 01 b1 42 dd  |.V...J.h.|....B.|
00000240  43 8e 8e 0a 89 94 8f 98  f7 d5 63 f4 ea bd 04 63  |C.........c....c|
00000250  33 28 fb a1 17 9a                                 |3(....|
00000256

The first byte encodes the (lc=3, lp=0, pb=2) triple: 0x5D is 93 is ((((2*5)+0)*9)+3). The next four bytes is the dictionary size (the maximum distance), little-endian: 0x0080_0000. The next eight bytes are the decoded length, the size in bytes of the uncompressed data. All 0xFF means unknown (at this time), so that the EOS (End Of Stream) marker is mandatory (and the “file” command-line tool reports “streamed”). If the decoded length was not all 0xFF then the EOS is optional. It’s valid to have both an explicit decoded length and an EOS (and, if both present, the two should agree).

After that, the remaining 585 bytes, from offset 0x00D to 0x256 is… the “treasure map”. A very precise range, whose lower bound is a number between zero and one. Here, it’s the base-256 range «00_29_1B_C9_A6_6A_..._D5_63_F4_EA_BD_04_63_33_28_FB_A1_17_9A». I could show you the bym stream that’s derived from this treasure map, similar to how I previously patched litonlylzma.go, but that’s not actually very interesting or educational.

That’s it! That’s the entire LZMA file format (also known as LZMA1, we’ll get to LZMA2 later, below). There’s no trailing bytes after the “treasure map”. There’s no “magic signature bytes” at the start, either, but the "5D eleven_bytes 00|FF 00" pattern from LZMA’s default configuration is often good enough, e.g. for the “file” command-line tool.

XZ

The XZ file format is a little more complicated. One structural difference is that XZ can break its source data into multiple independently-compressed chunks. There’s a trailer at the end of an XZ file that indexes those chunks. Independence leads to a slightly worse compression ratio but the chunks can be decoded in parallel, for significantly faster decompression (in terms of wall clock time).

The index enables faster random access. Similar to I-frames versus P-frames in video (and scrubbing around during video playback), getting just the millionth decompressed byte of a multi-chunk, indexed XZ file doesn’t require decompressing all million prior bytes. You can just start from a nearby I-frame-equivalent.

In theory, XZ is also a general-purpose container, combining one of many base compression algorithms with zero or more post-processing filters (or pre-processing, if you’re encoding instead of decoding). In practice, though, “one of many base compression algorithms” is just “you can have any algorithm you like, as long as it’s LZMA2”.

The filters try to improve how repetitive the encoder-input data is, since repetition compresses very well. There’s a Delta(N) encoder, which modifies every input byte by subtracting its from-N-bytes-ago byte.

The other filters are all BCJ(arch) filters: Branch / Call / Jump filters for specific CPU architectures like x86, SPARC and RISC-V. When a compiler (C, C++, Go, Rust, etc.) sees a while loop with multiple break statements, they all break to the same line of code but, at the machine code level, BCJ opcodes usually take a relative address. A BCJ(arch) filter just detects BCJ ops in that arch’s machine code and re-writes those (different) relative addresses as (repeated) absolute addresses. This is fiddly minutia but, again, presumably the gain in compression ratio for certain workloads was worth the extra complexity.

Some tangential trivia: the BCJ(RISC-V) filter was only added to xz very recently (January 2024: code, test files), by the now-infamous Jia Tan. I don’t think those tests/files/good-1-riscv-*.xz test files are malicious, but they were still rolled back, out of precaution.

XZ File

Without further ado, here’s a complete XZ file:

$ xz --keep --compress romeo.txt

$ file romeo.txt.xz
romeo.txt.xz: XZ compressed data, checksum CRC64

$ hd romeo.txt.xz
00000000  fd 37 7a 58 5a 00 00 04  e6 d6 b4 46 02 00 21 01  |.7zXZ......F..!.|
00000010  16 00 00 00 74 2f e5 a3  e0 03 ad 02 43 5d 00 29  |....t/......C].)|
00000020  1b c9 a6 6a 3f 39 3c 50  94 51 0f 22 ad 44 59 e8  |...j?9<P.Q.".DY.|
00000030  14 fe c8 f9 9b 35 c3 10  4a dd 3b ae 3a b0 5d a7  |.....5..J.;.:.].|
00000040  92 11 18 4c 21 d6 9f bb  93 12 e2 09 eb cf e9 9e  |...L!...........|
etc.
00000240  c1 ed 56 da 05 13 4a 0f  68 06 7c 01 16 01 b1 42  |..V...J.h.|....B|
00000250  dd 43 8e 8e 0a 89 94 8f  98 f7 d5 63 f4 ea b3 33  |.C.........c...3|
00000260  51 57 00 00 88 6a 00 2d  c4 61 fd 37 00 01 df 04  |QW...j.-.a.7....|
00000270  ae 07 00 00 54 a4 46 7d  b1 c4 67 fb 02 00 00 00  |....T.F}..g.....|
00000280  00 04 59 5a                                       |..YZ|
00000284

The first six bytes, "FD 37 7A 58 5A 00", is XZ’s “magic signature”. The next two bytes are flags, the "00 04" means to use the 8-byte CRC-64/ECMA checksum (instead of 4-byte CRC-32/IEEE, 32-byte SHA-256 or no checksum at all). The next four bytes are a CRC-32/IEEE (yes, 32) checksum of the previous two bytes (the flags).

The next eight bytes are a block header "02 00 21 01 16 00 00 00", which is padded and aligned to 4-byte boundaries (double-words). The "02" is the number of double-words. The "00" is block flags, meaning no post-processing filters (and only one “base compression” filter) and no overall compressed size or overall decompressed size is recorded. "21 01" means that that base compression filter is LZMA2 and its properties occupy one byte. "16" is that byte, meaning a dictionary size of 0x80_0000 (see section “5.3.1. LZMA2” of the XZ spec for the formula). Three NUL bytes pad to a double-word boundary. After that comes a 4-byte CRC-32/IEEE checksum of that block header.

We’re now decoding some LZMA2 data, which generally consists of multiple chunks. In our romeo.txt.xz specific case, there are only two chunks, since the original romeo.txt input was small. An interesting chunk starts at byte offset 0x018 and a trivial chunk starts at byte offset 0x262. The trivial chunk is only one byte long, a single NUL byte, meaning “no more chunks”.

Our interesting chunk at byte offset 0x018 starts with a six byte chunk header: "E0 03 AD 02 43 5D". The "E0" byte basically means that this is an ‘I-frame’ chunk. Combining its low five bits with the next two bytes means that the chunk’s decompressed length is (1 + 0x00_03AD), which is 942, which matches the byte size of the original romeo.txt file. The next two bytes means that the chunk’s compressed length (excluding the chunk header) is (1 + 0x0243). Adding 0x018 + 6 to that is how we know that the next (trivial) chunk starts at byte offset 0x262. The 0x5D byte encodes the (lc=3, lp=0, pb=2) triple just like the opening 0x5D byte of romeo.txt.lzma, above.

After that LZMA2 chunk header comes 0x244 = 580 bytes of treasure map data: «00_29_1B_C9_A6_6A_..._D5_63_F4_EA_B3_33_51_57». These 580 bytes that make up the bulk of romeo.txt.xz is almost the same as the 585 bytes that make up the bulk of romeo.txt.lzma: «00_29_1B_C9_A6_6A_..._D5_63_F4_EA_BD_04_63_33_28_FB_A1_17_9A». They’re slightly different (and the xz one is 5 bytes shorter) because the LZMA2 chunk header contains an explicit decompressed length and so its treasure map data does not need (or have) an EOS marker.

After that treasure map comes that trivial “no more chunks” chunk, and then another NUL byte of padding to get to double-word alignment. Then comes 8 bytes of a CRC-64/ECMA (yes, 64) checksum; a checksum of the chunk’s decompressed data (in contrast, the CRC-32/IEEE checksums run over the compressed file’s XZ metadata).

XZ Trailer

What remains after that is the XZ trailer, including the index, which we’ll tackle back-to-front.

00000260  ++ ++ ++ ++ ++ ++ ++ ++  ++ ++ ++ ++ 00 01 df 04  |++++++++++++....|
00000270  ae 07 00 00 54 a4 46 7d  b1 c4 67 fb 02 00 00 00  |....T.F}..g.....|
00000280  00 04 59 5a                                       |..YZ|

It ends with "59 5A", which is another XZ magic signature (but at the end of the file, not the beginning). Prior to that is "00 04", flags which must match the "00 04" flags near the start, at byte offset 0x006. Prior to that is the little-endian uint32_t value 0x0000_0002, which is one less than the size of the index (measured in double-words). Prior to that is four bytes of the CRC-32/IEEE checksum of those "02 00 00 00 00 04" bytes. Masking out those final 12 bytes leaves us with the index (in this case, also coincidentally 12 bytes, 3 double-words):

00000260  ++ ++ ++ ++ ++ ++ ++ ++  ++ ++ ++ ++ 00 01 df 04  |++++++++++++....|
00000270  ae 07 00 00 54 a4 46 7d  ++ ++ ++ ++ ++ ++ ++ ++  |....T.F}++++++++|

The opening NUL byte means that this is the index (as opposed to a block header’s “size of the block header in double-words” opening byte, which must be positive). An "01" byte is next. Our original romeo.txt file was short enough that we only need one block (one index record).

The "DF 04 AE 07" bytes hold two varints (variable width integers) for the block’s compressed size (excluding padding) and uncompressed size: ((0x04 << 7) | (0xDF & 0x7F)) is 0x25F is 607, ((0x07 << 7) | (0xAE & 0x7F)) is 0x3AE is 942. The 942 matches the length of the original romeo.txt. The 607 matches the length of the block from offset 0x00C to 0x26C, minus the one byte of padding at offset 0x263. No idea why we subtract that NUL padding byte (in the middle of the block) out of the compressed length, but it’s part of the XZ file format, now and forever.

After the "DF 04 AE 07" bytes comes some more NUL padding (to double-word alignment) and then a four byte CRC-32/IEEE checksum over the entire index (including the NUL padding but excluding that final checksum).

That’s it (again)! A breakdown of a complete XZ file, the vast bulk of which is a very precise range (expressed in base-256 digits).

Studying Code

That wraps up deconstructing an XZ or LZMA file. If you want specifications, here’s the XZ spec and the LZMA spec.

If you want to look at some real code, the lzma-sdk repo has a reference LZMA implementation. It reads from and writes to a C/C++ FILE * as a “one-shot” API and does not support resumable streaming I/O.

If you want the richer XZ container format, not just LZMA, and you want to study a C implementation, try the Linux kernel’s lib/xz/xz_dec_lzma2.c, also known as xz-embedded’s linux/lib/xz/xz_dec_lzma2.c.

There’s also the xz repo itself, and lzma-sdk, but both implementations are a bit macro heavy:

xz’s src/liblzma/lzma/lzma_decoder.c pulls in src/liblzma/rangecoder/range_decoder.h which has 47 #define lines.
lzma-sdk’s C/LzmaDec.c has 84 #define lines.

Memory-Safe XZ/LZMA Implementations

While the xz backdoor wasn’t about a memory-safety bug per se, the xz code still has comments saying that it knowingly violates the C standard. It also uses x86 assembly by default, in macros, which can be daunting to audit for memory-safety. LZMA-SDK has its own x86 assembly.

There’s also Wuffs’ std/lzma implementation. It’s more repetitive than the C implementations linked to above, since the Wuffs programming language doesn’t have macros (or an __always_inline attribute). One of Wuffs’ goals is proof of memory-safety (e.g. no buffer overflows) at compile time, and static analysis is easier the simpler the programming language is. With less power comes easier proof of safety. That’s not free, of course. The trade-off for not having macros is writing longer programs, manually inlining at the inner loops’ “call sites”.

Wuffs’ compiler generates C code (which is also checked into the repo), not object code. You can just fling that C code at gcc. Or existing C/C++ projects can use Wuffs’s XZ/LZMA implementation like any other C library (without needing any new toolchains in their build systems). It’s just not hand-written C. And it doesn’t use autotools.

$ wget --quiet https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.8.2.tar.xz

$ git clone --quiet --depth=1 https://github.com/google/wuffs.git

$ gcc -O3 wuffs/example/mzcat/mzcat.c -o my-mzcat

$ # my-mzcat and /usr/bin/xz agree on the decoding.
$ ./my-mzcat     < linux-6.8.2.tar.xz | sha256sum
d53c712611ea6cb5acaf6627a84d5226692ae90ce41ee599fcc3203e7f8aa359  -
$ /usr/bin/xz -d < linux-6.8.2.tar.xz | sha256sum
d53c712611ea6cb5acaf6627a84d5226692ae90ce41ee599fcc3203e7f8aa359  -

$ # Performance is roughly similar.
$ time ./my-mzcat     < linux-6.8.2.tar.xz > /dev/null
real	0m7.682s
etc.
$ time /usr/bin/xz -d < linux-6.8.2.tar.xz > /dev/null
real	0m7.845s
etc.

Wuffs’ example/mzcat program is like bzcat, xzcat or zcat, but speaks multiple compression formats. For additional defence in depth, on Linux, the very first thing that its main function does is to self-impose a SECCOMP_MODE_STRICT sandbox.

For other memory-safe languages, there’s github.com/ulikunitz/xz in Go. There’s undoubtedly XZ-the-file-format implementations in Java, Rust, etc. too. I’m just not as familiar with them.

Other Compression Formats

If you liked this breakdown of an actual XZ/LZMA file, I’ve also written deconstructions of bzip2, deflate, lzw and zstd.

Published: 2024-04-18