Skip to main content

Processing & substreams

Real binary formats rarely hand you a field you can read straight off the wire. Bytes may be compressed, lightly obfuscated, or stored in a region whose length you only learn at runtime. Kaitai Struct handles all three declaratively:

  • process runs a byte-to-byte transformation (decompression, de-obfuscation) on a field's raw bytes before they are parsed.
  • A substream is a smaller, bounded stream carved out of the current one, created automatically when you combine size with a type.
  • _io is the stream object attached to every parsed object, which lets you read its position, detect end-of-stream, and (with io:) jump to another stream entirely.

All .ksy snippets below are written in YAML and compiled with ksc (kaitai-struct-compiler) into a parser for your target language.

The process key

process plugs in a predefined algorithm that "takes a stream of bytes and returns another stream of bytes". The compiler reads the field's raw bytes, runs them through the named transformation, and then parses the result. As the user guide puts it, the incoming data "should be pre-processed before actual parsing takes place, or we'll just end up with garbage getting parsed".

process is available only on raw byte arrays or user types — i.e. fields that have a size (or size-eos), not on plain integer/float primitives.

The runtime libraries ship a standard set of transformations:

process valuePurposeArgument(s)
zlibDecompress zlib-compressed datanone
xor(key)XOR every byte with a keya single byte, or a byte array
rol(n)Circular bit rotate left by n bitsbit count n
ror(n)Circular bit rotate right by n bitsbit count n
note

process describes a byte-level transformation that happens before parsing. It is not a hook for arbitrary code in the .ksy itself — the algorithm names above are the built-in set. The expression language is allowed only inside the argument (for example, to reference a key you read earlier), not to invent a new algorithm.

XOR de-obfuscation

XOR is the simplest case: each byte of the raw data is XORed with the key. The key can be a literal, or an expression referencing another field.

seq:
- id: body_len
type: u4
- id: body
size: body_len
process: xor(0xaa)
type: some_body_type

Because the argument is an expression, you can XOR against a key that was read earlier in the same structure:

seq:
- id: key
type: u1
- id: payload
size: 16
process: xor(key)
Obfuscation, not encryption

XOR with a fixed (or trivially derivable) key is obfuscation. It hides nothing from an attacker and provides no confidentiality — it only stops naive tools from reading the bytes directly. Treat process: xor(...) as a format quirk to be undone, never as a security measure.

zlib decompression

process: zlib runs the raw bytes through a zlib (DEFLATE) decompressor and parses the decompressed output.

seq:
- id: len_compressed
type: u4
- id: data
size: len_compressed
process: zlib
type: payload
Compression specifics

zlib here means the zlib container format (RFC 1950) wrapping DEFLATE (RFC 1951) — the same envelope used inside PNG IDAT chunks and many archive formats. It is not raw DEFLATE, gzip (RFC 1952), or any other codec. If a format stores gzip or a different compression scheme, model that structure explicitly rather than reaching for process: zlib. Decompression is provided by each runtime's underlying zlib binding, so behaviour follows the standard library, not Kaitai.

rol / ror bit rotation

rol(n) and ror(n) apply a circular bit rotation to each byte — rol rotates left, ror rotates right, by n bits. They are exact inverses of each other: rol(n) undoes ror(n) for the same byte width.

seq:
- id: scrambled
size: 32
process: rol(3)
type: some_body_type
tip

Like XOR, rol/ror is light obfuscation, not cryptography. Reach for it when a format author has rotated bits to make data look opaque, then let Kaitai rotate them back.

Substreams via size + type

When you give a field both a size and a user type, Kaitai does not parse the type against the whole remaining stream. Instead it:

  1. reads exactly size bytes from the current stream into a raw byte array, and
  2. wraps those bytes in a brand-new, bounded stream — a substream — and parses the type inside it.
seq:
- id: joe
type: person
size: 20

Conceptually, the generated parser does something like this (Java shown for illustration; the same pattern appears in every target language):

this._raw_joe = this._io.readBytes(20);
KaitaiStream _io__raw_joe = new KaitaiStream(_raw_joe);
this.joe = new Person(_io__raw_joe, this, _root);

Two consequences fall out of this:

  • The raw bytes are also kept on an auto-generated _raw_<id> field (here _raw_joe), which is exactly what process operates on before the substream is built.
  • The substream is a hard boundary. If person tries to read past 20 bytes, it hits end-of-stream and the parse fails. If it reads fewer, the leftover bytes inside the substream are simply ignored — the outer stream still advances by the full size.

This is what makes length-prefixed records safe to parse: a malformed inner record can never run off into the bytes of the next record.

size-eos: true

Use size-eos: true instead of a numeric size to consume everything left in the current stream. It is most useful inside a substream, where "the end of the stream" means the end of that bounded region, not the end of the file.

seq:
- id: len_body
type: u4
- id: body
type: record_body
size: len_body
types:
record_body:
seq:
- id: tag
type: u1
- id: comment
type: str
encoding: UTF-8
size-eos: true

Here comment grabs all remaining bytes of the len_body-sized substream — exactly the trailing region, with no separate length field needed.

The _io object

Every parsed object carries an _io attribute: the stream it was read from. You can reference it in expressions to ask about position and end-of-stream.

_io memberReturnsMeaning
_io.posintegercurrent byte offset within this stream
_io.eofbooleantrue once the stream is fully consumed
_io.sizeintegertotal size of this stream in bytes

Because a substream is its own stream, _io.pos and pos: are relative to the substream's start, not to the whole file:

types:
block:
instances:
some_bytes_in_the_middle:
pos: 30
size: 16

Inside a block substream this reads bytes 30–45 of the block, regardless of where the block sits in the file.

Escaping the substream with io:

Sometimes a record stores an absolute offset that points outside its own substream — for example, a directory entry that points at a body elsewhere in the file. The io: key chooses which stream an instance reads from, letting you address the root stream absolutely.

types:
file_entry:
seq:
- id: file_name
type: strz
- id: ofs_body
type: u4
- id: len_body
type: u4
instances:
body:
io: _root._io
pos: ofs_body
size: len_body
info

The user guide is explicit about why io: matters here: without io: _root._io, body "would have been parsed inside a [...] substream (and most likely that would result in an exception)", because ofs_body is an offset into the whole file, not into the small entry substream. _root._io is the stream of the top-level object; you can also reach intermediate streams via _parent._io.

Putting it together (illustrative example)

Illustrative only. The format below is invented to show process, substreams, and _io cooperating in one place. It is not a real-world specification — see the formats gallery for those.

meta:
id: blob_container
endian: le
seq:
- id: xor_key
type: u1
- id: len_packed
type: u4
- id: packed
# raw bytes are first XOR-deobfuscated, then zlib-decompressed,
# then the result is parsed as `record` inside a substream
size: len_packed
process: xor(xor_key)
type: record
types:
record:
seq:
- id: count
type: u2
- id: comment
type: str
encoding: UTF-8
size-eos: true # rest of THIS substream

The order of operations is: read len_packed raw bytes → apply process: xor(xor_key) → wrap the transformed bytes in a substream → parse record inside it. size-eos: true then bounds comment to whatever remains of that substream.

note

To chain XOR and zlib you would model the layers explicitly (for example, an XORed outer field whose decoded bytes are a zlib-compressed inner field), since a single field carries one process. Keeping each transformation on its own field also makes the .ksy easier to read and debug.

Sources