Processing & substreams
Real binary formats rarely hand you a field you can read straight off the wire. Bytes may be compressed, lightly obfuscated, or stored in a region whose length you only learn at runtime. Kaitai Struct handles all three declaratively:
processruns a byte-to-byte transformation (decompression, de-obfuscation) on a field's raw bytes before they are parsed.- A substream is a smaller, bounded stream carved out of the current one,
created automatically when you combine
sizewith atype. _iois the stream object attached to every parsed object, which lets you read its position, detect end-of-stream, and (withio:) jump to another stream entirely.
All .ksy snippets below are written in YAML and compiled with ksc
(kaitai-struct-compiler) into a parser for your target language.
The process key
process plugs in a predefined algorithm that "takes a stream of bytes and
returns another stream of bytes". The compiler reads the field's raw bytes, runs
them through the named transformation, and then parses the result. As the user
guide puts it, the incoming data "should be pre-processed before actual parsing
takes place, or we'll just end up with garbage getting parsed".
process is available only on raw byte arrays or user types — i.e. fields
that have a size (or size-eos), not on plain integer/float primitives.
The runtime libraries ship a standard set of transformations:
process value | Purpose | Argument(s) |
|---|---|---|
zlib | Decompress zlib-compressed data | none |
xor(key) | XOR every byte with a key | a single byte, or a byte array |
rol(n) | Circular bit rotate left by n bits | bit count n |
ror(n) | Circular bit rotate right by n bits | bit count n |
process describes a byte-level transformation that happens before parsing. It
is not a hook for arbitrary code in the .ksy itself — the algorithm names above
are the built-in set. The expression language is allowed only inside the
argument (for example, to reference a key you read earlier), not to invent a new
algorithm.
XOR de-obfuscation
XOR is the simplest case: each byte of the raw data is XORed with the key. The key can be a literal, or an expression referencing another field.
seq:
- id: body_len
type: u4
- id: body
size: body_len
process: xor(0xaa)
type: some_body_type
Because the argument is an expression, you can XOR against a key that was read earlier in the same structure:
seq:
- id: key
type: u1
- id: payload
size: 16
process: xor(key)
XOR with a fixed (or trivially derivable) key is obfuscation. It hides nothing
from an attacker and provides no confidentiality — it only stops naive tools from
reading the bytes directly. Treat process: xor(...) as a format quirk to be
undone, never as a security measure.
zlib decompression
process: zlib runs the raw bytes through a zlib (DEFLATE) decompressor and
parses the decompressed output.
seq:
- id: len_compressed
type: u4
- id: data
size: len_compressed
process: zlib
type: payload
zlib here means the zlib container format (RFC 1950) wrapping DEFLATE (RFC
1951) — the same envelope used inside PNG IDAT chunks and many archive formats.
It is not raw DEFLATE, gzip (RFC 1952), or any other codec. If a format stores
gzip or a different compression scheme, model that structure explicitly rather
than reaching for process: zlib. Decompression is provided by each runtime's
underlying zlib binding, so behaviour follows the standard library, not Kaitai.
rol / ror bit rotation
rol(n) and ror(n) apply a circular bit rotation to each byte — rol rotates
left, ror rotates right, by n bits. They are exact inverses of each other:
rol(n) undoes ror(n) for the same byte width.
seq:
- id: scrambled
size: 32
process: rol(3)
type: some_body_type
Like XOR, rol/ror is light obfuscation, not cryptography. Reach for it when a
format author has rotated bits to make data look opaque, then let Kaitai rotate
them back.
Substreams via size + type
When you give a field both a size and a user type, Kaitai does not parse
the type against the whole remaining stream. Instead it:
- reads exactly
sizebytes from the current stream into a raw byte array, and - wraps those bytes in a brand-new, bounded stream — a substream — and parses the type inside it.
seq:
- id: joe
type: person
size: 20
Conceptually, the generated parser does something like this (Java shown for illustration; the same pattern appears in every target language):
this._raw_joe = this._io.readBytes(20);
KaitaiStream _io__raw_joe = new KaitaiStream(_raw_joe);
this.joe = new Person(_io__raw_joe, this, _root);
Two consequences fall out of this:
- The raw bytes are also kept on an auto-generated
_raw_<id>field (here_raw_joe), which is exactly whatprocessoperates on before the substream is built. - The substream is a hard boundary. If
persontries to read past 20 bytes, it hits end-of-stream and the parse fails. If it reads fewer, the leftover bytes inside the substream are simply ignored — the outer stream still advances by the fullsize.
This is what makes length-prefixed records safe to parse: a malformed inner record can never run off into the bytes of the next record.
size-eos: true
Use size-eos: true instead of a numeric size to consume everything left
in the current stream. It is most useful inside a substream, where "the end of
the stream" means the end of that bounded region, not the end of the file.
seq:
- id: len_body
type: u4
- id: body
type: record_body
size: len_body
types:
record_body:
seq:
- id: tag
type: u1
- id: comment
type: str
encoding: UTF-8
size-eos: true
Here comment grabs all remaining bytes of the len_body-sized substream —
exactly the trailing region, with no separate length field needed.
The _io object
Every parsed object carries an _io attribute: the stream it was read from. You
can reference it in expressions to ask about position and end-of-stream.
_io member | Returns | Meaning |
|---|---|---|
_io.pos | integer | current byte offset within this stream |
_io.eof | boolean | true once the stream is fully consumed |
_io.size | integer | total size of this stream in bytes |
Because a substream is its own stream, _io.pos and pos: are relative to the
substream's start, not to the whole file:
types:
block:
instances:
some_bytes_in_the_middle:
pos: 30
size: 16
Inside a block substream this reads bytes 30–45 of the block, regardless of
where the block sits in the file.
Escaping the substream with io:
Sometimes a record stores an absolute offset that points outside its own
substream — for example, a directory entry that points at a body elsewhere in the
file. The io: key chooses which stream an instance reads from, letting you
address the root stream absolutely.
types:
file_entry:
seq:
- id: file_name
type: strz
- id: ofs_body
type: u4
- id: len_body
type: u4
instances:
body:
io: _root._io
pos: ofs_body
size: len_body
The user guide is explicit about why io: matters here: without io: _root._io,
body "would have been parsed inside a [...] substream (and most likely that
would result in an exception)", because ofs_body is an offset into the whole
file, not into the small entry substream. _root._io is the stream of the
top-level object; you can also reach intermediate streams via _parent._io.
Putting it together (illustrative example)
Illustrative only. The format below is invented to show
process, substreams, and_iocooperating in one place. It is not a real-world specification — see the formats gallery for those.
meta:
id: blob_container
endian: le
seq:
- id: xor_key
type: u1
- id: len_packed
type: u4
- id: packed
# raw bytes are first XOR-deobfuscated, then zlib-decompressed,
# then the result is parsed as `record` inside a substream
size: len_packed
process: xor(xor_key)
type: record
types:
record:
seq:
- id: count
type: u2
- id: comment
type: str
encoding: UTF-8
size-eos: true # rest of THIS substream
The order of operations is: read len_packed raw bytes → apply
process: xor(xor_key) → wrap the transformed bytes in a substream → parse
record inside it. size-eos: true then bounds comment to whatever remains of
that substream.
To chain XOR and zlib you would model the layers explicitly (for example, an
XORed outer field whose decoded bytes are a zlib-compressed inner field), since a
single field carries one process. Keeping each transformation on its own field
also makes the .ksy easier to read and debug.
Sources
- Kaitai Struct User Guide —
process(zlib/xor/rol/ror), substreams,size-eos,io:/pos,_raw_fields. - KSY reference — attribute keys
(
process,size,size-eos,type,io,pos). - Serialization guide — how substreams are collapsed back when writing data.
- Kaitai Stream API — stream members
(
pos,eof,read_bytes,read_bytes_full,seek). - Formats gallery — real specs using compression
and processing (e.g.
png,gzip,zip). - kaitai-io/kaitai_struct — project README and overview.