Processing & substreams

Real binary formats rarely hand you a field you can read straight off the wire. Bytes may be compressed, lightly obfuscated, or stored in a region whose length you only learn at runtime. Kaitai Struct handles all three declaratively:

process runs a byte-to-byte transformation (decompression, de-obfuscation) on a field's raw bytes before they are parsed.
A substream is a smaller, bounded stream carved out of the current one, created automatically when you combine size with a type.
_io is the stream object attached to every parsed object, which lets you read its position, detect end-of-stream, and (with io:) jump to another stream entirely.

All .ksy snippets below are written in YAML and compiled with ksc (kaitai-struct-compiler) into a parser for your target language.

The `process` key

process plugs in a predefined algorithm that "takes a stream of bytes and returns another stream of bytes". The compiler reads the field's raw bytes, runs them through the named transformation, and then parses the result. As the user guide puts it, the incoming data "should be pre-processed before actual parsing takes place, or we'll just end up with garbage getting parsed".

process is available only on raw byte arrays or user types — i.e. fields that have a size (or size-eos), not on plain integer/float primitives.

The runtime libraries ship a standard set of transformations:

`process` value	Purpose	Argument(s)
`zlib`	Decompress zlib-compressed data	none
`xor(key)`	XOR every byte with a key	a single byte, or a byte array
`rol(n)`	Circular bit rotate left by `n` bits	bit count `n`
`ror(n)`	Circular bit rotate right by `n` bits	bit count `n`

note

process describes a byte-level transformation that happens before parsing. It is not a hook for arbitrary code in the .ksy itself — the algorithm names above are the built-in set. The expression language is allowed only inside the argument (for example, to reference a key you read earlier), not to invent a new algorithm.

XOR de-obfuscation

XOR is the simplest case: each byte of the raw data is XORed with the key. The key can be a literal, or an expression referencing another field.

seq:
  - id: body_len
    type: u4
  - id: body
    size: body_len
    process: xor(0xaa)
    type: some_body_type

Because the argument is an expression, you can XOR against a key that was read earlier in the same structure:

seq:
  - id: key
    type: u1
  - id: payload
    size: 16
    process: xor(key)

Obfuscation, not encryption

XOR with a fixed (or trivially derivable) key is obfuscation. It hides nothing from an attacker and provides no confidentiality — it only stops naive tools from reading the bytes directly. Treat process: xor(...) as a format quirk to be undone, never as a security measure.

zlib decompression

process: zlib runs the raw bytes through a zlib (DEFLATE) decompressor and parses the decompressed output.

seq:
  - id: len_compressed
    type: u4
  - id: data
    size: len_compressed
    process: zlib
    type: payload

Compression specifics

zlib here means the zlib container format (RFC 1950) wrapping DEFLATE (RFC 1951) — the same envelope used inside PNG IDAT chunks and many archive formats. It is not raw DEFLATE, gzip (RFC 1952), or any other codec. If a format stores gzip or a different compression scheme, model that structure explicitly rather than reaching for process: zlib. Decompression is provided by each runtime's underlying zlib binding, so behaviour follows the standard library, not Kaitai.

rol / ror bit rotation

rol(n) and ror(n) apply a circular bit rotation to each byte — rol rotates left, ror rotates right, by n bits. They are exact inverses of each other: rol(n) undoes ror(n) for the same byte width.

seq:
  - id: scrambled
    size: 32
    process: rol(3)
    type: some_body_type

tip

Like XOR, rol/ror is light obfuscation, not cryptography. Reach for it when a format author has rotated bits to make data look opaque, then let Kaitai rotate them back.

Substreams via `size` + `type`

When you give a field both a size and a user type, Kaitai does not parse the type against the whole remaining stream. Instead it:

reads exactly size bytes from the current stream into a raw byte array, and
wraps those bytes in a brand-new, bounded stream — a substream — and parses the type inside it.

seq:
  - id: joe
    type: person
    size: 20

Conceptually, the generated parser does something like this (Java shown for illustration; the same pattern appears in every target language):

this._raw_joe = this._io.readBytes(20);
KaitaiStream _io__raw_joe = new KaitaiStream(_raw_joe);
this.joe = new Person(_io__raw_joe, this, _root);

Two consequences fall out of this:

The raw bytes are also kept on an auto-generated _raw_<id> field (here _raw_joe), which is exactly what process operates on before the substream is built.
The substream is a hard boundary. If person tries to read past 20 bytes, it hits end-of-stream and the parse fails. If it reads fewer, the leftover bytes inside the substream are simply ignored — the outer stream still advances by the full size.

This is what makes length-prefixed records safe to parse: a malformed inner record can never run off into the bytes of the next record.

`size-eos: true`

Use size-eos: true instead of a numeric size to consume everything left in the current stream. It is most useful inside a substream, where "the end of the stream" means the end of that bounded region, not the end of the file.

seq:
  - id: len_body
    type: u4
  - id: body
    type: record_body
    size: len_body
types:
  record_body:
    seq:
      - id: tag
        type: u1
      - id: comment
        type: str
        encoding: UTF-8
        size-eos: true

Here comment grabs all remaining bytes of the len_body-sized substream — exactly the trailing region, with no separate length field needed.

The `_io` object

Every parsed object carries an _io attribute: the stream it was read from. You can reference it in expressions to ask about position and end-of-stream.

`_io` member	Returns	Meaning
`_io.pos`	integer	current byte offset within this stream
`_io.eof`	boolean	`true` once the stream is fully consumed
`_io.size`	integer	total size of this stream in bytes

Because a substream is its own stream, _io.pos and pos: are relative to the substream's start, not to the whole file:

types:
  block:
    instances:
      some_bytes_in_the_middle:
        pos: 30
        size: 16

Inside a block substream this reads bytes 30–45 of the block, regardless of where the block sits in the file.

Escaping the substream with `io:`

Sometimes a record stores an absolute offset that points outside its own substream — for example, a directory entry that points at a body elsewhere in the file. The io: key chooses which stream an instance reads from, letting you address the root stream absolutely.

types:
  file_entry:
    seq:
      - id: file_name
        type: strz
      - id: ofs_body
        type: u4
      - id: len_body
        type: u4
    instances:
      body:
        io: _root._io
        pos: ofs_body
        size: len_body

info

The user guide is explicit about why io: matters here: without io: _root._io, body "would have been parsed inside a [...] substream (and most likely that would result in an exception)", because ofs_body is an offset into the whole file, not into the small entry substream. _root._io is the stream of the top-level object; you can also reach intermediate streams via _parent._io.

Putting it together (illustrative example)

Illustrative only. The format below is invented to show process, substreams, and _io cooperating in one place. It is not a real-world specification — see the formats gallery for those.

meta:
  id: blob_container
  endian: le
seq:
  - id: xor_key
    type: u1
  - id: len_packed
    type: u4
  - id: packed
    # raw bytes are first XOR-deobfuscated, then zlib-decompressed,
    # then the result is parsed as `record` inside a substream
    size: len_packed
    process: xor(xor_key)
    type: record
types:
  record:
    seq:
      - id: count
        type: u2
      - id: comment
        type: str
        encoding: UTF-8
        size-eos: true   # rest of THIS substream

The order of operations is: read len_packed raw bytes → apply process: xor(xor_key) → wrap the transformed bytes in a substream → parse record inside it. size-eos: true then bounds comment to whatever remains of that substream.

note

To chain XOR and zlib you would model the layers explicitly (for example, an XORed outer field whose decoded bytes are a zlib-compressed inner field), since a single field carries one process. Keeping each transformation on its own field also makes the .ksy easier to read and debug.

Sources

Kaitai Struct User Guide — process (zlib/xor/rol/ror), substreams, size-eos, io:/pos, _raw_ fields.
KSY reference — attribute keys (process, size, size-eos, type, io, pos).
Serialization guide — how substreams are collapsed back when writing data.
Kaitai Stream API — stream members (pos, eof, read_bytes, read_bytes_full, seek).
Formats gallery — real specs using compression and processing (e.g. png, gzip, zip).
kaitai-io/kaitai_struct — project README and overview.

The process key​

XOR de-obfuscation​

zlib decompression​

rol / ror bit rotation​

Substreams via size + type​

size-eos: true​

The _io object​

Escaping the substream with io:​

Putting it together (illustrative example)​

Sources​