Skip to main content

Glossary

This page defines the vocabulary you will meet throughout the Kaitai Struct documentation. Terms are grouped roughly by where you encounter them: writing a format description, compiling it, and running the generated code.

note

Kaitai Struct describes binary data structures declaratively. You write a description once and the toolchain turns it into parsing code for many languages. Most of the terms below name a piece of that description or a piece of the toolchain that consumes it.

Quick reference

TermShort definition
Data typeThe kind of value an attribute holds (built-in like u2, or user-defined).
AttributeA named field parsed sequentially from the stream, declared under seq.
InstanceA named field that is computed or read on demand, declared under instances.
SubstreamA byte-limited view of the stream, typically created by size.
KSYThe .ksy YAML file that describes a format.
kscThe kaitai-struct-compiler, which turns a .ksy into source code.
Runtime libraryThe small per-language library the generated code depends on.
Format galleryThe public collection of ready-made .ksy format descriptions.
Expression languageThe small DSL used inside .ksy for values, conditions, and sizes.

Data type

A data type describes what kind of value a field holds and therefore how many bytes are consumed and how they are interpreted.

Built-in types include:

  • Unsigned integers: u1, u2, u4, u8
  • Signed integers: s1, s2, s4, s8
  • Floating-point: f4, f8
  • Strings: str (requires an encoding) and strz (NUL-terminated)
  • Bit-sized integers: bX (for example b3 for a 3-bit field)

Multi-byte integer types may carry an endianness suffix, for example u2le (little-endian) or u4be (big-endian).

A user-defined type is a composite structure you declare under the types key. It behaves like a class: it can contain its own seq, instances, and nested types. The documentation notes that you can have several levels of subtypes, so formats can be modeled hierarchically.

# Illustrative example
seq:
- id: magic
type: u4be # built-in 4-byte big-endian unsigned integer
- id: header
type: file_header # user-defined type declared under `types`

Attribute

An attribute is a single named field listed under the seq key. Attributes are parsed sequentially: the parser reads them in order, one after another, advancing through the stream. Each attribute has an id (its name) and usually a type, and may carry extra keys such as size, repeat, if, contents, or process.

# Illustrative example
seq:
- id: len_body
type: u4
- id: body
size: len_body # length comes from a previously parsed attribute
tip

Because seq attributes are read in order, a later attribute can refer to an earlier one (as len_body is used above). This is how length-prefixed fields are expressed.

Instance

An instance is a named field declared under the instances key. Unlike a seq attribute, an instance is not read in stream order. Instances are lazy by default — they are evaluated only when the corresponding property is accessed in the generated code.

There are two common kinds:

  • Positional instances use a pos key to read data from an arbitrary offset in the stream (and optionally an io to choose which stream).
  • Value instances compute a derived value with the value key. They do no actual parsing, so they do not require a pos.
# Illustrative example
instances:
footer:
pos: _io.size - 8 # positional: jump to 8 bytes before end of stream
type: u8be
is_large:
value: footer > 1000000 # value instance: computed, parses nothing
Attribute (seq)Instance (instances)
Read orderSequentialOn demand (lazy)
PositionImplicit (current offset)Explicit pos, or none for value
Parses bytes?YesOnly positional instances; value does not

Substream

A substream is a byte-limited view of an underlying stream. The most common way to create one is to apply size to a user-defined type: the parser reads exactly that many bytes and hands the subtype a stream that cannot read past those bytes. This both scopes relative positioning and prevents a subtype from reading beyond its declared boundary. Applying a process transformation (for example decompression) similarly produces a new stream over the processed bytes.

# Illustrative example
seq:
- id: record
type: record_body
size: 16 # record_body parses within a 16-byte substream
info

Inside generated code, the stream is exposed as the _io member. Substreams are ordinary KaitaiStream objects, so the same position-tracking and reading operations work whether you are on the top-level stream or a substream.

KSY

A KSY file (extension .ksy) is the format description itself. It is written in YAML and contains sections such as meta (metadata like the format id and default endianness), seq (sequential attributes), instances, types (subtypes), and enums (named integer constants). One .ksy file fully specifies how to parse a particular binary format.

ksc

ksc is short for kaitai-struct-compiler, the command-line compiler that reads a .ksy description and emits source code in a target language. It is invoked as:

$ ksc [options] <file>.ksy

The compiler supports many target languages, including C++, C#, Go, Java, JavaScript, Lua, Nim, Perl, PHP, Python, Ruby, and Rust.

Runtime library

The runtime library is a small, language-specific library that the generated code links against. The README describes it as small and present "mostly to ensure readability of generated code." It provides the stream abstraction — for example KaitaiStream (and, in Java, ByteBufferKaitaiStream) — used for reading bytes, tracking position, and creating substreams.

The runtime library is also what makes serialization possible: when writing data back out, generated classes expose _check() to validate that the object is consistent with the format constraints, and _write(stream) to serialize the object to a stream. A failed consistency check raises a ConsistencyError.

note

As of v0.11, serialization (writing) is available for Java and Python, with other target languages to follow. Parsing (reading) is supported across all target languages.

The format gallery at formats.kaitai.io is a public collection of .ksy descriptions for real-world formats. Per the gallery, every entry has a formal specification in the Kaitai Struct language and can be used as a concise text reference, viewed as a Graphviz block diagram, explored as an annotated hex dump in the visualizer, or compiled into a ready-made library in any supported language.

It spans many categories, including archives (ZIP, gzip, tar), executables (ELF, PE, Mach-O), filesystems (ext2, ISO 9660, FAT), images (PNG, JPEG, GIF), media (WAV, MP3, AVI), network protocols (DNS, Ethernet), and more. The descriptions live in the kaitai_struct_formats submodule repository.

Expression language

The expression language is the small embedded DSL used inside .ksy files wherever a computed value is needed — pos, size, if conditions, value instances, and repeat counts. It works over integers, floats, strings, booleans, byte arrays, and user-defined types, and supports:

  • Arithmetic: +, -, *, /
  • Bitwise: &, |, ^
  • Logical: and, or, not
  • Relational: ==, !=, <, >, <=, >=

It also offers methods such as .length (for arrays and strings) and .to_s for conversions, and can reference previously parsed attributes by name.

# Illustrative example
instances:
total:
value: header.count * 4 + 8 # arithmetic over a parsed field

Sources