Glossary
This page defines the vocabulary you will meet throughout the Kaitai Struct documentation. Terms are grouped roughly by where you encounter them: writing a format description, compiling it, and running the generated code.
Kaitai Struct describes binary data structures declaratively. You write a description once and the toolchain turns it into parsing code for many languages. Most of the terms below name a piece of that description or a piece of the toolchain that consumes it.
Quick reference
| Term | Short definition |
|---|---|
| Data type | The kind of value an attribute holds (built-in like u2, or user-defined). |
| Attribute | A named field parsed sequentially from the stream, declared under seq. |
| Instance | A named field that is computed or read on demand, declared under instances. |
| Substream | A byte-limited view of the stream, typically created by size. |
| KSY | The .ksy YAML file that describes a format. |
| ksc | The kaitai-struct-compiler, which turns a .ksy into source code. |
| Runtime library | The small per-language library the generated code depends on. |
| Format gallery | The public collection of ready-made .ksy format descriptions. |
| Expression language | The small DSL used inside .ksy for values, conditions, and sizes. |
Data type
A data type describes what kind of value a field holds and therefore how many bytes are consumed and how they are interpreted.
Built-in types include:
- Unsigned integers:
u1,u2,u4,u8 - Signed integers:
s1,s2,s4,s8 - Floating-point:
f4,f8 - Strings:
str(requires an encoding) andstrz(NUL-terminated) - Bit-sized integers:
bX(for exampleb3for a 3-bit field)
Multi-byte integer types may carry an endianness suffix, for example u2le
(little-endian) or u4be (big-endian).
A user-defined type is a composite structure you declare under the types
key. It behaves like a class: it can contain its own seq, instances, and
nested types. The documentation notes that you can have several levels of
subtypes, so formats can be modeled hierarchically.
# Illustrative example
seq:
- id: magic
type: u4be # built-in 4-byte big-endian unsigned integer
- id: header
type: file_header # user-defined type declared under `types`
Attribute
An attribute is a single named field listed under the seq key. Attributes
are parsed sequentially: the parser reads them in order, one after another,
advancing through the stream. Each attribute has an id (its name) and usually a
type, and may carry extra keys such as size, repeat, if, contents, or
process.
# Illustrative example
seq:
- id: len_body
type: u4
- id: body
size: len_body # length comes from a previously parsed attribute
Because seq attributes are read in order, a later attribute can refer to an
earlier one (as len_body is used above). This is how length-prefixed fields are
expressed.
Instance
An instance is a named field declared under the instances key. Unlike a
seq attribute, an instance is not read in stream order. Instances are
lazy by default — they are evaluated only when the corresponding property is
accessed in the generated code.
There are two common kinds:
- Positional instances use a
poskey to read data from an arbitrary offset in the stream (and optionally anioto choose which stream). - Value instances compute a derived value with the
valuekey. They do no actual parsing, so they do not require apos.
# Illustrative example
instances:
footer:
pos: _io.size - 8 # positional: jump to 8 bytes before end of stream
type: u8be
is_large:
value: footer > 1000000 # value instance: computed, parses nothing
Attribute (seq) | Instance (instances) | |
|---|---|---|
| Read order | Sequential | On demand (lazy) |
| Position | Implicit (current offset) | Explicit pos, or none for value |
| Parses bytes? | Yes | Only positional instances; value does not |
Substream
A substream is a byte-limited view of an underlying stream. The most common
way to create one is to apply size to a user-defined type: the parser reads
exactly that many bytes and hands the subtype a stream that cannot read past
those bytes. This both scopes relative positioning and prevents a subtype from
reading beyond its declared boundary. Applying a process transformation (for
example decompression) similarly produces a new stream over the processed bytes.
# Illustrative example
seq:
- id: record
type: record_body
size: 16 # record_body parses within a 16-byte substream
Inside generated code, the stream is exposed as the _io member. Substreams are
ordinary KaitaiStream objects, so the same position-tracking and reading
operations work whether you are on the top-level stream or a substream.
KSY
A KSY file (extension .ksy) is the format description itself. It is written
in YAML and contains sections such as meta (metadata like the format id
and default endianness), seq (sequential attributes), instances, types
(subtypes), and enums (named integer constants). One .ksy file fully
specifies how to parse a particular binary format.
ksc
ksc is short for kaitai-struct-compiler, the command-line compiler that
reads a .ksy description and emits source code in a target language. It is
invoked as:
$ ksc [options] <file>.ksy
The compiler supports many target languages, including C++, C#, Go, Java, JavaScript, Lua, Nim, Perl, PHP, Python, Ruby, and Rust.
Runtime library
The runtime library is a small, language-specific library that the generated
code links against. The README describes it as small and present "mostly to
ensure readability of generated code." It provides the stream abstraction —
for example KaitaiStream (and, in Java, ByteBufferKaitaiStream) — used for
reading bytes, tracking position, and creating substreams.
The runtime library is also what makes serialization possible: when writing
data back out, generated classes expose _check() to validate that the object is
consistent with the format constraints, and _write(stream) to serialize the
object to a stream. A failed consistency check raises a ConsistencyError.
As of v0.11, serialization (writing) is available for Java and Python, with other target languages to follow. Parsing (reading) is supported across all target languages.
Format gallery
The format gallery at formats.kaitai.io is a
public collection of .ksy descriptions for real-world formats. Per the gallery,
every entry has a formal specification in the Kaitai Struct language and can be
used as a concise text reference, viewed as a Graphviz block diagram, explored as
an annotated hex dump in the visualizer, or compiled into a ready-made library in
any supported language.
It spans many categories, including archives (ZIP, gzip, tar), executables (ELF,
PE, Mach-O), filesystems (ext2, ISO 9660, FAT), images (PNG, JPEG, GIF), media
(WAV, MP3, AVI), network protocols (DNS, Ethernet), and more. The descriptions
live in the kaitai_struct_formats submodule repository.
Expression language
The expression language is the small embedded DSL used inside .ksy files
wherever a computed value is needed — pos, size, if conditions, value
instances, and repeat counts. It works over integers, floats, strings,
booleans, byte arrays, and user-defined types, and supports:
- Arithmetic:
+,-,*,/ - Bitwise:
&,|,^ - Logical:
and,or,not - Relational:
==,!=,<,>,<=,>=
It also offers methods such as .length (for arrays and strings) and .to_s for
conversions, and can reference previously parsed attributes by name.
# Illustrative example
instances:
total:
value: header.count * 4 + 8 # arithmetic over a parsed field