Skip to content

Write Your Own Parser

Published:

Stop Being a Soy Dev: Write Your Own Parser

TLDR: Because JSON.parse() is cheating.

Because JSON.parse() is cheating.

JSON Parser… seriously?

If you know me personally, you’re probably surprised I’m writing a blog about building a JSON parser. Yes, Yes, me. The same person who never misses a chance to roast JSON as an inefficient excuse for a data-transfer format.

Yeah, yeah, I know. The irony isn’t lost on me. It’s not like I hated JSON from day one. For a long time, JSON and XML were basically the only formats I knew. This was barely two years ago, before I started wandering outside the warm, padded walls of Flutter and into the colder, sharper world of systems and low-level tooling.

Somewhere along the way, curiosity won. And I started wondering why certain parts of computer science are so absurdly elegant and efficient that they feel like they were handed down by literal gods, etched into stone tablets somewhere in a cold, perfectly optimized universe. Lambda calculus. Functional programming foundations. Quicksort. Red-black trees. B-trees. Even something as deceptively simple as a linked list. These are the kinds of ideas that hit you with a full-on mental orgasm… solutions so clean they make you briefly believe intelligence has a ceiling and I’m still clearly far from it…

And then there are other parts. Parts so wildly inefficient, so spiritually offensive, that they make you stop and ask: Am I just not smart enough to understand the 5-dimensional, intergalactic chess move that led to this design? Or did someone, at some point, look at the problem, shrug, say “eh, good enough,” and the rest of us have just been coping with the consequences ever since? Well, I don’t think I have gotten enough knowledge to be the Judge of that.

Tokenizing

Okay, enough rambling. Let’s start building. The first step is tokenization. Tokenization is the process of turning a raw input stream into a sequence of well-defined units called tokens that carry semantic meaning for the system you’re building.

Think of it like a competent engineer disassembling a piece of hardware. If you were taking apart a MacBook, you wouldn’t tear components apart at random. The screen comes out as a single module. The battery is removed as a unit. The keyboard stays whole. Each piece is isolated at a boundary that makes sense. Those pieces are your tokens.

No real engineer splits a keyboard in half. That doesn’t increase understanding; it destroys structure. Tokenization follows the same principle: we don’t split the input arbitrarily; instead, we split it at boundaries that preserve meaning. Parsing works the same way. We don’t just randomly split strings into characters and hope for the best. We carefully break the input into pieces that carry meaning, pieces we can later assemble into something smarter.

Tokenization is how we convince ourselves that an unhinged stream of characters actually has structure. By chopping the input into tokens, keywords, symbols, and literals, we stop staring at raw text like cavemen and start pretending it has rules.

Once tokenized, the input suddenly becomes legible. We can point at things and say, “That’s a string,” “that’s a number,” or “that’s absolutely not allowed here.” Congratulations, the chaos now has labels.

Tokenization is also where error handling stops being a vague shrug. Instead of “something broke somewhere,” we get to say, “you forgot a comma,” “this token does not belong here,” or the classic “your string never ended, and now everything is on fire.” Errors become precise, accusatory, and mildly judgmental.

In short, tokenization is the phase where we stop trusting the input and start holding it accountable.

Take, for example, the JSON below

{
  "id": "u_647ceaf3657eade56f8224eb",
  "email": "user@example.com",
  "username": "kodx",
  "isActive": true,
  "age": 28,
  "roles": ["user", "admin"],
  "profile": {
    "firstName": "Olamilekan",
    "lastName": "Yusuf",
    "avatarUrl": null
  },
  "createdAt": "2024-11-09T12:45:30Z"
}

We can see it has the following

For the JSON above, the tokenization will give us the..

[
    Token(TokenType.BraceOpen, "{")
    Token(TokenType.String, "id")
    Token(TokenType.Colon, ":")
    Token(TokenType.String, "u_647ceaf3657eade56f8224eb")
    Token(TokenType.Comma, ",")
    Token(TokenType.String, "email")
    Token(TokenType.Colon, ":")
    Token(TokenType.String, "user@example.com")
    Token(TokenType.Comma, ",")
    Token(TokenType.String, "username")
    Token(TokenType.Colon, ":")
    Token(TokenType.String, "kodx")
    Token(TokenType.Comma, ",")
    Token(TokenType.String, "isActive")
    Token(TokenType.Colon, ":")
    Token(TokenType.True, "true")
    Token(TokenType.Comma, ",")
    Token(TokenType.String, "age")
    Token(TokenType.Colon, ":")
    Token(TokenType.Number, "28")
    Token(TokenType.Comma, ",")
    Token(TokenType.String, "roles")
    Token(TokenType.Colon, ":")
    Token(TokenType.BracketOpen, "[")
    Token(TokenType.String, "user")
    Token(TokenType.Comma, ",")
    Token(TokenType.String, "admin")
    Token(TokenType.BracketClose, "]")
    Token(TokenType.Comma, ",")
    Token(TokenType.String, "profile")
    Token(TokenType.Colon, ":")
    Token(TokenType.BraceOpen, "{")
    Token(TokenType.String, "firstName")
    Token(TokenType.Colon, ":")
    Token(TokenType.String, "Olamilekan")
    Token(TokenType.Comma, ",")
    Token(TokenType.String, "lastName")
    Token(TokenType.Colon, ":")
    Token(TokenType.String, "Yusuf")
    Token(TokenType.Comma, ",")
    Token(TokenType.String, "avatarUrl")
    Token(TokenType.Colon, ":")
    Token(TokenType.Null, "null")
    Token(TokenType.BraceClose, "}")
    Token(TokenType.Comma, ",")
    Token(TokenType.String, "createdAt")
    Token(TokenType.Colon, ":")
    Token(TokenType.String, "2024-11-09T12:45:30Z")
    Token(TokenType.BraceClose, "}")
]

Create a file called token.dart and define the token types there. Token types are basically how we tell the parser what kind of thing it’s looking at without guessing. Every token falls into one of two broad categories: value tokens and syntactic tokens.

Value tokens represent actual data, strings, numbers, booleans, null. These are the things that mean something.

Syntactic tokens, on the other hand, exist purely to impose structure. Braces, brackets, commas, colons. They don’t carry values; they just tell us how everything is arranged. They’re the punctuation of the format, quietly judging you when you misuse them.

Once these categories are clearly defined, the rest of the tokenizer stops being magic and starts being bookkeeping. which is exactly how you want it.

enum TokenType {
  BraceOpen,
  BraceClose,
  BracketOpen,
  BracketClose,
  String,
  Number,
  Comma,
  Colon,
  True,
  False,
  Null,
}

final class Token {
  final TokenType type;
  final String value;

  const Token(this.type, this.value);

  @override
  String toString() => 'Token($type, "$value")';
}

Next, let’s create a tokenizer.dart file and add a few things. The two main players here are the Lexer and the Tokenizer.

A Lexer is the lowest-level piece of the pipeline. It deals purely with raw characters. It walks the input one character at a time, keeps track of position, and answers mechanical questions like “what character am I currently on?”, “what’s the next character?”, and “Can I safely advance?”. It has zero concern for meaning, its only job is to consume input correctly without losing its place.

The Tokenizer lives one level above that. It uses the lexer’s character stream to group characters into tokens that actually mean something to the parser, numbers, strings, braces, keywords, and so on.

A useful mental model:

Or put differently: the lexer is the eyes and hands; the tokenizer is the brain that interprets what’s being seen.

Keeping these two separate makes the system cleaner, more predictable, and far easier to debug, especially when JSON inevitably does something annoying.

final class Lexer {
  final String input;
  int current = 0;

  Lexer(this.input);

  String get peek => input[current];

  bool get isAtEnd => current == input.length;

  String advance() => input[current++];
}

Next, let’s add the Tokenizer, we would also add the Lexer, and just have a function the will return the token we need for the next steps.

final class Tokenizer {
  const Tokenizer();

  static List<Token> tokenize(String input) {
    final lexer = Lexer(input);
    final tokens = <Token>[];

    return tokens;
  }
}

The first tokens we want to recognize are the easy structural ones:

BraceOpen, BraceClose, BracketOpen, BracketClose, Colon, and Comma.

These are basically the skeleton of JSON, so we can handle them up front with a simple lookup table.

final class Tokenizer {
  const Tokenizer();

  static const Map<String, TokenType> _structuralTokens = {
    '{': TokenType.BraceOpen,
    '}': TokenType.BraceClose,
    '[': TokenType.BracketOpen,
    ']': TokenType.BracketClose,
    ':': TokenType.Colon,
    ',': TokenType.Comma,
  };

  static List<Token> tokenize(String input) {
    final lexer = Lexer(input);
    final tokens = <Token>[];

    while (!lexer.isAtEnd) {
      final char = lexer.peek;

      final TokenType? type = _structuralTokens[char];
      if (type != null) {
        lexer.advance();
        tokens.add(Token(type, type.lexeme));
        continue;
      }


      throw FormatException('Unexpected character: $char');
    }

    return tokens;
  }
}

Now, before we go any further, we need to deal with something JSON has a lot of: whitespace.

If the JSON is formatted nicely, it’s going to contain spaces, tabs, and newlines everywhere.

But none of that actually matters for parsing. It’s purely there for readability, Clankers don’t need them..

So the tokenizer should just skip over whitespace completely.

final class Tokenizer {
  const Tokenizer();

  static const Map<String, TokenType> _structuralTokens = {
    '{': TokenType.BraceOpen,
    '}': TokenType.BraceClose,
    '[': TokenType.BracketOpen,
    ']': TokenType.BracketClose,
    ':': TokenType.Colon,
    ',': TokenType.Comma,
  };

  static List<Token> tokenize(String input) {
    final lexer = Lexer(input);
    final tokens = <Token>[];

    while (!lexer.isAtEnd) {
      final char = lexer.peek;

      final TokenType? type = _structuralTokens[char];
      if (type != null) {
        lexer.advance();
        tokens.add(Token(type, type.lexeme));
        continue;
      }

      if (_isWhitespace(char)) {
        lexer.advance();
        continue;
      }

      throw FormatException('Unexpected character: $char');
    }

    return tokens;
  }

  static bool _isWhitespace(String c) => RegExp(r'\s').hasMatch(c);
}

At this point, we can already tokenize the basic structure of any JSON document.

Next up, we’ll start handling the real payload: strings, numbers, booleans, and null. Let’s create a helper function for those.. To do that cleanly, let’s introduce a small helper that can read these “literal” values from the input.

First, we’ll need a couple of utility checks:

static bool _isAlphaNumeric(String c) => RegExp(r'[a-zA-Z0-9]').hasMatch(c);

static bool _isNumber(String s) => RegExp(r'^\d+$').hasMatch(s);

static Token _readLiteral(Lexer lexer) {
  String value = "";
  while (!lexer.isAtEnd && _isAlphaNumeric(lexer.peek)) {
    value += lexer.advance();
  }

  if (_isNumber(value)) return Token(TokenType.Number, value);
  if (value == 'true') return Token(TokenType.True, value);
  if (value == 'false') return Token(TokenType.False, value);
  if (value == 'null') return Token(TokenType.Null, value);

  throw FormatException('Unexpected literal: $value');
}

What this does is pretty straightforward:

Now we just plug this into the main loop:

    while (!lexer.isAtEnd) {
      final char = lexer.peek;

			// .....

      if (_isAlphaNumeric(char)) {
        tokens.add(_readLiteral(lexer));
        continue;
      }

			// .....

      throw FormatException('Unexpected character: $char');
    }

So now we can correctly recognize things like:

Next up: strings.

The nice thing about JSON strings is that they’re easy to spot — they always start and end with a double quote ("). So we can write a dedicated helper for that:

  static Token _readString(Lexer lexer) {
    lexer.advance(); // skip opening quote
    String value = '';

    while (!lexer.isAtEnd && lexer.peek != '"') {
      value += lexer.advance();
    }

    if (lexer.isAtEnd) throw FormatException('Unterminated string');

    lexer.advance(); // skip closing quote
    return Token(TokenType.String, value);
  }

And then add it into the tokenizer loop:

  static List<Token> tokenize(String input) {
    final lexer = Lexer(input);
    final tokens = <Token>[];

    while (!lexer.isAtEnd) {
      //...

      if (char == '"') {
        tokens.add(_readString(lexer));
        continue;
      }

      if (_isAlphaNumeric(char)) {
        tokens.add(_readLiteral(lexer));
        continue;
      }

      // ....
    }

    return tokens;
  }

At this stage, our tokenizer can now break JSON down into meaningful pieces:

The full code should look like this

final class Tokenizer {
  const Tokenizer();

  static const Map<String, TokenType> _structuralTokens = {
    '{': TokenType.BraceOpen,
    '}': TokenType.BraceClose,
    '[': TokenType.BracketOpen,
    ']': TokenType.BracketClose,
    ':': TokenType.Colon,
    ',': TokenType.Comma,
  };

  static List<Token> tokenize(String input) {
    final lexer = Lexer(input);
    final tokens = <Token>[];

    while (!lexer.isAtEnd) {
      final char = lexer.peek;

      final TokenType? type = _structuralTokens[char];
      if (type != null) {
        lexer.advance();
        tokens.add(Token(type, type.lexeme));
        continue;
      }

      if (char == '"') {
        tokens.add(_readString(lexer));
        continue;
      }

      if (_isAlphaNumeric(char)) {
        tokens.add(_readLiteral(lexer));
        continue;
      }

      if (_isWhitespace(char)) {
        lexer.advance();
        continue;
      }

      throw FormatException('Unexpected character: $char');
    }

    return tokens;
  }

  static Token _readLiteral(Lexer lexer) {
    String value = "";
    while (!lexer.isAtEnd && _isAlphaNumeric(lexer.peek)) {
      value += lexer.advance();
    }

    if (_isNumber(value)) return Token(TokenType.Number, value);
    if (value == 'true') return Token(TokenType.True, value);
    if (value == 'false') return Token(TokenType.False, value);
    if (value == 'null') return Token(TokenType.Null, value);

    throw FormatException('Unexpected literal: $value');
  }

  static Token _readString(Lexer lexer) {
    lexer.advance(); // skip opening quote
    String value = '';

    while (!lexer.isAtEnd && lexer.peek != '"') {
      value += lexer.advance();
    }

    if (lexer.isAtEnd) throw FormatException('Unterminated string');

    lexer.advance(); // skip closing quote
    return Token(TokenType.String, value);
  }

  static bool _isWhitespace(String c) => RegExp(r'\s').hasMatch(c);

  static bool _isAlphaNumeric(String c) => RegExp(r'[a-zA-Z0-9]').hasMatch(c);
  static bool _isNumber(String s) => RegExp(r'^\d+$').hasMatch(s);
}

And that’s basically it for the Tokenizer.

Now that we have the raw input turned into a proper token stream, we can finally start building something meaningful on top of it…

An AST: an Abstract Syntax Tree.

Take this JSON:

[1, true, "hello"]

Tokenization gives us a flat list:

That’s useful, but it’s still just pieces lying on the floor.

The AST is when we finally assemble them into something meaningful:

Now we’re no longer dealing with symbols and commas.

We’re dealing with actual structure:

“This is an array containing three values.”

That’s the whole point of an AST. It turns token soup into a shape the parser can actually work with. If Tokens are the ingredients, AST is the meal.

First, let’s create a file called ast_node.dart and define the core node types our parser will work with.

These nodes represent every possible value that can exist in JSON:

sealed class ASTNode {
  const ASTNode();
}

final class ObjectNode extends ASTNode {
  final Map<String, ASTNode> value;
  const ObjectNode(this.value);
}

final class ArrayNode extends ASTNode {
  final List<ASTNode> value;
  const ArrayNode(this.value);
}

final class StringNode extends ASTNode {
  final String value;
  const StringNode(this.value);
}

final class NumberNode extends ASTNode {
  final num value;
  const NumberNode(this.value);
}

final class BooleanNode extends ASTNode {
  final bool value;
  const BooleanNode(this.value);

}

final class NullNode extends ASTNode {
  const NullNode();
}

These are basically all the building blocks we need for a JSON parser.

So the AST naturally becomes a tree:

With these node types in place, we now have the exact shape we need to represent any JSON document as a proper AST.

Next, we’ll write the parser that takes our token stream and actually builds this tree. Now, let’s create a Parser class.

final class Parser {
  final List<Token> tokens;
  Parser(this.tokens);

  int current = 0;
  Token _peek() => tokens[current];
  Token _advance() => tokens[current++];

  bool get isAtEnd => current >= tokens.length;

}

This is the core setup.

The parser receives a list of tokens, and we keep a simple pointer (current) that tells us where we are in that list.

We also add a few helpers:

This is basically the same idea as the lexer pointer, just operating at the token level instead of raw characters.

The parser() function; Now we add the main entry point: parser().

final class Parser {
	// ....

  ASTNode parser() {
    if (tokens.isEmpty) throw StateError("Nothing to parse");

    final token = _peek();
    switch (token.type) {
      case TokenType.String:
        _advance();
        return StringNode(token.value);
      case TokenType.Number:
        _advance();
        return NumberNode(num.parse(token.value));
      case TokenType.True:
        _advance();
        return BooleanNode(true);
      case TokenType.False:
        _advance();
        return BooleanNode(false);
      case TokenType.Null:
        _advance();
        return NullNode();

      default:
        throw FormatException('Unexpected token: ${token.type}');
    }
  }
}

If there are no tokens, we throw immediately, nothing to parse.

Otherwise, we look at the current token and decide what node it represents:

So at this stage, we can already parse leaf values.

But JSON isn’t just leaf values.

The real fun starts with arrays and objects.

Next, let’s handle arrays.

ArrayNode _parseArray() {
  final List<ASTNode> nodes = <ASTNode>[];
  _advance(); // Skip [

  while (_peek().type != TokenType.BracketClose) {
    nodes.add(parser());
    if (_peek().type == TokenType.Comma) {
      _advance(); // skip ','
    }
  }

  if (_peek().type != TokenType.BracketClose) {
    throw FormatException('Expected "]" at end of array');
  }

  _advance(); // skip ]
  return ArrayNode(nodes);
}

Here’s what’s happening:

That’s why we call parser() recursively.

This is how nesting works naturally.

Now we plug it into the main switch:

final class Parser {

	// ...

  ASTNode parser() {
		// ...

    switch (token.type) {

	    // ...
      case TokenType.BracketOpen:
        return _parseArray();

      default:
        print(tokens[current - 1]);
        throw FormatException('Unexpected token: ${token.type}');
    }
  }
}

With that, we can handle an array

Now the final boss: objects.

ObjectNode _parseObject() {
  final node = <String, ASTNode>{};
  _advance(); // Skip {

  while (_peek().type != TokenType.BraceClose) {
    Token token = _advance();
    final key = token.value;
    if (token.type != TokenType.String) {
      throw FormatException(
        'Expected string key in object got: ${token.value} | prevouse ${tokens[current - 1]}',
      );
    }

    final colon = _advance();
    if (colon.type != TokenType.Colon) {
      throw FormatException('Expected ":" after key in object');
    }

    final value = parser();
    node[key] = value;

    if (_peek().type == TokenType.Comma) {
      _advance(); // Skip ','
    }
  }

  if (_peek().type != TokenType.BraceClose) {
    throw FormatException('Expected "}" at end of object');
  }

  _advance(); // Skip }
  return ObjectNode(node);
}

Objects are slightly more structured than arrays.

Inside {} we always expect:

  1. A string key
  2. A colon
  3. A value
  4. Optional commas between entries

So the parser does exactly that:

Again, recursion is doing all the heavy lifting.

An object can contain arrays, objects, or leaf nodes, and the parser naturally builds the tree.

Now we add it into the main parser:

final class Parser {

	// ...

  ASTNode parser() {
		// ...

    switch (token.type) {

	    // ...
      case TokenType.BracketOpen:
        return _parseArray();

      case TokenType.BraceOpen:
        return _parseObject();

      default:
        print(tokens[current - 1]);
        throw FormatException('Unexpected token: ${token.type}');
    }
  }
}

At this point, the full parser looks like this:

final class Parser {
  final List<Token> tokens;
  Parser(this.tokens);

  int current = 0;
  Token _peek() => tokens[current];
  Token _advance() => tokens[current++];

  bool get isAtEnd => current >= tokens.length;

  ASTNode parser() {
    if (tokens.isEmpty) throw StateError("Nothing to parse");

    final token = _peek();
    switch (token.type) {
      case TokenType.String:
        _advance();
        return StringNode(token.value);
      case TokenType.Number:
        _advance();
        return NumberNode(num.parse(token.value));
      case TokenType.True:
        _advance();
        return BooleanNode(true);
      case TokenType.False:
        _advance();
        return BooleanNode(false);
      case TokenType.Null:
        _advance();
        return NullNode();

      case TokenType.BracketOpen:
        return _parseArray();

      case TokenType.BraceOpen:
        return _parseObject();

      default:
        print(tokens[current - 1]);
        throw FormatException('Unexpected token: ${token.type}');
    }
  }

  ArrayNode _parseArray() {
    final List<ASTNode> nodes = <ASTNode>[];
    _advance(); // Skip [

    while (_peek().type != TokenType.BracketClose) {
      nodes.add(parser());
      if (_peek().type == TokenType.Comma) {
        _advance(); // skip ','
      }
    }

    if (_peek().type != TokenType.BracketClose) {
      throw FormatException('Expected "]" at end of array');
    }

    _advance(); // skip ]
    return ArrayNode(nodes);
  }

  ObjectNode _parseObject() {
    final node = <String, ASTNode>{};
    _advance(); // Skip {

    while (_peek().type != TokenType.BraceClose) {
      Token token = _advance();
      final key = token.value;
      if (token.type != TokenType.String) {
        throw FormatException(
          'Expected string key in object got: ${token.value} | prevouse ${tokens[current - 1]}',
        );
      }

      final colon = _advance();
      if (colon.type != TokenType.Colon) {
        throw FormatException('Expected ":" after key in object');
      }

      final value = parser();
      node[key] = value;

      if (_peek().type == TokenType.Comma) {
        _advance(); // Skip ','
      }
    }

    if (_peek().type != TokenType.BraceClose) {
      throw FormatException('Expected "}" at end of object');
    }

    _advance(); // Skip }
    return ObjectNode(node);
  }
}

Now, if you run it with an input like this:

final String input = '''
{
  "id": "u_647ceaf3657eade56f8224eb",
  "email": "user@example.com",
  "username": "kodx",
  "isActive": true,
  "age": 28,
  "roles": ["user", "admin"],
  "profile": {
    "firstName": "Olamilekan",
    "lastName": "Yusuf",
    "avatarUrl": null
  },
  "createdAt": "2024-11-09T12:45:30Z"
}
''';

final tokens = Tokenizer.tokenize(input);
final parser = Parser(tokens).parser();

You’ll get an AST output like:

{
  "id": StringNode(value: u_647ceaf3657eade56f8224eb),
  "email": StringNode(value: user@example.com),
  "username": StringNode(value: kodx),
  "isActive": BooleanNode(value: true),
  "age": NumberNode(value: 28),
  "roles": [
    StringNode(value: user),
    StringNode(value: admin)
  ],
  "profile": {
    "firstName": StringNode(value: Olamilekan),
    "lastName": StringNode(value: Yusuf),
    "avatarUrl": NullNode(value: null)
  },
  "createdAt": StringNode(value: 2024-11-09T12:45:30Z)
}

And that’s basically it.

We’ve taken raw text…

→ tokenized it

→ parsed it

→ built a full AST representation

Now that you have this tree, you can do anything with it:

So, there it is. We’ve successfully descended into the gutter of raw strings and clawed our way back out with a structured tree. Is it as fast as a hand-optimized C parser written by a ginger-bearded wizard in 1988? Probably not. Is JSON still an inefficient, bloated excuse for a data format that makes binary formats look like divine architecture? Absolutely.

But at least now, when you roast it, you’re doing so with the smug, unearned confidence of someone who has actually seen its skeleton. We’ve turned chaos into a tree, and in the process, we’ve moved one step closer to that intellectual ceiling. We’ve essentially pulled a Fullmetal Alchemist: we took a pile of raw materials, understood its chemical makeup, and transmuted it into something useful, though, unlike Edward Elric, the only thing we sacrificed was a few hours of sleep and our remaining respect for double quotes.

Now, if you’ll excuse me, I need to go back to reading more Slop Manhwa to cleanse my palate and remind myself what true Slop feels like. The system might be broken, but hey, the parser finally works…