Why I stopped using JSON for MQTT and use Zig to develop gRPC-like communication?

I've spent the last few weeks building ProtoMQ, an MQTT broker that validates and encodes messages using Protocol Buffers — with both the MQTT parser and the Protobuf engine written from scratch in Zig.

GitHub - electricalgorithm/protomq: ProtoMQ: Type-safe, bandwidth-efficient MQTT for the rest of us. Stop sending bloated JSON over the wire.
ProtoMQ: Type-safe, bandwidth-efficient MQTT for the rest of us. Stop sending bloated JSON over the wire. - electricalgorithm/protomq

What's the problem?

I love working with IoT devices. In most setups I've seen, sensor data flows through MQTT as JSON — it's human-readable, every language has a parser, and you can inspect payloads with mosquitto_sub when something goes wrong. If one's use-case is not super specific, this would work fine.

It's clear that JSON has real costs at the edge:

  • A 12-field sensor reading takes 310 bytes in JSON. The same data in Protobuf: 82 bytes. That's a 74% reduction. Implementations based on your custom binary format may even reduce it further.
  • On a cellular-connected device sending data every 5 seconds, that difference adds up to ~1.6 MB/day per device. Multiply by a fleet of 10,000 devices and you're looking at 16 GB/day of unnecessary data transfer.
  • More bytes on the wire means more radio time, which means shorter battery life.

Protobuf solves the size problem, but introduces its own drawbacks: you need code generation in every client language, you need to keep generated stubs in sync when schemas change, and debugging becomes harder because payloads are opaque binary.

Why not go with a custom binary packet format?

This is the first thing most embedded teams try (I've done it too). You define a packed struct, serialize it on the device, deserialize it on the server. It's compact — often even smaller than Protobuf — and dead simple to implement.

The problem shows up about three months in, when you need to add a field.

With a custom binary format, the server and every device in the field must agree on the exact byte layout. There's no field tags, no length prefixes, no way to tell what version of the struct you're looking at. Adding a field means every device needs an OTA update before the server can start accepting the new format — or you end up maintaining parallel parsers for v1, v2, v3, each with their own offsets and sizes.

This has a few consequences:

  • OTA become critical. Every schema change requires a firmware update pushed to the entire fleet. If even one device misses the update, its data is silently corrupted or rejected.
  • End-to-end ownership is mandatory. You need one team (or one person) who controls the struct definition, the firmware serializer, the server deserializer, and the OTA pipeline. In practice this means the device team and the backend team are tightly coupled on every field change.
  • No schema introspection. A new developer joining the project can't look at the data on the wire and understand it. There's no .proto file to read, no schema registry to query — just a header file somewhere in the firmware repo that may or may not match what's actually deployed.

Protobuf avoids all of this because fields are tagged and the format is self-describing enough to be forward- and backward-compatible. You can add fields without breaking old producers or consumers. ProtoMQ takes it a step further by making the broker the single owner of the schema, so there's one place to look and one place to update. But that's for the next sections!

Solution Proposal

So the goal became: get payloads as compact as a custom binary format, but keep clients and the server completely decoupled.

The design I landed is that the broker owns the .proto schemas and maps them to MQTT topics. Clients don't need to ship with hardcoded struct definitions or pre-compiled stubs. Instead, they discover what schemas exist by querying the broker's service discovery topic ($SYS/discovery/request), fetch the .proto definition at runtime, and construct their payloads based on that which is a methodology I learned when I was working with gRPC and service discovery. If a schema changes, clients pick up the new version on next discovery — no OTA, no redeployment, no version negotiation.

Everything on the wire is binary Protobuf -- because it's compact enough, plenty of tools to debug, and no need to invent another data format!. The broker validates every PUBLISH against the registered schema, so malformed messages get rejected before reaching any subscriber.

This would give you the compactness of a custom binary protocol with the decoupling of a schema registry. The broker is the single source of truth, and clients are as thin as possible.

Why Ziglang?

I picked Zig for a few reasons, and most of them are practical rather than ideological, for sure.

  • The YouTube video I saw comparing Rust vs Zig vs Go for the same workload showed how good Zig was at memory consumption. This already excited me. After searching about the language a bit more, I found that it gives you the capability to explicitly control your memory.
  • The Zig toolchain makes it easy for anyone to cross-compile. zig build -Dtarget=aarch64-linux gives me a Raspberry Pi binary from my Mac. No toolchain setup, no Docker gymnastics, no fiddling with cross-compilers.
  • I needed to use a language with no garbage collector, goroutine scheduler or some other technology that does stuff and decreases my tool's performance.

Let's admit it before going further: I was super curious about Zig language, and wanted to use it in a project to understand its limits. That's the part with ideological reasons 😄.

Engineering Decisions

A Run-Time Protobuf Engine

This is the part I'm most proud of and most nervous about showing people.

I wrote the .proto file parser, wire format encoder, and decoder from scratch in Zig. No protoc, no protobuf-c, no nanopb — nothing external.

Three reasons:

  1. Zero external dependencies is a design goal. The entire broker builds with zig build and nothing else. No system libraries, no package manager downloads, no vendored C code. I wanted to keep it that way.
  2. Protobuf's standard toolchain is a static compilation step. You run protoc, it generates code, you compile that code into your binary. That's fine for most use cases, but I needed runtime schema loading — drop a .proto file in a directory, the broker parses it at startup, done. That doesn't fit the code generation model.
  3. I wanted to learn how serializers work. Writing an encoder/decoder from the wire format spec is one of those things that sounds intimidating but teaches you a lot about how data actually lives on the wire. Field tags, varint encoding, length-delimited records — it's surprisingly elegant once you dig in.

So the engine, firstly parses .proto files at startup (a recursive descent parser that handles syntax, package, message, and all scalar field types). Then it builds a field-type table for each message type. At publish time, validates the payload against the schema and decodes/encodes Protobuf wire-format bytes field by field. In the second release, the parsing and topic-mapping functionality is enabled to execute on runtime. It means that one can register new proto interfeces to broker for supporting new data types on the run.

It's not a complete Protobuf implementation — I don't support nested messages, oneof, maps, or extensions yet. But for flat sensor telemetry messages, it handles most of the cases. If you see something missing, don't forget to create an issue ;).

Direct Implementation of the Network Layer

The network layer uses epoll on Linux and kqueue on macOS — no libuv, no tokio, no abstraction layers. Part of the motivation here was personal: as someone who works close to the Linux kernel, I wanted to actually use the epoll interface directly rather than through three layers of indirection. Building something real on top of it is the best way to understand how it behaves under load.

The event loop is a single-threaded loop that:

  1. Waits for I/O readiness events
  2. Reads MQTT packets from ready sockets
  3. Parses the MQTT fixed header to determine packet type
  4. Dispatches to the appropriate handler (CONNECT, PUBLISH, SUBSCRIBE, etc.)
  5. For PUBLISH: validates the payload against the schema, then fans out to all matching subscribers

Single-threaded might sound limiting, but it avoids all synchronization overhead and keeps the per-message path as short as possible. The topic broker uses a trie structure for wildcard matching, so routing a message to subscribers is O(topic depth) regardless of the number of subscriptions.

Performance as the Gate-Keeper

I take performance seriously enough that I built a dedicated benchmarking suite as part of the project. It's written in Python and runs 7 scenarios, each designed to stress a different part of the broker:

Scenario What it tests
B1: Baseline Concurrency 100 concurrent connections — measures p50/p99 latency, connection time, and memory usage
B2: Thundering Herd 10,000 clients connecting simultaneously — connection time, fan-out latency, message loss
B3: Sustained Throughput 10-minute continuous load — throughput stability, memory growth, CPU usage, late-stage latency
B4: Wildcard Explosion 1,000 wildcard subscribers on overlapping topic trees — matching correctness and per-match latency
B5: Protobuf Load Protobuf validation and encoding overhead — decoding latency, bandwidth savings, CPU cost
B6: Connection Churn 100,000 connect/disconnect cycles — memory leaks, file descriptor leaks, error rate
B7: Message Sizes Throughput across payload sizes from 10 bytes to 64 KB — finding where the broker becomes I/O bound

Each scenario has a thresholds.json file that defines pass/fail criteria per metric. Every threshold specifies a direction (lower for things like latency and memory, higher for throughput and connection counts), a hard max or min limit, and a warn level for early regression detection. At the end of each run, the suite checks every metric against its threshold and prints a summary: [PASS], [WARN], or [FAIL] per metric.

For example, B1's thresholds require p99 latency under 0.8 ms (warn at 0.6 ms), memory under 3.5 MB (warn at 3.0 MB), and all 100 connections to succeed. B6 requires that memory growth over 100k connection cycles stays under 10 MB (warn at 5 MB) and file descriptor leaks are zero. If any metric crosses the max/min boundary, the benchmark fails — it's a hard gate, not a suggestion.

Results from two machines — an Apple M2 Pro and a Raspberry Pi 5:

Metric M2 Pro RPi 5
Throughput (10-byte messages) 208k msg/s 147k msg/s
p99 latency (100 clients) 0.44 ms 0.13 ms
Memory (100 connections) 2.6 MB 2.5 MB
Connection churn (100k cycles) 0 MB leaked 0 MB leaked
Sustained throughput (10 min) 8,981 msg/s 9,012 msg/s

The Raspberry Pi numbers are the ones I care about most. If the broker runs well on a $60 ARM board, it can run on many SBCs.

A few things I found interesting:

  • The RPi 5 actually has lower p99 latency (0.13 ms vs 0.44 ms on M2 Pro). My theory: macOS kqueue has more scheduling jitter than Linux epoll, possibly due to QoS classes in the Darwin scheduler.
  • Sustained throughput is nearly identical on both platforms (~9k msg/s). This tells me the bottleneck is the Python test harness, not the broker.
  • Zero memory growth over 100,000 connection cycles. I'm paranoid about leaks, so testing setup check this obsessively.

What's next?

The broker is functional but still early. Things on my list:

  • QoS 1 and 2 — I need to add packet acknowledgment and the retry state machine. It will be easy but important.
  • Possibility to use as C-library - Since Zig is intercompatible with C, I want to benefit from it. The majority of the embedded applications are written in C, and losing chance to integrate them while Zig provides the methodology. That'd be a huge loss.
  • TLS — It should be not hard (I hope), just haven't gotten to it. I'm still unsure if I can still preserve "no-deps" strategy when TLS is there.
  • Nested Protobuf messages — The engine currently handles flat messages only. I'll wait for this feature until someone requests it.
  • Multi-node federation — According to LLMs, this should be a really cool feature. I don't see any use case (yet). I might not do it at all.

How to try it?

You can easily clone the repository and use Docker container, or better, run it on your host machine if Zig is installed.

git clone https://github.com/electricalgorithm/protomq.git
cd protomq
docker compose up

Or if you have Zig 0.15.2 installed:

zig build run-server

The full benchmark suite, results, and methodology are in the repo under benchmarks/.


If you're interested in MQTT internals, Protobuf wire format encoding, or systems programming in Zig, I'd genuinely appreciate feedback on the code. I'm one person working on this, and I know there's plenty of room for improvement.