Why structured logging is the thing

When I wrote the first iteration of the Lixie tool about year ago (early 2024), my idea was to identify which logs were boring (most of them), interesting (very few of them) and unknown (not yet classified).

At the time I chose not to use ‘AI’ (LLMs) for it, and I am still not that convinced they are the best way to approach that particular problem. Ultimately it boils down to human judgment of what is useful is much more realistic (at least in my context) than what the LLMs ‘know’ (absent fine-tuning and-or extensive example sets which I do not by definition have for my personal logs). After choosing not to use LLMs for it, it was just matching exercise - structured logging messages against an ordered set of rules.

To actually do that, I needed to parse the log messages so I could get useful structures out of them for rule matching.

This led to me looking at different logging formats that are used by various daemons within my home infrastructure.

Logging at home - the good, the bad, the ugly?

By and large, there were three different types of logs:

the bad: entirely adhoc
- arbitrarily encoded time, message, key-value
- ( possibly stored in e.g. journald or transmitted across syslog )
the ugly: syslog
- level, rest of content mostly arbitrary
the good: json encoded
- somewhat varying time formats
- otherwise standard

Examples of ad-hoc textual logging formats

The amount of formats, especially in terms of how timestamps are encoded, is mind-boggling. Here are some examples from my Vector configuration (non-exhaustive) which are then parsed using regular expression to structured logs:

Frigate: yyyy-mm-dd hh:mm:ss.msec text
Python?: [yyyy-mm-dd hh:mm:ss] module level: text
Rust, some node stuff: yyyy-mm-ddThh:mm:ss.msecZ LEVEL text or yyyy-mm-ddThh:mm:ss.msecZ level: text
Gatus: yyyy/mm/dd hh:mm:ss [module] text
HomeAssistant: yyyy-mm-dd hh:mm:ss.msec level text
Jellyfin: [hh:mm:ss] [something] [digits] text
Podman: yyyy-mm-dd hh:mm:ss.msec +-dddd something m=+-d.d+ text[something] [digits] text
Uptime Kuma: yyyy-mm-ddThh:mm:ss+-dd:dd [module] level: text
Vaultwarden: [yyyy-mm-dd hh:mm:ss.msec][module][level] text
Vector and LLDAP: yyyy-mm-ddThh:mm:ss.msec(Z|+-tz:tz) level text
Redis: pid:role day monthname year hh:mm:ss.fffff text
PG: yyyy-mm-dd hh:mm:ss.msec TZ [pid] level: text
Zigbee2MQTT: [yyyy-mm-dd hh:mm:ss] level: module: text

Dealing with this is incredibly painful. It is also error prone. Additionally, in some of the tools you can specify your own log format, or worse yet, they spew ANSI to textual logs because they want to be pretty (who wouldn’t want to be?).

So, how to be good?

The only way to be sane Is to use good structured logging with sane standard for encoding data (JSON, XML, or ASN.1 if you must), and use standards also for encoding dates - ISO 8601 exists, please use it.

Does it have to be text? Not always

In my current startup, we are using protocol buffers for structured logging as it is more efficient to handle than JSON, has essentially better schema support, and as we first handle them programmatically it is still ‘fine’ for it to be only machine readable.

We eventually convert them to JSON to store them in Loki though. All in all, it works quite well (and efficiently).

Levels and their meaning

The set of log levels varies bit by programming language and logging system. In general, you might not even want to forward all logs to infrastructure which costs money per log line, and instead filter them early on. Log level is good first level approximation of what to filter by. E.g. debug or trace logging might not be worth forwarding always - onlyif you know it is relevant (e.g. you are in some sort of error condition and you want to have full trace of what happened).

We have settled on not forwarding (Go/Swift) debug level logs at all by default, and just storing them locally. They are available for debugging if needed, and in e.g. unit tests they are available, but the cost of running log infrastructure is a lot less thanks to the filter.

If dealing with logs that you are not producing yourself, you will wind up with issues unless you really know what log levels are actually used. For example, one Rust service in my home infrastructure likes logging on warning level about normal network authentication issues, which is simply noise. Filtering everything non-error is a bad idea though, so in Lixie I have filtered only specific modules’ warning messages for that particular service.

Conclusion

No matter what you do, please do not just log lines - log structures in some standard format. Just having standard-based timestamps and levels is a good start, but also having some other metadata as separate fields (e.g. module, pid if relevant) can be helpful too.

Logging at home - the good, the bad, the ugly?#

Examples of ad-hoc textual logging formats#

So, how to be good?#

Does it have to be text? Not always#

Levels and their meaning#

Conclusion#