First impressions: GPT 5.4 + GPT 5.3 codex spark!

After brew upgrade (and GPT 5.4 announcement last night) I was excited to see what was up. Interesting. Try 1 I did one minor feature using planning mode (gpt-5.4 high) and implemented it using gpt-5.3 codex spark. Codex spark is backed by super-fast Cerebras megachips, so I was optimistic about the speed (if not quality). Observations: REALLY fast (couple of seconds to implement something that was couple of hundred lines long) Bit stupid, or literal? It did not run unit tests by default Unit tests had an error, which it fixed It had used 62% of its context by this time(!) - seems like order of magnitude less context than 5.3 Codex Try 2 Without planning mode, on high, I asked it to refactor logging in the app I am working on. It actually did what was asked for, but again did not run tests as I did not specifically ask for it to run them too. The outcome was also pretty ugly as it did not look at the API surface of the logging library (or know it), so I had to send gpt-5.4 after it to clean up. ...

6.3.2026 · 2 min · 339 words · Markus Stenberg

Another day, another vibe coded tool (ta-export)

I spent some time today finally creating a (publicly available) export handling code for Time Atlas app. While I enjoyed using (and developing) the app to some extent, it no longer fulfills some of my wishes and I no longer work there, so I deleted the app (and no longer have access to internal tooling for dealing with GPX exports). What could be better in the app I miss the old comments to journal notes that pre-1.0 versions had (this export tool preserves them) The battery usage was and is bit janky (and could be optimized, but has not been) GPS track (when actually tracking) is suboptimal compared to what is available (I rather filter stuff myself, than get filtered data for my long term storage; storing infrequent points isn’t really helping battery usage much, compared to just keeping the radio off when it is not really needed). Storage is cheap, but if you don’t have the (original) data, you don’t have the option to refine the data further later on if you come up with better analysis algorithm or more ways to use the raw data. Switch .. somewhere .. This week I figured I might as well switch to another GPS tracker app, and move my journal entries back to Bear, where I have 27 years of journal entries and some posts of some of my blogs as well. ...

5.3.2026 · 3 min · 440 words · Markus Stenberg

Developing stuff with LLMs on your own credit card

After a week of agentic coding on my own credit card, here’s some notes, mostly for to organize my own thoughts. I don’t like term vibe coding, although most of this code is only cursorily reviewed as 29k LoC produced this week would be full-time job just to read through once or twice. OpenCode Zen (free tier) As of today, it has 3 models: Big Pickle (unknown, presumably older GLM?) Minimax M2.5 - 230B MoE model Trinity Large Preview from Arcee AI - 400B MoE model I am not sure if they are quantified for economical reasons or not, but at least earlier free GLM 5 did not perform very well in my tests last week. Trinity did not impress me either (it seemed quite slow and less capable than Minimax), but Minimax I am using as my daily driver when I don’t need more powerful model. ...

3.3.2026 · 7 min · 1351 words · Markus Stenberg

filemirror vibe coding

This is my experiment to see how far you can get using free models that are currently available in the OpenCode Zen. As they are (mostly) open weights, I could also run them locally if I felt like it, someday. What was available today: GLM-5 (with some rate limiting, unfortunately) MiniMax M2.5 Big Pickle (?) Trinity Large Preview (?) Scaffolding I copy pasta’d some configuration files (e.g. Makefile, .golangci-lint.yml) from another project, wrote README.md (which is bit inconsistent with the prompt by design, to see which model honors or if it asks questions), and also left one error in place in the Makefile (lint target doesn’t work due to depending on another target which is no longer in the file). ...

23.2.2026 · 4 min · 771 words · Markus Stenberg

Beer consumption analysis using LLMs

I have been working on a life tracking app since last year. To analyze the data I have logged using it, I queried it for ‘beer in 2025’ and analyzed results. The dataset itself I will not publish here, but there are three types of relevant data there (in parentheses how they are encoded in the Markdown output that I pass to the LLMs): Place visits involving beer ( e.g. * 2 hours spent in <insert pub here>) Journal entries mentioning beer ( e.g. I had beer and pizza for lunch) Explicitly counted beer logging ( e.g. - 3 count beer) Baseline - shell egrep 'count beer$' 20250528-beer.md | cut -d ' ' -f 2 | awk '{sum += $1} END {print sum}' 17 So the expectation is that the number should be AT least 17 beers, but ideally more, as there are some journal entries which mention beer. ...

28.5.2025 · 4 min · 726 words · Markus Stenberg

April vibe coding summary

This will be the last post on vibe coding for now, I promise.. ( at least about Google Gemini 2.5 Pro Exp ). I did some vibe coding every weekend in April, just to get a change of pace from work (and for science), starting with ‘what if I could not code’ experiment (not great success), and finishing with two probably useful tools that I wanted. Last week Google made Gemini 2.5 Pro Exp flash available commercially, and reduced the free input token rate limit per day quite a lot. The new limits are (as of now) million input tokens, 25 requests per day (no idea about output tokens). Single request maximum size is probably still? 250k tokens (I hit it couple of times earlier, not sure if it was reduced as most recent project was smaller and I didn’t get beyond 100k token requests). ...

28.4.2025 · 5 min · 862 words · Markus Stenberg

Vibe coding try 2: feat. Gemini 2.5 pro exp

I was not particularly satisfied with my experience of doing fully hands-off vibe coding, but I wanted also to see what I can do if I spend bit more thinking and instructing the LLM before hitting ‘send’ button. So another Sunday spent ‘usefully’. Gemini 2.5 pro exp is free(!) (for now) The shocking part is that Gemini 2.5 pro is currently available in free tier of Google AI Studio (and to chat with at ‎Gemini). The quota is quite generous - you can do essentially up to 25 M tokens per day (25 request limit per day, 1M context size - I did not get quite that far as my requests were <= 100k context size). ...

13.4.2025 · 4 min · 699 words · Markus Stenberg

Aider 0.8.1 and me

I have been using Aider on and off for a couple of months now. I have found its defaults to be pretty bad (at least for me), and so I decided to write up on how I use it and the configuration I use with it. Note: ‘model’ in this text refers to large language models (LLMs), and more specifically, those that are reasonably good at reasoning/coding tasks. Currently I am using mainly Claude 3.7 Sonnet, but the model I use seems to change every month (o3-mini high-reason was the one I used last month), and the recent Deepcoder release makes it possible I will try using local model again soon as my main model. ...

10.4.2025 · 7 min · 1395 words · Markus Stenberg

Vibe coding try 1 .. spoiler: not great success

Vibe coding has been frequently touted in the internet, and not wanting to feel left out, I spent half a day working on ‘something’ I picked from depths of my todo list: a Python utility to convert from format X to format Y (particular format is not relevant so omitted here - nested data structures with tags, and keyword-values). The vision I decided I wanted to pretend I don’t know how to code. So I for most part chose not to write any code myself, but instead guide (a set of) LLMs to produce what I wanted, mostly just specifying which files I want to touch and to do what. ...

6.4.2025 · 6 min · 1119 words · Markus Stenberg

NVidia L40S - reasonably priced LLM runner in the cloud?

As we are currently doing things in AWS, I wanted to evaluate AWS EC2 g6e.xlarge (32 GB RAM EPYC 4 cores, with 48 GB nvidia L40S GPU), as it seems to be only AWS offering that is even moderately competitive at around 1,8$/hour. The other instance types wind up either with lots of (unneeded) compute compared to GPU, or have ‘large’ number of GPUs, and in general the pricing seems quite depressing compared to their smaller competitors (e.g. https://datacrunch.io/ provides 2 L40S at 1,8$/hour, and also 1 A100 is similarly priced). ...

8.1.2025 · 4 min · 849 words · Markus Stenberg