April vibe coding summary

This will be the last post on vibe coding for now, I promise.. ( at least about Google Gemini 2.5 Pro Exp ). I did some vibe coding every weekend in April, just to get a change of pace from work (and for science), starting with ‘what if I could not code’ experiment (not great success), and finishing with two probably useful tools that I wanted. Last week Google made Gemini 2.5 Pro Exp flash available commercially, and reduced the free input token rate limit per day quite a lot. The new limits are (as of now) million input tokens, 25 requests per day (no idea about output tokens). Single request maximum size is probably still? 250k tokens (I hit it couple of times earlier, not sure if it was reduced as most recent project was smaller and I didn’t get beyond 100k token requests). ...

28.4.2025 · 5 min · 862 words · Markus Stenberg

Vibe coding try 2: feat. Gemini 2.5 pro exp

I was not particularly satisfied with my experience of doing fully hands-off vibe coding, but I wanted also to see what I can do if I spend bit more thinking and instructing the LLM before hitting ‘send’ button. So another Sunday spent ‘usefully’. Gemini 2.5 pro exp is free(!) (for now) The shocking part is that Gemini 2.5 pro is currently available in free tier of Google AI Studio (and to chat with at ‎Gemini). The quota is quite generous - you can do essentially up to 25 M tokens per day (25 request limit per day, 1M context size - I did not get quite that far as my requests were <= 100k context size). ...

13.4.2025 · 4 min · 699 words · Markus Stenberg

Aider 0.8.1 and me

I have been using Aider on and off for a couple of months now. I have found its defaults to be pretty bad (at least for me), and so I decided to write up on how I use it and the configuration I use with it. Note: ‘model’ in this text refers to large language models (LLMs), and more specifically, those that are reasonably good at reasoning/coding tasks. Currently I am using mainly Claude 3.7 Sonnet, but the model I use seems to change every month (o3-mini high-reason was the one I used last month), and the recent Deepcoder release makes it possible I will try using local model again soon as my main model. ...

10.4.2025 · 7 min · 1395 words · Markus Stenberg

Vibe coding try 1 .. spoiler: not great success

Vibe coding has been frequently touted in the internet, and not wanting to feel left out, I spent half a day working on ‘something’ I picked from depths of my todo list: a Python utility to convert from format X to format Y (particular format is not relevant so omitted here - nested data structures with tags, and keyword-values). The vision I decided I wanted to pretend I don’t know how to code. So I for most part chose not to write any code myself, but instead guide (a set of) LLMs to produce what I wanted, mostly just specifying which files I want to touch and to do what. ...

6.4.2025 · 6 min · 1119 words · Markus Stenberg

NVidia L40S - reasonably priced LLM runner in the cloud?

As we are currently doing things in AWS, I wanted to evaluate AWS EC2 g6e.xlarge (32 GB RAM EPYC 4 cores, with 48 GB nvidia L40S GPU), as it seems to be only AWS offering that is even moderately competitive at around 1,8$/hour. The other instance types wind up either with lots of (unneeded) compute compared to GPU, or have ‘large’ number of GPUs, and in general the pricing seems quite depressing compared to their smaller competitors (e.g. https://datacrunch.io/ provides 2 L40S at 1,8$/hour, and also 1 A100 is similarly priced). ...

8.1.2025 · 4 min · 849 words · Markus Stenberg

M1 Pro vs M4 Max

New work laptop. So of course I had to benchmark its speed at running local LLMs.. These results are the using the default 4 bit quantization, with ollama version 0.4.1. Apple Macbook Pro M1 Pro (32GB RAM) (2021 model) gemma2:9b: eval rate: 24.17 tokens/s gemma2:27b: eval rate: 10.06 tokens/s llama3.2:3b: eval rate: 52.10 tokens/s llama3.1:8b: eval rate: 31.69 tokens/s Apple Macbook Pro M4 Max (36GB RAM) (2024 model) gemma2:9b: eval rate: 46.49 tokens/s gemma2:27b: eval rate: 20.06 tokens/s llama3.2:3b: eval rate: 99.66 tokens/s llama3.1:8b: eval rate: 59.98 tokens/s Conclusions The 2024 laptop roughly twice as fast as the 2021 one, and almost exactly the speed of RTX 3080 (3 years old nvidia GPU) with more VRAM to play with, so quite nice. Still, cloud providers are order of magnitude faster. ...

14.11.2024 · 1 min · 130 words · Markus Stenberg

In the trenches with small LLMs, or, we need a (prompt) hero

TL:DR; The smaller the model, the stupider it is. And this is by a lot. gemma2 is where it is at, even in its 2b version, but at least for me, prompt engineering produced better results than tool calling with it. I decided to do a write-up about this particular experience as I spent quite a bit of time recently staring at results, and writing things down is usually helpful to advance my own thinking. I did something similar in July last time, but with less scope and less data. The outcome is still the same though. ...

13.9.2024 · 7 min · 1380 words · Markus Stenberg

Playing with local gemma2

I tinkered bit with Google’s new gemma2 model on my 32GB RAM M1 Pro. It seems so far quite useful, although I have dabbled with it only day or two so far. Here’s summary from some of the things I tested with it. Benchmarking Using the script from earlier iterations: for MODEL in gemma2:27b-instruct-q5_K_M gemma2:27b \ gemma2:9b-instruct-fp16 gemma2:9b-instruct-q8_0 gemma2 \ llama3:8b-instruct-q8_0 llama3 do echo ${MODEL}: ollama run $MODEL --verbose 'Why is sky blue?' 2>&1 \ | grep -E '^(load duration|eval rate)' echo done with the following models: ...

2.7.2024 · 3 min · 536 words · Markus Stenberg

Playing with local LLMs (or not so local), part 2

This is a really brief follow-up on the earlier local llm performance benchmarking post. Nvidia RTX 3080 Today I decided to check out also performance of RTX 3080. Now that Windows beta of ollama is available, testing it out was straightforward. As it turns out, it was almost exactly double the speed of Apple Silicon hardware, e.g. llama2:7B model produced around 80 tokens per second, but model load duration was a bit slower (3-4s -> 4,9s). ...

18.6.2024 · 1 min · 190 words · Markus Stenberg

Playing with local LLMs

I have been somewhat interested about LLM performance for years, and it used to be that playing with them was quite painful (e.g. conda ecosystem in general sucks and it used to be that GPU was mandatory), but now with ollama ( https://ollama.com/ ) they’re quite trivial to benchmark across different devices without need for setting up complex stack. So this morning I indulged.. I have not yet gotten around to checking the numbers on a real GPU card, but here’s what I found out at my home (without starting gaming PC). ...

25.4.2024 · 3 min · 634 words · Markus Stenberg