Vibe coding try 1 .. spoiler: not great success

Vibe coding has been frequently touted in the internet, and not wanting to feel left out, I spent half a day working on ‘something’ I picked from depths of my todo list: a Python utility to convert from format X to format Y (particular format is not relevant so omitted here - nested data structures with tags, and keyword-values).

The vision

I decided I wanted to pretend I don’t know how to code. So I for most part chose not to write any code myself, but instead guide (a set of) LLMs to produce what I wanted, mostly just specifying which files I want to touch and to do what.

Tool choice

I wondered about which tool to use. There seems to be large number of options available, but as I have used aider - AI Pair Programming in Your Terminal as stupid autocomplete (--watch-files, and then in Emacs just say do something AI! and it does something!) so I figured I’d try to use it in chat mode instead.

Model choice

I looked at Aider LLM Leaderboards to choose models to use. In the end, I picked 5:

claude 3.7 sonnet
gpt-o3-mini
gpt-4o
llama 3.3-70b (at groq)
qwen2.5-coder:32b (local)

I used them with following:

alias aider='aider --cache-prompts --no-check-update'
alias aider-claude='aider --model anthropic/claude-3-7-sonnet-latest'
alias aider-gpt-o3='aider --model o3-mini --reasoning-effort high'
alias aider-gpt-4o='aider --model chatgpt-4o-latest'
alias aider-groq-llama='aider --model groq/llama-3.3-70b-versatile'
alias aider-local='aider --model ollama_chat/qwen2.5-coder:32b'

and off I went. In hindsight I should have probably annotated retained commits with which model produced them, but I didn’t and that is .. life. Aider does not seem to have option for that either, which is a shame.

The start looked promising

I started with Claude. I gave Aider (relatively lame) ebnf specification of one format, and (partially correct) ebnf specification of the other, as well as some sample files, and asked Aider to create converter tool from one format to the other.

The initial few prompts produced ~420 lines of content to the repository (converter and its testsuite), and some hello world examples were actually converted correctly.

Devil is in the details

Once I started adding some real, larger test files, the illusion broke. I still tried to avoid reading the code, but no matter how I prompted Claude, it could not fix bugs in the tests that it had itself created, that read the sample files I provided, so that they would match sample outputs I also created.

At this point I switched to gpt-o3-mini and noted it is unusably slow in comparison to Claude, and so I gave up on that. gpt-4o turned out to deal with code reasonably fast too, but again, no breakthroughs although perhaps few minor improvements that Claude didn’t do.

Local Qwen was also too slow for me to bother dealing with it, and aider used too many tokens for (free tier) Groq to be useful.

So outcome is that I did not get what I wanted despite spending couple of hours tinkering with aider and feeding in some money to Anthropic/OpenAI pockets.

Lessons learned

For rapid ‘this does something’, there was definitely wow feeling. But later on I started noticing that the subsequent iterations actually started cheating:

changing input data files I told model explicitly not to change (gpt-4o was bit more keen to do that, but claude did it too unless I told in prompt not to do that)
hardcoding special cases to the converter

Also, none of the models produced nested data structures in the conversion correctly - despite having even examples that there is nesting, and providing example of how it is done (produced by me manually), the models simply failed at producing correct conversion for different data with very similar pattern.

Similarly some other things about data formats remained opaque to the models - the described objects in the formats had tags of sorts, and while I provided description of them, as well as examples, the models just hardcoded the examples to the generator instead of producing generic code.

Producing recursive code seemed to be hard for the models too.

In the end, I spent probably more time tinkering with the tool than it would have taken me to write the thing - and I still don’t have working version of it.

Model specific notes

Claude was unavailable couple of times, and default tier rate-limit was hit sometimes (not too bad)
gpt-4o was fast and mostly correct - it was not very keen to run tests per iteration though, unless told to do so explicitly
- aider has option to force tests, though, but Claude did that proactively
gpt-o3-mini was too slow for general case, and didn’t solve the final mess other models had created anyway

Commits / line numbers

In half an hour, I had 400 lines of code and some other stuff in the repo. After perhaps 4 hours when I gave up on this project, I had 800 lines of code and 420 lines of other stuff (mainly test data that I had inserted). In terms of commits:

    18  Markus Stenberg
lines added: 799 lines removed: 391

    75  Markus Stenberg (aider)
lines added: 2427 lines removed: 1259

I removed some generated broken code too at some point in frustration, but otherwise I didn’t touch the code at all.

Bonus: GPT 4.5 to the rescue?

I tried few different prompts with GPT 4.5 too. It understood recursion bit better, but even then, the tag/alias handling did not improve much. Funnily enough, the three iterations I did with it cost 7$, and number of failing tests actually increased by one (although possibly with further prompting it could have fixed them, as the code structure looked better).

Two commits produced by GPT 4.5 (with lackluster results) cost less than all the ~60 I did with Claude 3.7 Sonnet (6$).

Conclusions

The more you help the model the better the outcome is. But if I have to specify everything, and iterate on it (if and when the model gets things wrong the first time around), the time savings are not that spectacular, at least for me. Of the ‘final’ version I have now (~100 commits), some of it is ‘ok’ quality, but there is some redundant code, some ugly code, and definitely also some broken/missing code. I could have done better myself by simply writing the converter (with some code assist from aider possibly if there was something that I felt was simple, but lots of typing), and then perhaps asking some model to produce tests for it.

I don’t think this is the end of experimentation for me though - there is always room for improvements on how you craft hobby projects (where quality does not matter so much) or actual production stuff at $DAYJOB.

The vision#

Tool choice#

Model choice#

The start looked promising#

Devil is in the details#

Lessons learned#

Model specific notes#

Commits / line numbers#

Bonus: GPT 4.5 to the rescue?#

Conclusions#