I have been working on a life tracking app since last year. To analyze the data I have logged using it, I queried it for ‘beer in 2025’ and analyzed results. The dataset itself I will not publish here, but there are three types of relevant data there (in parentheses how they are encoded in the Markdown output that I pass to the LLMs):
- Place visits involving beer ( e.g.
* 2 hours spent in <insert pub here>
) - Journal entries mentioning beer ( e.g.
I had beer and pizza for lunch
) - Explicitly counted beer logging ( e.g.
- 3 count beer
)
Baseline - shell
egrep 'count beer$' 20250528-beer.md | cut -d ' ' -f 2 | awk '{sum += $1} END {print sum}'
17
So the expectation is that the number should be AT least 17 beers, but ideally more, as there are some journal entries which mention beer.
Open source models
The prompt I used was How many beers have I had this year based on this log - just respond with the count:
and then the large (27k tokens) input of Markdown. The outcomes were bit varying, but universally bad.
The results use 4q models, running on Mac Studio M3 Ultra, running Ollama version 0.7.1.
gemma3:4b
prompt eval count: 27347 token(s)
prompt eval rate: 1656.88 tokens/s
eval rate: 67.59 tokens/s
total duration: 22.9707945s
The result seemed almost right? (over 10, but not quite 17)
gemma3:12b
total duration: 56.498373125s
prompt eval rate: 542.61 tokens/s
The result was badly wrong (309 - definitely far from truth)
granite3.3 (8b MoE)
total duration: 1m30.882190625s
prompt eval rate: 400.75 tokens/s
eval rate: 11.25 tokens/s
Granite didn’t find any beer (even from the explicit counts noted above; possibly a prompting problem?
qwen3:30b (MoE, 3b active)
total duration: 3m52.465156791s
prompt eval rate: 457.52 tokens/s
eval rate: 10.50 tokens/s
In default thinking mode at least quite .. bad, too much overthinking. But final result: 14 which was almost right, it seems, although math is bit off.
Then, without thinking ( /no_think ):
total duration: 1m0.823253625s
Result is 0. :p
llama4 scout (MoE)
total duration: 3m58.287042625s
load duration: 37.466076875s
prompt eval count: 25252 token(s)
prompt eval duration: 2m53.690596583s
prompt eval rate: 145.38 tokens/s
eval count: 219 token(s)
eval duration: 27.112690708s
eval rate: 8.08 tokens/s
Nonsensical answer ‘I can help with the data that seems to contain..’ (It lost the prompt with all the context following it?)
phi4-mini-reasoning (3.8b)
total duration: 5m57.690600583s
load duration: 4.359488541s
prompt eval count: 24944 token(s)
prompt eval duration: 27.874653s
prompt eval rate: 894.86 tokens/s
eval count: 5993 token(s)
eval duration: 5m25.455383458s
eval rate: 18.41 tokens/s
.. lots of thinking, not really good result (it identified handful of beers)
phi4-reasoning:plus (14b)
total duration: 19m51.628240875s
load duration: 2.760024709s
prompt eval count: 25741 token(s)
prompt eval duration: 1m21.037061958s
prompt eval rate: 317.64 tokens/s
eval count: 9569 token(s)
eval duration: 18m27.829909375s
eval rate: 8.64 tokens/s
.. A LOT of thinking. Final result: 20 (very plausible, because some of them I had not explicitly tagged with the ‘count beer’)
gemma3:27b
total duration: 1m52.313133959s
load duration: 5.437922417s
prompt eval count: 27351 token(s)
prompt eval duration: 1m46.692822291s
prompt eval rate: 256.35 tokens/s
eval count: 4 token(s)
eval duration: 178.602917ms
eval rate: 22.40 tokens/s
187 ( .. not great answer )
Commercial options
Claude 4.0 Opus (via MCP in Claude Desktop)
17, in couple of seconds. Presumably only the explicitly counted ones.
Claude 4.0 Sonnet (via MCP in Claude Desktop)
12 (+1), again in couple of seconds.
OpenAI GPT 4.1 (via MCP in AnythingLLM)
12, .. still fast, but also not correct.
OpenAI GPT 4.1-mini (via MCP in AnythingLLM)
9 (bit faster than 4.1, but also less correct)
Conclusions
Claude 4.0 seems to be the winners choice right now for (loosely) formatted data analysis. The token count friendly Markdown format might be worse for models to process, though, but same experiment done with json encoded output did not improve (subset of) open source models’ results, so it seems that at least for small models, having 30k token context and needing to extract only ‘some’ information from there is too hard.
Another observation is that while Mac Studio has reasonable output speed, its prompt processing speed leaves a lot to be desired.
Honorable mention to Phi 4 reasoning plus - I think it had the most correct answer, but producing it 20 minutes seems to be bit overkill.