As we are currently doing things in AWS, I wanted to evaluate AWS EC2 g6e.xlarge (32 GB RAM EPYC 4 cores, with 48 GB nvidia L40S GPU), as it seems to be only AWS offering that is even moderately competitive at around 1,8$/hour. The other instance types wind up either with lots of (unneeded) compute compared to GPU, or have ‘large’ number of GPUs, and in general the pricing seems quite depressing compared to their smaller competitors (e.g. https://datacrunch.io/ provides 2 L40S at 1,8$/hour, and also 1 A100 is similarly priced).

The models I was curious about were mainly 8b and 70b llama models from Meta. All have been tested with 4q precision, with the exception of Groq. I was curious about performance of ollama (essentially llama.cpp), and vllm ( https://docs.vllm.ai/en/stable/ ).

To provide some contrast, I included also numbers for my Macbook and Groq.

Overview of the results

Here is the summary of results (time to first token, input token rate/sec, output token rate/sec), for llama 3.1 8b:

  • Macbook M1 Pro - ollama:
    • 0.29s 209.60 input/sec, 26.07 output/sec
  • AWS EC2 g6e.xlarge - ollama 0.5.4:
    • 0.06s 6771.84 input/sec 105.30 output/sec

The 8b model is not practical for many cases though, I also tested llama 3.3 70b model:

  • AWS EC2 g6e.xlarge - ollama 0.5.4:
    • 0.67s 1254.32 input/sec 15.12 output/sec
  • AWS EC2 g6e.xlarge - vllm 0.6.6:
    • 0.11s 1180.22 input/sec 17.84 output/sec
  • Groq - llama 3.3 70b versatile
    • ~1s ??? (hidden by request latency but apparently very fast) ~250 output/sec

So running on your own cheapo L40S the speed is less than 1/10 of Groq, even at 4 bit quantization (presumably Groq uses 8 but I do not think it has been documented anywhere). With this configuration L40S does handles 3 or more requests in parallel with similar performance (with vllm), so the aggregate throughput is bit more. ollama parallelism was not tested.

The rest of the post is just details about what is being measured + raw data - it can be skipped if it is not interesting.

Llama 3.1 8b:4q

The raw data noted is test case, duration in seconds, number of input tokens,. number of output tokens. Sadly Groq is bear only in time-to-first-token (short prompt + short output) but once there’s any substantial amount of input, the outcome is quite sad.

Macbook ( M1 Pro - ollama latest )

short_prompt_short_output 0.6767678339965641 16 8
long_prompt_short_output 15.856207665987313 2048 151
short_prompt_medium_output 17.927224125014618 19 483
short_prompt_long_output 27.094048583181575 19 722

AWS EC2 ( g6e.xlarge - ollama 0.5.4 )

(Aka llama3.1:8b in ollama naming)

short_prompt_short_output 0.14124418500000502 16 8
long_prompt_short_output 3.366259947000117 2048 316
short_prompt_medium_output 2.1779543770001055 19 228
short_prompt_long_output 6.185515323000118 19 650

Llama 3 series 70B:4Q

I could not find immediately 4q quant of 3.3 on HuggingFace, so replaced it with 3.1; this is unlikely to affect the speed much (even if it does affect the results).

AWS EC2 ( g6e.xlarge - ollama 0.5.4 - llama 3.3 70b:4q)

(Aka llama3.3:70b in ollama naming)

short_prompt_short_output 1.3487983049999457 16 10
long_prompt_short_output 28.636835034999876 2048 398
short_prompt_medium_output 12.470829341000126 19 190
short_prompt_long_output 58.448369457000126 19 885

AWS EC2 (g6e.xlarge - vllm 0.6.6 - llama 3.3 70b:4q AWQ)

Note: This is really tight configuration in terms of memory; the default model size (128kb) would actually OOM the GPU, so I had to limit it to 4096 tokens and even then the request parallelism would not be great (less than 2, so in practice probably 2048 or something would work). Setting up the –gpu-memory-utilization higher than 0.9 caused OOM so that is probably not an option either.

How it was started:

sudo docker run --runtime nvidia --gpus all \
    -v /mnt/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=..." \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --max-model-len 4096 \
    --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
short_prompt_short_output 0.3056405600000289 37 3
long_prompt_short_output 6.556634896000105 3446 63
short_prompt_medium_output 22.446355006999966 41 400
short_prompt_long_output 35.61636655299981 40 635

Running 3 speed tests in parallel did not degrade performance much; beyond that there wasn’t enough VRAM (with long input) so it queued them (and e.g. 100 runs took quite long time to complete in ‘parallel’).

Overall, it was quite positive experience, although it seems there is no way to limit queue length of vllm ( Controlling max queue time · Issue #2901 · vllm-project/vllm issue was autoclosed at some point, and implementation PR feat: controlling max queue time by KrishnaM251 · Pull Request #5884 · vllm-project/vllm is not going anywhere ).

Groq (llama 3.3 versatile 70b)

short_prompt_short_output 1.365323374979198 37 10
long_prompt_short_output 1.25455483305268 3446 65
short_prompt_medium_output 1.717046125093475 41 397
short_prompt_long_output 4.310799000086263 40 1062

This is left here just to note that yes, it is fast, and probably the randomness in API (typically all requests take at least second no matter what) hides the real performance except perhaps for the long output case.

Bonus content

I was also planning to try Aphrodite Engine, but unfortunately its built-in quantization did not work out and as it is just convenience stuff on top of VLLM it did not seem worth running the same models as I did with VLLM. For some reason the same AWQ model I used with VLLM did not work with it either, and I gave up on it at that point.