Using llama.cpp with AWS instances #4225

ggerganov · 2023-11-26T16:15:14Z

ggerganov
Nov 26, 2023
Maintainer

Description

The llama.cpp project offers unique ways of utilizing cloud computing resources. Here we will demonstrate how to deploy a llama.cpp server on a AWS instance for serving quantum and full-precision F16 models to multiple clients efficiently.

Select an instance

Go to AWS instance listings: https://aws.amazon.com/ec2/pricing/on-demand/
Sort by price and find the cheapest one with NVIDIA GPU:
Check the specs:
The g4dn.xlarge instance has 1x T4 Tensor Core GPU with 16GB VRAM. Here are the NVIDIA specs for convenience:

Click me
Start the instance and login over SSH
Also, make sure to enable inbound connections to port 8888 - we will need it later for the HTTP server

Select a model and prepare `llama.cpp`

We have just 16GB VRAM to work with, so we likely want to choose a 7B model. Lately, the OpenHermes-2.5-Mistral-7B model is getting some traction so let's go with it.

We will clone the latest llama.cpp repo, download the model and convert it to GGUF format:

sudo apt update
sudo apt install make g++ git-lfs

# get the model data
git lfs install
git clone https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B

# clone llama.cpp and setup python conversion stuff
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt
ln -sfn ../OpenHermes-2.5-Mistral-7B ./models/openhermes-7b-v2.5

# convert to F16 GGUF
python3 convert.py ./models/openhermes-7b-v2.5 --outfile ./models/openhermes-7b-v2.5/ggml-model-f16.gguf --outtype f16

# quantize to Q8_0 and Q4_K
./quantize ./models/openhermes-7b-v2.5/ggml-model-f16.gguf ./models/openhermes-7b-v2.5/ggml-model-q8_0.gguf q8_0
./quantize ./models/openhermes-7b-v2.5/ggml-model-f16.gguf ./models/openhermes-7b-v2.5/ggml-model-q4_k.gguf q4_k

Do some performance benchmarks

The T4 GPUs have just 320GB/s memory bandwidth, so we cannot expect huge tok/s numbers, but let's work with what we have.
We want to be serving requests in parallel, so we have to have an idea about the types of queries that we are going to be processing in order to setup some limits. Let's do the following assumptions:

Max clients	Max prompt	Max Len
4	2048	512

We assume that at any moment in time, there will be a maximum of 4 queries being processed in parallel. Each query can have a maximum individual prompt of 2048 tokens and each query can generate a maximum of 512 tokens. So in order to support this scenario, we need to have a KV cache of size 4*(2048 + 512) = 10240 tokens (1280 MiB, F16).

Let's benchmark stock llama.cpp using the F16 model:

# build the benchmark tool
LLAMA_CUBLAS=1 make -j batched-bench

# bench the F16 model
./batched-bench ./models/openhermes-7b-v2.5/ggml-model-f16.gguf 4096 0 99 0 2048 128,512 1,2,3,4

llama_new_context_with_model: total VRAM used: 14363.04 MiB (model: 13563.03 MiB, context: 800.00 MiB)

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  2048 |    128 |    1 |   2176 |    1.914 |  1070.20 |    8.108 |    15.79 |   10.022 |   217.13 |
|  2048 |    512 |    1 |   2560 |    1.828 |  1120.61 |   32.709 |    15.65 |   34.537 |    74.12 |

Legend

PP: prompt size in tokens
TG: text to generate in tokens
B: number of batches (i.e. parallel requests)
N_KV: required size of the KV cache in tokens
T_PP: time to process the prompts
S_PP: speed of processing the prompts in tok/s
T_TG: time to generate the response
S_TG: speed of the generation in tok/s
T: total time to process the batches (i.e. requests)
S: total speed (including prompt and text) in tok/s

We immediately notice that there is not enough VRAM to load both the F16 model and the 10240 tokens KV cache. This means the maximum clients we can serve in this case is just 1. The TG speed is also not great as we expected: ~16 t/s.

Let's for a moment relax the requirements and say that the max prompt size would be 512 instead of 2048. This scenario now fits in the available VRAM and here are the results:

# bench the F16 model using small prompt size of 512
LLAMA_CUBLAS=1 make -j batched-bench && ./batched-bench ./models/openhermes-7b-v2.5/ggml-model-f16.gguf 4096 0 99 0 512 128,512 1,2,3,4

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.398 |  1288.00 |    7.589 |    16.87 |    7.987 |    80.13 |
|   512 |    128 |    2 |   1280 |    0.798 |  1283.52 |    8.554 |    29.93 |    9.352 |   136.87 |
|   512 |    128 |    3 |   1920 |    1.298 |  1183.32 |    9.053 |    42.42 |   10.351 |   185.49 |
|   512 |    128 |    4 |   2560 |    1.838 |  1114.15 |    9.277 |    55.19 |   11.115 |   230.33 |
|   512 |    512 |    1 |   1024 |    0.373 |  1372.17 |   30.693 |    16.68 |   31.066 |    32.96 |
|   512 |    512 |    2 |   2048 |    0.784 |  1306.16 |   34.753 |    29.47 |   35.537 |    57.63 |
|   512 |    512 |    3 |   3072 |    1.274 |  1205.64 |   37.243 |    41.24 |   38.517 |    79.76 |
|   512 |    512 |    4 |   4096 |    1.846 |  1109.59 |   38.492 |    53.21 |   40.338 |   101.54 |

llama.cpp supports efficient quantization formats. By using a quantum model, we can reduce the base VRAM required to store the model in memory and thus free some VRAM for a bigger KV cache. This will allow us to serve more clients with the original prompt size of 2048 tokens. Let's repeat the same benchmark using Q8_0 and Q4_K quantum models:

# bench the Q8_0 model
./batched-bench ./models/openhermes-7b-v2.5/ggml-model-q8_0.gguf 10240 0 99 0 2048 128,512 1,2,3,4

llama_new_context_with_model: total VRAM used: 9169.84 MiB (model: 7205.84 MiB, context: 1964.00 MiB)

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  2048 |    128 |    1 |   2176 |    2.596 |   788.77 |    4.974 |    25.73 |    7.570 |   287.44 |
|  2048 |    128 |    2 |   4352 |    6.247 |   655.65 |    7.614 |    33.62 |   13.861 |   313.97 |
|  2048 |    128 |    3 |   6528 |   11.200 |   548.56 |    9.499 |    40.43 |   20.699 |   315.38 |
|  2048 |    128 |    4 |   8704 |   17.268 |   474.42 |   10.880 |    47.06 |   28.148 |   309.23 |
|  2048 |    512 |    1 |   2560 |    2.514 |   814.65 |   20.158 |    25.40 |   22.672 |   112.92 |
|  2048 |    512 |    2 |   5120 |    6.221 |   658.39 |   31.274 |    32.74 |   37.495 |   136.55 |
|  2048 |    512 |    3 |   7680 |   11.136 |   551.71 |   39.606 |    38.78 |   50.743 |   151.35 |
|  2048 |    512 |    4 |  10240 |   17.295 |   473.65 |   46.026 |    44.50 |   63.321 |   161.72 |

# bench the Q4_K model
./batched-bench ./models/openhermes-7b-v2.5/ggml-model-q4_k.gguf 10240 0 99 0 2048 128,512 1,2,3,4

llama_new_context_with_model: total VRAM used: 6059.07 MiB (model: 4095.06 MiB, context: 1964.00 MiB)

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  2048 |    128 |    1 |   2176 |    2.409 |   849.99 |    3.938 |    32.50 |    6.348 |   342.79 |
|  2048 |    128 |    2 |   4352 |    5.864 |   698.52 |    6.554 |    39.06 |   12.417 |   350.48 |
|  2048 |    128 |    3 |   6528 |   10.635 |   577.70 |    8.433 |    45.53 |   19.068 |   342.35 |
|  2048 |    128 |    4 |   8704 |   16.522 |   495.82 |    9.795 |    52.27 |   26.317 |   330.74 |
|  2048 |    512 |    1 |   2560 |    2.301 |   890.02 |   16.083 |    31.83 |   18.384 |   139.25 |
|  2048 |    512 |    2 |   5120 |    5.804 |   705.69 |   27.014 |    37.91 |   32.818 |   156.01 |
|  2048 |    512 |    3 |   7680 |   10.540 |   582.93 |   35.224 |    43.61 |   45.764 |   167.82 |
|  2048 |    512 |    4 |  10240 |   16.542 |   495.22 |   41.611 |    49.22 |   58.153 |   176.09 |

Using the quantum models and a KV cache of size 4*(2048 + 512) == 10240 we can now successfully serve 4 clients in parallel and have plenty of VRAM left. The prompt processing speed is not as good as F16, but the text generation is better or similar.

Note that llama.cpp supports continuous batching and sharing a common prompt. A sample implementation is demonstrated in the parallel.cpp example. Here is a sample run with the Q4_K quantum model, simulating 4 clients in parallel, asking short questions with a shared assistant prompt of 300 tokens, for a total of 64 requests:

LLAMA_CUBLAS=1 make -j parallel && ./parallel -m ./models/openhermes-7b-v2.5/ggml-model-f16.gguf -n -1 -c 4096 --cont_batching --parallel 4 --sequences 64 --n-gpu-layers 99 -s 1

Results from `parallel`

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) 
llm_load_print_meta: general.name   = models
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   70.42 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4095.06 MiB
...............................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 512.00 MiB
llama_new_context_with_model: kv self size  =  512.00 MiB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 291.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 288.00 MiB
llama_new_context_with_model: total VRAM used: 4895.07 MiB (model: 4095.06 MiB, context: 800.00 MiB)

No new questions so proceed with build-in defaults.


main: Simulating parallel requests from clients:
main: n_parallel = 4, n_sequences = 64, cont_batching = 1, system tokens = 299

main: Evaluating the system prompt ...

Processing requests ...

main: clearing the KV cache
Client   0, seq    0, started decoding ...
Client   1, seq    1, started decoding ...
Client   2, seq    2, started decoding ...
Client   3, seq    3, started decoding ...
Client   1, seq   1/ 64, prompt   15 t, response   22 t, time  1.20 s, speed 30.83 t/s, cache miss 0  
Input:    What is the best way to cook a steak?
Response: The best way to cook a steak depends on your personal preference, but here is a general guideline:

Client   1, seq    4, started decoding ...
Client   1, seq   4/ 64, prompt   13 t, response   12 t, time  0.69 s, speed 36.18 t/s, cache miss 0  
Input:    Recommend some interesting books to read.
Response: Here are some books that I recommend for your reading pleasure:

Client   1, seq    5, started decoding ...
Client   3, seq   3/ 64, prompt   22 t, response   56 t, time  3.06 s, speed 25.48 t/s, cache miss 0  
Input:    Are you familiar with the Special Theory of Relativity and can you explain it to me?
Response: I am familiar with the Special Theory of Relativity and I would be happy to explain it to you. The Special Theory of Relativity is a theory proposed by Albert Einstein in 1905 that explains the relationship between space and time. It has two postulates:

...

Client   2, seq  63/ 64, prompt   22 t, response   46 t, time  2.36 s, speed 28.80 t/s, cache miss 0  
Input:    Are you familiar with the Special Theory of Relativity and can you explain it to me?
Response: Yes, I am familiar with the Special Theory of Relativity. It was developed by Albert Einstein in 1905 and it describes the physical laws that govern objects in motion. The theory is based on two postulates:

Client   3, seq  61/ 64, prompt   22 t, response   96 t, time  4.52 s, speed 26.13 t/s, cache miss 0  
Input:    Are you familiar with the Special Theory of Relativity and can you explain it to me?
Response: Yes, I am familiar with the Special Theory of Relativity. It was developed by Albert Einstein in 1905 and it is based on two postulates or principles. The first postulate is that the laws of physics are the same for all observers who are moving at a constant velocity relative to each other. The second postulate is that the speed of light in a vacuum is always the same, regardless of the motion of the source of light or the observer.

main: clearing the KV cache

run parameters as at 2023-11-26 14:56:55

main: n_parallel = 4, n_sequences = 64, cont_batching = 1, system tokens = 299
External prompt file: used built-in defaults
Model and path used:  ./models/openhermes-7b-v2.5/ggml-model-q4_k.gguf

Total prompt tokens:    956, speed: 17.48 t/s
Total gen tokens:      3794, speed: 69.39 t/s
Total speed (AVG):           speed: 86.87 t/s
Cache misses:             0


llama_print_timings:        load time =   12698.38 ms
llama_print_timings:      sample time =    1845.35 ms /  3858 runs   (    0.48 ms per token,  2090.66 tokens per second)
llama_print_timings: prompt eval time =   51162.52 ms /  5021 tokens (   10.19 ms per token,    98.14 tokens per second)
llama_print_timings:        eval time =     750.60 ms /    28 runs   (   26.81 ms per token,    37.30 tokens per second)
llama_print_timings:       total time =   54680.08 ms

Running a demo HTTP server

The llama.cpp server example can be build and started like this:

# start llama.cpp server, max 4 clients in parallel, prompt size 2048, max seq 512, listen on port 8888
LLAMA_CUBLAS=1 make -j server && ./server -m models/openhermes-7b-v2.5/ggml-model-q4_k.gguf --port 8888 --host 0.0.0.0 --ctx-size 10240 --parallel 4 -ngl 99 -n 512

# send a completion request via curl
curl -s http://XXX.XXX.XXX.XXX:8888/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer no-key" \
    -d '{
        "model": "gpt-3.5-turbo",
        "messages": [
            {
                "role": "system",
                "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
            },
            {
                "role": "user",
                "content": "Write a limerick about python exceptions"
            }
        ]
    }' | jq

# ... result:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "There once was a coder named Sue,\nWho worked with Python, quite true.\nShe'd write her code,\nBut exceptions would load,\nAnd leave her with feelings of askew.",
        "role": "assistant"
      }
    }
  ],
  "created": 1701012712,
  "id": "chatcmpl-sHBoOZIbYDI3M6vfzWREuNJxRJ0WuBqN",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 43,
    "prompt_tokens": 48,
    "total_tokens": 91
  }
}

An alternative way for really quick deployment of llama.cpp for demo purposes is to use the server-llm.sh helper script:

bash -c "$(curl -s https://ggml.ai/server-llm.sh)"

For more info, see: #3868

Final notes

This was a short walkthrough of how to setup and bench llama.cpp in the cloud that I hope would be useful for people looking for a simple and efficient LLM solution. There are many details not covered here and one needs to understand some of the intricate details of the llama.cpp and ggml implementations in order to take full advantage of the available compute resources. Knowing when to use a quantum model vs F16 model for example requires understanding of the existing CUDA kernels and their limitations. The code base is still relatively simple and allows to easily customize the implementation according to the specific needs of a project. Such customizations can yield significant performance gains compared to the stock llama.cpp implementation that is available out-of-the-box from master.

arch-btw · 2023-11-26T19:03:55Z

arch-btw
Nov 26, 2023

Thank you!

0 replies

gerred · 2023-11-27T20:36:30Z

gerred
Nov 27, 2023

this is gold. thank you.

0 replies

jpalacios84 · 2023-11-28T01:49:37Z

jpalacios84
Nov 28, 2023

Thanks for sharing! Here's a side quest for those of you using llama.cpp via Python bindings and CUDA. This is a minimalistic example of a Docker container you can deploy in smaller cloud providers like VastAI or similar.

To build the image: docker build -f Dockerfile_llamacpp -t mistral7b-llamacpp .
To run the container: docker run --gpus all mistral7b-llamacpp

Model: https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF

Dockerfile_llamacpp

You can change CUDA's version number as required. Literally, just change the number on line 1 to a valid version compatible with your hardware.

FROM nvidia/cuda:12.0.0-devel-ubuntu22.04

RUN apt-get update && apt-get upgrade -y \
    && apt-get install -y git build-essential \
    python3 python3-pip gcc wget \
    ocl-icd-opencl-dev opencl-headers clinfo \
    libclblast-dev libopenblas-dev \
    && mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

ENV CUDA_DOCKER_ARCH=all
ENV LLAMA_CUBLAS=1

RUN CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

WORKDIR /home/myapp

COPY ./app.py ./app.py
COPY ./mistral-7b-openorca.Q5_K_M.gguf ./mistral-7b-openorca.Q5_K_M.gguf

CMD python3 app.py

app.py

from llama_cpp import Llama

__VERBOSE = True
__N_CTX = 1024
__MAX_TOKENS = 4096 - __N_CTX

model = Llama(
    model_path="mistral-7b-openorca.Q5_K_M.gguf",
    n_gpu_layers=35,
    n_ctx=__N_CTX,
    n_threads=1,
    verbose=__VERBOSE,
)

template = """
<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
"""

prompt = "How did the Universe start?"

llm_res = model(
    template.replace(
        "{system}",
        "You are answering the questions of someone with a Physics degree. Please be very technical.",
    ).replace(
        "{prompt}",
        prompt,
    ),
    max_tokens=__MAX_TOKENS,
)

print("")
print("User:", prompt)
print("Mistral-7B:", llm_res["choices"][0]["text"])

When you run it you should see something like this at the very top (if verbose is True):

Happy building/hacking

1 reply

Galunid Nov 29, 2023
Collaborator

I'm quite sure that's not a good idea:
COPY ./mistral-7b-openorca.Q5_K_M.gguf ./mistral-7b-openorca.Q5_K_M.gguf

I think ideally you'd want to mount the model as a volume inside the container, so that you can swap to different models and your images don't have those extra gigabytes.

openmarmot · 2023-11-28T16:31:47Z

openmarmot
Nov 28, 2023

I created a example of doing this in cloudformation - although I geared my example for CPU only in order to keep total costs down

https://github.com/openmarmot/aws-cft-llama-cpp

The template creates a EC2 instance, installs llama.cpp and runs the server, and attaches an IAM role with the permissions necessary to enable the AWS web console (Connect) for linux console access so you don't have to SSH to the instance to connect

0 replies

cmp-nct · 2023-11-29T16:56:45Z

cmp-nct
Nov 29, 2023

A word of caution on Amazon Cloud:
They are usually quite cheap to get into, free Credits and sometimes even half a year or more of free usage.
But in general they are extremely high priced as soon as you actually have production loads, not just a bit more than competition but a multiple and their ecosystem is unique so getting out of it isn't that easy.

If you run something productive there you'll find moving out difficult as everything needs to adapt to their cloud.
I fell into that trap with open eyes and overpaid almost 200 grand (for stuff that's worth 15 grand), took me months and months of time to slide out again into normal hosting solutions.

Not saying to stay away, it's a good way to get started and quite cheap when using at low scale or only for a couple hours.
But if you start to scale up the trap opens.

1 reply

tescao Feb 22, 2024

What did you end up using instead?

geekgao · 2023-11-30T03:22:31Z

geekgao
Nov 30, 2023

Awesome, thanks!

0 replies

ei-grad · 2023-12-03T16:22:04Z

ei-grad
Dec 3, 2023

Did someone tried llama.cpp on the m6g/t4g? (cpu-only, ARM; t4g - burstable)

Though it is 5 times cheaper, CPU-inference should be many times slower, but Nvidia T4 is not a very powerfull GPU, so it still can make sense to compare.

(beware, so many different t/g-4's)

1 reply

koskoakos Dec 4, 2023

Tried on c6g.2xlarge (they too use graviton) – for my scenario (medium length prompt, short reply) it was about 10x slower, in the range of 3-5 tokens per second.

srikrish2812 · 2023-12-05T03:14:34Z

srikrish2812
Dec 5, 2023

I am unable to run quantize in the last two lines. I downloaded the repo on 5th December 2023. Could you please guide me in executing the code? I am facing the following error:

./quantize ./models/openhermes-7b-v2.5/ggml-model-f16.gguf ./models/openhermes-7b-v2.5/ggml-model-q8_0.gguf q8_0
zsh: no such file or directory: ./quantize

1 reply

Jeff-Jc Dec 5, 2023

first - make LLAMA_CLBLAST=1
And then use quantize

d-z-m · 2023-12-13T19:19:12Z

d-z-m
Dec 13, 2023

Hi there. I wanted to clarify a couple of things about this tutorial.

What AMI is being used? I assume a recent version of Ubuntu or Debian, based on the use of the apt package manager.
I'm unable to compile any of the examples(batched-bench, server, parallel). because of an error from nvcc
nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'
Having fixed that by changing:
MK_NVCCFLAGS += -arch=native
to:
MK_NVCCFLAGS += -arch=all

running the executables fails with:

CUDA error 100 at ggml-cuda.cu:487: no CUDA-capable device is detected

TL;DR:

Is this tutorial omitting some steps? I'm having some trouble getting it to work with a pretty much identical configuration.

5 replies

koskoakos Dec 13, 2023

That means you’re running it on a non-nvidia GPU instance (probably graviton?)
Which means you can’t use CUBLAS.
If you use the instance type from the guide (g4dn) it comes with nvidia T4 (you can verify by running nvidia-smi command)

I personally run ML on AWS on PyTorch Ubuntu becaue it has nvidia drivers installed and such.

d-z-m Dec 13, 2023

I'm using a g4dn.2xl.

nvidia-smi isn't installed. apt suggests many different nvidia-utils-NNN packages which provide it.

openmarmot Dec 13, 2023

Amazon has some AMIs with the Nvidia drivers installed. They will have a DLAMI suffix (?) which stands for Deep learning AMI.

Otherwise you will want to install and setup the Nvidia drivers yourself (you will also have to do some configuration).
see here for more details on installing the NVIDIA drivers : https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html

Note that each major instance type generally has a different NVIDIA GPU - which corresponds to a different driver family

For ease of use I would suggest starting with Amazon Linux 2 or (the newer) Amazon Linux 2023 - as these are what the Amazon documentation is generally written for.

If you are interested in a CPU only install you can follow along with the bash commands, or run the cloudformation file that I have here : https://github.com/openmarmot/aws-cft-llama-cpp

koskoakos Dec 13, 2023

In this case I would advise either choosing different AMI (like pytorch or from the answer above), because installing nvidia drivers on ubuntu is either super easy, or a fun way to spend an unspecified amount of extra time.
Also be careful with amazon linux 2, because some time ago (maybe it’s irrelevant now) I couldn’t build ggml on it because of very old kernel version and compatibility issues

d-z-m Dec 13, 2023

Amazon has some AMIs with the Nvidia drivers installed. They will have a DLAMI suffix (?) which stands for Deep learning AMI.

Thanks, this helped a lot.

openmarmot · 2024-01-04T01:09:16Z

openmarmot
Jan 4, 2024

I've gone through the process of getting the Nvidia drivers installed silently on the P3 series if anyone is still looking for help on this

note - this is a aws cloudformation template that can be run from the aws cloudformation console. You can also just grab the linux shell commands from the userdata area and do it yourself manually if you want.

The Nvidia linux drivers are ultra picky. This template is designed for Amazon Linux 2 and a P3 series - it will likely need modification if you are using a different instance or instance type.

https://github.com/openmarmot/aws-ec2-nvidia-drivers/blob/main/cft-al2-p3-series.yaml

2 replies

openmarmot Mar 6, 2024

update - also added a template for Amazon Linux 2023. This OS is Amazon's newest. Amazon Linux 2 is reaching EOL
https://github.com/openmarmot/aws-ec2-nvidia-drivers

openmarmot Jun 18, 2024

update 2 - NVIDIA has official repos for AL2023 now so I was able to create a GPU accelerated version. I haven't really optimized the server settings but it is extremely fast in my testing
https://github.com/openmarmot/aws-cft-llama-cpp/blob/main/cft-llama-cpp-gpu.yaml

maccarini · 2024-02-21T17:46:19Z

maccarini
Feb 21, 2024

How do you come up with the KV cache size? (1280 MiB)

0 replies

Using llama.cpp with AWS instances #4225

ggerganov Nov 26, 2023 Maintainer

Description

Select an instance

Select a model and prepare llama.cpp

Do some performance benchmarks

Running a demo HTTP server

Final notes

Replies: 11 comments · 11 replies

Dockerfile_llamacpp

app.py

Galunid Nov 29, 2023 Collaborator

ggerganov
Nov 26, 2023
Maintainer

Select a model and prepare `llama.cpp`

Replies: 11 comments 11 replies

Galunid Nov 29, 2023
Collaborator