Running LLM’s Locally: A Step-by-Step Guide

In this post, you will take a closer look at LocalAI, an open source alternative to OpenAI which allows you to run LLM’s on your local machine. No GPU is needed, consumer grade hardware will suffice. Enjoy!

1. Introduction

OpenAI is a great tool. However, you may not be allowed to use it due to company policies because you might send sensitive information to OpenAI. Besides that, you might want to experiment with different kinds of LLM’s (Large Language Models). Wouldn’t it be great if you could run models locally using the same Rest API as for OpenAI? Well, that is exactly what LocalAI has to offer you! LocalAI is an open source alternative to OpenAI and has a Rest API which is compatible with the OpenAI API specifications. Besides that, no GPU is needed, you can run it on consumer grade hardware. It is advised, however, to use a GPU, because it will be approximately 20 times faster.

2. Prerequisites

Actually, there are no prerequisites for reading this blog. As I am at the beginning of learning more about AI appliances, this blog is at entry level. No need to know how LLM’s work internally, we will just make use of the LLM’s.

You do need the following tools:

Git
Docker Compose
curl or equivalent (Postman for example)

3. Installation

The installation of LocalAI for CPU is described here. This paragraph contains the steps and changes I made in order to install LocalAI.

Clone the LocalAI git repository.

$ git clone https://github.com/go-skynet/LocalAI

Navigate into the repository directory.

$ cd LocalAI

The repository contains a .env file which you need to customize.

Uncomment THREADS and adjust the number to the number of physical cores you have (12 in my case);
Uncomment GALLERIES and adjust it to the galleries as described in the installation guide.

The top of the file looks as follows:

## Set number of threads.
## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.
THREADS=12

## Specify a different bind address (defaults to ":8080")
# ADDRESS=127.0.0.1:8080

## Default models context size
# CONTEXT_SIZE=512
#
## Define galleries.
## models will to install will be visible in `/models/available`
GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]

Start the Docker container. The Docker image refers to the latest tag, at the time of writing v2.0.0 of LocalAI is the latest. You can find this by navigating to the image repository, search for the latest tag, copy the manifest hash and search for the copied manifest hash.

$ docker compose up -d --pull always

Be patient, this takes some time. The image is about 70GB. The previous version v1.40.0 was approximately 14GB.

When the container has started successfully, you should be able to retrieve the available models:

$ curl http://localhost:8080/models/available

4. Install a Model

First, you need to install a model. You can do so via the model gallery via the API but at the time of writing, this is still experimental. I prefer to add the model manually. The instructions can be found here, but do know that it might change over time, so do not solely rely on the contents in this paragraph.

Create a file lunademo.yaml in directory models. Change the threads to the number of physical cores on your machine.

name: lunademo
parameters:
  model: luna-ai-llama2-uncensored.Q5_K_M.gguf
  top_k: 80
  temperature: 0.2
  top_p: 0.7
context_size: 1024
threads: 12
backend: llama
roles:
  assistant: 'ASSISTANT:'
  system: 'SYSTEM:'
  user: 'USER:'
template:
  chat: lunademo-chat
  completion: lunademo-completion

The model refers to a file containing the model. Download the file to the models directory from HuggingFace. HuggingFace contains many open source models which you can use, but in this example you will use a model based on Llama 2, the AI model created by Meta. Note that in the Model Card the models are listed with their use case. Also, the use case states which models are recommended to use. Beware to use only GGUF models, GGML is no longer supported for Llama 2.

Also note that two templates are defined in the configuration file of the model. One chat template and one completion template.

Create a file lunademo-chat.tmpl in the models directory. The template is derived from the Model Card at HuggingFace (search for Prompt template).

USER: {{.Input}}

ASSISTANT:

Create a file lunademo-completion.tmpl in the models directory.

Complete the following sentence: {{.Input}}

Restart the Docker container in order to load the model.

$ docker compose restart

5. Ask Questions

Now that a model has been loaded, you can start asking questions. You can take a look at the OpenAPI specification, below some examples are shown in order to verify how the local model responds and how accurate it is.

5.1 How Are You?

As a first simple example, you ask the model how it is feeling. In the request, you mention the model to be used, the message and you can set the temperature. A high temperature allows the model to be more creative. The model answers that it is doing well.

$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "lunademo",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9 
   }'
{
   "created":1700993538,
   "object":"chat.completion",
   "id":"2fe33052-f4be-4724-8b53-fdade80b49de",
   "model":"lunademo",
   "choices":[
      {
         "index":0,
         "finish_reason":"stop",
         "message":{
            "role":"assistant",
            "content":"I'm doing well, thank you. How about yourself?"
         }
      }
   ],
   "usage":{
      "prompt_tokens":0,
      "completion_tokens":0,
      "total_tokens":0
   }
}

5.2 Fact About Famous Actor

Let’s ask the model if it knows who Leonard di Caprio is. You set the temperature to zero because you want only facts. The answer is short, but correct. Also note that it corrected the name in the response.

$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "lunademo",
     "messages": [{"role": "user", "content": "who is leonardo di caprio?"}],
     "temperature": 0 
   }'
{
   "created":1700993538,
   "object":"chat.completion",
   "id":"2fe33052-f4be-4724-8b53-fdade80b49de",
   "model":"lunademo",
   "choices":[
      {
         "index":0,
         "finish_reason":"stop",
         "message":{
            "role":"assistant",
            "content":"Leonardo DiCaprio is an American actor and film producer. He has appeared in numerous films, including \"Titanic,\" \"The Revenant,\" and \"The Wolf of Wall Street.\""
         }
      }
   ],
   "usage":{
      "prompt_tokens":0,
      "completion_tokens":0,
      "total_tokens":0
   }
}

5.3 Facts About Famous Soccer Player

Let’s verify whether it also knows the famous Dutch soccer player Johan Cruijff. The answer is in this case also correct.

$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "lunademo",
     "messages": [{"role": "user", "content": "who is Johan Cruijff?"}],
     "temperature": 0 
   }'
{
   "created":1700993538,
   "object":"chat.completion",
   "id":"2fe33052-f4be-4724-8b53-fdade80b49de",
   "model":"lunademo",
   "choices":[
      {
         "index":0,
         "finish_reason":"stop",
         "message":{
            "role":"assistant",
            "content":"Johan Cruyff was a Dutch professional football player and coach. He played as a forward for Ajax, Barcelona, and the Netherlands national team. He is widely regarded as one of the greatest players of all time and was known for his creativity, skill, and ability to score goals from any position on the field."
         }
      }
   ],
   "usage":{
      "prompt_tokens":0,
      "completion_tokens":0,
      "total_tokens":0
   }
}

5.4 A Fairy Tale

Ask the model to write a short fairy tale, set the temperature again on a high value. The answer is a nice fairy tale including goblins and fairies.

$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "lunademo",
"messages": [{"role": "user", "content": "write a short fairy tale including goblins and fairies"}],
"temperature": 0.9
}'
{
"created":1700993538,
"object":"chat.completion",
"id":"2fe33052-f4be-4724-8b53-fdade80b49de",
"model":"lunademo",
"choices":[
{
"index":0,
"finish_reason":"stop",
"message":{
"role":"assistant",
"content":"Once upon a time, in a faraway forest lived a group of mischievous goblins. They loved to cause trouble for the creatures living in the woods. One day, they decided to prank the fairies who lived on the other side of the forest.\n\nThe goblins snuck into the fairy's garden and stole all their flowers. The fairies were devastated and knew they had to get their flowers back. They searched high and low but couldn't find any trace of the goblins.\n\nJust when they thought all hope was lost, a wise old fairy appeared before them. She told them that she knew where the goblins had taken the flowers and gave them a clue to find them.\n\nThe fairies followed the clue and found themselves in front of a cave. They were hesitant to enter but the wise fairy encouraged them to be brave. Inside the cave, they saw the goblins playing with the stolen flowers.\n\nThe fairies were furious and ready to punish the goblins for their mischief. But the wise old fairy stopped them and suggested a different approach. She told them that instead of getting angry, they should try to understand why the goblins had done this.\n\nThe fairies listened to her and decided to talk to the goblins. They explained to them how much the flowers meant to them and how they were used for healing and magic spells. The goblins were sorry for their prank and promised never to do it again.\n\nFrom that day on, the fairies and goblins became friends and would often play together in the forest. They learned to appreciate each other's differences and lived happily ever after."
}
}
],
"usage":{
"prompt_tokens":0,
"completion_tokens":0,
"total_tokens":0
}
}

5.5 Other Languages

Up till now, you used English as a language to interact. But what if you need to use a different language, Dutch for example. Is the model able to understand and answer in Dutch? Let’s find out!

Ask the model ‘Do you understand Dutch?’. The answer is yes.

$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "lunademo",
     "messages": [{"role": "user", "content": "begrijp je nederlands?"}],
     "temperature": 0 
   }'
{
   "created":1700993538,
   "object":"chat.completion",
   "id":"2fe33052-f4be-4724-8b53-fdade80b49de",
   "model":"lunademo",
   "choices":[
      {
         "index":0,
         "finish_reason":"stop",
         "message":{
            "role":"assistant",
            "content":"Ja, ik kan Nederlands begrijpen."
         }
      }
   ],
   "usage":{
      "prompt_tokens":0,
      "completion_tokens":0,
      "total_tokens":0
   }
}

Ask the model in Dutch who Johan Cruijff is. The model answers correctly, but it answers in English.

$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "lunademo",
     "messages": [{"role": "user", "content": "wie is johan cruijff?"}],
     "temperature": 0 
   }'
{
   "created":1700993538,
   "object":"chat.completion",
   "id":"2fe33052-f4be-4724-8b53-fdade80b49de",
   "model":"lunademo",
   "choices":[
      {
         "index":0,
         "finish_reason":"stop",
         "message":{
            "role":"assistant",
            "content":"Johan Cruyff was a Dutch professional football player and manager. He played as a forward for Ajax, Barcelona, and the Netherlands national team. He is widely regarded as one of the greatest players of all time and is known for his innovative playing style."
         }
      }
   ],
   "usage":{
      "prompt_tokens":0,
      "completion_tokens":0,
      "total_tokens":0
   }
}

You can fix this by instructing the assistant to always answer in Dutch. You can do so by adding a system message to the request with this instruction. The answer is in Dutch and correct. This is quite amazing, isn’t it? Dutch is not a widespread language and you are running the model locally!

$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "lunademo",
     "messages": [
             {"role": "system", "content": "You are a helpful assistant. Antwoord altijd in het Nederlands."},
             {"role": "user", "content": "wie is Johan Cruijff?"}],
     "temperature": 0 
   }'
{
   "created":1700993538,
   "object":"chat.completion",
   "id":"2fe33052-f4be-4724-8b53-fdade80b49de",
   "model":"lunademo",
   "choices":[
      {
         "index":0,
         "finish_reason":"stop",
         "message":{
            "role":"assistant",
            "content":"Johan Cruijff was een Nederlandse voetballer die bekendstond om zijn technische vaardigheden en zijn snelle, creatieve spel. Hij speelde als middenvelder voor onder andere Ajax, Barcelona en het Nederlands elftal."
         }
      }
   ],
   "usage":{
      "prompt_tokens":0,
      "completion_tokens":0,
      "total_tokens":0
   }
}

5.6 Stream the Response

Sometimes, the answer will take some time. However, by adding the stream parameter in the request, you do not have to wait for the complete response, but you can receive character by character in order that you can display the response already to the user. This way, you have a better user experience for the user.

$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "lunademo",
     "messages": [{"role": "user", "content": "who is Johan Cruijff?"}],
     "temperature": 0,
     "stream": true
   }'
data: {"created":1700993538,"object":"chat.completion.chunk","id":"2fe33052-f4be-4724-8b53-fdade80b49de","model":"lunademo","choices":[{"index":0,"delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

data: {"created":1700993538,"object":"chat.completion.chunk","id":"2fe33052-f4be-4724-8b53-fdade80b49de","model":"lunademo","choices":[{"index":0,"delta":{"content":"J"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
...
data: {"created":1700993538,"object":"chat.completion.chunk","id":"2fe33052-f4be-4724-8b53-fdade80b49de","model":"lunademo","choices":[{"index":0,"finish_reason":"stop","delta":{"content":""}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

data: [DONE]

5.7 Format Response as JSON

Verify whether the answer can be formatted as a JSON object.

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "lunademo",
     "messages": [{"role": "user", "content": "who is Johan Cruijff? Format the response as a JSON object containing firstName, lastName and clubs"}],
     "temperature": 0 
   }'
{
   "created":1700993538,
   "object":"chat.completion",
   "id":"2fe33052-f4be-4724-8b53-fdade80b49de",
   "model":"lunademo",
   "choices":[
      {
         "index":0,
         "finish_reason":"stop",
         "message":{
            "role":"assistant",
            "content":"{\n \"firstName\": \"Johan\",\n \"lastName\": \"Cruijff\",\n \"clubs\": [\n {\n \"name\": \"Ajax Amsterdam\",\n \"startYear\": 1957,\n \"endYear\": 1968\n },\n {\n \"name\": \"Barcelona\",\n \"startYear\": 1968,\n \"endYear\": 1973\n },\n {\n \"name\": \"Manchester United\",\n \"startYear\": 1973,\n \"endYear\": 1974\n }\n ]\n"
         }
      }
   ],
   "usage":{
      "prompt_tokens":0,
      "completion_tokens":0,
      "total_tokens":0
   }
}

The content is a JSON object and it is formatted just as we asked.

{
   "firstName":"Johan",
   "lastName":"Cruijff",
   "clubs":[
      {
         "name":"Ajax Amsterdam",
         "startYear":1957,
         "endYear":1968
      },
      {
         "name":"Barcelona",
         "startYear":1968,
         "endYear":1973
      },
      {
         "name":"Manchester United",
         "startYear":1973,
         "endYear":1974
      }
   ]
}

However, do note that you also asked to mention the clubs Johan Cruijff played for. Although this seems to be correct, Johan Cruijff never played for Manchester United. Also, the start and end years of Ajax and Barcelona are not correct. The model is hallucinating here, even when the temperature is set to 0. See wikipedia for the details.

6. Conclusion

Running an LLM locally is possible by means of LocalAI. You can run it even if you do not have a GPU. This is very promising and opens the door for using LLM’s even if your company policies do not allow you to use cloud hosted LLM’s.

This blog is for me a platform to share my knowledge in the area of software development.

Running LLM’s Locally: A Step-by-Step Guide

1. Introduction

2. Prerequisites

3. Installation

4. Install a Model

5. Ask Questions

5.1 How Are You?

5.2 Fact About Famous Actor

5.3 Facts About Famous Soccer Player

5.4 A Fairy Tale

5.5 Other Languages

5.6 Stream the Response

5.7 Format Response as JSON

6. Conclusion

4 thoughts on “Running LLM’s Locally: A Step-by-Step Guide”

Leave a comment Cancel reply

1. Introduction

2. Prerequisites

3. Installation

4. Install a Model

5. Ask Questions

5.1 How Are You?

5.2 Fact About Famous Actor

5.3 Facts About Famous Soccer Player

5.4 A Fairy Tale

5.5 Other Languages

5.6 Stream the Response

5.7 Format Response as JSON

6. Conclusion

Share this:

Related

4 thoughts on “Running LLM’s Locally: A Step-by-Step Guide”

Leave a comment Cancel reply