Motivation
After installing a graphics card on my server, I want to make full use of this graphics card, so I thought of deploying a large language model. In this way, I don’t need to subscribe to ChatGPT every month. After all, I don’t use ChatGPT very often, and the 20 USD monthly subscription fee is a bit expensive. If I can deploy a private large language model, I can use various large language models, not limited to ChatGPT. Even if the open-source large language model is not as good as the paid version of ChatGTP, I can use OpenAI’s API to call ChatGPT’s interface, which can achieve almost the same effect as using the paid version of ChatGPT, but the monthly fee should be less than 20 USD.
Prerequisites
- Already have a server or local computer
- Docker and docker-compose are installed
- If you have a graphics card, the relevant driver is installed
Background Knowledge
A large language model (LLM) is a deep learning-based natural language processing model that can generate natural language text. Training a large language model requires a lot of computing resources, so it is usually trained on a graphics card. There are many open-source large language models, such as GPT-2, GPT-3, T5, etc. These models are based on the Transformer architecture and have achieved good results in natural language processing tasks.
To deploy a large language model by yourself, you generally need three components:
- Trained model: Large language models trained by companies or research institutions are usually saved in the form of PyTorch or TensorFlow model files. Model files are usually large and require several GB to tens of GB of storage space. Generally, choose according to the size of your graphics card memory, for example, many models are 7B, which means that the model has 700 million parameters and can generally run on an 8GB graphics card.
- Framework for running the model: A program that can load model files and run models on a graphics card. Here we use ollama, which can load various model files and allow users to train, fine-tune, and deploy large models.
- User interaction front-end interface: An interface that can interact with users, users can enter text, and the model will generate a reply. Here we use LobeChat, which is a web-based user interaction interface that can be integrated with ollama.
ollama
ollama is an open-source large language model framework that can load various model files and allow users to train, fine-tune, and deploy large models. ollama supports PyTorch and TensorFlow model files and can run on CPU and GPU. ollama provides a RESTful API that users can call the model through HTTP requests.
ollama supports Windows, Linux, and macOS systems and can run on local computers or servers, as well as in Docker containers. Here we use Docker containers to run ollama.
Installation
We use Docker containers to deploy ollama, using the method we introduced in the article “Docker Best Practices Guide - docker-compose and Portainer”, to deploy ollama in a Docker container.
-
Create a subdirectory
ollama
under~/docker
to store ollama-related files. -
Create a
docker-compose.yml
file in the~/docker/ollama
directory to define the ollama configuration.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
services: ollama: container_name: ollama image: ollama/ollama environment: - OLLAMA_ORIGINS=* - OLLAMA_HOST=0.0.0.0 - OLLAMA_MODELS=/root/.ollama/models ports: - "11434:11434" deploy: resources: reservations: devices: - driver: nvidia capabilities: ["gpu"] count: all volumes: - ollama:/root/.ollama restart: always volumes: ollama:
-
Run the
docker-compose up -d
command to start the ollama container.
Usage
Common commands in ollama include:
ollama train
: Used to train the model.ollama pull
: Used to download pre-trained models.ollama serve
: Used to start the ollama service.
-
We can enter the ollama container by executing the
docker exec -it ollama bash
command and then execute the command. -
First, we need to download a pre-trained model, for example, we can download a llama3.1 model:
1
ollama pull llama3.1
Of course, there are many other models, you can check them in the ollama model library.
-
Then we test whether we can use this model:
1 2 3 4 5 6 7
curl http://127.0.0.1:11434/api/generate -d '{ "model": "llama3.1", "prompt": "Why is the sky blue?", "options": { "num_ctx": 4096 } }'
We send an HTTP POST request to ollama’s RESTful API to ask llama3.1 why the sky is blue. If our deployed ollama runs well, llama3.1 will return a JSON response containing the generated text.
LobeChat
LobeChat is a web-based user interaction interface that can be integrated with ollama. Users can enter text in LobeChat, and ollama will generate a reply. LobeChat provides a simple interface that users can use in a browser.
LobeChat also supports voice synthesis, image recognition, multimodal, plugins, and other functions, allowing users to interact with ollama in multiple ways.
Installation
LobeChat is a Node.js-based application that can run on various operating systems. We can use Docker containers to deploy LobeChat.
-
Create a subdirectory
lobe-chat
under~/docker
to store LobeChat-related files. -
Create a
docker-compose.yml
file in the~/docker/lobe-chat
directory to define the LobeChat configuration.1 2 3 4 5 6 7 8 9
services: lobe-chat: container_name: lobe-chat image: lobe-chat/lobe-chat environment: - OLLAMA_URL=http://127.0.0.1:11434 ports: - "3000:3000" restart: always
-
Run the
docker-compose up -d
command to start the LobeChat container.
Usage
After completing the above steps, we can access the LobeChat interface in the browser by visiting http://localhost:3000
(the first time you open this page may take a few seconds to complete initialization):
Then, we can use LobeChat to interact with ollama, just like using ChatGPT. It should be noted that we need to select the model that has been downloaded in ollama. For example, we downloaded the Llama3.1 8B model, so we can choose the Llama3.1 model in LobeChat.
Configure Other Models
LobeChat also supports many other models, such as Open AI, Google’s Gemini, Tongyi Qianwen, etc. You can click on the avatar in the upper left corner, select “Settings” in the menu, open the settings page, and then select the “Language Model” tab to see all supported models. You can choose to enable or disable a model, or enter the API Key required by the model to use these models.
Remote Access
If you want to use LobeChat remotely, you can set up a reverse proxy with Nginx according to the method introduced in the article “Access Personal Website from the Public Network - Nginx Reverse Proxy Configuration”, bind it with your domain name, and then access LobeChat through the domain name.