Featured image of post Deploy a Private Large Language Model on Local or Server with ollama and LobeChat

Deploy a Private Large Language Model on Local or Server with ollama and LobeChat

Deploy a private large language model on local or server with ollama and LobeChat, instead of ChatGPT

Motivation

After installing a graphics card on my server, I want to make full use of this graphics card, so I thought of deploying a large language model. In this way, I don’t need to subscribe to ChatGPT every month. After all, I don’t use ChatGPT very often, and the 20 USD monthly subscription fee is a bit expensive. If I can deploy a private large language model, I can use various large language models, not limited to ChatGPT. Even if the open-source large language model is not as good as the paid version of ChatGTP, I can use OpenAI’s API to call ChatGPT’s interface, which can achieve almost the same effect as using the paid version of ChatGPT, but the monthly fee should be less than 20 USD.

Prerequisites

  • Already have a server or local computer
  • Docker and docker-compose are installed
  • If you have a graphics card, the relevant driver is installed

Background Knowledge

A large language model (LLM) is a deep learning-based natural language processing model that can generate natural language text. Training a large language model requires a lot of computing resources, so it is usually trained on a graphics card. There are many open-source large language models, such as GPT-2, GPT-3, T5, etc. These models are based on the Transformer architecture and have achieved good results in natural language processing tasks.

To deploy a large language model by yourself, you generally need three components:

  • Trained model: Large language models trained by companies or research institutions are usually saved in the form of PyTorch or TensorFlow model files. Model files are usually large and require several GB to tens of GB of storage space. Generally, choose according to the size of your graphics card memory, for example, many models are 7B, which means that the model has 700 million parameters and can generally run on an 8GB graphics card.
  • Framework for running the model: A program that can load model files and run models on a graphics card. Here we use ollama, which can load various model files and allow users to train, fine-tune, and deploy large models.
  • User interaction front-end interface: An interface that can interact with users, users can enter text, and the model will generate a reply. Here we use LobeChat, which is a web-based user interaction interface that can be integrated with ollama.

ollama

ollama is an open-source large language model framework that can load various model files and allow users to train, fine-tune, and deploy large models. ollama supports PyTorch and TensorFlow model files and can run on CPU and GPU. ollama provides a RESTful API that users can call the model through HTTP requests.

ollama supports Windows, Linux, and macOS systems and can run on local computers or servers, as well as in Docker containers. Here we use Docker containers to run ollama.

Installation

We use Docker containers to deploy ollama, using the method we introduced in the article “Docker Best Practices Guide - docker-compose and Portainer”, to deploy ollama in a Docker container.

  1. Create a subdirectory ollama under ~/docker to store ollama-related files.

  2. Create a docker-compose.yml file in the ~/docker/ollama directory to define the ollama configuration.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    
    services:
    ollama:
        container_name: ollama
        image: ollama/ollama
        environment:
        - OLLAMA_ORIGINS=*
        - OLLAMA_HOST=0.0.0.0
        - OLLAMA_MODELS=/root/.ollama/models
        ports:
        - "11434:11434"
        deploy:
        resources:
            reservations:
            devices:
            - driver: nvidia
                capabilities: ["gpu"]
                count: all
        volumes:
        - ollama:/root/.ollama
        restart: always
    
    volumes:
    ollama:
    
  3. Run the docker-compose up -d command to start the ollama container.

Usage

Common commands in ollama include:

  • ollama train: Used to train the model.
  • ollama pull: Used to download pre-trained models.
  • ollama serve: Used to start the ollama service.
  1. We can enter the ollama container by executing the docker exec -it ollama bash command and then execute the command.

  2. First, we need to download a pre-trained model, for example, we can download a llama3.1 model:

    1
    
    ollama pull llama3.1
    

    Of course, there are many other models, you can check them in the ollama model library.

  3. Then we test whether we can use this model:

    1
    2
    3
    4
    5
    6
    7
    
    curl http://127.0.0.1:11434/api/generate -d '{
            "model": "llama3.1",
            "prompt": "Why is the sky blue?",
            "options": {
                "num_ctx": 4096
            }
        }'
    

    We send an HTTP POST request to ollama’s RESTful API to ask llama3.1 why the sky is blue. If our deployed ollama runs well, llama3.1 will return a JSON response containing the generated text.

LobeChat

LobeChat is a web-based user interaction interface that can be integrated with ollama. Users can enter text in LobeChat, and ollama will generate a reply. LobeChat provides a simple interface that users can use in a browser.

LobeChat also supports voice synthesis, image recognition, multimodal, plugins, and other functions, allowing users to interact with ollama in multiple ways.

Installation

LobeChat is a Node.js-based application that can run on various operating systems. We can use Docker containers to deploy LobeChat.

  1. Create a subdirectory lobe-chat under ~/docker to store LobeChat-related files.

  2. Create a docker-compose.yml file in the ~/docker/lobe-chat directory to define the LobeChat configuration.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    services:
    lobe-chat:
        container_name: lobe-chat
        image: lobe-chat/lobe-chat
        environment:
        - OLLAMA_URL=http://127.0.0.1:11434
        ports:
        - "3000:3000"
        restart: always
    
  3. Run the docker-compose up -d command to start the LobeChat container.

Usage

After completing the above steps, we can access the LobeChat interface in the browser by visiting http://localhost:3000 (the first time you open this page may take a few seconds to complete initialization):

LobeChat Init

Then, we can use LobeChat to interact with ollama, just like using ChatGPT. It should be noted that we need to select the model that has been downloaded in ollama. For example, we downloaded the Llama3.1 8B model, so we can choose the Llama3.1 model in LobeChat.

LobeChat

Configure Other Models

LobeChat also supports many other models, such as Open AI, Google’s Gemini, Tongyi Qianwen, etc. You can click on the avatar in the upper left corner, select “Settings” in the menu, open the settings page, and then select the “Language Model” tab to see all supported models. You can choose to enable or disable a model, or enter the API Key required by the model to use these models.

LobeChat Settings

Remote Access

If you want to use LobeChat remotely, you can set up a reverse proxy with Nginx according to the method introduced in the article “Access Personal Website from the Public Network - Nginx Reverse Proxy Configuration”, bind it with your domain name, and then access LobeChat through the domain name.

comments powered by Disqus