Chạy mô hình LLM

May 20, 2025 · 5 min read

Chúng ta có thể chạy các mô hình ngôn ngữ lớn mã nguồn mở trên chính máy tính của mình.

Ollama

Cài đặt Ollama.

Chạy ollama

# Liệt kê models hiện có
ollama list
NAME               ID              SIZE      MODIFIED    
gemma3:27b         a418f5838eaf    17 GB     4 weeks ago
llama3.3:latest    a6eb4748fd29    42 GB     4 weeks ago
llama3:latest      365c0bd3c000    4.7 GB    4 weeks ago
llama3:8b          365c0bd3c000    4.7 GB    4 weeks ago

# Chạy model
ollama run llama3:latest

# help
ollama
Available Commands:
  serve       Start ollama
  create      Create a model from a Modelfile
  show        Show information for a model
  run         Run a model
  stop        Stop a running model
  pull        Pull a model from a registry
  push        Push a model to a registry
  list        List models
  ps          List running models
  cp          Copy a model
  rm          Remove a model
  help        Help about any command

Flags:
  -h, --help      help for ollama
  -v, --version   Show version information

Sau khi chạy model trên terminal thì chúng ta có thể sử dụng để hỏi và nhận câu trả lời ngay trên terminal.

$ ollama run llama3:8b
>>> what is capital of Vietnam
The capital of Vietnam is Hanoi ( Vietnamese: Hà Nội).

>>> tell a recap of 2 sentences about Vietnam
Vietnam is a country located in Southeast Asia, known for its rich cultural heritage, stunning natural beauty, and complex history that includes the ancient Champa and Khmer kingdoms, French colonialism, and the Vietnam War. Today, Vietnam is a thriving economy and popular tourist destination, with iconic cities like Hanoi and Ho Chi Minh City, beautiful beaches, and vibrant cities filled with street food, markets, and temples.

>>> Send a message (/? for help)

info

Máy mình dùng GPU NVIDIA GeForce RTX 3060, Ram 16GB.

Mỗi lần mô hình chạy để đưa ra câu trả lời phần thực thi GPU thường tăng rất cao. Với mô hình 8 tỷ tham số llama3:8b size 4GB máy chạy khá mượt nhưng với mô hình lớn hơn sẽ tốn nhiều tài nguyên và đưa ra câu trả lời chậm hơn.

Thậm chí có thể lỗi ngay khi khởi chạy vì không cấp đủ tài nguyên như khi mình chạy mô hình llama3.3:latest size 42GB.

API

Do câu trả lời từ mô hình LLM (Large language model) thường sử dụng định dạng Markdown. Nên để tiện lợi hơn ta có thể cài đặt ứng dụng frontend để tương tác và hiển thị câu trả lời từ mô hình chạy trên máy.

API Endpoint: http://localhost:11434/api/chat.

Có 2 loại API hỏi đáp mà Ollama cung cấp

Chat: /api/chat
Generate: /api/generate

Generate

Mỗi request bao gồm 1 câu prompt với lựa chọn có stream câu trả lời hay không.

{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": true
}

{"response":"The sky is blue because...","done":false}
{"response":" Rayleigh scattering...","done":false}
...
{"done":true}

Chat

Mỗi request bao gồm nhiều message dưới dạng mảng. Message cuối là câu hỏi của người dùng cần trả lời. Các message phía trc là lịch sử hỏi và trả lời giữa người dùng và AI để cho AI nắm được ngữ cảnh cuộc hội thoại và cho ra câu trả lời phù hợp, tiếp nối nội dung.

{
  "model": "llama3",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What's the capital of France?" }
  ],
  "stream": true
}

Frontend

Code typescript gửi yêu cầu đến và nhận dữ liệu dạng stream từ mô hình LLM chạy bởi ollama.

export interface Message {
  role: 'user' | 'system';
  content: string;
}

export async function streamChatMessage(messages: Message[], msgHandler: (msg: string) => void, url: string): Promise<string> {
  const payload = {
    model: 'llama3',
    messages,
    stream: true
  }

  try {
    const response = await fetch(url, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json'
      },
      body: JSON.stringify(payload)
    })

    if (!response.ok || !response.body) {
      throw new Error(`HTTP error! Status: ${response.status}`)
    }

    // Get the readable stream from the response body
    const reader = response.body.getReader()
    const decoder = new TextDecoder('utf-8')
    let result = ''

    while (true) {
      const { done, value } = await reader.read()
      if (done) {
        console.log('Stream complete')
        break
      }

      // Decode the chunk and split by newlines (each chunk may contain multiple JSON objects)
      const chunk = decoder.decode(value)
      const lines = chunk.split('\n').filter(line => line.trim() !== '')

      // Process each line as a JSON object
      for (const line of lines) {
        try {
          const parsed = JSON.parse(line)
          if (parsed.message && parsed.message.content) {
            // Append the content chunk to the result
            result += parsed.message.content
            // Update UI or log in real-time
            // console.log(parsed.message.content) // Or append to a DOM element
            msgHandler(result) // Call the message handler with the new content
          }
          if (parsed.done) {
            console.log('Response fully received:', result)
          }
        } catch (e) {
          console.error('Error parsing JSON chunk:', e)
        }
      }
    }

    return result // Final accumulated response
  } catch (error) {
    console.error('Error during streaming:', error)
  }

  return '' // Return empty string in case of error
}

Chú ý setup CORS cho ollama để có thể tương tác với mô hình từ web app.

Demo

Khi chúng ta chạy ollama được setup ở trên với CORS, ta có thể tương tác trực tiếp sử dụng app Demo dưới đây.

Hi! Can you tell me a joke?

Why did the computer show up at work late? It had a hard drive!

Ollama​

API​

Generate​

Chat​

Frontend​

Demo​

Ollama

API

Generate

Chat

Frontend

Demo