大模型管理平台 Ollama

Ollama 大模型管理工具介绍

Ollama 介绍

基于 llama.cpp 的大模型管理平台。提供了大模型的管理、命令行调用、接口调用能力。底层使用了 Cloudflare 的全球 CDN 服务，国内下载大模型超快。

ollama

客户端部署

Mac 可以支持 Mac 自带 GPU
Windows

windows ollama

docker 部署方式

# 纯CPU计算
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# Nvidia GPU
# 英伟达显卡 安装 NVIDIA Container Toolkit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation
docker run -d --runtime=nvidia --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# AMD GPU 参考
docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm

目录结构

$ find .ollama -maxdepth 5 | grep -v /blobs/
.ollama/models/blobs
.ollama/models/manifests/registry.ollama.ai/milkey/dmeta-embedding-zh
.ollama/models/manifests/registry.ollama.ai/library/starcoder2
.ollama/models/manifests/registry.ollama.ai/library/qwen2.5
.ollama/models/manifests/registry.ollama.ai/library/llama3.1
.ollama/models/manifests/registry.ollama.ai/library/mxbai-embed-large
.ollama/models/manifests/registry.ollama.ai/library/llama3.2
.ollama/models/manifests/registry.ollama.ai/library/llava
.ollama/models/manifests/hf.co/second-state/gte-Qwen2-1.5B-instruct-GGUF
.ollama/logs/server.log
.ollama/history

Ollama 官方模型仓库

models ollama

量化模型介绍

GGUF 格式

GGUF 是一种二进制格式，针对快速加载和保存模型进行了优化，使其在推理方面非常高效。在 PyTorch 等框架中开发的模型可以转换为 GGUF 格式。GGUF 整合了多种已有的量化格式，旨在提供更通用、更统一的量化和推理方案，便于模型在不同的硬件和推理框架上兼容运行。

gguf

模型量化 (Quantization)

量化是一种将模型的参数从高精度格式（如 32 位浮点数）转换为低精度格式（如 8 位整数）的技术。量化可以显著减少模型的内存占用和计算需求，从而提升推理效率。在不显著损害模型性能的前提下，量化可以显著加快推理速度。

hf gguf

量化模型的命名与量化模型的创建

Q：量化位数，决定了量化精度和模型的压缩程度。
K：分组大小，影响量化精度与性能间的平衡。
M：多乘性常数，用于平衡分组尺度的噪声与范围。
S：尺度因子，核心缩放参数，将浮点数映射到量化范围。

qte qwen models

llama3.2

Ollama 拉取模型

#ollama 官方仓库
ollama pull llama3.2
#直接使用huggingface上的模型
ollama pull hf.co/Qwen/Qwen2-7B-Instruct-GGUF:Q4_K_M

# 使用魔搭社区的模型
ollama pull modelscope.cn/Qwen/Qwen2.5-3B-Instruct-GGUF

deepseek 模型部署

ollama run deepseek-r1

# 1.5b模型
ollama run deepseek-r1:1.5b

# 14b模型
ollama run deepseek-r1:14b

deepseek

deepseek v3 模型

个人电脑很难部署运行，量力而行，推荐用 deepseek 的 R 模型

deepseek v3

千问模型部署

# 默认7b模型 按需选择
ollama run qwen2.5

ollama run qwen2.5:0.5b

ollama run qwen2.5:14b

运行模型

命令行方式运行

ollama run llama3.2
ollama run qwen2.5

ollama 服务运行与环境变量

$ ollama serve --help
Start ollama

Usage:
  ollama serve [flags]

Aliases:
  serve, start

Flags:
  -h, --help   help for serve

Environment Variables:
      OLLAMA_DEBUG               Show additional debug information (e.g. OLLAMA_DEBUG=1)
      OLLAMA_HOST                IP Address for the ollama server (default 127.0.0.1:11434)
      OLLAMA_KEEP_ALIVE          The duration that models stay loaded in memory (default "5m")
      OLLAMA_MAX_LOADED_MODELS   Maximum number of loaded models per GPU
      OLLAMA_MAX_QUEUE           Maximum number of queued requests
      OLLAMA_MODELS              The path to the models directory
      OLLAMA_NUM_PARALLEL        Maximum number of parallel requests
      OLLAMA_NOPRUNE             Do not prune model blobs on startup
      OLLAMA_ORIGINS             A comma separated list of allowed origins
      OLLAMA_SCHED_SPREAD        Always schedule model across all GPUs
      OLLAMA_TMPDIR              Location for temporary files
      OLLAMA_FLASH_ATTENTION     Enabled flash attention
      OLLAMA_LLM_LIBRARY         Set LLM library to bypass autodetection
      OLLAMA_GPU_OVERHEAD        Reserve a portion of VRAM per GPU (bytes)
      OLLAMA_LOAD_TIMEOUT        How long to allow model loads to stall before giving up (default "5m")

Ollama API

RESTful API
兼容 OpenAI

curl http://localhost:11434/api/tags

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt":"Why is the sky blue?"
}'


curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "user", "content": "why is the sky blue?" }
  ]
}'

curl http://localhost:11434/api/embed -d '{
  "model": "mxbai-embed-large",
  "input": "Why is the sky blue?"
}'

curl http://localhost:11434/api/embeddings -d '{
  "model": "mxbai-embed-large",
  "prompt": "Here is an article about llamas..."
}'

Ollama Client

# pip install ollama

import ollama
response = ollama.chat(model='llama3.1', messages=[
  {
    'role': 'user',
    'content': 'Why is the sky blue?',
  },
])
print(response['message']['content'])

Ollama Langchain

#pip install -qU langchain-ollama

from langchain_ollama import ChatOllama
from langchain_core.messages import AIMessage

llm = ChatOllama(
    model="llama3.1",
    temperature=0,
    # other params...
)

messages = [
    (
        "system",
        "You are a helpful assistant that translates English to French. Translate the user sentence.",
    ),
    ("human", "I love programming."),
]
ai_msg = llm.invoke(messages)
ai_msg

兼容 OpenAI 的 Api

import openai

openai.base_url = "http://localhost:11434/v1"
openai.api_key = 'ollama'

response = openai.chat.completions.create(
    model="llama3.1",
    messages=messages,
    tools=tools,
)

Ollama 大模型分类

Ollama 模型分类搜索

大模型
嵌入模型
视觉模型
可支持工具调用模型

search ollama

中文嵌入模型部署

mteb chinese embedding

常用英文嵌入模型与中文嵌入模型

#英文嵌入模型
ollama pull mxbai-embed-large
ollama pull nomic-embed-tex

#中文嵌入模型
ollama pull hf.co/second-state/gte-Qwen2-1.5B-instruct-GGUF:Q4_K_M
ollama pull hf.co/RichardErkhov/Alibaba-NLP_-_gte-Qwen2-7B-instruct-gguf:Q4_K_M

视觉识别

>>> 这个图里有什么/Users/seveniruby/霍格沃兹测试学院/banner/斯内普.jpg

Added image '/Users/seveniruby/霍格沃兹测试学院/banner/斯内普.jpg'
 这张图片中，有两个年轻人坐在一起的桌子上。他们看似在写或者绘制什么。在后面，一个老师或者教授正在指引他们的作业
。他们的位置和服装表明这可能是一间学校或者课堂。周围环境中，书本、笔记本和其他教学用具可见，增强了学习的气息。


这张图片出自一个电影或者电视剧，因为它旁边有一位类似于《哈利·波特》中的魔法老师（Dumbledore）的演员。他的服装和
肌型是那个角色的特点。这个场景通常与教育、学习和知识传递相关，也可能是一个幽默或者奇幻主题。

>>> 给出每个人的位置坐标,使用json格式,每个人包含三个属性description x y width height
 ```json
[
  {
    "description": "Teacher standing behind students",
    "x": 0.231,
    "y": 0.578,
    "width": 0.492,
    "height": 0.739
  },
  {
    "description": "Student sitting at desk with hand on head",
    "x": 0.469,
    "y": 0.582,
    "width": 0.613,
    "height": 0.752
  },
  {
    "description": "Student standing next to teacher",
    "x": 0.459,
    "y": 0.582,
    "width": 0.620,
    "height": 0.753
  }
]
```

注意，这些坐标仅基于图片中的人物位置，可能存在误差。也请注意，由于图片是一个二维空间，所以无法确定三维空间中的
物体位置，因此仅提供了二维坐标。

斯内普

支持工具调用的模型

llama3.2 llama3.1
qwen2.5 qwen2
mistral

def weather(city: str):
    """
    查询天气
    """
    if city == '北京':
        return '北京晴朗'
    elif city == '上海':
        return '上海多云'
    else:
        return f'不知道 {city}.'

def test_agent():
    from langgraph.prebuilt import create_react_agent

    llm = ChatOllama(model='qwen2.5', temperature=0)
    tools = [weather]
    langgraph_agent_executor = create_react_agent(llm, tools)

    query = '北京天气如何'

    messages = langgraph_agent_executor.invoke({"messages": [("human", query)]})
    print(json.dumps([message.model_dump() for message in messages['messages']], indent=2, ensure_ascii=False))

# 输出结果
[
  {
    "content": "北京天气如何",
    "additional_kwargs": {},
    "response_metadata": {},
    "type": "human",
    "name": null,
    "id": "0d312e83-dd8a-4638-af7d-c8e47585e380",
    "example": false
  },
  {
    "content": "",
    "additional_kwargs": {},
    "response_metadata": {
      "model": "qwen2.5",
      "created_at": "2024-10-30T10:35:51.32219Z",
      "message": {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "function": {
              "name": "weather",
              "arguments": {
                "city": "北京"
              }
            }
          }
        ]
      },
      "done_reason": "stop",
      "done": true,
      "total_duration": 660820709,
      "load_duration": 33652000,
      "prompt_eval_count": 151,
      "prompt_eval_duration": 216216000,
      "eval_count": 19,
      "eval_duration": 408378000
    },
    "type": "ai",
    "name": null,
    "id": "run-44766d01-3e92-4688-9e3c-3a7557338bce-0",
    "example": false,
    "tool_calls": [
      {
        "name": "weather",
        "args": {
          "city": "北京"
        },
        "id": "99331cef-b8f5-481a-9e10-1ea5e5c0b4f2",
        "type": "tool_call"
      }
    ],
    "invalid_tool_calls": [],
    "usage_metadata": {
      "input_tokens": 151,
      "output_tokens": 19,
      "total_tokens": 170
    }
  },
  {
    "content": "北京晴朗",
    "additional_kwargs": {},
    "response_metadata": {},
    "type": "tool",
    "name": "weather",
    "id": "b74bfc72-8dd1-4541-a653-da6e8e299725",
    "tool_call_id": "99331cef-b8f5-481a-9e10-1ea5e5c0b4f2",
    "artifact": null,
    "status": "success"
  },
  {
    "content": "北京现在是晴朗的天气。请注意防晒哦！",
    "additional_kwargs": {},
    "response_metadata": {
      "model": "qwen2.5",
      "created_at": "2024-10-30T10:35:51.727177Z",
      "message": {
        "role": "assistant",
        "content": "北京现在是晴朗的天气。请注意防晒哦！"
      },
      "done_reason": "stop",
      "done": true,
      "total_duration": 399748125,
      "load_duration": 11164250,
      "prompt_eval_count": 190,
      "prompt_eval_duration": 112236000,
      "eval_count": 13,
      "eval_duration": 270630000
    },
    "type": "ai",
    "name": null,
    "id": "run-bafa67a6-9a7d-4fa5-b240-861e9b39e99b-0",
    "example": false,
    "tool_calls": [],
    "invalid_tool_calls": [],
    "usage_metadata": {
      "input_tokens": 190,
      "output_tokens": 13,
      "total_tokens": 203
    }
  }
]

Dify 中配置 Agent

llama3.1:8b
google search tool