关于 Milvus
开始
概念
用户指南
- 数据库
- Collections
- Schema 和数据字段
  - Schema 解释
  - 主字段和自动识别
  - 密集向量
  - 二进制向量
  - 稀疏向量
  - 字符串字段
  - 数字字段
  - JSON 字段
  - 数组字段
  - 结构数组
  - 几何领域
  - TIMESTAMPTZ 字段
  - Dynamic Field
  - 可归零字段
  - 默认值
  - 分析仪
    分析仪概述
    内置分析仪
    代币化器
    标准
    白色空间
    杰巴
    林德拉
    加护病房
    语言标识符
    过滤器
    多语言分析仪
    根据使用案例选择合适的分析仪
    管理文件资源
  - 更改 Collections 字段
  - 为现有 Collections 添加字段
  - 最佳做法
- 插入和删除
- 索引
- 搜索
- 功能与模型推理
- 存储优化
- 剪影
数据导入
人工智能工具
管理指南
工具
集成
教程
常见问题
API Reference

Home
Docs
用户指南
Schema 和数据字段
分析仪
代币化器
杰巴

词霸

jieba 标记符号转换器将中文文本分解为单词。

jieba 令牌转换器在输出中保留标点符号作为独立令牌。例如，"你好！世界。" 变成["你好", "！", "世界", "。"] 。要删除这些独立的标点符号，请使用 removepunct过滤器。

配置

Milvus 支持jieba 令牌生成器的两种配置方法：简单配置和自定义配置。

简单配置

使用简单配置，只需将标记符设置为"jieba" 。例如

Python Java NodeJS Go cURL

# Simple configuration: only specifying the tokenizer name
analyzer_params = {
    "tokenizer": "jieba",  # Use the default settings: dict=["_default_"], mode="search", hmm=True
}

Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "jieba");

const analyzer_params = {
    "tokenizer": "jieba",
};

analyzerParams = map[string]any{"tokenizer": "jieba"}

# restful
analyzerParams='{
  "tokenizer": "jieba"
}'

此简单配置等同于以下自定义配置：

Python Java NodeJS Go cURL

# Custom configuration equivalent to the simple configuration above
analyzer_params = {
    "type": "jieba",          # Tokenizer type, fixed as "jieba"
    "dict": ["_default_"],     # Use the default dictionary
    "mode": "search",          # Use search mode for improved recall (see mode details below)
    "hmm": True                # Enable HMM for probabilistic segmentation
}

Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "jieba");
analyzerParams.put("dict", Collections.singletonList("_default_"));
analyzerParams.put("mode", "search");
analyzerParams.put("hmm", true);

// javascript

analyzerParams = map[string]any{"type": "jieba", "dict": []any{"_default_"}, "mode": "search", "hmm": true}

# restful

有关参数的详细信息，请参阅自定义配置。

自定义配置

为获得更多控制权，您可以提供自定义配置，允许您指定自定义字典、选择分割模式以及启用或禁用隐马尔可夫模型（HMM）。例如

Python Java NodeJS Go cURL

# Custom configuration with user-defined settings
analyzer_params = {
    "tokenizer": {
        "type": "jieba",           # Fixed tokenizer type
        "dict": ["customDictionary"],  # Custom dictionary list; replace with your own terms
        "mode": "exact",           # Use exact mode (non-overlapping tokens)
        "hmm": False               # Disable HMM; unmatched text will be split into individual characters
    }
}

Map<String, Object> analyzerParams = new HashMap<>();                                                                          
analyzerParams.put("tokenizer", new HashMap<String, Object>() {{
  put("type", "jieba");                                                                                                      
  put("dict", Arrays.asList("customDictionary"));             
  put("mode", "exact");
  put("hmm", false);
}});

// javascript

analyzerParams := map[string]interface{}{
  "tokenizer": map[string]interface{}{
      "type": "jieba",
      "dict": []string{"customDictionary"},
      "mode": "exact",
      "hmm":  false,
  },
}

# restful

参数	参数	默认值
`type`	标记符类型。固定为`"jieba"` 。	`"jieba"`
`dict`	分析器将作为词汇源加载的词典列表。内置选项： `"_default_"`:加载引擎内置的简体中文词典。详情请参阅dict.txt。 `"_extend_default_"`:加载`"_default_"` 中的所有内容以及额外的繁体中文补充。详情请参阅dict.txt.big。您也可以将内置词典与任意数量的自定义词典混合使用。示例：`["_default_", "结巴分词器"]` 。	`["_default_"]`
`mode`	分段模式。可能的值： `"exact"`:尝试以最精确的方式分割句子，是文本分析的理想选择。 `"search"`:在精确模式的基础上进一步分解长词以提高召回率，适合搜索引擎标记化。更多信息，请参阅Jieba GitHub 项目。	`"search"`
`hmm`	布尔标志，表示是否启用隐马尔可夫模型（HMM）对字典中找不到的单词进行概率分割。	`true`

要从外部文件加载大型自定义词汇表而不是通过dict 内联，请参阅下面的使用词典文件的自定义配置。

定义analyzer_params 后，可以在定义 Collections Schema 时将其应用到VARCHAR 字段。这样，Milvus 就能使用指定的分析器处理该字段中的文本，以实现高效的标记化和过滤。有关详情，请参阅示例使用。

使用字典文件进行自定义配置Compatible with Milvus 3.0.x

对于大型自定义词汇表（领域词汇、产品术语或专有名词列表），可将单词存储在一个文件中，并将该文件注册为远程文件资源，然后通过extra_dict_file 参数从标记化器中引用该文件。分析器会将这些词加载到内置词典的词汇表中。

文件是纯 UTF-8 文本，每行一个词。例如

结巴分词器
向量数据库

将文件上传到 Milvus 集群配置使用的对象存储，然后注册：

Python Java NodeJS Go cURL

from pymilvus import MilvusClient

client = MilvusClient(uri="http://localhost:19530")

# Register the uploaded file under a name you'll reference from analyzer configs.
client.add_file_resource(
    name="zh_terms",
    path="file/zh_terms.txt",    # full S3 object key, including the rootPath prefix
)

// java

// nodejs

// go

# restful

通过extra_dict_file 在标记器中引用已注册的资源：

Python Java NodeJS Go cURL

analyzer_params = {
    "tokenizer": {
        "type": "jieba",
        "dict": ["_default_"],             # keep the built-in dictionary
        "mode": "exact",
        "hmm": False,
        "extra_dict_file": {
            "type": "remote",
            "resource_name": "zh_terms",
            "file_name": "zh_terms.txt",
        },
    },
}

client.run_analyzer(["milvus结巴分词器中文测试"], analyzer_params)
# → [['milvus', '结巴', '分词器', '中文', '测试']]

// java

// nodejs

// go

# restful

extra_dict_file 参数接受包含以下字段的对象：

字段	字段
`type`	资源类型。对于通过`add_file_resource` 注册的文件，使用`"remote"` 。有关自托管部署中使用的`"local"` 变体，请参阅管理文件资源。
`resource_name`	文件在`add_file_resource` 注册时使用的名称。
`file_name`	注册资源的对象存储路径中的文件名部分（例如，如果资源是通过`path="file/zh_terms.txt"` 注册的，则为`"zh_terms.txt"` ）。

通过extra_dict_file 添加的词语会与内置词典合并，因此 jieba 的分词算法会将它们与现有词条放在一起。任何特定词条是否作为独立标记出现取决于 jieba 的概率加权 DAG 选择--如果较短的词条在内置词典中出现频率较高，则向量数据库 等较长的自定义词条仍可能被拆分为向量 +数据库 。

示例

在将分析器配置应用于 Collections Schema 之前，请使用run_analyzer 方法验证其行为。

分析器配置

Python Java NodeJS Go cURL

analyzer_params = {
    "tokenizer": {
        "type": "jieba",
        "dict": ["结巴分词器"],
        "mode": "exact",
        "hmm": False
    }
}

Map<String, Object> analyzerParams = new HashMap<>();                                                                          
analyzerParams.put("tokenizer", new HashMap<String, Object>() {{
  put("type", "jieba");                                                                                                      
  put("dict", Arrays.asList("结巴分词器"));                   
  put("mode", "exact");
  put("hmm", false);
}});

// javascript

analyzerParams := map[string]interface{}{
  "tokenizer": map[string]interface{}{
      "type": "jieba",
      "dict": []string{"结巴分词器"},
      "mode": "exact",
      "hmm":  false,
  },
}

# restful

验证使用`run_analyzer`

Python Java NodeJS Go cURL

from pymilvus import (
    MilvusClient,
)

client = MilvusClient(
    uri="http://localhost:19530",
    token="root:Milvus"
)

# Sample text to analyze
sample_text = "milvus结巴分词器中文测试"

# Run the standard analyzer with the defined configuration
result = client.run_analyzer(sample_text, analyzer_params)
print("Standard analyzer output:", result)

import io.milvus.v2.client.ConnectConfig;
import io.milvus.v2.client.MilvusClientV2;
import io.milvus.v2.service.vector.request.RunAnalyzerReq;
import io.milvus.v2.service.vector.response.RunAnalyzerResp;

ConnectConfig config = ConnectConfig.builder()
        .uri("http://localhost:19530")
        .token("root:Milvus")
        .build();
MilvusClientV2 client = new MilvusClientV2(config);

List<String> texts = new ArrayList<>();
texts.add("milvus结巴分词器中文测试");

RunAnalyzerResp resp = client.runAnalyzer(RunAnalyzerReq.builder()
        .texts(texts)
        .analyzerParams(analyzerParams)
        .build());
List<RunAnalyzerResp.AnalyzerResult> results = resp.getResults();

// javascript

import (
    "context"
    "encoding/json"
    "fmt"

    "github.com/milvus-io/milvus/client/v2/milvusclient"
)

client, err := milvusclient.New(ctx, &milvusclient.ClientConfig{
    Address: "localhost:19530",
    APIKey:  "root:Milvus",
})
if err != nil {
    fmt.Println(err.Error())
    // handle error
}

bs, _ := json.Marshal(analyzerParams)
texts := []string{"milvus结巴分词器中文测试"}
option := milvusclient.NewRunAnalyzerOption(texts).
    WithAnalyzerParams(string(bs))

result, err := client.RunAnalyzer(ctx, option)
if err != nil {
    fmt.Println(err.Error())
    // handle error
}

# restful

预期输出

['milvus', '结巴分词器', '中', '文', '测', '试']

想要更快、更简单、更好用的 Milvus SaaS服务？

Zilliz Cloud是基于Milvus的全托管向量数据库，拥有更高性能，更易扩展，以及卓越性价比

免费试用 Zilliz Cloud

反馈

此页对您是否有帮助?

词霸

配置

简单配置

自定义配置

使用字典文件进行自定义配置Compatible with Milvus 3.0.x

示例

分析器配置

验证使用run_analyzer

预期输出

目录

想要更快、更简单、更好用的 Milvus SaaS服务 ？

反馈

验证使用`run_analyzer`

想要更快、更简单、更好用的 Milvus SaaS服务？