Tokenize Smarter, Retrieve Better: A Deep Dive into Milvus Analyzer for Full-Text Search
Modern AI applications are complex and rarely one-dimensional. In many cases, a single search method canât solve real-world problems on its own. Take a recommendation system, for example. It requires vector search to comprehend the meaning behind text or images, metadata filtering to refine results by price, category, or location, and full-text search to handle direct queries like âNike Air Max.â Each method solves a different part of the puzzleâand practical systems depend on all of them working together seamlessly.
Milvus excels at vector search and metadata filtering, and starting with version 2.5, it introduced full-text search based on the optimized BM25 algorithm. This upgrade makes AI search both smarter and more accurate, combining semantic understanding with precise keyword intent. With Milvus 2.6, full-text search becomes even fasterâup to 4Ă the performance of Elasticsearch.
At the heart of this capability is the Milvus Analyzer, the component that transforms raw text into searchable tokens. Itâs what enables Milvus to interpret language efficiently and perform keyword matching at scale. In the rest of this post, weâll dive into how the Milvus Analyzer worksâand why itâs key to unlocking the full potential of hybrid search in Milvus.
What is Milvus AnalyzerïŒ
To power efficient full-text searchâwhether for keyword matching or semantic retrievalâthe first step is always the same: turning raw text into tokens that the system can understand, index, and compare.
The Milvus Analyzer handles this step. Itâs a built-in text preprocessing and tokenization component that breaks input text into discrete tokens, then normalizes, cleans, and standardizes them to ensure consistent matching across queries and documents. This process lays the foundation for accurate, high-performance full-text search and hybrid retrieval.
Hereâs an overview of the Milvus Analyzer architecture:
As the diagram shows, Analyzer has two core components: the Tokenizer and the Filter. Together, they convert input text into tokens and optimize them for efficient indexing and retrieval.
Tokenizer: Splits text into basic tokens using methods like whitespace splitting (Whitespace), Chinese word segmentation (Jieba), or multilingual segmentation (ICU).
Filter: Processes tokens through specific transformations. Milvus includes a rich set of built-in filters for operations like case normalization (Lowercase), punctuation removal (Removepunct), stop word filtering (Stop), stemming (Stemmer), and pattern matching (Regex). You can chain multiple filters to handle complex processing needs.
Milvus offers several Analyzer types: three built-in options (Standard, English, and Chinese), Custom Analyzers where you define your own Tokenizer and Filter combinations, and the Multi-language Analyzer for handling multilingual documents. The processing flow is straightforward: Raw text â Tokenizer â Filter â Tokens.
Tokenizer
The Tokenizer is the first processing step. It splits raw text into smaller tokens (words or subwords), and the right choice depends on your language and use case.
Milvus currently supports the following types of tokenizers:
Tokenizer | Use Case | Description |
---|---|---|
Standard | English and space-delimited languages | The most common general-purpose tokenizer; detects word boundaries and splits accordingly. |
Whitespace | Simple text with minimal preprocessing | Splits only by spaces; does not handle punctuation or casing. |
JiebaïŒChineseïŒ | Chinese text | Dictionary and probability-based tokenizer that splits continuous Chinese characters into meaningful words. |
LinderaïŒJP/KRïŒ | Japanese and Korean text | Uses Lindera morphological analysis for effective segmentation. |
ICUïŒMulti-languageïŒ | Complex languages like Arabic, and multilingual scenarios | Based on the ICU library with support for multilingual tokenization across Unicode. |
You can configure the Tokenizer when creating your Collectionâs Schema, specifically when defining VARCHAR
fields through the analyzer_params
parameter. In other words, the Tokenizer is not a standalone object but a field-level configuration. Milvus automatically performs tokenization and preprocessing when inserting data.
FieldSchema(
name="text",
dtype=DataType.VARCHAR,
max_length=512,
analyzer_params={
"tokenizer": "standard" # Configure Tokenizer here
}
)
Filter
If the Tokenizer cuts text apart, the Filter refines whatâs left. Filters standardize, clean, or transform your tokens to make them search-ready.
Common Filter operations include normalizing case, removing stop words (like âtheâ and âandâ), stripping punctuation, and applying stemming (reducing ârunningâ to ârunâ).
Milvus includes many built-in Filters for most language processing needs:
Filter Name | Function | Use Case |
---|---|---|
Lowercase | Converts all tokens to lowercase | Essential for English search to avoid case mismatches |
Asciifolding | Converts accented characters to ASCII | Multilingual scenarios (e.g., âcafĂ©â â âcafeâ) |
Alphanumonly | Keeps only letters and numbers | Strips mixed symbols from text like logs |
Cncharonly | Keeps only Chinese characters | Chinese corpus cleaning |
Cnalphanumonly | Keeps only Chinese, English, and numbers | Mixed Chinese-English text |
Length | Filters tokens by length | Removes excessively short or long tokens |
Stop | Stop word filtering | Removes high-frequency meaningless words like âisâ and âtheâ |
Decompounder | Splits compound words | Languages with frequent compounds like German and Dutch |
Stemmer | Word stemming | English scenarios (e.g.,"studies" and âstudyingâ â âstudyâ |
Removepunct | Removes punctuation | General text cleaning |
Regex | Filters or replaces with a regex pattern | Custom needs, like extracting only email addresses |
The power of Filters is in their flexibilityâyou can mix and match cleaning rules based on your needs. For English search, a typical combination is Lowercase + Stop + Stemmer, ensuring case uniformity, removing filler words, and normalizing word forms to their stem.
For Chinese search, youâll usually combine Cncharonly + Stop for cleaner, more precise results. Configure Filters the same way as Tokenizers, through analyzer_params
in your FieldSchema:
FieldSchema(
name="text",
dtype=DataType.VARCHAR,
max_length=512,
analyzer_params={
"tokenizer": "standard",
"filter": [
"lowercase",
{
"type": "stop", # Specifies the filter type as stop
"stop_words": ["of", "to", "_english_"], # Defines custom stop words and includes the English stop word list
},
{
"type": "stemmer", # Specifies the filter type as stemmer
"language": "english"
}],
}
)
Analyzer Types
The right Analyzer makes your search both faster and more cost-effective. To fit different needs, Milvus provides three types: Built-in, Multi-language, and Custom Analyzers.
Built-in Analyzer
Built-in Analyzers are ready to use out of the boxâstandard configurations that work for most common scenarios. They come with predefined Tokenizer and Filter combinations:
Name | ComponentsïŒTokenizer+FiltersïŒ | Use Case |
---|---|---|
Standard | Standard Tokenizer + Lowercase | General use for English or space-delimited languages |
English | Standard Tokenizer + Lowercase + Stop + Stemmer | English search with higher precision |
Chinese | Jieba Tokenizer + Cnalphanumonly | Chinese text search with natural word segmentation |
For straightforward English or Chinese search, these built-in Analyzers work without any extra setup.
One important note: the Standard Analyzer is designed for English by default. If applied to Chinese text, full-text search may return no results.
Multi-language Analyzer
When you deal with multiple languages, a single tokenizer often canât handle everything. Thatâs where the Multi-language Analyzer comes inâit automatically picks the right tokenizer based on each textâs language. Hereâs how languages map to Tokenizers:
Language Code | Tokenizer Used |
---|---|
en | English Analyzer |
zh | Jieba |
ja / ko | Lindera |
ar | ICU |
If your dataset mixes English, Chinese, Japanese, Korean, and even Arabic, Milvus can handle them all in the same field. This cuts down dramatically on manual preprocessing.
Custom Analyzer
When Built-in or Multi-language Analyzers donât quite fit, Milvus lets you build Custom Analyzers. Mix and match Tokenizers and Filters to create something tailored to your needs. Hereâs an example:
FieldSchema(
name="text",
dtype=DataType.VARCHAR,
max_length=512,
analyzer_params={
"tokenizer": "jieba",
"filter": ["cncharonly", "stop"] # Custom combination for mixed Chinese-English text
}
)
Hands-on Coding with Milvus Analyzer
Theory helps, but nothing beats a full code example. Letâs walk through how to use Analyzers in Milvus with the Python SDK, covering both built-in Analyzers and multi-language Analyzers. These examples use Milvus v2.6.1 and Pymilvus v2.6.1.
How to Use Built-in Analyzer
Say you want to build a Collection for English text search that automatically handles tokenization and preprocessing during data insertion. Weâll use the built-in English Analyzer (equivalent to standard + lowercase + stop + stemmer
).
from pymilvus import MilvusClient, DataType, Function, FunctionType
client = MilvusClient(
uri="http://localhost:19530",
)
schema = client.create_schema()
schema.add_field(
field_name="id", # Field name
datatype=DataType.INT64, # Integer data type
is_primary=True, # Designate as primary key
auto_id=True # Auto-generate IDs (recommended)
)
schema.add_field(
field_name='text',
datatype=DataType.VARCHAR,
max_length=1000,
enable_analyzer=True,
analyzer_params={
"tokenizer": "standard",
"filter": [
"lowercase",
{
"type": "stop", # Specifies the filter type as stop
"stop_words": ["of", "to", "_english_"], # Defines custom stop words and includes the English stop word list
},
{
"type": "stemmer", # Specifies the filter type as stemmer
"language": "english"
}],
},
enable_match=True,
)
schema.add_field(
field_name="sparse", # Field name
datatype=DataType.SPARSE_FLOAT_VECTOR # Sparse vector data type
)
bm25_function = Function(
name="text_to_vector", # Descriptive function name
function_type=FunctionType.BM25, # Use BM25 algorithm
input_field_names=["text"], # Process text from this field
output_field_names=["sparse"] # Store vectors in this field
)
schema.add_function(bm25_function)
index_params = client.prepare_index_params()
index_params.add_index(
field_name="sparse", # Field to index (our vector field)
index_type="AUTOINDEX", # Let Milvus choose optimal index type
metric_type="BM25" # Must be BM25 for this feature
)
COLLECTION_NAME = "english_demo"
if client.has_collection(COLLECTION_NAME):
client.drop_collection(COLLECTION_NAME)
print(f"Dropped existing collection: {COLLECTION_NAME}")
client.create_collection(
collection_name=COLLECTION_NAME, # Collection name
schema=schema, # Our schema
index_params=index_params # Our search index configuration
)
print(f"Successfully created collection: {COLLECTION_NAME}")
# Prepare sample data
sample_texts = [
"The quick brown fox jumps over the lazy dog",
"Machine learning algorithms are revolutionizing artificial intelligence",
"Python programming language is widely used for data science projects",
"Natural language processing helps computers understand human languages",
"Deep learning models require large amounts of training data",
"Search engines use complex algorithms to rank web pages",
"Text analysis and information retrieval are important NLP tasks",
"Vector databases enable efficient similarity searches",
"Stemming reduces words to their root forms for better searching",
"Stop words like 'the', 'and', 'of' are often filtered out"
]
# Insert data
print("\nInserting data...")
data = [{"text": text} for text in sample_texts]
client.insert(
collection_name=COLLECTION_NAME,
data=data
)
print(f"Successfully inserted {len(sample_texts)} records")
# Demonstrate tokenizer effect
print("\n" + "="*60)
print("Tokenizer Analysis Demo")
print("="*60)
test_text = "The running dogs are jumping over the lazy cats"
print(f"\nOriginal text: '{test_text}'")
# Use run_analyzer to show tokenization results
analyzer_result = client.run_analyzer(
texts=test_text,
collection_name=COLLECTION_NAME,
field_name="text"
)
print(f"Tokenization result: {analyzer_result}")
print("\nBreakdown:")
print("- lowercase: Converts all letters to lowercase")
print("- stop words: Filtered out ['of', 'to'] and common English stop words")
print("- stemmer: Reduced words to stem form (running -> run, jumping -> jump)")
# Full-text search demo
print("\n" + "="*60)
print("Full-Text Search Demo")
print("="*60)
# Wait for indexing to complete
import time
time.sleep(2)
# Search query examples
search_queries = [
"jump", # Test stem matching (should match "jumps")
"algorithm", # Test exact matching
"python program", # Test multi-word query
"learn" # Test stem matching (should match "learning")
]
for i, query in enumerate(search_queries, 1):
print(f"\nQuery {i}: '{query}'")
print("-" * 40)
# Execute full-text search
search_results = client.search(
collection_name=COLLECTION_NAME,
data=[query], # Query text
search_params={"metric_type": "BM25"},
output_fields=["text"], # Return original text
limit=3 # Return top 3 results
)
if search_results and len(search_results[0]) > 0:
for j, result in enumerate(search_results[0], 1):
score = result["distance"]
text = result["entity"]["text"]
print(f" Result {j} (relevance: {score:.4f}): {text}")
else:
print(" No relevant results found")
print("\n" + "="*60)
print("Search completeïŒ")
print("="*60)
OutputïŒ
Dropped existing collection: english_demo
Successfully created collection: english_demo
Inserting data...
Successfully inserted 10 records
============================================================
Tokenizer Analysis Demo
============================================================
Original text: 'The running dogs are jumping over the lazy cats'
Tokenization result: ['run', 'dog', 'jump', 'over', 'lazi', 'cat']
Breakdown:
- lowercase: Converts all letters to lowercase
- stop words: Filtered out ['of', 'to'] and common English stop words
- stemmer: Reduced words to stem form (running -> run, jumping -> jump)
============================================================
Full-Text Search Demo
============================================================
Query 1: 'jump'
----------------------------------------
Result 1 (relevance: 2.0040): The quick brown fox jumps over the lazy dog
Query 2: 'algorithm'
----------------------------------------
Result 1 (relevance: 1.5819): Machine learning algorithms are revolutionizing artificial intelligence
Result 2 (relevance: 1.4086): Search engines use complex algorithms to rank web pages
Query 3: 'python program'
----------------------------------------
Result 1 (relevance: 3.7884): Python programming language is widely used for data science projects
Query 4: 'learn'
----------------------------------------
Result 1 (relevance: 1.5819): Machine learning algorithms are revolutionizing artificial intelligence
Result 2 (relevance: 1.4086): Deep learning models require large amounts of training data
============================================================
Search completeïŒ
============================================================
How to Use Multi-language Analyzer
When your dataset contains multiple languagesâEnglish, Chinese, and Japanese, for exampleâyou can enable the Multi-language Analyzer. Milvus will automatically pick the right tokenizer based on each textâs language.
from pymilvus import MilvusClient, DataType, Function, FunctionType
import time
# Configure connection
client = MilvusClient(
uri="http://localhost:19530",
)
COLLECTION_NAME = "multilingual_demo"
# Drop existing collection if present
if client.has_collection(COLLECTION_NAME):
client.drop_collection(COLLECTION_NAME)
# Create schema
schema = client.create_schema()
# Add primary key field
schema.add_field(
field_name="id",
datatype=DataType.INT64,
is_primary=True,
auto_id=True
)
# Add language identifier field
schema.add_field(
field_name="language",
datatype=DataType.VARCHAR,
max_length=50
)
# Add text field with multi-language analyzer configuration
multi_analyzer_params = {
"by_field": "language", # Select analyzer based on language field
"analyzers": {
"en": {
"type": "english" # English analyzer
},
"zh": {
"type": "chinese" # Chinese analyzer
},
"jp": {
"tokenizer": "icu", # Use ICU tokenizer for Japanese
"filter": [
"lowercase",
{
"type": "stop",
"stop_words": ["ăŻ", "ă", "ăź", "ă«", "ă", "ă§", "ăš"]
}
]
},
"default": {
"tokenizer": "icu" # Default to ICU general tokenizer
}
},
"alias": {
"english": "en",
"chinese": "zh",
"japanese": "jp",
"äžæ": "zh",
"è±æ": "en",
"æ„æ": "jp"
}
}
schema.add_field(
field_name="text",
datatype=DataType.VARCHAR,
max_length=2000,
enable_analyzer=True,
multi_analyzer_params=multi_analyzer_params
)
# Add sparse vector field for BM25
schema.add_field(
field_name="sparse_vector",
datatype=DataType.SPARSE_FLOAT_VECTOR
)
# Define BM25 function
bm25_function = Function(
name="text_bm25",
function_type=FunctionType.BM25,
input_field_names=["text"],
output_field_names=["sparse_vector"]
)
schema.add_function(bm25_function)
# Prepare index parameters
index_params = client.prepare_index_params()
index_params.add_index(
field_name="sparse_vector",
index_type="AUTOINDEX",
metric_type="BM25"
)
# Create collection
client.create_collection(
collection_name=COLLECTION_NAME,
schema=schema,
index_params=index_params
)
# Prepare multilingual test data
multilingual_data = [
# English data
{"language": "en", "text": "Artificial intelligence is revolutionizing technology industries worldwide"},
{"language": "en", "text": "Machine learning algorithms process large datasets efficiently"},
{"language": "en", "text": "Vector databases provide fast similarity search capabilities"},
# Chinese data
{"language": "zh", "text": "äșșć·„æșèœæŁćšæčćäžçćèĄćäž"},
{"language": "zh", "text": "æșćšćŠäč çźæłèœć€é«æć€çć€§è§æšĄæ°æźé"},
{"language": "zh", "text": "ćéæ°æźćșæäŸćż«éççžäŒŒæ§æçŽąćèœ"},
# Japanese data
{"language": "jp", "text": "äșșć·„ç„èœăŻäžçäžăźæèĄçŁæ„ă«é©ćœăăăăăăŠăăŸă"},
{"language": "jp", "text": "æ©æą°ćŠçżăąă«ăŽăȘășă ăŻć€§éăźăăŒăżă»ăăăćčççă«ćŠçăăŸă"},
{"language": "jp", "text": "ăăŻăă«ăăŒăżăăŒăčăŻé«éăȘéĄäŒŒæ§æ€çŽąæ©èœăæäŸăăŸă"},
]
client.insert(
collection_name=COLLECTION_NAME,
data=multilingual_data
)
# Wait for BM25 function to generate vectors
print("Waiting for BM25 vector generation...")
client.flush(COLLECTION_NAME)
time.sleep(5)
client.load_collection(COLLECTION_NAME)
# Demonstrate tokenizer effect
print("\nTokenizer Analysis:")
test_texts = {
"en": "The running algorithms are processing data efficiently",
"zh": "èżäșèżèĄäžççźæłæŁćšé«æć°ć€çæ°æź",
"jp": "ăăăăźćźèĄäžăźăąă«ăŽăȘășă ăŻćčççă«ăăŒăżăćŠçăăŠăăŸă"
}
for lang, text in test_texts.items():
print(f"{lang}: {text}")
try:
analyzer_result = client.run_analyzer(
texts=text,
collection_name=COLLECTION_NAME,
field_name="text",
analyzer_names=[lang]
)
print(f" â {analyzer_result}")
except Exception as e:
print(f" â Analysis failed: {e}")
# Multi-language search demo
print("\nSearch Test:")
search_cases = [
("zh", "äșșć·„æșèœ"),
("jp", "æ©æą°ćŠçż"),
("en", "algorithm"),
]
for lang, query in search_cases:
print(f"\n{lang} '{query}':")
try:
search_results = client.search(
collection_name=COLLECTION_NAME,
data=[query],
search_params={"metric_type": "BM25"},
output_fields=["language", "text"],
limit=3,
filter=f'language == "{lang}"'
)
if search_results and len(search_results[0]) > 0:
for result in search_results[0]:
score = result["distance"]
text = result["entity"]["text"]
print(f" {score:.3f}: {text}")
else:
print(" No results")
except Exception as e:
print(f" Error: {e}")
print("\nComplete")
OutputïŒ
Waiting for BM25 vector generation...
Tokenizer Analysis:
en: The running algorithms are processing data efficiently
â ['run', 'algorithm', 'process', 'data', 'effici']
zh: èżäșèżèĄäžççźæłæŁćšé«æć°ć€çæ°æź
â ['èżäș', 'èżèĄ', 'äž', 'ç', 'çźæł', 'æŁćš', '髿', 'ć°', 'ć€ç', 'æ°æź']
jp: ăăăăźćźèĄäžăźăąă«ăŽăȘășă ăŻćčççă«ăăŒăżăćŠçăăŠăăŸă
â ['ăăăăź', 'ćźèĄ', 'äžăź', 'ăąă«ăŽăȘășă ', 'ćčç', 'ç', 'ăăŒăż', 'ćŠç', 'ă', 'ăŠă', 'ăŸă']
Search Test:
zh 'äșșć·„æșèœ':
3.300: äșșć·„æșèœæŁćšæčćäžçćèĄćäž
jp 'æ©æą°ćŠçż':
3.649: æ©æą°ćŠçżăąă«ăŽăȘășă ăŻć€§éăźăăŒăżă»ăăăćčççă«ćŠçăăŸă
en 'algorithm':
2.096: Machine learning algorithms process large datasets efficiently
Complete
Also, Milvus supports the language_identifier tokenizer for search. It automatically detects the languages of a given text, which means the language field is optional. For more details, check out How Milvus 2.6 Upgrades Multilingual Full-Text Search at Scale.
Conclusion
The Milvus Analyzer turns what used to be a simple preprocessing step into a well-defined, modular system for handling text. Its designâbuilt around tokenization and filteringâgives developers fine-grained control over how language is interpreted, cleaned, and indexed. Whether youâre building a single-language application or a global RAG system that spans multiple languages, the Analyzer provides a consistent foundation for full-text search. Itâs the part of Milvus that quietly makes everything else work better.
Have questions or want a deep dive on any feature? Join our Discord channel or file issues on GitHub. You can also book a 20-minute one-on-one session to get insights, guidance, and answers to your questions through Milvus Office Hours.
- What is Milvus AnalyzerïŒ
- Tokenizer
- Filter
- Analyzer Types
- Built-in Analyzer
- Multi-language Analyzer
- Custom Analyzer
- Hands-on Coding with Milvus Analyzer
- How to Use Built-in Analyzer
- How to Use Multi-language Analyzer
- Conclusion
On This Page
Try Managed Milvus for Free
Zilliz Cloud is hassle-free, powered by Milvus and 10x faster.
Get StartedLike the article? Spread the word