Home
Blog
Will Gemini Embedding 2 Kill Multi-Vector Search in Vector Databases?

Will Gemini Embedding 2 Kill Multi-Vector Search in Vector Databases?

Engineering

March 13, 2026

Jack Li

Google released Gemini Embedding 2 — the first multimodal embedding model that maps text, images, video, audio, and documents into a single vector space.

You embed a video clip, a product photo, and a paragraph of text with one API call, and they all land in the same semantic neighborhood.

Before models like this, you had no choice but to run each modality through its own specialist model and store each output in a separate vector column. Multi-vector columns in vector databases like Milvus were built precisely for that world.

So the question now is: how much of that complexity can Gemini Embedding 2 replace — and where does it fall short? This post walks through where each approach fits and how they work together.

What’s Different About Gemini Embedding 2 When Compared to CLIP/CLAP

Embedding models convert unstructured data into dense vectors so that semantically similar items cluster together in vector space. What makes Gemini Embedding 2 different is that it does this natively across modalities, with no separate models and no stitching pipelines.

Until now, multimodal embeddings meant dual-encoder models trained with contrastive learning: CLIP for image-text, CLAP for audio-text, each handling exactly two modalities. If you needed all three, you ran multiple models and coordinated their embedding spaces yourself.

For example, indexing a podcast with cover art meant running CLIP for the image, CLAP for the audio, and a text encoder for the transcript — three models, three vector spaces, and custom fusion logic to make their scores comparable at query time.

In contrast, according to Google’s official announcement, here is what Gemini Embedding 2 supports:

Text up to 8,192 tokens per request
Images up to 6 per request (PNG, JPEG)
Video up to 120 seconds (MP4, MOV)
Audio up to 80 seconds, embedded natively without ASR transcription
Documents PDF input, up to 6 pages

Mixed input image + text together in a single embedding call

Gemini Embedding 2 vs. CLIP/CLAP One Model vs. Many for Multimodal Embeddings

	Dual-encoder (CLIP, CLAP)	Gemini Embedding 2
Modalities per model	2 (e.g., image + text)	5 (text, image, video, audio, PDF)
Adding a new modality	You bring another model and align spaces manually	Already included — one API call
Cross-modal input	Separate encoders, separate calls	Interleaved input (e.g., image + text in one request)
Architecture	Separate vision and text encoders aligned via contrastive loss	Single model inheriting multimodal understanding from Gemini

Gemini Embedding 2’s Advantage: Pipeline Simplification

Take a common scenario: building a semantic search engine over a short video library. Each clip has visual frames, spoken audio, and subtitle text — all describing the same content.

Before Gemini Embedding 2, you’d need three separate embedding models (image, audio, text), three vector columns, and a retrieval pipeline that does multi-way recall, result fusion, and deduplication. That’s a lot of moving parts to build and maintain.

Now, you feed the video’s frames, audio, and subtitles into a single API call and get one unified vector that captures the full semantic picture.

Naturally, it’s tempting to conclude that multi-vector columns are dead. But that conclusion confuses “multimodal unified representation” with “multi-dimensional vector retrieval.” They solve different problems, and understanding the difference matters for picking the right approach.

What is Multi-Vector Search in Milvus?

In Milvus, multi-vector search means searching the same item through multiple vector fields at once and then combining those results with reranking.

The core idea: a single object often carries more than one kind of meaning. A product has a title and a description. A social media post has a caption and an image. Each angle tells you something different, so each one gets its own vector field.

Milvus searches every vector field independently, then merges the candidate sets using a reranker. In the API, each request maps to a different field and search configuration, and hybrid_search() returns the combined result.

Two common patterns depend on this:

Sparse+Dense Vector Search. You have a product catalog where users type queries like “red Nike Air Max size 10.” Dense vectors catch the semantic intent (“running shoes, red, Nike”), but miss the exact size. Sparse vectors via BM25 or models like BGE-M3 nail the keyword match. You need both running in parallel, then reranked — because neither alone returns good results for queries that mix natural language with specific identifiers like SKUs, file names, or error codes.
Multimodal Vector Search. A user uploads a photo of a dress and types “something like this but in blue.” You search the image embedding column for visual similarity and the text embedding column for the color constraint simultaneously. Each column has its own index and model — CLIP for the image, a text encoder for the description — and results get merged.

Milvus runs both patterns as parallel ANN searches with native reranking via RRFRanker. Schema definition, multi-index configuration, built-in BM25 — all handled in one system.

To make this concrete, consider a product catalog where each item includes a text description and an image. You can run three searches against that data in parallel:

Semantic text search. Query the text description with dense vectors generated by models like BERT, Transformers, or the OpenAI embeddings API.
Full-text search. Query the text description with sparse vectors using BM25 or sparse embedding models like BGE-M3 or SPLADE.
Cross-modal image search. Query over product images using a text query, with dense vectors from a model like CLIP.

Now, with Gemini Embedding 2, Does Multi-Vector Search Still Matter?

Gemini Embedding 2 handles more modalities in one call, which simplifies pipelines considerably. But a unified multimodal embedding is not the same thing as multi-vector retrieval. In other words, multi-vector search will not become obsolete.

Gemini Embedding 2 maps text, images, video, audio, and documents into one shared vector space. Google positions it for multimodal semantic search, document retrieval, and recommendation — scenarios where all modalities describe the same content and high cross-modal overlap makes a single vector viable.

Milvus multi-vector search solves a different problem. It is a way to search the same object through multiple vector fields— for instance, a title plus description, or text plus image—and then combine those signals during retrieval. In other words, it is about preserving and querying multiple semantic views of the same item, not just compressing everything into one representation.

But real-world data rarely fits into a single embedding. Biometric systems, agentic tool retrieval, and mixed-intent e-commerce all depend on vectors that live in completely different semantic spaces. That’s exactly where a unified embedding stops working.

Why One Embedding Isn’t Enough: Multi-Vector Retrieval in Practice

Gemini Embedding 2 handles the case where all your modalities describe the same thing. Multi-vector search handles everything else — and “everything else” covers most production retrieval systems.

Biometrics. A single user has face, voiceprint, fingerprint, and iris vectors. These describe completely independent biological features with zero semantic overlap. You cannot collapse them into one vector — each needs its own column, index, and similarity metric.

Agentic tools. A coding assistant like OpenClaw stores dense semantic vectors for conversation history (“that deployment issue from last week”) alongside sparse BM25 vectors for exact matching on file names, CLI commands, and config parameters. Different retrieval goals, different vector types, independent search paths, then reranked.

E-commerce with mixed intent. A product’s promo video and detail images work well as a unified Gemini embedding. But when a user wants “dresses that look like this” and “same fabric, size M,” you need a visual similarity column and a structured attribute column with separate indexes and a hybrid retrieval layer.

When to Use Gemini Embedding 2 vs. Multi-vector Columns

Scenario	What to use	Why
All modalities describe the same content (video frames + audio + subtitles)	Gemini Embedding 2 unified vector	High semantic overlap means one vector captures the full picture — no fusion needed
You need keyword precision alongside semantic recall (BM25 + dense)	Multi-vector columns with hybrid_search()	Sparse and dense vectors serve different retrieval goals that can’t collapse into one embedding
Cross-modal search is the primary use case (text query → image results)	Gemini Embedding 2 unified vector	Single shared space makes cross-modal similarity native
Vectors live in fundamentally different semantic spaces (biometrics, structured attributes)	Multi-vector columns with per-field indexes	Independent similarity metrics and index types per vector field
You want pipeline simplicity and fine-grained retrieval	Both — unified Gemini vector + additional sparse or attribute columns in the same collection	Gemini handles the multimodal column; Milvus handles the hybrid retrieval layer around it

These two approaches are not mutually exclusive. You can use Gemini Embedding 2 for the unified multimodal column and still store additional sparse or attribute-specific vectors in separate columns within the same Milvus collection.

Quick Start: Set Up Gemini Embedding 2 + Milvus

Here’s a working demo. You need a running Milvus or Zilliz Cloud instance and a GOOGLE_API_KEY.

Setup

pip install google-genai pymilvus
export GOOGLE_API_KEY="your-api-key"

Full Example

"""
Prerequisites:
    pip install google-genai pymilvus

Set environment variable:
    export GOOGLE_API_KEY="your-api-key"
"""

import os
import struct
import numpy as np
from google import genai
from google.genai import types
from pymilvus import MilvusClient, DataType

# ── Config ───────────────────────────────────────────────────────────────
COLLECTION_NAME = "gemini_multimodal_demo"
MILVUS_URI = "http://localhost:19530"  # Change to your Milvus address
DIM = 3072  # gemini-embedding-2-preview output dimension
GEMINI_MODEL = "gemini-embedding-2-preview"

# ── Initialize clients ──────────────────────────────────────────────────
gemini_client = genai.Client()  # Uses GOOGLE_API_KEY env var
milvus_client = MilvusClient(MILVUS_URI)

# ── Helper: generate embedding ──────────────────────────────────────────
def embed_texts(texts: list[str], task_type: str = "SEMANTIC_SIMILARITY") -> list[list[float]]:
    """Embed a list of text strings."""
    result = gemini_client.models.embed_content(
        model=GEMINI_MODEL,
        contents=texts,
        config=types.EmbedContentConfig(task_type=task_type),
    )
    return [e.values for e in result.embeddings]

def embed_image(image_path: str) -> list[float]:
    """Embed an image file."""
    with open(image_path, "rb") as f:
        image_bytes = f.read()
    mime = "image/png" if image_path.endswith(".png") else "image/jpeg"
    result = gemini_client.models.embed_content(
        model=GEMINI_MODEL,
        contents=types.Part.from_bytes(data=image_bytes, mime_type=mime),
    )
    return result.embeddings[0].values

def embed_audio(audio_path: str) -> list[float]:
    """Embed an audio file."""
    with open(audio_path, "rb") as f:
        audio_bytes = f.read()
    mime_map = {".mp3": "audio/mpeg", ".wav": "audio/wav", ".flac": "audio/flac"}
    ext = os.path.splitext(audio_path)[1].lower()
    mime = mime_map.get(ext, "audio/mpeg")
    result = gemini_client.models.embed_content(
        model=GEMINI_MODEL,
        contents=types.Part.from_bytes(data=audio_bytes, mime_type=mime),
    )
    return result.embeddings[0].values

# ── 1. Create Milvus collection ─────────────────────────────────────────
print("=== Creating collection ===")
if milvus_client.has_collection(COLLECTION_NAME):
    milvus_client.drop_collection(COLLECTION_NAME)

schema = milvus_client.create_schema()
schema.add_field("id", DataType.INT64, is_primary=True, auto_id=True)
schema.add_field("content", DataType.VARCHAR, max_length=2000)   # description of the content
schema.add_field("modality", DataType.VARCHAR, max_length=20)    # "text", "image", "audio"
schema.add_field("vector", DataType.FLOAT_VECTOR, dim=DIM)

index_params = milvus_client.prepare_index_params()
index_params.add_index(
    field_name="vector",
    index_type="AUTOINDEX",
    metric_type="COSINE",
)

milvus_client.create_collection(
    COLLECTION_NAME,
    schema=schema,
    index_params=index_params,
    consistency_level="Strong",
)
print(f"Collection '{COLLECTION_NAME}' created (dim={DIM}, metric=COSINE)")

# ── 2. Insert text embeddings ───────────────────────────────────────────
print("\n=== Inserting text embeddings ===")
documents = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "The Mona Lisa is a half-length portrait painting by Leonardo da Vinci.",
    "Beethoven's Symphony No. 9 premiered in Vienna on May 7, 1824.",
    "The Great Wall of China stretches over 13,000 miles across northern China.",
    "Jazz music originated in the African-American communities of New Orleans.",
    "The Hubble Space Telescope was launched into orbit on April 24, 1990.",
    "Vincent van Gogh painted The Starry Night while in an asylum in Saint-Rémy.",
    "Machine learning is a subset of AI focused on learning from data.",
]

text_vectors = embed_texts(documents)
text_rows = [
    {"content": doc, "modality": "text", "vector": vec}
    for doc, vec in zip(documents, text_vectors)
]
milvus_client.insert(COLLECTION_NAME, text_rows)
print(f"Inserted {len(text_rows)} text documents")

# ── 3. (Optional) Insert image embeddings ───────────────────────────────
# Uncomment and provide real image paths to test multimodal search
#
# image_files = [
#     ("photo of the Mona Lisa painting", "mona_lisa.jpg"),
#     ("satellite photo of the Great Wall of China", "great_wall.png"),
# ]
# for desc, path in image_files:
#     if os.path.exists(path):
#         vec = embed_image(path)
#         milvus_client.insert(COLLECTION_NAME, [
#             {"content": desc, "modality": "image", "vector": vec}
#         ])
#         print(f"Inserted image: {desc}")

# ── 4. (Optional) Insert audio embeddings ───────────────────────────────
# Uncomment and provide real audio paths to test multimodal search
#
# audio_files = [
#     ("Beethoven Symphony No.9 excerpt", "beethoven_9.mp3"),
#     ("jazz piano improvisation", "jazz_piano.mp3"),
# ]
# for desc, path in audio_files:
#     if os.path.exists(path):
#         vec = embed_audio(path)
#         milvus_client.insert(COLLECTION_NAME, [
#             {"content": desc, "modality": "audio", "vector": vec}
#         ])
#         print(f"Inserted audio: {desc}")

# ── 5. Search ────────────────────────────────────────────────────────────
print("\n=== Searching ===")

queries = [
    "history of artificial intelligence",
    "famous Renaissance paintings",
    "classical music concerts",
]

query_vectors = embed_texts(queries, task_type="SEMANTIC_SIMILARITY")

for query_text, query_vec in zip(queries, query_vectors):
    results = milvus_client.search(
        COLLECTION_NAME,
        data=[query_vec],
        limit=3,
        output_fields=["content", "modality"],
        search_params={"metric_type": "COSINE"},
    )
    print(f"\nQuery: '{query_text}'")
    for hits in results:
        for rank, hit in enumerate(hits, 1):
            print(f"  [{rank}] (score={hit['distance']:.4f}, modality={hit['entity']['modality']}) "
                  f"{hit['entity']['content'][:80]}")

# ── 6. Cross-modal search example (image query -> text results) ─────────
# Uncomment to search text collection using an image as query
#
# print("\n=== Cross-modal search: image -> text ===")
# query_image_vec = embed_image("query_image.jpg")
# results = milvus_client.search(
#     COLLECTION_NAME,
#     data=[query_image_vec],
#     limit=3,
#     output_fields=["content", "modality"],
#     search_params={"metric_type": "COSINE"},
# )
# for hits in results:
#     for rank, hit in enumerate(hits, 1):
#         print(f"  [{rank}] (score={hit['distance']:.4f}) {hit['entity']['content'][:80]}")

# ── Cleanup ──────────────────────────────────────────────────────────────
# milvus_client.drop_collection(COLLECTION_NAME)
# print(f"\nCollection '{COLLECTION_NAME}' dropped")

print("\nDone!")

For image and audio embeddings, use embed_image() and embed_audio() the same way — the vectors land in the same collection and same vector space, enabling true cross-modal search.

Gemini Embedding 2 Will be Available in Milvus/Zilliz Cloud Soon

Milvus is shipping deep integration with Gemini Embedding 2 through its Embedding Function feature. Once live, you won’t need to call embedding APIs manually. Milvus will auto-invoke the model (supporting OpenAI, AWS Bedrock, Google Vertex AI, and more) to vectorize raw data on insert and queries on search.

That means you get unified multimodal embedding from Gemini where it fits, and Milvus’s full multi-vector toolkit — sparse-dense hybrid search, multi-index schemas, reranking — where you need fine-grained control.

Want to try it? Start with the Milvus quickstart and run the demo above, or check out the hybrid search guide for the full multi-vector setup with BGE-M3. Bring your questions to Discord or Milvus Office Hours.