To implement semantic search in Python, you need to focus on understanding the meaning of text rather than just matching keywords. This typically involves three core steps: converting text into numerical representations (embeddings), storing those embeddings efficiently, and comparing them to find semantically similar content. Modern libraries like sentence-transformers
and vector databases (e.g., FAISS) simplify this process. Here’s a practical approach using freely available tools.
First, use a pre-trained language model to generate embeddings. For example, the sentence-transformers
library provides models like all-MiniLM-L6-v2
, which convert sentences into 384-dimensional vectors. Install the library with pip install sentence-transformers
, then load the model and encode your documents:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
documents = ["A dog chasing a ball", "Cats sleeping in the sun", ...]
document_embeddings = model.encode(documents)
Next, store embeddings for efficient search. For small datasets, compute cosine similarity between a query embedding and all document embeddings using scikit-learn
:
from sklearn.metrics.pairwise import cosine_similarity
query = "Playful pets running around"
query_embedding = model.encode([query])
similarities = cosine_similarity(query_embedding, document_embeddings)[0]
top_match_index = similarities.argmax()
For larger datasets, use FAISS (Facebook AI Similarity Search) to speed up retrieval. Install it with pip install faiss-cpu
, then build an index:
import faiss
index = faiss.IndexFlatIP(384) # Inner product (cosine similarity)
faiss.normalize_L2(document_embeddings) # Normalize for cosine
index.add(document_embeddings)
distances, indices = index.search(query_embedding, k=3) # Top 3 matches
Finally, consider practical adjustments. Choose a model that balances speed and accuracy based on your use case—larger models like all-mpnet-base-v2
perform better but are slower. Preprocess text by removing irrelevant noise (e.g., HTML tags) and standardizing formats. If handling multilingual data, use models like paraphrase-multilingual-MiniLM-L12-v2
. For production, deploy the index using dedicated vector databases like Qdrant or Pinecone, which offer scalability and real-time updates. This approach ensures you retrieve results based on contextual relevance, not just keyword overlap.