🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

How can I use Haystack with a non-relational database?

To use Haystack with a non-relational database, you’ll need to bridge Haystack’s document processing and retrieval components with your database’s data storage model. Haystack is designed to work with document stores like Elasticsearch or PostgreSQL, but it can integrate with non-relational databases (e.g., MongoDB, Cassandra) by creating a custom document store or using intermediate tools. The core idea is to map your database records to Haystack’s Document objects, which are used for indexing, querying, and retrieval. This involves writing connectors to fetch data from your database, convert it into the required format, and pass it through Haystack’s pipelines.

Start by implementing a custom DocumentStore class that interacts with your non-relational database. For example, if using MongoDB, you could create a MongoDocumentStore that uses PyMongo to read and write documents. This class must handle operations like saving documents, fetching them by ID, and performing basic filtering. Next, ensure your data is converted into Haystack’s Document format, which includes fields like content, meta, and embedding. If your database stores nested or unstructured data (e.g., JSON blobs), you’ll need to flatten or extract relevant text fields. For instance, a MongoDB document with a text field and metadata tags could be mapped to a Haystack Document where content is the text and meta includes the tags. Use Haystack’s PreProcessor to split large texts into smaller chunks if needed.

Once your data is in Haystack’s format, build a pipeline that connects your custom DocumentStore to retrievers (e.g., BM25Retriever, EmbeddingRetriever) and readers. For example, you could create a pipeline that first retrieves candidate documents from MongoDB using a keyword search, then reranks them with a neural retriever. If your non-relational database lacks native search capabilities, consider exporting data to a temporary index in a supported tool like Elasticsearch for hybrid workflows. Alternatively, use Haystack’s FAISSDocumentStore for vector-based retrieval alongside your primary database. Be mindful of latency and consistency trade-offs, especially if your non-relational database is distributed or optimized for high write throughput. Test with real queries to ensure the pipeline handles your data’s structure and scale.

Like the article? Spread the word