To use Haystack with a non-relational database, you’ll need to bridge Haystack’s document processing and retrieval components with your database’s data storage model. Haystack is designed to work with document stores like Elasticsearch or PostgreSQL, but it can integrate with non-relational databases (e.g., MongoDB, Cassandra) by creating a custom document store or using intermediate tools. The core idea is to map your database records to Haystack’s Document
objects, which are used for indexing, querying, and retrieval. This involves writing connectors to fetch data from your database, convert it into the required format, and pass it through Haystack’s pipelines.
Start by implementing a custom DocumentStore
class that interacts with your non-relational database. For example, if using MongoDB, you could create a MongoDocumentStore
that uses PyMongo to read and write documents. This class must handle operations like saving documents, fetching them by ID, and performing basic filtering. Next, ensure your data is converted into Haystack’s Document
format, which includes fields like content
, meta
, and embedding
. If your database stores nested or unstructured data (e.g., JSON blobs), you’ll need to flatten or extract relevant text fields. For instance, a MongoDB document with a text
field and metadata tags could be mapped to a Haystack Document
where content
is the text and meta
includes the tags. Use Haystack’s PreProcessor
to split large texts into smaller chunks if needed.
Once your data is in Haystack’s format, build a pipeline that connects your custom DocumentStore
to retrievers (e.g., BM25Retriever
, EmbeddingRetriever
) and readers. For example, you could create a pipeline that first retrieves candidate documents from MongoDB using a keyword search, then reranks them with a neural retriever. If your non-relational database lacks native search capabilities, consider exporting data to a temporary index in a supported tool like Elasticsearch for hybrid workflows. Alternatively, use Haystack’s FAISSDocumentStore
for vector-based retrieval alongside your primary database. Be mindful of latency and consistency trade-offs, especially if your non-relational database is distributed or optimized for high write throughput. Test with real queries to ensure the pipeline handles your data’s structure and scale.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word