To set up Haystack in your Python environment, begin by installing the package and verifying dependencies. Haystack requires Python 3.7+ and can be installed via pip with pip install farm-haystack
. For specific use cases like working with databases or machine learning models, you may need extras. For example, pip install "farm-haystack[all]"
installs all optional dependencies, including support for Elasticsearch, Hugging Face models, and cloud services. If you plan to use Elasticsearch as a document store, ensure it’s running locally (e.g., via Docker with docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:8.9.0
) or configure connection details for a remote instance.
Next, configure your document store and processing pipeline. Haystack uses document stores (like InMemoryDocumentStore
, ElasticsearchDocumentStore
, or PostgreSQLDocumentStore
) to manage data. For example, initialize an Elasticsearch-backed store:
from haystack.document_stores import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", index="my_docs")
To ingest files, convert them into Haystack Document
objects using FileTypeClassifier
, TextConverter
, or PDFConverter
. Create a preprocessing pipeline to clean and split text:
from haystack.nodes import PreProcessor
processor = PreProcessor(split_length=200, split_overlap=20)
docs = processor.process([Document(content="...")])
document_store.write_documents(docs)
Finally, set up a retrieval or question-answering pipeline. For semantic search, use a Retriever
(e.g., BM25Retriever
for keyword-based search or EmbeddingRetriever
with a model like sentence-transformers/all-MiniLM-L6-v2
). Add a Reader
(like TransformersReader
) for extractive QA:
from haystack.pipelines import ExtractiveQAPipeline
from haystack.nodes import BM25Retriever, TransformersReader
retriever = BM25Retriever(document_store=document_store)
reader = TransformersReader(model_name_or_path="deepset/bert-base-cased-squad2")
pipeline = ExtractiveQAPipeline(retriever, reader)
results = pipeline.run(query="What is Haystack?", params={"Retriever": {"top_k": 3}, "Reader": {"top_k": 1}})
Test your setup by running queries and validating outputs. For scalability, consider using Haystack’s REST API or cloud integrations for distributed workloads.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word