Using Different Vector Stores

This example demonstrates how to use the VectorStoreFactory in OnPrem.LLM to easily create and experiment with different types of vector stores for your RAG (Retrieval-Augmented Generation) and semantic search applications.

The VectorStoreFactory provides a unified interface for creating three different types of vector stores, each optimized for different use cases:

This makes it easy to experiment with different search strategies and find the best approach for your specific data and use case.

Setup

First, let’s create some sample documents that we’ll use throughout our examples:

import tempfile
import os
from langchain_core.documents import Document
from onprem.ingest.stores import VectorStoreFactory

# Create some sample documents for our examples
sample_docs = [
    Document(
        page_content="Machine learning is a subset of artificial intelligence that enables computers to learn without explicit programming.",
        metadata={"source": "ml_intro.txt", "topic": "AI", "difficulty": "beginner"}
    ),
    Document(
        page_content="Deep learning uses neural networks with multiple layers to model and understand complex patterns in data.",
        metadata={"source": "dl_guide.txt", "topic": "AI", "difficulty": "intermediate"}
    ),
    Document(
        page_content="Natural language processing (NLP) enables computers to understand and process human language.",
        metadata={"source": "nlp_basics.txt", "topic": "AI", "difficulty": "beginner"}
    ),
    Document(
        page_content="Vector databases store high-dimensional vectors and enable similarity search for AI applications.",
        metadata={"source": "vector_db.txt", "topic": "databases", "difficulty": "intermediate"}
    ),
    Document(
        page_content="Retrieval-augmented generation (RAG) combines information retrieval with language generation for better AI responses.",
        metadata={"source": "rag_overview.txt", "topic": "AI", "difficulty": "advanced"}
    ),
    Document(
    page_content="Cats have five toes on their front paws, four on their back paws, and zero interest in your personal space..",
    metadata={"source": "cat_facts.txt", "topic": "cats", "difficulty": "advanced"}
    )
]

print(f"Created {len(sample_docs)} sample documents for testing")
Created 6 sample documents for testing

Integration with LLM

The VectorStoreFactory works seamlessly with OnPrem.LLM for complete RAG (Retrieval-Augmented Generation) workflows.

By default, supplying store_type="dense" to LLM will use ChromaStore and supplying store_type="sparse" will use WhooshStore. To use ElasticsearchStore, you can supply it to load_vectorstore as a custom vector store:

llm = LLM(...)
llm.load_vectorstore(custom_vectorstore=elasticsearch_store)

You can also implement and use your own custom VectorStore instances (by subclassing DenseStore, SparseStore, or DualStore) using whatever vector database backend you like.

For illustration purposes, in the example below, we explictly tell LLM to use WhooshStore as a custom vector store. (This is equivalent to supplying store_type="sparse" to LLM, but it shows how you would use LLM with Elasticsearch or your own custom vector store.)

# Example: Using VectorStoreFactory with LLM for RAG
print("🤖 Integration with OnPrem.LLM:")

# Create a simple document corpus
documents_dir = tempfile.mkdtemp()
doc_files = {
    "ai_overview.txt": "Artificial intelligence is transforming how we work and live. Machine learning enables computers to learn from data without explicit programming.",
    "ml_types.txt": "There are three main types of machine learning: supervised learning uses labeled data, unsupervised learning finds patterns in unlabeled data, and reinforcement learning learns through trial and error.",
    "applications.txt": "AI applications include natural language processing for text analysis, computer vision for image recognition, and recommendation systems for personalized content."
}

# Write documents to files
for filename, content in doc_files.items():
    with open(os.path.join(documents_dir, filename), 'w') as f:
        f.write(content)

print(f"✓ Created {len(doc_files)} documents in {documents_dir}")

# Show how to use custom vector store with LLM
from onprem import LLM
from onprem.ingest.stores import VectorStoreFactory

# Create custom vector store
store = VectorStoreFactory.create('whoosh', persist_location='/tmp/my_search_index')

# Create LLM and use custom vector store
llm = LLM('openai/gpt-4o-mini', vectordb_path=tempfile.mkdtemp())
llm.load_vectorstore(custom_vectorstore=store)

# Ingest documents
llm.ingest(documents_dir)

print('\n\n----RAG EXAMPLE----')
# Ask questions
question = 'What are the types of machine learning?'
print(f'QUESTION: {question}')
print()
result = llm.ask(question)

print('\n\nSOURCES:')
for i, d in enumerate(result['source_documents']):
    print(f"source #{i+1}: {d.metadata['source']}")
store.erase(confirm=False)
🤖 Integration with OnPrem.LLM:
✓ Created 3 documents in /tmp/tmpjekc6pkt
Creating new vectorstore at /tmp/my_search_index
Loading documents from /tmp/tmpjekc6pkt
Loading new documents: 100%|█████████████████████| 3/3 [00:00<00:00, 175.48it/s]
Processing and chunking 3 new documents: 100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 248.67it/s]
Split into 3 chunks of text (max. 500 chars each for text; max. 2000 chars for tables)
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 983.81it/s]
Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods


----RAG EXAMPLE----
QUESTION: What are the types of machine learning?

The types of machine learning are:

1. Supervised learning - uses labeled data.
2. Unsupervised learning - finds patterns in unlabeled data.
3. Reinforcement learning - learns through trial and error.

SOURCES:
source #1: /tmp/tmpjekc6pkt/ml_types.txt
source #2: /tmp/tmpjekc6pkt/ai_overview.txt
True
# Clean up temporary directories
import shutil

temp_dirs = [chroma_path, whoosh_path, documents_dir]
for temp_dir in temp_dirs:
    try:
        shutil.rmtree(temp_dir)
    except:
        pass
        
print("🧹 Cleaned up temporary directories")
🧹 Cleaned up temporary directories