import os, tempfile
from onprem.ingest.stores import VectorStoreFactory
= VectorStoreFactory.create(
store ='chroma',
kind='/tmp/my_vectordb',
persist_location="sentence-transformers/nli-mpnet-base-v2",
embedding_model_name={"normalize_embeddings": True},
embedding_encode_kwargs )
Semantic Similarity
The underlying vector database in OnPrem.LLM can be used for detecting semantic similarity among pieces of text.
You can access the default vector store from an LLM
object:
from onprem import LLM
= tempfile.mkdtemp()
vectordb_path = LLM(
llm ="sentence-transformers/nli-mpnet-base-v2",
embedding_model_name={"normalize_embeddings": True},
embedding_encode_kwargs=vectordb_path,
vectordb_path='dense',
store_type=False
verbose
)= llm.load_vectorstore() store
However, in this example, we will create VectorStore
instances explicitly to avoid loading an LLM, which is not needed in this example.
The VectorStoreFactory
is useful in instantiating different backend vectorstores (e.g., Chroma, Whoosh, Elasticsearch).
= [ # from txtai
data "US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day",
]= tempfile.mkdtemp()
source_folder for i, d in enumerate(data):
= os.path.join(source_folder, f"doc{i}.txt")
filename with open(filename, "w") as f:
f.write(d)
=500, chunk_overlap=0) store.ingest(source_folder, chunk_size
Creating new vectorstore at /tmp/my_vectordb
Loading documents from /tmp/tmpeg2wt1z7
Loading new documents: 100%|████████████████████| 6/6 [00:00<00:00, 1540.98it/s]
Processing and chunking 6 new documents: 100%|██████████████████████████████████████████| 1/1 [00:00<00:00, 1940.01it/s]
Split into 6 chunks of text (max. 500 chars each for text; max. 2000 chars for tables)
Creating embeddings. May take some minutes...
100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.13it/s]
Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods
Here, we get a reference to the underlying vector store and query it directly to find the best semantic match.
for query in (
"feel good story",
"climate change",
"public health story",
"war",
"wildlife",
"asia",
"lucky",
"dishonest junk",
):= store.semantic_search(query)
docs print(f"{query} : {docs[0].page_content}")
feel good story : Maine man wins $1M from $25 lottery ticket
climate change : Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
public health story : US tops 5 million confirmed virus cases
war : Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife : The National Park Service warns against sacrificing slower friends in a bear attack
asia : Beijing mobilises invasion craft along coast as Taiwan tensions escalate
lucky : Maine man wins $1M from $25 lottery ticket
dishonest junk : Make huge profits without work, earn up to $100,000 a day