Semantic Similarity

The underlying vector database in OnPrem.LLM can be used for detecting semantic similarity among pieces of text.

You can access the default vector store from an LLM object:

from onprem import LLM

vectordb_path = tempfile.mkdtemp()
llm = LLM(
    embedding_model_name="sentence-transformers/nli-mpnet-base-v2",
    embedding_encode_kwargs={"normalize_embeddings": True},
    vectordb_path=vectordb_path,
    store_type='dense',
    verbose=False
)
store = llm.load_vectorstore()

However, in this example, we will create VectorStore instances explicitly to avoid loading an LLM, which is not needed in this example.

The VectorStoreFactory is useful in instantiating different backend vectorstores (e.g., Chroma, Whoosh, Elasticsearch).

import os, tempfile

from onprem.ingest.stores import VectorStoreFactory

store = VectorStoreFactory.create(
    kind='chroma',
    persist_location='/tmp/my_vectordb',
    embedding_model_name="sentence-transformers/nli-mpnet-base-v2",
    embedding_encode_kwargs={"normalize_embeddings": True},
)

data = [  # from txtai
    "US tops 5 million confirmed virus cases",
    "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
    "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
    "The National Park Service warns against sacrificing slower friends in a bear attack",
    "Maine man wins $1M from $25 lottery ticket",
    "Make huge profits without work, earn up to $100,000 a day",
]
source_folder = tempfile.mkdtemp()
for i, d in enumerate(data):
    filename = os.path.join(source_folder, f"doc{i}.txt")
    with open(filename, "w") as f:
        f.write(d)

store.ingest(source_folder, chunk_size=500, chunk_overlap=0)

Creating new vectorstore at /tmp/my_vectordb
Loading documents from /tmp/tmpeg2wt1z7

Loading new documents: 100%|████████████████████| 6/6 [00:00<00:00, 1540.98it/s]
Processing and chunking 6 new documents: 100%|██████████████████████████████████████████| 1/1 [00:00<00:00, 1940.01it/s]

Split into 6 chunks of text (max. 500 chars each for text; max. 2000 chars for tables)
Creating embeddings. May take some minutes...

100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.13it/s]

Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods

Here, we get a reference to the underlying vector store and query it directly to find the best semantic match.

for query in (
    "feel good story",
    "climate change",
    "public health story",
    "war",
    "wildlife",
    "asia",
    "lucky",
    "dishonest junk",
):
    docs = store.semantic_search(query)
    print(f"{query} : {docs[0].page_content}")

feel good story : Maine man wins $1M from $25 lottery ticket
climate change : Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
public health story : US tops 5 million confirmed virus cases
war : Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife : The National Park Service warns against sacrificing slower friends in a bear attack
asia : Beijing mobilises invasion craft along coast as Taiwan tensions escalate
lucky : Maine man wins $1M from $25 lottery ticket
dishonest junk : Make huge profits without work, earn up to $100,000 a day