Semantic Similarity

The underlying vector database in OnPrem.LLM can be used for detecting semantic similarity among pieces of text.

import os, tempfile
from onprem import LLM

vectordb_path = tempfile.mkdtemp()
llm = LLM(
    embedding_model_name="sentence-transformers/nli-mpnet-base-v2",
    embedding_encode_kwargs={"normalize_embeddings": True},
    vectordb_path=vectordb_path,
)
data = [  # from txtai
    "US tops 5 million confirmed virus cases",
    "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
    "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
    "The National Park Service warns against sacrificing slower friends in a bear attack",
    "Maine man wins $1M from $25 lottery ticket",
    "Make huge profits without work, earn up to $100,000 a day",
]
source_folder = tempfile.mkdtemp()
for i, d in enumerate(data):
    filename = os.path.join(source_folder, f"doc{i}.txt")
    with open(filename, "w") as f:
        f.write(d)
llm.ingest(source_folder, chunk_size=500, chunk_overlap=0)
Creating new vectorstore at /tmp/tmpqbwhmx3v
Loading documents from /tmp/tmpocpf9fe4
Loaded 6 new documents from /tmp/tmpocpf9fe4
Split into 6 chunks of text (max. 500 chars each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now query your documents using the LLM.ask method
Loading new documents: 100%|█████████████████████| 6/6 [00:00<00:00, 931.07it/s]

Here, we get a reference to the underlying vector store and query it directly to find the best semantic match.

db = llm.load_ingester().get_db()
for query in (
    "feel good story",
    "climate change",
    "public health story",
    "war",
    "wildlife",
    "asia",
    "lucky",
    "dishonest junk",
):
    docs = db.similarity_search(query)
    print(f"{query} : {docs[0].page_content}")
feel good story : Maine man wins $1M from $25 lottery ticket
climate change : Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
public health story : US tops 5 million confirmed virus cases
war : Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife : The National Park Service warns against sacrificing slower friends in a bear attack
asia : Beijing mobilises invasion craft along coast as Taiwan tensions escalate
lucky : Maine man wins $1M from $25 lottery ticket
dishonest junk : Make huge profits without work, earn up to $100,000 a day