Stuff & Nonsense

AI and Semantic Search

This post is going to be a bit different than my usual fare in that some of it will be written by an AI (so much like 90% of the web these days!). In fact, I started out this morning wanting to know more about semantic search and the use of large language models to create embeddings for use in search thinking I’d probably spend much of the week looking into this. But then I had a great idea, why not ask a large language model how to do it? It turns out it wasn’t quite that easy, but within a couple of hours I had a working POC which I’ll share later.

So AI is all the rage at the moment, mostly due to advances pioneered by the likes of Google and OpenAI building on the work done by researchers over decades. I won’t go too much into large language models here or how they work (mostly because I don’t have a clue!) but suffice it to say that with a bit of tinkering it’s now possible to run pretty decent models on consumer level hardware. I have a ‘server’ that I built to run this stuff (a threadripper 2920 with an Nvidia 3070ti for inference) and run ollama on it for local Large language models. So, I started off by asking a model (mistral-nemo) a simple question:

I want to investigate the use of embeddings for semantic search, can you help with that?

And it replied:

We had a bit of back and forth to explore this a bit further, with the model taking me through what embeddings actually are:

And then eventually, after a bit more information about different types of embedding and how to use them, I got to implementation:

“Okay, I’m interested in a practical implementation. I have a wordpress database, with posts stored in mariadb. I have access to ollama which runs various models and can be access via an openai compatible API. This includes embedding models such as mxbai-embed-large. I also have access to chromaDB for storing vector data. Could you walk me through creating a proof of concept of a semantic search engine?”

The code the model produced was…..not great. It used out of date methods for getting data, didn’t even suggest using the python libraries for chroma or ollama that exist and had some weird issues with adding too many closing brackets to statements. Maybe if I’d used a more code focussed model I’d have had better results.

For example, here’s the code I was given for generating embeddings:

import requests
import json

def get_ollama_embedding(text):
    response = requests.post(
        "http://localhost:7868/api/embeddings",
        headers={"Content-Type": "application/json"},
        data=json.dumps({"text": text, "model": "mxbai-embed-large"}),
    )
    return response.json()["data"][0]["vector"]

# Assuming posts are loaded from posts.json
posts = json.load(open("posts.json"))

embeddings = []
for post in posts:
    title_embedding = get_ollama_embedding(post["title"]["rendered"]).tolist()
    content_embedding = get_ollama_embedding(post["content"]["rendered"]).tolist()
    embeddings.append({
        "post_id": post["id"],
        "title_embedding": title_embedding,
        "content_embedding": content_embedding,
    })

And here’s what I had after a bit of googling, adding the right libraries etc:

from typing import Dict
import requests
import json
import ollama
from ollama import Client
import chromadb


posts = json.load(open("posts.json"))
client = Client(host='http://localhost:11434')

embeddings = []
for post in posts:

    print(post['title'])
    title_embedding = client.embeddings(
        model='mxbai-embed-large',
        prompt=post['title']
    )
    content_embedding = client.embeddings(
        model='mxbai-embed-large',
        prompt=post['content']
    )
    embeddings.append({
        "post_id": post["id"],
        "post_title": post["title"],
        "content_embedding": content_embedding['embedding'],
        "title_embedding": title_embedding['embedding']
    })

client = chromadb.HttpClient(host="localhost",port="8100")
collection_name = "wordpress_posts"
collection = client.get_or_create_collection(name=collection_name,metadata={'hnsw:space':'cosine'})

for embedding in embeddings:
    collection.upsert(
        documents=[embedding['post_title']],
        ids=[embedding['post_id']],
        embeddings=[embedding["content_embedding"]],
    )

And my search code:

from typing import Dict
import requests
import json
import ollama
from ollama import Client
import chromadb
import argparse

chroma_client = chromadb.HttpClient(host="localhost",port="8100")
ollama_client = Client(host='localhost:11434')
collection_name = "wordpress_posts"
collection = chroma_client.get_collection(collection_name)


def search(query, n_results=10):
    query_embedding = ollama_client.embeddings(
        model='mxbai-embed-large',
        prompt=query
    )
    results = collection.query(
        query_embeddings=query_embedding["embedding"],
    )
    return results
    #return [{"post_id": result["metadatas"][0]["id"], "score": result.score} for result in results]

# Example usage:
parser = argparse.ArgumentParser(description="A simple example of using argparse.")
parser.add_argument("querystring", help="A query string", type=str)
args = parser.parse_args()
search_results = search(args.querystring)
loop = 0
for document in search_results["documents"][0]:
    if search_results["distances"][0][loop] <= 0.5:
        print(document)
    loop = loop+1

And here’s the output from that code with the posts on this site as a dataset. Note that none of the posts found by the search contain the searchterm itself:

This was just a simple POC, but it works pretty well and to go from knowing very little about embeddings or how all this works to something that works at all in a couple of hours has been pretty eye opening.



Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.