Custom LLM with a RAG

The challenge

Recently, I’ve been deeply involved in tech diligence and strategy projects, which are all about quickly accessing the right information. The challenge lies in the fact that this information is scattered across multiple sources, such as pre-due diligence reports, post-buy due diligence documents, and various detailed reports within different parts of the system.

And it’s not just about technology. Understanding what products companies are selling, their product roadmaps, and current customer segments is equally crucial. This results in a massive pile of information that can be overwhelming to navigate.
Consolidating this information into a single document or wiki is a good starting point, but often a simple search function isn’t sufficient.

This is where large language models (LLMs) come into play as a fantastic solution. Instead of tediously searching for terms like “products,” you can simply ask, “What products does Company A sell?”

After some nice conversations with AI/ML expert Robin I implemented a pipeline that ingests potentially sensitive data and can answer questions in a comprehensive way. This post is very much based on what Hervé Ishimye from Timescale presented (Check it out!).

This is post is part of a series of posts on LLMs and RAGs. Check out the other articles as well:

Goals Of My Custom RAG LLM experiment

Running my own llm on my own machine because we are potentially dealing with sensitive data. (defense industry).
Get the pipeline up and running as proof of concept. No webservice, No fine-tuning.

The basic flow

The basic flow is simple and consists of two steps:

Preparation of our custom data so that we can query it using a LLM
Retrieval - aka using an LLM to get nice answers based on our sensitive data

Indexing

To efficiently retrieve matching documents and pieces of information, it’s important to index your data. This indexing needs to be done only once, or whenever your data changes. We use a large language model (LLM) to generate embeddings.

Embeddings are mathematical representations in a multidimensional vector space, to find relevant information in raw texts.

This process allows for quick retrieval of similar documents, much like a traditional search engine but with a deeper understanding of the indexed content. The vectors are stored as special database columns, and while specialized vector databases like QDrant are available, we’ll focus on using PostgreSQL with pgai from Timescale for simplicity.

We’ll use postgresql + pgai and nomic-embed-text to generate the vector representation.

Retrieval Flow

The retrieval flow consists of two parts:

Get relevant content using the indexed sensitive data in the vector database
Supply query (e.g. “List all products of Company B”) together with a context (relevant content that we retrieved in step 1) to a LLM.

This will allow the LLM to combine general knowledge with specialized knowledge provided in the context. That way the LLM will be able to answer questions that are not existing int eh “general knowledge”.

We’ll use ollama running model llama3.2 to get the results.

The whole RAG (Retrieval Augmented Graph) magic is supplying relevant data to the context of your model query.

How large can the context we supply along the query be?

This depends on the model we are using for a model that supports 128k tokens (like llama3.2)

We can use that as a rule of thumb:

128,000 tokens / 300 tokens per page = approximately 427 book pages
128,000 tokens / 400 tokens per page = approximately 320 book pages

So it is not infinite, but you can supply a lot of information.

Step by step guide to get your own private RAG LLM pipeline up and running

Prerequisites

Note: Make sure you got docker installed and allow docker containers to access around 5Gig of ram. Otherwise you’ll get “model requires more system memory (3.5 GiB) than is available” More: https://stackoverflow.com/questions/44533319/how-to-assign-more-memory-to-docker-container

Install OLLama and Meta’s Model llama3.2

Ollama allows to easily run LLM models locally. It does all the heavy lifting for you, provides easy ways to try out different models and runs them encapsulated and ready to use in a webserver.

## Create a network so that all systems can talk to each other
docker network create rag-net

## Start Ollama - this manages and runs your models
docker run -d --network rag-net -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

## Download model llama 3.2 using ollama (around 2GB)
docker exec -it ollama ollama pull llama3.2

Install the Embedding Model from Nomic

## Download the Nomic model (can translate your content to vectors)
docker exec -it ollama ollama pull nomic-embed-text

## List models in Ollama
docker exec -it ollama ollama list

NAME                       ID              SIZE      MODIFIED       
nomic-embed-text:latest    0a109f422b47    274 MB    38 seconds ago    
llama3.2:latest            a80c4f17acd5    2.0 GB    3 minutes ago   

## Check containers that are running
docker ps -a

70dcc4e8535c   ollama/ollama   "/bin/ollama serve"      9 minutes ago   Up 9 minutes              0.0.0.0:11434->11434/tcp   ollama

Install the Vector Database

## Install the vector database PgAI (postgres + timescale extension)
docker run -d --network rag-net -p 5432:5432 --name timescaledb -e POSTGRES_PASSWORD=password timescale/timescaledb-ha:pg16

## Run psql (to manage postgresql) inside the container
docker exec -it timescaledb psql -d postgres 

## Install the pgai extension in the postgres database (happens to be the default database)
CREATE EXTENSION IF NOT EXISTS ai CASCADE;

NOTICE:  installing required extension "vector"
NOTICE:  installing required extension "plpython3u"
CREATE EXTENSION

## Then we can verify that the extension got installed
postgres=# \dx
                                                    List of installed extensions
        Name         | Version |   Schema   |                                      Description                                      
---------------------+---------+------------+---------------------------------------------------------------------------------------
ai                  | 0.3.0   | public     | helper functions for ai workflows
plpgsql             | 1.0     | pg_catalog | PL/pgSQL procedural language
plpython3u          | 1.0     | pg_catalog | PL/Python3U untrusted procedural language
timescaledb         | 2.17.0  | public     | Enables scalable inserts and complex queries for time-series data (Community Edition)
timescaledb_toolkit | 1.18.0  | public     | Library of analytical hyperfunctions, time-series pipelining, and other SQL utilities
vector              | 0.7.4   | public     | vector data type and ivfflat and hnsw access methods
(6 rows)

Install Jupyter Lab in a Virtual Environment

Jupyter Lab is our python IDE to run the pipeline. To be honest - I was away from the python ecosystems for some years now. It turned out to be way harder to run and install python in a clean way than anticipated. I finally got it up and running using virtual environments.

# Let's create e virtual environment to encapsulate all libraries from the global installation
python3 -m venv llm-pipeline
source llm-pipeline/bin/activate

# To connect with our database
pip install psycopg2
# To parse our hugo markdown files
pip install markdown python-frontmatter
# Our IDE
pip install jupyterlab

# This starts the IDE and you can access it in your browser
jupyter lab

The code

Simply copy and paste this code into your jupyter IDE and run it. Important: This code is very much based on Hervé Ishimye,’s presentation over here. He deserves all the praise!

Parse our markdown files

… I am using the content of my blog as source of “sensitive” data for the LLM.


import sys
import psycopg2
import os
import frontmatter

def parse_markdown_files(directory):
    markdown_data = []

    # Use os.walk to traverse the directory tree
    for root, _, files in os.walk(directory):
        for filename in files:
            if filename.endswith('.md'):
                # Construct the full file path
                filepath = os.path.join(root, filename)
                
                # Open and read the markdown file
                with open(filepath, 'r', encoding='utf-8') as file:
                    # Parse front matter and content using frontmatter library
                    post = frontmatter.load(file)
                    
                    # Extract title from front matter
                    title = post.get('title', 'No Title')
                    
                    # Extract content (the markdown content itself)
                    content = post.content
                    
                    # Append to markdown_data list as a dictionary
                    markdown_data.append({
                        "title": title,
                        "content": content
                    })

    return markdown_data

directory_path = '/Users/I/workspace/raphaelbauer.com/content'

markdown_data = parse_markdown_files(directory_path)

Create table to store data

def connect_db():
    return psycopg2.connect( # use the credentials of your postgresql database 
        host = 'localhost',
        database = 'postgres',
        user = 'postgres',
        password = 'password',
        port = '5432'
    )

conn = connect_db()
cur = conn.cursor()
cur.execute("""
        CREATE TABLE IF NOT EXISTS documents (
            id SERIAL PRIMARY KEY,
            title TEXT,
            content TEXT,
            embedding VECTOR(768)
        );
    """)
conn.commit()
cur.close()
conn.close()

Translate Data into Vectors and Store Them

conn = connect_db()
cur = conn.cursor()

# use the port at which your ollama service is running.
for doc in markdown_data:
    cur.execute("""
        INSERT INTO documents (title, content, embedding)
        VALUES (
            %(title)s,
            %(content)s,
            ollama_embed('nomic-embed-text', concat(%(title)s, ' - ', %(content)s), _host=>'http://ollama:11434')
        )
    """, doc)

conn.commit()
cur.close()
conn.close()

Verify that Retrieval Works

conn = connect_db()
cur = conn.cursor()
    
cur.execute("""
    SELECT title, content, vector_dims(embedding) 
    FROM documents LIMIT 10;
""")

rows = cur.fetchall()
for row in rows:
    print(f"Title: {row[0]}, Content: {row[1]}, Embedding Dimensions: {row[2]}")

cur.close()
conn.close()

Define query…

query = "Can you describe how modern QA should look like?"

Get Custom Data from Vector Database Based on Query

conn = connect_db()
cur = conn.cursor()
    
# Embed the query using the ollama_embed function
cur.execute("""
    SELECT ollama_embed('nomic-embed-text', %s, _host=>'http://ollama:11434');
""", (query,))
query_embedding = cur.fetchone()[0]

# Retrieve relevant documents based on cosine distance
cur.execute("""
    SELECT title, content, 1 - (embedding <=> %s) AS similarity
    FROM documents
    ORDER BY similarity DESC
    LIMIT 1;
""", (query_embedding,))

rows = cur.fetchall()
    
# Prepare the context for generating the response
context = "\n\n".join([f"Title: {row[0]}\nContent: {row[1]}" for row in rows])
print(context)

cur.close()
conn.close()

Execute the query + context against Meta’s llama 3.2

conn = connect_db()
cur = conn.cursor()

# Generate the response using the ollama_generate function
cur.execute("""
    SELECT ollama_generate('llama3.2', %s, _host=>'http://ollama:11434');
""", (f"""
DOCUMENT:
{context}

QUESTION:
{query}

INSTRUCTIONS:
Answer the users QUESTION using the DOCUMENT text above.
Keep your answer ground in the facts of the DOCUMENT.
If the DOCUMENT doesn’t contain the facts to answer the QUESTION then please say so.
""",))
    
model_response = cur.fetchone()[0]
print(model_response['response'])
    
cur.close()
conn.close()

Discussion

Finding relevant documents based on similarity generally works well, but fine-tuning the context input is still necessary. Simply adding relevant documents sometimes leads to unexpected results, or the context isn’t utilized effectively, which has been somewhat unsatisfactory.

Running your own LLM feels powerful, but it is slow. For instance, storing around 2MB of text information using Nomic takes minutes on my M1 Mac, and retrieving an answer, while straightforward, takes also minutes with more context. This process could be faster with dedicated hardware or by running it in the cloud.

Returning to the Python ecosystem after ten years has been a strange experience. Python remains great and easy to use, but the many small details, like installing libraries, using pip, managing virtual environments, and utilizing Jupyter, add a surprising level of complexity. Nonetheless, the ability to reference diverse sources makes the journey worthwhile.

Next Steps

Make this a webservice to serve my thoughts from my homepage as LLM. I imagine a “Talk to Raphael Chatbot”. This’d also mean making this a real application and not just a python hack.
Fine tuning the context and what to supply as data. It’s not yet good enough.
Improving credibility by adding references and reducing hallucination.
Improving speed. Try out other models like OpenAI’s 4o to make it faster. Downside is of course that it is no longer confined to my local machine.

Source code of the jupyter notebook: https://gist.github.com/raphaelbauer/c9e6cc2c95d218cf5fe5e576ff5fa69e
About nomic embedding model: https://www.nomic.ai/blog/posts/nomic-embed-text-v1
https://help.openai.com/en/articles/8868588-retrieval-augmented-generation-rag-and-semantic-search-for-gpts
The amazing presentation which is the foundation for this article: https://www.youtube.com/watch?v=-ikCYKcPoqU

Custom LLM with a RAG.