The challenge
Recently, I’ve been deeply involved in tech diligence and strategy projects, which are all about quickly accessing the right information. The challenge lies in the fact that this information is scattered across multiple sources, such as pre-due diligence reports, post-buy due diligence documents, and various detailed reports within different parts of the system.
And it’s not just about technology. Understanding what products companies are selling,
their product roadmaps, and current customer segments is equally crucial.
This results in a massive pile of information that can be overwhelming to navigate.
Consolidating this information into a single document or wiki is a good starting point,
but often a simple search function isn’t sufficient.
This is where large language models (LLMs) come into play as a fantastic solution. Instead of tediously searching for terms like “products,” you can simply ask, “What products does Company A sell?”
After some nice conversations with AI/ML expert Robin I implemented a pipeline that ingests potentially sensitive data and can answer questions in a comprehensive way. This post is very much based on what Hervé Ishimye from Timescale presented (Check it out!).
This is post is part of a series of posts on LLMs and RAGs. Check out the other articles as well:
Table Of Contents
Goals Of My Custom RAG LLM experiment
- Running my own llm on my own machine because we are potentially dealing with sensitive data. (defense industry).
- Get the pipeline up and running as proof of concept. No webservice, No fine-tuning.
The basic flow
The basic flow is simple and consists of two steps:
- Preparation of our custom data so that we can query it using a LLM
- Retrieval - aka using an LLM to get nice answers based on our sensitive data
Indexing
To efficiently retrieve matching documents and pieces of information, it’s important to index your data. This indexing needs to be done only once, or whenever your data changes. We use a large language model (LLM) to generate embeddings.
Embeddings are mathematical representations in a multidimensional vector space, to find relevant information in raw texts.
This process allows for quick retrieval of similar documents, much like a traditional search engine but with a deeper understanding of the indexed content. The vectors are stored as special database columns, and while specialized vector databases like QDrant are available, we’ll focus on using PostgreSQL with pgai from Timescale for simplicity.
We’ll use postgresql + pgai and nomic-embed-text to generate the vector representation.
Retrieval Flow
The retrieval flow consists of two parts:
- Get relevant content using the indexed sensitive data in the vector database
- Supply query (e.g. “List all products of Company B”) together with a context (relevant content that we retrieved in step 1) to a LLM.
This will allow the LLM to combine general knowledge with specialized knowledge provided in the context. That way the LLM will be able to answer questions that are not existing int eh “general knowledge”.
We’ll use ollama running model llama3.2 to get the results.
The whole RAG (Retrieval Augmented Graph) magic is supplying relevant data to the context of your model query.
How large can the context we supply along the query be?
This depends on the model we are using for a model that supports 128k tokens (like llama3.2)
We can use that as a rule of thumb:
- 128,000 tokens / 300 tokens per page = approximately 427 book pages
- 128,000 tokens / 400 tokens per page = approximately 320 book pages
So it is not infinite, but you can supply a lot of information.
Step by step guide to get your own private RAG LLM pipeline up and running
Prerequisites
Note: Make sure you got docker installed and allow docker containers to access around 5Gig of ram. Otherwise you’ll get “model requires more system memory (3.5 GiB) than is available” More: https://stackoverflow.com/questions/44533319/how-to-assign-more-memory-to-docker-container
Install OLLama and Meta’s Model llama3.2
Ollama allows to easily run LLM models locally. It does all the heavy lifting for you, provides easy ways to try out different models and runs them encapsulated and ready to use in a webserver.
## Create a network so that all systems can talk to each other
docker network create rag-net
## Start Ollama - this manages and runs your models
docker run -d --network rag-net -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
## Download model llama 3.2 using ollama (around 2GB)
docker exec -it ollama ollama pull llama3.2
Install the Embedding Model from Nomic
## Download the Nomic model (can translate your content to vectors)
docker exec -it ollama ollama pull nomic-embed-text
## List models in Ollama
docker exec -it ollama ollama list
NAME ID SIZE MODIFIED
nomic-embed-text:latest 0a109f422b47 274 MB 38 seconds ago
llama3.2:latest a80c4f17acd5 2.0 GB 3 minutes ago
## Check containers that are running
docker ps -a
70dcc4e8535c ollama/ollama "/bin/ollama serve" 9 minutes ago Up 9 minutes 0.0.0.0:11434->11434/tcp ollama
Install the Vector Database
## Install the vector database PgAI (postgres + timescale extension)
docker run -d --network rag-net -p 5432:5432 --name timescaledb -e POSTGRES_PASSWORD=password timescale/timescaledb-ha:pg16
## Run psql (to manage postgresql) inside the container
docker exec -it timescaledb psql -d postgres
## Install the pgai extension in the postgres database (happens to be the default database)
CREATE EXTENSION IF NOT EXISTS ai CASCADE;
NOTICE: installing required extension "vector"
NOTICE: installing required extension "plpython3u"
CREATE EXTENSION
## Then we can verify that the extension got installed
postgres=# \dx
List of installed extensions
Name | Version | Schema | Description
---------------------+---------+------------+---------------------------------------------------------------------------------------
ai | 0.3.0 | public | helper functions for ai workflows
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
plpython3u | 1.0 | pg_catalog | PL/Python3U untrusted procedural language
timescaledb | 2.17.0 | public | Enables scalable inserts and complex queries for time-series data (Community Edition)
timescaledb_toolkit | 1.18.0 | public | Library of analytical hyperfunctions, time-series pipelining, and other SQL utilities
vector | 0.7.4 | public | vector data type and ivfflat and hnsw access methods
(6 rows)
Install Jupyter Lab in a Virtual Environment
Jupyter Lab is our python IDE to run the pipeline. To be honest - I was away from the python ecosystems for some years now. It turned out to be way harder to run and install python in a clean way than anticipated. I finally got it up and running using virtual environments.
# Let's create e virtual environment to encapsulate all libraries from the global installation
python3 -m venv llm-pipeline
source llm-pipeline/bin/activate
# To connect with our database
pip install psycopg2
# To parse our hugo markdown files
pip install markdown python-frontmatter
# Our IDE
pip install jupyterlab
# This starts the IDE and you can access it in your browser
jupyter lab
The code
Simply copy and paste this code into your jupyter IDE and run it. Important: This code is very much based on Hervé Ishimye,’s presentation over here. He deserves all the praise!
Parse our markdown files
… I am using the content of my blog as source of “sensitive” data for the LLM.
import sys
import psycopg2
import os
import frontmatter
def parse_markdown_files(directory):
markdown_data = []
# Use os.walk to traverse the directory tree
for root, _, files in os.walk(directory):
for filename in files:
if filename.endswith('.md'):
# Construct the full file path
filepath = os.path.join(root, filename)
# Open and read the markdown file
with open(filepath, 'r', encoding='utf-8') as file:
# Parse front matter and content using frontmatter library
post = frontmatter.load(file)
# Extract title from front matter
title = post.get('title', 'No Title')
# Extract content (the markdown content itself)
content = post.content
# Append to markdown_data list as a dictionary
markdown_data.append({
"title": title,
"content": content
})
return markdown_data
directory_path = '/Users/I/workspace/raphaelbauer.com/content'
markdown_data = parse_markdown_files(directory_path)
Create table to store data
def connect_db():
return psycopg2.connect( # use the credentials of your postgresql database
host = 'localhost',
database = 'postgres',
user = 'postgres',
password = 'password',
port = '5432'
)
conn = connect_db()
cur = conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
title TEXT,
content TEXT,
embedding VECTOR(768)
);
""")
conn.commit()
cur.close()
conn.close()
Translate Data into Vectors and Store Them
conn = connect_db()
cur = conn.cursor()
# use the port at which your ollama service is running.
for doc in markdown_data:
cur.execute("""
INSERT INTO documents (title, content, embedding)
VALUES (
%(title)s,
%(content)s,
ollama_embed('nomic-embed-text', concat(%(title)s, ' - ', %(content)s), _host=>'http://ollama:11434')
)
""", doc)
conn.commit()
cur.close()
conn.close()
Verify that Retrieval Works
conn = connect_db()
cur = conn.cursor()
cur.execute("""
SELECT title, content, vector_dims(embedding)
FROM documents LIMIT 10;
""")
rows = cur.fetchall()
for row in rows:
print(f"Title: {row[0]}, Content: {row[1]}, Embedding Dimensions: {row[2]}")
cur.close()
conn.close()
Define query…
query = "Can you describe how modern QA should look like?"
Get Custom Data from Vector Database Based on Query
conn = connect_db()
cur = conn.cursor()
# Embed the query using the ollama_embed function
cur.execute("""
SELECT ollama_embed('nomic-embed-text', %s, _host=>'http://ollama:11434');
""", (query,))
query_embedding = cur.fetchone()[0]
# Retrieve relevant documents based on cosine distance
cur.execute("""
SELECT title, content, 1 - (embedding <=> %s) AS similarity
FROM documents
ORDER BY similarity DESC
LIMIT 1;
""", (query_embedding,))
rows = cur.fetchall()
# Prepare the context for generating the response
context = "\n\n".join([f"Title: {row[0]}\nContent: {row[1]}" for row in rows])
print(context)
cur.close()
conn.close()
Execute the query + context against Meta’s llama 3.2
conn = connect_db()
cur = conn.cursor()
# Generate the response using the ollama_generate function
cur.execute("""
SELECT ollama_generate('llama3.2', %s, _host=>'http://ollama:11434');
""", (f"""
DOCUMENT:
{context}
QUESTION:
{query}
INSTRUCTIONS:
Answer the users QUESTION using the DOCUMENT text above.
Keep your answer ground in the facts of the DOCUMENT.
If the DOCUMENT doesn’t contain the facts to answer the QUESTION then please say so.
""",))
model_response = cur.fetchone()[0]
print(model_response['response'])
cur.close()
conn.close()
Discussion
Finding relevant documents based on similarity generally works well, but fine-tuning the context input is still necessary. Simply adding relevant documents sometimes leads to unexpected results, or the context isn’t utilized effectively, which has been somewhat unsatisfactory.
Running your own LLM feels powerful, but it is slow. For instance, storing around 2MB of text information using Nomic takes minutes on my M1 Mac, and retrieving an answer, while straightforward, takes also minutes with more context. This process could be faster with dedicated hardware or by running it in the cloud.
Returning to the Python ecosystem after ten years has been a strange experience. Python remains great and easy to use, but the many small details, like installing libraries, using pip, managing virtual environments, and utilizing Jupyter, add a surprising level of complexity. Nonetheless, the ability to reference diverse sources makes the journey worthwhile.
Next Steps
- Make this a webservice to serve my thoughts from my homepage as LLM. I imagine a “Talk to Raphael Chatbot”. This’d also mean making this a real application and not just a python hack.
- Fine tuning the context and what to supply as data. It’s not yet good enough.
- Improving credibility by adding references and reducing hallucination.
- Improving speed. Try out other models like OpenAI’s 4o to make it faster. Downside is of course that it is no longer confined to my local machine.
More
- Source code of the jupyter notebook: https://gist.github.com/raphaelbauer/c9e6cc2c95d218cf5fe5e576ff5fa69e
- About nomic embedding model: https://www.nomic.ai/blog/posts/nomic-embed-text-v1
- https://help.openai.com/en/articles/8868588-retrieval-augmented-generation-rag-and-semantic-search-for-gpts
- The amazing presentation which is the foundation for this article: https://www.youtube.com/watch?v=-ikCYKcPoqU