Skip to main content

Command Palette

Search for a command to run...

Building a RAG System with ChromaDB: Why My Embeddings Failed (And How I Fixed It)

Why my RAG system failed using ChromaDB and local embeddings — and how better context and data structuring fixed retrieval accuracy.

Updated
10 min read
Building a RAG System with ChromaDB: Why My Embeddings Failed (And How I Fixed It)
N

Self-taught Software Engineer from Singapore (that eventually goes to school for it). Passionate about building new stuff that helps people save time!


Author’s Note

I’ve been gone for long.

I’ve been serving my nation (National Service) and have been using the time to relax, learn different skills and upgrading myself in different ways.

Unexpectedly, I also went full-on into my other hobby… Playing the guitar and the bass.

Hence, catch me playing bass and guitar on:


“AI IS SO COOL!”

Not long ago, I was hellbent on learning AI.

Well, to be fair, I caved in with the whole marketing jigamajig.

Other than the whole “AI bubble” thing, I’ve always wanted to learn AI. Intelligence in computers has always sparked something in me.

Moreover, I was convinced (at the time I graduated Poly) that to succeed in the IT sector space, one has to either choose between Cybersecurity or AI. Being a Software Developer in the big ‘23 was outdated. Like everybody can build a node.js server and set up some SaaS and just piece it together…

After I graduated, I took this unofficial course on YouTube that kick-started my ML learning journey.

It was a “Learn PyTorch in 24 hours” where I only could make it halfway, 12 hours before giving up because of how technical it was for me.

Time skip to 2026, I decided to jump on it again.

I’ve always been inspired by how people use AI to do insane things but I never understood how.

So I did some research and came across a new term: “AI Engineer”.

My initial thoughts was “What is that?” and I thought an AI Engineer is the same as ML Engineers except AI Engineers build LLMs.

But that was when I discovered that the meaning of “AI Engineer” keeps changing and the current one is just a Software Engineer who uses the capabilities of LLMs and other technologies to streamline mundane tasks and processes.

During my research about AI Engineers, I come across this course on Datacamp titled “Associate AI Engineers for Developers” and thought “Great! I’m a Developer, and this course teaches me about being an associate AI engineer! At least the courses treats me like a developer so I don’t need to learn everything from scratch”

So I took the course (well, it was a free 3 month trial)

Time jump to 3 weeks later, after learning majority of the except for the LangChain stuff, I decided to use my skills.

So far, here's the summarised version of what I've learned that I think is important:

  • Prompt engineering

  • RAG

  • Vector databases

  • Embeddings

  • LLMOps (testing, model selection, RAG vs fine-tuning, model costs and etc.)

Applying my skills

Let me introduce to you the problem: I'm serving NS and I take care of the Ops Room.

That's my vocation: Ops Room.

Our job, among many things, is to take care of the keys to the whole of the training center. We issue alot of keys daily.

The problem lies in the loads of important key information that is somehow left out and hard to keep track.

Here’s what I mean by that: throughout the months that I've been serving, keys and information tend to change. Sometimes they're no longer stored in the ops room. Sometimes keys go by different names. Sometimes you need to ask permission to issue out a certain key. And these information is relayed once and sometimes gets lost in the sauce.

Sure, a seasoned experienced person like me would know all these information and can blurt out information at lightning speed but what about the newcomers? (I mean, to be fair, they can just store these knowledge in their notes)

Hence the solution, building an RAG system using the information of keys and other important contextual information.

To be fair, the problem isn’t even a real issue to begin with but you know me… I like to hone my skills by solving a non-existent problem.

It’s like the saying goes, “I take 10 hours to build a solution that takes me 2 minutes by hand every day.”


Disclaimer: Privacy

Before working on this project, I understand the importance of privacy. Especially the information that I’ve been working with. Which is why I made an effort to:

  • Scramble data when interacting with online services

  • Censor sensitive information in this blog post (i.e. make it as vague as possible)

  • I made sure all of this is ran locally on my machine!


Data processing

I sat down in my chair. Back arched into my desk. Furiously typing and looking for relevant files to feed the vector database while thinking about processing the data for the embeddings.

Yes, there's the problem of what MEANINGFUL data to embed but before that I was already stuck about converting the master-list excel file into a structured data

FYI: All record of keys and what rooms they're for is kept in an excel master-list file that has multiple sheets, one for each section of the training center. It is not as simple as your typical CSV file – some pages have a subheading that merged the cells and whatnot. It's not a simple "read excel file from python" problem.

Introducing... ChatGPT! Yep – I used the skills I learned in Prompt Engineering to format my excel file into a JSON format I wanted.

In the end, I wrote a quick process_keys.py file that helped me take the multiple json files and combined them into one json filed called merged_keys and looks somewhat like this:

[
    ...,
    {
        "bunch_no": "930D",
        "no_of_keys": 8,
        "location": "DORM 93",
        "remarks": "",
        "key_section": "'D' - DORMS",
        "building": "DORM BLOCK",
        "id": "D_93"
    },
    ...
]

Note that the information on top has been scrambled with false information. The structure remains true.

Embedding data

With the help of the lectures from the Datacamp course, I used ChromaDB as my Vector Database, ChatGPT-ed and asked for the best embedding model. It's my choice because there is a persistent store and useful for quick prototypes. Don’t need to run a docker container and whatnot.

At first, when embedding data, I embedded all the information I got from the keys.

The original JSON structure looked like this:

{
    "bunch_no": "930D",
    "no_of_keys": 8,
    "location": "DORM 93",
    "remarks": "",
    "key_section": "'D' - DORMS",
    "building": "DORM BLOCK",
    "id": "D_93"
}

Each key was down to a string that looked like this:

Bunch No: 930D
Number of keys: 8 key(s)
Location: DORM 93
Remarks: 
Key Section: 'D' - DORMS
Building: DORM BLOCK
...

I was ready to test out my big masterpiece next-gen AI app that will crown me the best tech adopter in the whole of Singapore!!!

When life give you lemons...

You squeeze it into your eyes! Just kidding.

Did you really thought my first test would pass?

It failed miserably. Queries were returning unexpected documents.

For example, when asked "What is the keys to bunk 930D"

It would return god-knows-what key that has nothing to do with bunk 930D

I was “debugging” furiously.

To me, this whole “embedding model” thing is a black box. You can't really debug the semantics that the embedding model captured.

Sure, you can visualise the embeddings on a graph but whatfor? (correct me if I'm mistaken)

I went back to ChatGPT and asked "What is the best embedding model that can be ran locally?"

I was so sure that it was the embedding model that didn't capture my search query semantics – maybe it was dumb enough to not understand "bunk 930D" as "dormitory 930D" but well, I'm not any smarter either.

I did a quick switch of embedding models and ran it again, and…

Do you not understand?

It worked! I’m kidding!

Nope, it didn’t.

Yeah… I was baffled.

I thought a better embedding model would understand the nuances of the word “dorm” being interchangeable with "dormitory" or "bunk".

I went back to the drawing board.

I thought hard about how do I make these models understand these nuances.

Do I have to train my own model to do so? Will that take so much of my time?

I was convinced it was somehow the embedding model UNTIL I asked ChatGPT during my research process.

I asked whether my semantics were captured, why it did not return what I expected and if the issue is with the dimensionality of the embeddings.

ChatGPT, being the supportive friend it is (/s), told me that there wasn't ever an issue about embeddings but could be a myriad of other issues:

  • Metric for search (dot product vs cosine similarity)

  • Vector normalisation

  • Embedded text data

  • ... and many more

Embed or not to embed?

Okay, now I’m convinced it’s the data that I’m embedding

I went back to my code and went back to the embed_keys.py file.

I looked at the initial data that was sent for embedding – ChatGPT told me to embed information that’s concise but store all the other important information in the metadata of the document.

So I did just that.

From each key being given a embedding data of:

Bunch No: 930D
Number of keys: 8 key(s)
Location: DORM 93
Remarks: 
Key Section: 'D' - DORMS
Building: DORM BLOCK
...

I reduced it to:

DORM 93. Key number/bunch number: 930D. 8 key(s). DORM BLOCK Building. Remarks:  'D' - DORMS.

✅ Concise and understandable.

Maybe that's what the embedding model wants: A concise sentence that captures the information!

I excitedly re-embedded all the documents and tested it out to query for keys for bunk 930D...

It didn't work again.

I went through all that nail-biting, toe-curling research using ChatGPT and Google just to find out that the changes I made didn't work.

Don't even get me started about how I hard I had to research on how to delete a ChromaDB collection.

You might think what in the world is going on...

Here's the rundown

  • First, I wanted to create an RAG to practice the skills I've learned

  • So I decided to solve a (non-existent) issue – the nuances and context with keys in my building. How easy it would be to just ask the Chatbot "What keys are bunk 930?" and it'll return the relevant information.

  • Hence, I built RAG system using ChromaDB and a local embedding model

  • After embedding, I tried asking the vector DB "What keys are bunk 930" but it didn't work

  • I was baffled because I thought embedding models are supposed to understand the nuances of ‘bunk, dorm, dormitory’ but it didn't

  • I went through a nail-biting research session using ChatGPT and Google because embedding models are a black box and these type of issues are not easily debuggable.

  • ChatGPT suggested (among other potential issues) that maybe the data I'm embedding for each key might be too verbose.

  • I went back to reduce the embedding data for each key making it concise. Concise one sentence like "DORM 930. KEY/BUNCH NUMBER 930D. DORM BLOCK"

  • I was excited to test again by asking "What keys are bunk 930?"

  • Did NOT work

Yeah, that sums it up.

I was close to giving up. Maybe it was built to answer YouTube questions (it’s a joke pointing to how one of the lectures required us to code out a RAG system querying from a YouTube video, description, title and links dataset) and not answer questions about the keys.

So, I finally took a break. I went about my day. Every single thing I do, the problem haunts me, nudging me every second.

I caved in to the hauntings.

I thought about it...

How do I make them understand that when I ask "What keys are bunk 930?", I actually want the program to understand that I'm asking for Dorm 930 keys.

And something clicked!

“If RAG was a system to retrieve relevant documents (all it returns are just documents and context for the LLM to understand), then this means I can add my own context to be embedded!”

I opened up my laptop.

I created another file called info.json that contains all important context for keys.

[
    ...,
    {
        "remarks": "dorm 930 is also known as dormitory 930 or bunk 930 which is bunch 930d at dorm block",
        "id": "dorm_930_info"
    },
    ...
]

and finally...

The Win

I cleared all the database items and re-embedded all the information and asked the vector database "What keys are bunk 930?" and LO AND BEHOLD! It returned the information correctly!

So, what's next?

Well, now that I got the prototype RAG built.

I will

  • build a more robust system using other vector DBs like Pinecone.

  • using LangChain to use LLMs so answers are nicely structured when asked (obviously, there's more to make out of it and it can expand to other domains – not just asking about keys in the place)

Yeah, that’s all I could think about as of writing.

Well. That's about it for now. Catch y'all soon!