Enhancing Generative AI with RAG Models for Accurate and Relevant Content

Over the course of our business journey spanning more than 10 years, we have been involved in some very bespoke and interesting software development projects. I’d like to take a moment to reflect on recent intriguing projects that involved some very interesting, cutting-edge, generative AI technology.

What is RAG and Why Is It Important?

As AI technology continues to evolve, the need for accuracy and relevance in generated content becomes increasingly important. This is where Retrieval-Augmented Generation (RAG) models come into play. RAG combines the strength of retrieval systems with the generative power of language models to create a robust framework that grounds AI-generated content in factual, pre-existing documents. This significantly reduces the risk of misinformation, ensuring that the output is both accurate and reliable.

I’m not talking about a simple ChatGPT-like system alone. Rather, these projects involve a RAG model that provides ground truth to a Large Language Model (LLM). This means written content is based on relevant and factual information/data, rather than the potential for made-up or “hallucinated” information. Unfortunately, this is the current reality of LLM models – there is a chance (even for commercial models) that the content generated may be hallucinated and not based on real-world facts.

Our Approach

For one of our projects, it required a large number fact-based articles. For another project, it involved historical books in various versions, and yet another project involved a huge dataset of documents.

Below is a diagram to illustrate the approach.

For the RAG and LLM infrastructure needed for such a project, the majority of it was based on a brilliant open-source project called PrivateGPT. This provided the pipeline needed, including ingesting embeddings, storing the data, processing prompts, retrieving embeddings, API calls to various services, and a front-end web GUI. It is a truly remarkable project and a developer-centric product which greatly cuts down the cost and time involved in developing such a project from scratch.

If you are interested in learning more about this project, the repository can be found here: PrivateGPT Repository

Interact with your documents using the power of GPT, 100% privately, no data leaks
https://github.com/zylon-ai/private-gpt
7,255 forks.
53,971 stars.
236 open issues.

Recent commits:

Handling Large Datasets – Important Considerations for Article Content

Exporting the Data

We created a system to extract the content from the online source into a format that would be readily understood by a RAG and LLM model. After some research, we found that converting and stripping the content down to a simple HTML format meant that the content markup language would be understood by the RAG model, preserving the structural aspects of the articles.

Why Simple HTML is a Good Choice for RAG and LLM

Simple HTML ensures that the content retains its structure without introducing extraneous elements that could confuse the model. HTML also standardizes the format, making it easier for the model to parse and understand.

Export Structure

The data is arranged into a folder and file structure that preserved the publish date by year and month as a folder structure. Each article is then arranged into a file, named with the publish date and the title. This structure allows for a useful method to help keep track of the ingestion process and as a reference that we could observe which articles had been chosen by the RAG model to provide ground truth context to the LLM.

Why is Data Structure Important?

Data structure is crucial as it ensures that the data can be efficiently processed and retrieved. It also helps maintain organization and traceability, which is essential when dealing with large datasets.

Cleaning the Data

This involves cleaning out many irrelevant aspects of the HTML content – elements, classes, and other content that provided no real meaning to the content itself. This is an important step as it reduces the noise in the data. The less noise, the better, as it helps RAG and LLM models to focus on the important details within the data.

Why is Cleaning the Data and Removing Noise Important for RAG and LLM?

Cleaning the data ensures that the model is not distracted by irrelevant information. This makes the training process more efficient and helps produce more accurate results.

Ingesting the Data

The next phase involves ingesting the data. From our research, we concluded that OpenAI’s large embedding model (text-embedding-ada-002) was the best fit for extracting the most “meaning” from articles to create and store what is known as “embeddings”. You could consider this to be something like creating maps of meaning which could then be later searched in order to retrieve the articles that are most relevant to the topic being discussed.

What is Ingesting Data in Relation to Creating Embeddings?

Ingesting data involves processing raw data to create embeddings. Embeddings are a type of numerical representation of text that captures its semantic meaning.

What are Embeddings and How Do They Work?

Embeddings are vector representations of text. They work by converting words, sentences, or even entire documents into dense vectors in a high-dimensional space. These vectors capture the semantic meaning of the text, allowing the model to understand and process it more effectively.

Considerations for Generating the Embeddings

There were several careful considerations we needed to make in order to generate the most useful embeddings:

Selection of Embeddings Model: We selected OpenAI’s large embedding model as it provided a large token window and was the most accurate in creating meaningful embeddings.
Embedding Dimensions: We wanted to maximize the dimensions of the embeddings according to the maximum size allowed by the embedding model. This would therefore maximize the detail captured.
Service Provider: Because OpenAI has not provided this model as open source, we were dependent on using their services to process the embeddings. We considered other service providers, such as Google, but based on benchmarks, OpenAI was the clear winner.
Compatibility with LLM: It is important to select an embedding model that will work with the LLM that has been selected.
Processing Method: We considered both local processing and via API. However, the large model was not available for download, so we used OpenAI’s API to process embeddings.
Ingestion Parallel Mode: This was simply for speed. By processing multiple articles in parallel, we could effectively speed up the process of creating embeddings.

Naturally, there were other considerations but the above are the ones that come to mind as important.

Storing the Ingested Data

The next important step is to select a method for storing the embeddings. It was important to select a database structure that would be efficient, reduce the potential risks of data corruption, be a proven method for storing embeddings, and fast and efficient for retrieval.

We tried one simple approach by storing by a file-based vector database approach. However, we found the method of writing embeddings became very slow as the database grew. So we decided to go for another database method that felt more natural. We chose Postgres.

Why is Postgres a Good Choice for Embeddings?

Postgres provides a robust, scalable, and reliable database solution. It supports complex queries and indexing, which are essential for efficiently retrieving embeddings. Additionally, its open-source nature and widespread adoption make it a trusted choice.

Retrieval-Augmented Generation (RAG)

With the embeddings effictively stored within the Postgres database, we could move onto the methodologies of RAG. Naturally, there were various elements to consider:

How Many Top Rated Relevant Articles Should Be Retrieved? To answer this question, we needed to get an idea of the scope of how many articles were written for general categories of topics. Based on the number of articles available for a vast spread of topic categories, it became a relatively simple decision to match the number of articles by the relevancy of the topic. Also, we included a relatively large window as we planned for a multi-step approach to both vet the relevancy and sort for relevancy.
Similarity Score: After the first step, the list of embeddings (articles) retrieved were further vetted by a similarity score. This meant that if the articles were not above a certain score of similarity to the topic (prompt) then they were discarded from the list.
Reranking: The last method is called reranking. It involved a sorting of all articles left in the list of retrieved articles to determine the sort order of relevancy. This meant the most relevant articles would be considered as most important to the topic. It also helped protect against the LLM prompt being seeded with information that may be somewhat related but not quite that important to the topic.

The rerank process involved using a local model to process the reranking of articles. The model we used is called a cross-encoder named ms-marco-MiniLM-L-2-v2. This model can be found with more details on Huggingface: ms-marco-MiniLM-L-2-v2

Final Step: The final step cut the total number of articles by removing the articles sorted as least relevant to keep the context window within a range that does not exceed the LLM’s maximum token window. By this stage, the least relevant articles would not have much bearing on the prompt, serving more as a safeguard against exceeding limits.

How Does RAG Work?

RAG works by using a combination of retrieval and generation. The retrieval component searches for relevant documents based on a query, while the generation component creates a coherent response grounded in those documents.

How Does a Rerank Work and What is a Cross-Encoder Model?

A rerank works by reordering the retrieved documents based on their relevance to the query. A cross-encoder model, like ms-marco-MiniLM-L-2-v2, evaluates the relevance of each document by considering the query and document together, providing a more accurate relevance score.

How is Relevant Data Retrieved from the Embeddings Database and Added to the Prompt?

Convert the Prompt: The user’s prompt is converted into an embedding vector using the same model that created the document embeddings.
Search for Similar Embeddings: The prompt embedding is used to search the database for the most similar document embeddings based on similarity measures.
Retrieve Relevant Documents: The top relevant documents, which have the highest similarity scores to the prompt embedding, are retrieved from the database.
Augment the Prompt: The retrieved relevant information is incorporated into the original prompt to provide additional context.

The LLM

By now it is probably becoming at least somewhat obvious the LLM provider we selected – OpenAI.

Why Was OpenAI a Better Choice Compared to Google, and Others?

We wanted to select a model that was the most up-to-date and had the best benchmarks. That was the core reason for selecting OpenAI’s GPT-4-turbo model. This was the best model available at the time. Naturally, even as of the time of writing this article, we now have GPT-4o, which excels beyond GPT-4-turbo, and by the time you read this, there may be another model that outdoes this – either provided by OpenAI or another provider.

With OpenAI as a reliable choice for both embeddings and the LLM, we had embeddings we could couple naturally with the LLM. This also streamlined the development approach as we only needed to use the one service provider for both the embedding API and the LLM API calls.

Considerations for the LLM

Here are some additional considerations:

Context Window: Ensure the LLM can handle the context size required.
System Prompt: Customize the system prompt for optimal performance.
Temperature: Adjust the temperature for more creative or focused outputs.
Control of a Random Seed: Ensure reproducibility if needed.
Ability to Switch Between RAG with LLM and LLM Only: Flexibility to use different modes as required.

The Web GUI

We used Gradio for the web GUI.

Why is Gradio Such a Great Choice for This Project?

Gradio allows for quick and easy creation of interactive web interfaces. It’s user-friendly and integrates seamlessly with Python, making it ideal for rapid prototyping and deployment.

Server-side and Hosting

We wanted to make this project simple to work with and easily maintainable. So we wrapped it into Docker containers. This meant both the development environment and production deployment and environment were streamlined.

Why is Docker Great for Both Development and Production Environments?

Docker provides a consistent environment for development, testing, and deployment. It ensures that the application runs the same way, regardless of where it is deployed. This reduces the “it works on my machine” problem and enhances scalability and maintainability.

Periodic Data Ingestion

The nature of some projects requires a need to keep a RAG database up-to-date with the latest and newest data. In one case, this involved periodically exporting new articles and ingesting them into the RAG model to keep the database up to date and relevant to the ground-truth-based generative AI tasks. So our team wrote scripts to handle the ingestion process and web-based software to handle the recurring export process of the articles.

Potential Use-cases

You could imagine how useful such a system could be for a large variety of use-cases.

Business:

Content Generation: Automated generation of articles, reports, and summaries based on factual data.
Customer Support: Enhanced chatbot systems that provide accurate, context-based responses.
Market Research: Analyzing trends and generating insights from large datasets.

Community:

Education: Creating interactive and informative platforms for students and educators.
Research: Assisting researchers in finding and summarizing relevant papers and articles.
Public Health: Generating informative content for public awareness campaigns.

Can you think of a use-case that either your business could implement? Or perhaps a community-run project? One such community project I have thought of is a service for researchers to search the massive ArXiv for papers relevant to their industry or current research. The search would allow for a meaningful search of papers based on the semantics rather than keywords alone. Furthermore, a user could have a “chat” with the papers to further understand or explain a paper according to a given context.

If you have a use-case you can think of that could benefit the wider community or an idea for your business, I’d be keen to have a chat with you about it. There might just be an opportunity waiting to be discovered.

July 17, 2024

AI | Development