Mastering RAG Systems: From Fundamentals to Advanced, with Strategic Component Evaluation

18 Apr 2024

Towards Data Science

Share on XTweet

Mastering RAG Systems From Fundamentals to Advanced with Strategic
Component Evaluation

Elevating your RAG System: A step-by-step guide to advanced enhancements via LLM evaluation, with a real-world data use case

A basic RAG system contains four important steps. We will examine each of these steps and illustrate their application in our project.

Illustration of the basic RAG system applied on law Codes. Image generated by author using the Diagrams: Show Me GPT

1- Data ingestion

This phase involves the collection and preprocessing of relevant data from a variety of sources, such as PDF files, databases, APIs, websites, and more. This stage is closely tied to two key concepts: documents and nodes.

In the terminology of LlamaIndex, a Document refers to a container that encapsulates any data source, like a PDF, an API output, or data retrieved from a database. A Node, on the other hand, is the basic unit of data in LlamaIndex, representing a “chunk” of a source Document. Nodes carry metadata linking them to the document they belong to and to other nodes.

In our project, a document would be the full civil Code text and a node an article of this code. However, since we get the data already parsed into different articles from the API, we won’t need to chunk the documents and therefore prevent all the errors induced by this process.

The code to create the nodes is the following:

from dataclasses import dataclassfrom typing import Listimport osfrom uuid import uuid4from llama_index.core.schema import TextNode

from loguru import logger

from utils import load_jsonfrom .preprocess_legifrance_data import get_code_articlesfrom .constants import path_dir_data

@dataclassclass CodeNodes:code_name: strmax_words_per_node: int = 4000_n_truncated_articles: int = 0...

def __post_init__(self):self.articles = self.try_load_data()self.nodes = self.create_nodes(self.articles)...

def create_nodes(self, list_articles: List[dict]):nodes = [TextNode(text=article["content"],id_=str(uuid4()),metadata=self._parse_metadata(article),)for article in list_articles]

return nodes

def try_load_data(self) -> List[dict]:path = os.path.join(path_dir_data, f"{self.code_name}.json")try:code_articles = load_json(path=path)except FileNotFoundError:logger.warning(f"File not found at path {path}. Fetching data from Legifrance.")code_articles = get_code_articles(code_name=self.code_name)truncated_articles = self._chunk_long_articles(code_articles)return truncated_articles

def _parse_metadata(self, article: dict) -> dict:metadata = {k: v for k, v in article.items() if k not in ["content", "num"]}metadata = {"Nom du code": self.code_name,**metadata,"Article numero": article["num"],}return metadata

def _chunk_long_articles(self, articles: List[dict]) -> List[dict]:...

First we load the articles in try_load_data as mentioned in the previous section.
We chunk the long articles into “sub-articles” so that we can embed them without truncation. This is done in _chunk_long_articles. The full code for the chunking can be found ondata_ingestion/ nodes_processing.py
We create nodes using llama-index TextNode class. At the end of this step, our data consists of a list of nodes where each node has a text, an id and its metadata. For example:

TextNode(id_='c0ee164f-ae58-47c8-8e5b-e82f55ac98a9',embedding=None,metadata={'Nom du code': 'Code civil','livre': 'Des différentes manières dont on acquiert la propriété','titre': 'Des libéralités','chapitre': 'Des dispositions testamentaires.','section': 'Du legs universel.','id': '1003','Article numero': '1003'},excluded_embed_metadata_keys=[],excluded_embed_metadata_keys=[],relationships={},text="Le legs universel est la disposition testamentaire par laquelle le testateur donne à une ou plusieurs personnes l'universalité des biens qu'il laissera à son décès.",start_char_idx=None,end_char_idx=None,text_template='{metadata_str}\n\n{content}',metadata_template='{key}: {value}',metadata_seperator='\n')

The metadata will be used for both the embedding and the LLM context. Specifically, we will embed the metadata as key-value pairs concatenated with the content, following the text_template format. This format will also be used to represent the node when forming the LLM context. If desired, you can exclude some or all of the metadata for the embedding or the LLM by changing the excluded_embed_metadata_keys and/or excluded_embed_metadata_keys.

2- Indexing and storage

Indexing is the process of creating a data structure that enables data queries. This typically involves generating vector embeddings, which are numerical representations that capture the core essence of the data. Additionally, various metadata strategies can be adopted to efficiently retrieve information that is contextually relevant. The embedding model is used not only to embed the documents during the index construction phase, but also to embed any queries we make using the query engine in the future.

Storing is the subsequent phase where the created index and other metadata are saved to prevent the need for re-indexing. This ensures that once data is organized and indexed, it remains accessible and retrievable without undergoing the indexing process again.

The vector embeddings can be stored and persisted in a vector database. In the next section we will see how to create and run this database locally.

a- Setting-up the vector database

Qdrant will be used as a vector database to store and index the articles embeddings along with the meta-data. We will run the server locally using Qdrant official docker image.

First you can pull the image from the Dockerhub:

docker pull qdrant/qdrant

Following this, run the Qdrant service with the command below. This will also map the necessary ports and designate a local directory (./qdrant_storage) for data storage:

docker run -p 6333:6333 -p 6334:6334 \-v $(pwd)/qdrant_storage:/qdrant/storage:z \qdrant/qdrant

With the setup complete, we can interact with the Qdrant service using its Python client:

from qdrant_client import QdrantClient

client = QdrantClient("localhost", port=6333)

b- Indexing and storing the data

Below is the code to index and store the data.

from llama_index.core import VectorStoreIndexfrom llama_index.core import StorageContextfrom llama_index.embeddings.mistralai import MistralAIEmbeddingfrom llama_index.embeddings.openai import OpenAIEmbeddingfrom llama_index.embeddings.fastembed import FastEmbedEmbeddingfrom llama_index.vector_stores.qdrant import QdrantVectorStorefrom qdrant_client import QdrantClientfrom qdrant_client.http.exceptions import UnexpectedResponsefrom loguru import logger

from data_ingestion.nodes_processing import CodeNodes

def index_given_nodes(code_nodes: CodeNodes,embedding_model: MistralAIEmbedding | OpenAIEmbedding | FastEmbedEmbedding,hybrid_search: bool,recreate_collection: bool = False,) -> VectorStoreIndex:"""Given a list of nodes, create a new index or use an existing one.

Parameters----------code_nodes : CodeNodesThe nodes to index.embedding_model : MistralAIEmbedding | OpenAIEmbedding | FastEmbedEmbeddingThe embedding model to use.hybrid_search : boolWhether to enable hybrid search.recreate_collection : bool, optionalWhether to recreate the collection, by default False.

"""

collection_name = code_nodes.nodes_configclient = QdrantClient("localhost", port=6333)if recreate_collection:client.delete_collection(collection_name)

try:count = client.count(collection_name).countexcept UnexpectedResponse:count = 0

if count == len(code_nodes.nodes):logger.info(f"Found {count} existing nodes. Using the existing collection.")vector_store = QdrantVectorStore(collection_name=collection_name,client=client,enable_hybrid=hybrid_search,)return VectorStoreIndex.from_vector_store(vector_store,embed_model=embedding_model,)logger.info(f"Found {count} existing nodes. Creating a new index with {len(code_nodes.nodes)} nodes. This may take a while.")if count > 0:client.delete_collection(collection_name)

vector_store = QdrantVectorStore(collection_name=collection_name,client=client,enable_hybrid=hybrid_search,)storage_context = StorageContext.from_defaults(vector_store=vector_store)index = VectorStoreIndex(code_nodes.nodes,storage_context=storage_context,embed_model=embedding_model,)return index

The logic of the indexing function is quite simple:

First we create the Qdrant client to interact with the vector database.
We check the number of nodes in the desired collection. This collection name is provided by code_nodes.nodes_config and corresponds to the current experiment (that is a concatenation of the code name, the embedding method and eventually the advanced RAG technique(s) like base, window-nodes, hybrid search… more on these experiments later). If the number of nodes is different than the current number, we create a new index:

storage_context = StorageContext.from_defaults(vector_store=vector_store)index = VectorStoreIndex(code_nodes.nodes,storage_context=storage_context,embed_model=embedding_model,)

Else we return the existing index:

index = VectorStoreIndex.from_vector_store(vector_store,embed_model=embedding_model,)

The full code to create or retrieve indexes can be found in the ./retriever/get_retriever.py module. You can also find more information about this phase on llama-index documentation.

3- Querying

Given a user query, we compute the similarity between the query embedding and the indexed node embeddings to find the most similar data. The content of these similar nodes is then used to generate a context. This context enables the Language Model to synthesize a response for the user.

We define a query engine based on the index created in the previous section. This engine is an end-to-end pipeline that takes a natural language query and returns a response, as well as the context retrieved and passed to the LLM.

from llama_index.core.query_engine import BaseQueryEnginefrom llama_index.core import VectorStoreIndexfrom llama_index.core import PromptTemplate

def get_query_engine_based_on_index(index: VectorStoreIndex,similarity_top_k: int = 5,) -> BaseQueryEngine:

query_engine = index.as_query_engine(similarity_top_k=similarity_top_k)

query_engine = update_prompts_for_query_engine(query_engine)return query_engine

update_prompts_for_query_engine is a function that allows us to change the prompt of the response synthesizer, i.e the LLM responsible of taking the context as input and generating an answer for the user.

def update_prompts_for_query_engine(query_engine: BaseQueryEngine) -> BaseQueryEngine:

new_tmpl_str = ("Below is the specific context required to answer the upcoming query. You must base your response solely on this context, strictly avoiding the use of external knowledge or assumptions..\n""---------------------\n""{context_str}\n""---------------------\n""Given this context, please formulate your response to the following query. It is imperative that your answer explicitly mentions any relevant code name and article number by using the format 'according to code X and article Y,...'. Ensure your response adheres to these instructions to maintain accuracy and relevance.""Furthermore, it is crucial to respond in the same language in which the query is presented. This requirement is to ensure the response is directly applicable and understandable in the context of the query provided.""Query: {query_str}\n""Answer: ")new_tmpl = PromptTemplate(new_tmpl_str)query_engine.update_prompts({"response_synthesizer:text_qa_template": new_tmpl})return query_engine

In this new template, we emphasize the importance of preventing hallucinations. We also instruct the LLM to always reference the code name, such as the civil code, and the article number before providing a response. Moreover, we instruct it to reply in the language of the query to support multilingual queries.

The create_query_engine function in the query_engine module is the entrypoint for creating the RAG pipeline. The arguments for this function define the RAG configuration parameters.

The code below creates a basic query engine and generates a response based on a specific query:

from query.query_engine import create_query_engine

query_engine = create_query_engine()

response = query_engine.query("What are the conditions required for a marriage to be considered valid ?")

print(response)

You can find the full code to create the query engine in ./query/query_engine.py . More infos about the llama-index query engine can be found here.

4- Evaluation

Evaluation is crucial to assess the performance of the RAG pipeline and validate the decisions made regarding its components.

In this project, we will employ an LLM-based evaluation method to assess the quality of the results. This evaluation will involve a golden LLM analyzing a set of responses against certain metrics to ensure the effectiveness and accuracy of the RAG pipeline.

Typically, a RAG system can be evaluated in two ways:

Response Evaluation: Does the response align with the retrieved context and the query?
Retrieval Evaluation: Are the data sources retrieved relevant to the query?

LLama-index provides several evaluators to compute various metrics, such as faithfulness, context relevancy, and answer relevancy. You can find more details about these metrics in the evaluation section.

Here’s the process for creating evaluators in this project:

Initially, we generate n questions using an LLM (in our case, gpt-4) about the legal code we want to evaluate. For cost reasons, we've set n=50, but a larger number could provide a more confident assessment of the system's performance. An LLM-generated question example is “What are the legal consequences of bigamy in France?” or in French: “Quelles sont les conséquences juridiques de la bigamie en France?”. Note that this approach has limitations as the questions generated by the LLM may not mirror the actual distribution of questions from real users. You can find the module for questions generation is evaluation/generate_questions.py .
Using the list of generated questions, an evaluation metric, and a query engine, we generate a list of responses. Each response is scored between 0 and 1 by the evaluator, which also provides a feedback text to explain its scoring.
llama-index provides evaluators for different metrics, such as the ContextRelevancyEvaluator to compute the context relevancy. In addition, we override the evaluate_response method of these evaluators to take into account the metadata when embedding and creating context.

from typing import Any, Optional, Sequence

from llama_index.core.evaluation import ContextRelevancyEvaluator, EvaluationResultfrom llama_index.core.response import Responsefrom llama_index.core.schema import MetadataMode

class CustomContextRelevancyEvaluator(ContextRelevancyEvaluator):def __init__(self, **kwargs):super().__init__(**kwargs)

def evaluate_response(self,query: Optional[str] = None,response: Optional[Response] = None,metadata_mode: MetadataMode = MetadataMode.ALL,**kwargs: Any,) -> EvaluationResult:"""Run evaluation with query string and generated Response object.

Subclasses can override this method to provide custom evaluation logic andtake in additional arguments."""response_str: Optional[str] = Nonecontexts: Optional[Sequence[str]] = Noneif response is not None:response_str = response.responsecontexts = [node.get_content(metadata_mode=metadata_mode)for node in response.source_nodes]

return self.evaluate(query=query, response=response_str, contexts=contexts, **kwargs)

Below is the code to obtain evaluation results from a specific evaluator, such as the context relevancy evaluator. It’s worth noting that you can generate these evaluations in an asynchronous mode for quicker results. However, due to OpenAI rate limits, we performed these evaluations sequentially.

from typing import Literal, List

from llama_index.core.query_engine import BaseQueryEnginefrom llama_index.llms.openai import OpenAIfrom llama_index.core.evaluation import EvaluationResultfrom llama_index.core.schema import MetadataModefrom tqdm import tqdm

from utils import load_jsonfrom evaluation.custom_evaluators import CustomContextRelevancyEvaluator

PATH_EVAL_CODE_CIVIL = "./data/questions_code_civil.json"PATH_EVAL_CODE_DE_LA_ROUTE = "./data/questions_code_de_la_route.json"

def evaluate_one_metric(query_engine: BaseQueryEngine,code_name: Literal["code_civil", "code_de_la_route"],metadata_mode: MetadataMode,llm_for_eval: str = "gpt-3.5-turbo",) -> List[EvaluationResult]:

evaluator = CustomContextRelevancyEvaluator(llm=OpenAI(temperature=0, model=llm_for_eval))

eval_data: List[str] = load_json(globals()[f"PATH_EVAL_{code_name.upper()}"])

results = []for question in tqdm(eval_data):response = query_engine.query(question)eval_result = evaluator.evaluate_response(query=question, response=response, metadata_mode=metadata_mode)

return results

Finally to get a score for a certain pipeline (defined by its query engine), we average the scores given by the evaluator to each query.

The complete code for evaluation is located in evaluation/eval_with_llamaindex.py.

For the upcoming RAG pipeline evaluations, we will use 50 questions generated by GPT-4 based on the French civil code. These questions can be found on ./data/questions_code_civil.json . The gold LLM used for evaluation is gpt-3.5-turbo.