Query Translation for RAG (Retrieval Augmented Generation)Applications

20 min readMay 5, 2024

Enhance RAG’s Insights: Optimizing Data Context for Better Results

Building a chatbot that truly understands your users with OpenAI and Gemini ? There’s more to it than meets the eye. One major hurdle with Retrieval-Augmented Generation (RAG) chatbots is “hallucination.” Imagine a data scientist creating a tax-focused chatbot for both experts and novices in an organization. Unless the chatbot is built with sufficient context and guidance from tax specialists (like tax jargons and legal nuances), it’s responses will vary wildly based on the user’s knowledge level. This can lead to confusion and ultimately, a chatbot failing to serve its purpose. So, how do we ensure our RAG chatbots provide accurate information regardless of the user’s expertise? This can be achieved by implementing advanced RAG pipelines, and query translation is the first stage of this pipeline. This article is the first of my four-part series “Analyzing RAG for beginners”, aiming to help make an informed RAG implementation.

As illustrated, implementing a successful retrieval mechanism makes up the major chunk of the RAG implementation. Therefore, utilizing all relevant context helps build a successful retriever. Consider the same example of a tax chatbot; user queries can vary from “what are the jurisdictions that received federal tax relief for natural disasters in February 2024 by the IRS” to “states with disaster tax relief by the IRS this year.” While we cannot always eliminate user query ambiguity, we can always try to refine them to make them relevant enough. Query translation is a series of steps to improve the likelihood of relevance between the query embeddings and the document embeddings to ensure the best possible match for the LLM answer generation. In this article, we will try to understand the basics of how context is fetched based on the user input, how query translation helps, popular techniques for query translation, and code the different approaches using LangChain in Python.

Computing Distance Between Query and Document

The similarity searches can range from a naive distance calculation between two vectors to a complex semantic similarity calculation based on embeddings. Some broad approaches to finding sentence similarity are word-to-word based, structure-based, and vector-based similarities. We will have a separate article dedicated to the various types of embeddings and similarity calculations. Here, let us implement a simple search using the popular BERT uncased sentence embedding(case insensitive) with cosine similarity to identify the relevant answer.

# Importing the necessary libraries
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load BERT uncased sentence embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Define the query and documents
query = "Where is XYZ having its head office?"
documents = [
    "XYZ's head office is located in New York City.",
    "There is an office for XYZ in London.",
    "head of XYZ is in San Francisco."
]

# Encode query and documents into BERT embeddings
query_embedding = model.encode(query)
document_embeddings = model.encode(documents)

# Calculate cosine similarity between query and documents
similarities = cosine_similarity([query_embedding], document_embeddings)[0]

# Find the index of the most similar document
most_similar_index = similarities.argmax()

# Print the most similar document and its similarity score
print(f"""Answer : {documents[most_similar_index]}\nscore : {similarities[most_similar_index]}""")

In the above example, we are using all-MiniLM-L6-v2 embeddings, a 384-dimensional dense vector. This means that each sentence or paragraph fed into the model is transformed into a single vector containing 384 numerical values representing the information. The general process flow for this embedding is as follows:-

This is a BERT uncased model, so the input text is converted into lowercase by default.
The converted string is then tokenized (dividing the string into words or sub-words). An example of sub-words are “play” and “ing” for the word “playing”. This is done to ensure the model’s knowledge of the root word in the training corpus, as not every word is exposed to the model. It also aids in morphological analysis.
Mapping the token to a 384-dimensional space embedding captures the semantics using its multiple layers to capture the position, kind of word (like POS and NER tags), etc. A pooling layer is used at the end of the transformer layers to reduce the dimensionality of the representation. This focuses on the most relevant features within the sentence. The resulting sentence embedding is a dense vector, meaning most features within the original representation contribute to its final form.
The same process occurs for the list of documents to search for the answer.
Cosine distance is computed between the query embedding and the three document embeddings to identify the most relevant representation. To provide a one-liner about cosine similarity, it is a distance metric that considers both direction (angle — Cos(θ)) and magnitude of two vectors to determine the similarity. The result always lies between -1 and 1, with 1 being the perfect match.
The document with the maximum value for cosine similarity is returned as the response.

Similar distance metrics are used to identify the relevant documents from vector stores. There are also semi-bidirectional LSTM embeddings like ELMo embeddings that are popular but do not capture the entire context as a transformer architecture does. To summarize, the words are translated into machine-understandable number vector representations and compared for similarity using mathematical calculations, which are then returned as a number theoretically representing the distance between the two embeddings. This distance is then sorted in an appropriate way based on the distance metric adopted and returned to the user. Therefore, these embeddings, along with the appropriate distance metrics, can help retrieve accurate contexts, which in turn generates an effective response. We will look at implementing this with a vector store and various query translation mechanisms.

Vector Store Implementation for Similarity Search

Let us implement a basic similarity search using Chroma vector store and all-MiniLM-L6-v2 embeddings. Instead of storing the embeddings in a DB, we are going to use an in-memory approach for storing the embeddings without persisting (An in-memory vector database is a type of database system optimized for storing and querying high-dimensional vector data entirely in memory, rather than on disk). Let us begin with 5 documents and later extend it as needed. To ease the data extraction process, there are many document loaders provided by LangChain and Llama Hub. Let us look at one such example for loading Wikipedia content and using Chroma in-memory vector store approach for performing a similarity search.

#installing the necessary packcages using pip manager
!pip install unstructured langchain langchain-chroma sentence-transformers wikipedia

# importing all the necessary libraries
# wikipedialoader is part of the langchain document loaders
from langchain_community.document_loaders import WikipediaLoader
# chroma is the vector database we are using
from langchain_chroma import Chroma
from langchain_text_splitters import CharacterTextSplitter
# Using sentence embeddings
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

# loading the documents using wikipedialoader
docs = WikipediaLoader(query="family guy tv series", load_max_docs=5).load()
# checking the number of articles returned
len(docs)
# Now we need to split the text into chunks as the LLMs have an input token limit (maximum word limit for user input)
text_splitter = CharacterTextSplitter(chunk_size=3500, chunk_overlap=0) # every split will be having 3000 characters with an overlap of 0 characters
docs = text_splitter.split_documents(docs)
# loading the embedding function for the documents
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
# load it into Chroma - this is in-memory (will be available only for the duration of the program)
db = Chroma.from_documents(docs, embedding_function)
# performing a similarity search to fetch the most relevant context
db.similarity_search('who is the voice for Meg')

Relevant chunks of information are returned by the Chroma similarity search algorithms. You can find all available options for the Chroma store from here. The output returned for the user query can be found below.

In order to retrieve the most relevant chunk, we need to provide an accurate user prompt. This is where we use query translation to design the most accurate prompt to pass to the vector store.

Query Translation and It’s Types

With a basic understanding of how the distance is computed, we can proceed to craft the most contextual query achievable. Doing so will enable our end-users to flexibly query our model for results. Let us explore the popular query translation (query rewriting) mechanisms supported by LangChain.

Multi-Query

Intuition — We are taking a question and breaking it down into a few differently worded questions to offer multiple perspectives on the same question. This is done in the belief that the initial wording of the user query, when embedded, might not always align with the document we want to retrieve. By rewriting the query in different ways, we are trying to increase the likelihood of identifying the right document by having embeddings with nuances similar to the document embedding nuances.

Implementation — We are going to utilize the MultiQueryRetriever present in LangChain and log the execution to monitor the different queries generated. In this approach, the LLM we are using will generate different versions of the user query based on the predefined prompt-template. For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents. This broadens the scope of retrieval, but this approach is a double-edged sword as we have to carefully determine the size of documents to accommodate LLM context size and prevent unwanted information from leaking into our context. Let us look at a quick implementation. Let us utilize the above vector store and embeddings and test all the approaches.

# We are using the Azure OpenAI model for this approach ( This default retriever is not supported for the older versions)
# that is, langchain.llms.AzureOpenAI does not work and langchain_openai.AzureChatOpenAI works
import os
import logging
logging.basicConfig()
# setting up the AzureOpenAI configuration
os.environ["AZURE_OPENAI_API_KEY"] = ""
os.environ["AZURE_OPENAI_ENDPOINT"] = ""
os.environ["AZURE_OPENAI_API_VERSION"] = ""
os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"] = ""

# importing the relevant libraries for multiqueryretrieval and LLM
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import AzureChatOpenAI

# reusing the same question as before
question = 'who is the voice for Meg'
# initializing the LLM model
llm = AzureChatOpenAI(openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"])
# setting up the retriever
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=db.as_retriever(), llm=llm
)
# setting the logger to listen to specific tasks
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)
# Generating multiple contexts for the same question and retrieving the relevant contexts
unique_docs = retriever_from_llm.invoke(question)
len(unique_docs)

List of generated questions

Drawbacks — Even though this approach aims to provide the best possible answer, it doesn’t always work out that way. To make this process efficient, a human evaluation process is required to understand the maximum number of chunks to use, the maximum number of queries to generate, and a response synthesizer function designed, among other activities, to make the most of the returned contexts. This approach, as seen here, requires the user to design a clear prompt, which in turn will drive the sub-query creation, making the process more reliant on the general-info pre-tuned LLM. This might not be able to understand the context of niche user queries. Additionally, the cost and latency of using this approach are higher as we have to make more retrievals and generations. Having a broader contextual window to look for and giving equal weightage to all the contexts will degrade the model performance if not all the returned information is relevant to that question.

Rag-Fusion

Intuition — This approach is an enhanced multi-query approach with a ranker at the end of the multi-query layer that ranks the relevance of contexts and consumes them instead of a naive union like before. When multiple queries are passed through the retriever, it generates a list of relevant documents for each query. By ranking the order of relevance of each context, we focus on the most relevant pieces of information first and less relevant information later. The calculation is in such a way that even the lower ranked documents are taken into consideration. This maximizes the contextual information passed on to the LLM.

Implementation — We will be passing down the results generated from the multi-query approach to a reciprocal rank fusion function that sorts all the retrieved contexts in descending order based on the calculated fusion score. The fusion score is given by Score = 1 / (rank + k), where rank is the position of the document in the combined rank list and k is a smoothing constant that helps to give non-zero scores even for those contexts that are at the bottom. The score is inversely proportional to the document’s rank. Documents appearing higher in the combined list receive a higher score. The resultant RF score is obtained by averaging all document fusion scores.

from langchain.load import dumps, loads

def reciprocal_rank_fusion(results: list[list], k=60):
    # Initialize a dictionary to hold fused scores for each unique document
    fused_scores = {}

    # Iterate through each list of ranked documents
    for docs in results:
        # Iterate through each document in the list, with its rank (position in the list)
        for rank, doc in enumerate(docs):
            # Convert the document to a string format to use as a key (assumes documents can be serialized to JSON)
            doc_str = dumps(doc)
            # If the document is not yet in the fused_scores dictionary, add it with an initial score of 0
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
            # Retrieve the current score of the document, if any
            previous_score = fused_scores[doc_str]
            # Update the score of the document using the RRF formula: 1 / (rank + k)
            fused_scores[doc_str] += 1 / (rank + k)

    # Sort the documents based on their fused scores in descending order to get the final reranked results
    reranked_results = [
        (loads(doc), score)
        for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    ]

    # Return the reranked results as a list of tuples, each containing the document and its fused score
    return reranked_results

final_docs = reciprocal_rank_fusion(unique_docs)
len(final_docs)

Query Decomposition

Intuition — Understanding a complex user query in one attempt is difficult, as it may require looking for multiple contexts to assimilate them and return a nuanced output. To achieve this, we are breaking down a complex user query into multiple parts and processing them either parallelly or sequentially. This simplifies the prompts and increases the context for the retrieval process.
Implementation — As seen in the intuition, this approach tries to maximize the relevant context of the question asked by decoupling it instead of rephrasing it multiple times therefore maximizing the context retrieved. Let us try implementing both recursive and parallel answering approaches.

Recursive Answering Approach —Here we are passing the questions one by one along with the previous Q-and-A response and context fetched for the current question. This, in turn, retains the old perspective and synchronizes the solution with the new perspective, making the solution more nuanced. This approach has proven to be effective against really complex queries.

# let us first break down the complex query into simpler parts
from langchain_openai import AzureChatOpenAI
from langchain.prompts import ChatPromptTemplate

# Decomposition logic explained to the model each time a new base query comes in
template = """You are a helpful assistant that generates multiple sub-questions related to an input question. \n
The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. \n
Generate multiple search queries related to: {question} \n
Output (3 queries):"""
prompt_decomposition = ChatPromptTemplate.from_template(template)
# the prompt template only has one argument - question (which is the user query)
question = "who is the voice for Meg"
prompt = prompt_decomposition.format(question=question)
# let us apply the llm to this prompt template and extract the 3 queries
llm = AzureChatOpenAI(openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"])
questions = llm.invoke(prompt)
questions = questions.content
questions = questions.split("\n")
questions
# the generated questions are below
"""['1. Who provides the voice for Meg in the TV show Family Guy?',
 '2. What is the name of the actress who voices Meg on Family Guy?',
 '3. Has the voice actor for Meg on Family Guy ever changed throughout the show's run?']
"""
# Preparing a sequential Q-and-A prompt that drives the answer formulation
sequential_prompt = """Here is the question you need to answer:

\n --- \n {question} \n --- \n

Here is any available background question + answer pairs:

\n --- \n {q_a_pairs} \n --- \n

Here is additional context relevant to the question:

\n --- \n {context} \n --- \n

Use the above context and any background question + answer pairs to answer the question: \n {question}
"""
seq_decomposition_prompt = ChatPromptTemplate.from_template(sequential_prompt)
# now we have  to format the question answer pairs generated to make it interpretable by the LLM
# function to structure each question and answer pair and feed it in the prompt for the next question thereby improving the context
def format_qa_pair(question, answer):
    """Format Q and A pair"""
    formatted_string = ""
    formatted_string += f"Question: {question}\nAnswer: {answer}\n\n"
    return formatted_string.strip()
# setting up the vectorstore with the prepared chroma vectors
vectorstore = db.as_retriever()
q_a_pairs = "" # initializing the string to blank
# iterating over each question
for q in questions:
  # combining the context for each question (you can fetch the first context alone or the firs 2 alone, etc.)
  context = ',.,'.join([i.page_content for i in vectorstore.invoke('who is the voice for Meg')])
  print(context)
  # passing the question, context and q_a_pair to generate the prompt and hit the LLM
  # (initially the q_a_pair history will be an empty string)
  answer = llm.invoke(seq_decomposition_prompt.format(question=q,q_a_pairs=q_a_pairs,context=context))
  # passing the generated answer along with the question to prepare the q_a_pair
  q_a_pair = format_qa_pair(q,answer)
  # combining all the q_a_pairs
  q_a_pairs = q_a_pairs + "\n---\n"+  q_a_pair
# printing the result
print(f'Q: {question}\nA: {answer}')

Parallel Answering Approach — In the parallel answering approach, we are decomposing the user prompt into nuanced slices as before. The difference is that we are attempting to solve them in parallel. Here, we answer each question individually and then combine them together for a much more nuanced context, which is then used for answering the user query. Depending on the quality of the sub-queries, this approach is an efficient solution for most use cases.

# we will reuse the code until question generation ( until we populate the "questions" variable

# the prepared template is used for answering each sub-query
chat_template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:"""
prompt_decomposition = ChatPromptTemplate.from_template(chat_template)
# function to answer each sub-query and append them to a list which is then returned as the response
def parallel_retrieve_and_throw(questions, llm, vectorstore, prompt_decompostion):
  rag_results = []
  for q in questions:
    qa_chain = RetrievalQA.from_chain_type(
      llm,
      retriever=vectorstore,
      chain_type_kwargs={"prompt": prompt_decomposition}
    )
    result = qa_chain({"query": q})
    context = result['result']
    rag_results.append(context)
  return rag_results
# setting up the retriever
vectorstore = db.as_retriever()
# storing the generated answers as a list
primary_answers = parallel_retrieve_and_throw(questions, llm, vectorstore, prompt_decomposition)
# function to convert the Q-A pairs into a single string
def format_qa_pairs(questions, answers):
    """Format Q and A pairs"""
    formatted_string = ""
    for i, (question, answer) in enumerate(zip(questions, answers), start=1):
        formatted_string += f"Question {i}: {question}\nAnswer {i}: {answer}\n\n"
    return formatted_string.strip()
context = format_qa_pairs(questions, primary_answers)
# the returned context is below
"""
Question 1: 1. Who is the voice actor for Meg in the Family Guy TV show?
Answer 1: Lacey Chabert is the voice actor for Meg in the first season of Family Guy.

Question 2: 2. Who provided the voice for Meg in the Family Guy movie?
Answer 2: Lacey Chabert provided the voice for Meg in the first season of Family Guy, but was replaced by Mila Kunis for the rest of the series' run.

Question 3: 3. Who is the actress behind the voice of Meg in the animated series Family Guy?
Answer 3: Lacey Chabert voiced Meg in the animated series Family Guy in Episodes 1-9 of the first season. However, Mila Kunis replaced her from Episode 10 onwards for the rest of the series.
"""

# Prompt for the final refined answer
template = """Here is a set of Q+A pairs:

{context}

Use these to synthesize an answer to the question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
# passing the user query (base query) and the condensed context to get the final result
answer = llm.invoke(prompt.invoke({'question':question,'context':context}))
answer
"""
The voice for Meg in the animated series Family Guy has changed over time. 
Lacey Chabert was the voice actress for Meg in the first season, but was 
replaced by Mila Kunis for the rest of the series' run.
"""

Returned Answer

Sub Question Query Engine (A parallel prompting tool in llama-index)— Another popular type of parallel approach is to use the query engine called “SubQuestionQueryEngine”. The concept is to break down a given query into sub-queries, each answering a particular piece of info supporting the base query. You can find the implementation here. It makes the most out of the given context by using a shotgun approach and targeting each piece of information separately, finally assimilating them into a single comprehensive text. The fascinating thing about this method is its effectiveness in designing the sub-questions to extract a specific piece of information from the data source it is directed to. For example, the user input “What happens to rains during and after summer” gets effectively split into “What happens to rains during summer” and “What happens to rain after summer”, and the answer to each will be returned in the thought flow that the user can verify. A condensed answer is then returned using these intermediate answers.

Step Back Prompting

Intuition —The intuition behind step-back prompting is to encourage the LLM to think deeply and engage in a form of meta-reasoning before tackling a task. We generate a more abstract question from the user query to get a broader perspective of the task, retrieving more contexts. Then, we generate the direct context based on the user query and combine these two contexts to get a more relevant answer. This is considered to be the best approach for reasoning questions, as the model gets to explore a broader view of the user query and use it to compare with the specific context from the direct user query to get a more nuanced response. To summarize, this method involves abstraction and causal reasoning to make the model more generalized because it focuses on the underlying principles as well.

Implementation — This is a simple 2-step process: one to understand the underlying governing principles, and the second to use the base response with the broad reasoning for generating the output response. Let us implement the same using LangChain. We will be reusing the LLM and Chroma store constructed earlier and continue with the rest.

from langchain.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate

# examples are provided to make the model understand how a generic question has to be phrased
examples = [
    {
        "input": "Could the members of The Police perform lawful arrests?",
        "output": "what can the members of The Police do?",
    },
    {
        "input": "Jan Sindel’s was born in what country?",
        "output": "what is Jan Sindel’s personal history?",
    },
]
# We now transform these to example messages
example_prompt = ChatPromptTemplate.from_messages(
    [
        ("human", "{input}"),
        ("ai", "{output}"),
    ]
)
few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=example_prompt,
    examples=examples,
)
# prompt to initiate few short prompting
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are an expert at world knowledge. Your task is to step back and paraphrase a question to a more generic step-back question, which is easier to answer. Here are a few examples:""",
        ),
        # Few shot examples
        few_shot_prompt,
        # New question
        ("user", "{question}"),
    ]
)
# generating the step back question
question = "who is the voice for Meg"
broad_qn = llm.invoke(prompt.format(question= question))
# returned question
"""
AI: who provided the voice for a character named Meg?
"""
# Preparing the final response prompt 
response_prompt_template = """You are an expert of world knowledge. I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. Otherwise, ignore them if they are not relevant.

# {normal_context}
# {step_back_context}

# Original Question: {question}
# Answer:"""
response_prompt = ChatPromptTemplate.from_template(response_prompt_template)
vectorstore = db.as_retriever()
broad_context = vectorstore.invoke(broad_qn.content)
short_context = vectorstore.invoke(question)

output = llm.invoke(response_prompt.invoke({'normal_context':short_context, 'step_back_context':broad_context, 'question':question}))
print(output)
# returned response is below
"""
The voice for Meg in Family Guy is provided by Mila Kunis, who won the role 
after auditions and a slight rewrite of the character, in part due to her 
performance on That '70s Show. MacFarlane called Kunis back after her first 
audition, instructing her to speak slower, and then told her to come back 
another time and enunciate more. Once she claimed that she had it under 
control, MacFarlane gave her the role.
"""

Returned Answer

HyDE — Hypothetical Document Embeddings

Intuition — The idea behind this approach is that documents are large chunks of data containing information from a well-phrased, dense, informed text, whereas the user query is not so well-constructed. Therefore, we construct a hypothetical document or answer from the user query (based on the innate LLM inference), which when embedded, is believed to be closer to the document chunks than the user query itself in the high-dimensional embedding space. This approach performs best in certain domains depending on the base model. If the model is pre-trained with corpus in or similar to your domain, this approach might work best. Otherwise, it isn’t the right approach, as the level of generalization is increasing. (If trying to reduce the level of generalization using prompt engineering, this approach really doesn’t make sense for your use case).
Implementation — Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details(to the required context — mostly). Then, an unsupervised contrastively learned encoder(e.g. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity. This second step ground the generated document to the actual corpus, with the encoder’s dense bottleneck filtering out the incorrect details. Let’s code this approach using LangChain for the same question and context as earlier. This method can be combined with other approaches, like implementing two or more different kinds of documents based on the kind of user or user query and fetching context for all. Then, we can rank and assimilate the information for the final context to answer the user query. Even in the previous methods, we can also use a combination of compatible LLMs (local and enterprise) to maximize efficiency and reduce costs. Additionally, we should design our approach in a scalable manner if needed.

image from Precise Zero-Shot Dense Retrieval without Relevance Labels Research paper

# reusing the question and vectorstore as earlier
from langchain.prompts import ChatPromptTemplate

# HyDE document genration prompt
template = """Please write a brief history about the family guy cartoon from the question
Question: {question}
Passage:"""
prompt_hyde = ChatPromptTemplate.from_template(template)

# generating a hypothetical document based on the user input
hypothetical_document = llm.invoke(prompt_hyde.format(question=question))
# retrieving relevant document based on hyde document instead of user query
retriever = db.as_retriever()
retireved_docs = retriever.invoke(hypothetical_document.content)
# implementing the final anser generation using the retrieved_docs and user query
template = """Answer the following question based on this context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

answer = llm.invoke(prompt.format(context=retireved_docs,question=question))
print(answer.content)

# returned answer is below
"""
Lacey Chabert voiced Meg in Episodes 1–9, and Mila Kunis has voiced Meg from 
Episode 10 onwards.
"""

Returned Answer

Conclusion

In conclusion, the initial step of prompt translation stands out as a crucial element in the Advanced RAG pipeline, demonstrating its effectiveness in enhancing the contextual understanding of the language model. However, it’s important to acknowledge that this process involves multiple calls to the language model, thus relying on both human and machine collaboration to refine queries and generate the necessary context. This iterative nature impacts factors such as consistency, cost, and latency.
To optimize this process, it’s essential to implement a strategic blend of approaches, leveraging both in-house and deployed language models. While the code presented may serve as a starting point, it requires customization with pertinent information and procedures, such as guardrails, routing, and appropriate indexing, to ensure its readiness for action.
For everyday tasks like email, news, or media summarization, these initial implementations may suffice. Nonetheless, there’s always room for refinement and improvement. Starting with a foundation and progressively enhancing it is a prudent approach. Through continuous iteration and adaptation, one can maximize the efficiency and effectiveness of the summarization process.

Resources

Exploring Retrieval-Augmented Generation (RAG) and Its Alternatives- Medium (2024) (This is my previous post giving an overview of RAG)
Lance Martin (LangChain), Learn RAG From Scratch — Python AI Tutorial from a LangChain Engineer , youtube (2024)
Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan, Precise Zero-Shot Dense Retrieval without Relevance Labels, arXiv.2212.10496 (2022)
Andrei-Laurentiu Bornea∗, Fadhel Ayed∗, Antonio De Domenico∗, Nicola Piovesan∗, Ali Maatouk+ ∗Paris Research Center, Huawei Technologies, Boulogne-Billancourt, France +Yale University, New Haven, Connecticut, USA, Telco-RAG: Navigating the Challenges of Retrieval-Augmented Language Models for Telecommunications, arXiv.2404.15939 (2024) (This is an easy to understand domain implementation of a RAG with a generic GPT model)
Shay Palachy Affek, Document Embedding Techniques- Medium (2019)

Query Translation for RAG (Retrieval Augmented Generation)Applications

Computing Distance Between Query and Document

Vector Store Implementation for Similarity Search

Query Translation and It’s Types

Conclusion

Written by Raghunaathan