Exploring Retrieval-Augmented Generation (RAG) and Its Alternatives

Diving into the Helpfulness of Retrieval-Augmented Generation and Beyond in NLP Evolution

20 min readApr 30, 2024

Retrieval-Augmented Generation(RAG) is the process of making a large language model (LLM) to reference a custom knowledge base outside its training data before generating a response. As we all know, LLMs are trained on vast volumes of data with billions of parameters thereby possessing powerful natural language capabilities. RAG extends these capabilities of LLMs to specific domains or an organization’s internal knowledge base, all without the need to retrain the model. It is a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.
The aim of this article is to craft a comprehensive guide enabling you to make informed decisions about when to leverage — and when to sidestep — the power of RAG.

Overview of the RAG architecture

Advanced RAG implementation from Hugging Face

As illustrated in the above image, RAG’s efficiency depends on the quality of context we are feeding the LLM. To understand this, let us look at the various components illustrated.

Knowledge Base — This is the organization’s internal knowledge repository. The knowledge base comprises of the source files, transformed text, embeddings and the vector store. The process of creating a knowledge base is as follows:-

Upload all the relevant files to the appropriate storage location.
The files might be of different formats, therefore we have to utilize data loaders that convert all the file formats into a uniform format like flat text and image format.
This text is then transformed into a meaningful text . Transformation involves case conversion, text normalization, tokenization, stemming or lemmatization and removal of stopwords (common approaches). The choice of tokenization technique is influenced by the specific characteristics of the LLM being used, as well as the requirements of the NLP task at hand.
The tokenized content is then chunked based on the embedding model and the LLM context window. The chunked content is then embedded.
Finally the embeddings are indexed and stored in the vector store. This database becomes the foundation for the retrieval stage.

Retriever — The retrieval stage in the RAG approach acts like a librarian searching for relevant information to answer a question. The aim of this step is to embed the user query and identify the similar chunks of information from the vector store.

The working of this retriever is as follows:-

Retrieval Mechanisms
- The retriever utilizes sophisticated retrieval mechanisms to access the knowledge base efficiently. These mechanisms are designed to retrieve relevant information based on the input query or context.
- Sparse retrieval methods employ traditional information retrieval techniques such as inverted index-based search or TF-IDF ranking. They construct an index of the external knowledge sources and use it to retrieve candidate documents or passages.
- Dense retrieval methods, on the other hand, leverage dense vector representations of documents and queries. They encode the documents and queries into dense embeddings using techniques like neural networks or pre-trained language models. Dense retrieval methods are often more computationally expensive but can capture semantic similarity more effectively compared to sparse retrieval approaches.
- Contextual retrieval methods take into account the context of the input query or document when performing retrieval. Contextual retrieval methods leverage techniques such as contextual embeddings or attention mechanisms to capture the context and retrieve relevant information accordingly. These methods consider the surrounding text or previous interactions to retrieve information that is contextually relevant to the current task or conversation. This contextual understanding helps improve the quality and relevance of retrieved information, especially in conversational or dialogue-based applications.
Scoring and Ranking
- Once candidate documents or passages are retrieved, the retriever computes similarity scores between the query/context and the retrieved items.
- Sparse retrieval methods typically use ranking algorithms such as BM25 or cosine similarity to score and rank the candidate documents based on their relevance to the query.
- Dense retrieval methods compute similarity scores using techniques like dot product or cosine similarity between the embeddings of the query/context and the retrieved documents. The retrieved items are ranked based on their similarity scores.

But simply passing the user query does not provide the entire context. As mentioned earlier, RAG thrives on the quality of the context. Consider this query “what is the price of Aquafina water bottle in 2022”. If there are multiple documents with you one for each brand and each document is having the string “ Price of 1 bottle is XX”, the similarity score between this document context and user query will be similar across all documents. Therefore, the first returned context will be returned which can point to the price of Bisleri instead of Aquafina. We have to make sure the context of the document should have the brand name in it for appropriate results, but the structure of documents cannot be controlled always. To resolve this issue, we use metadata while storing the chunks to the vectorstore. If we save the brand name and year in the metadata and use a NER model/ LLM for extracting the same from the user input, we can first filter out the relevant context before the distance calculation. We can also extend this approach to use a topic classifier for query reformulation. Prompt rewriting not only improves the nuances but also reduces the computational effort.

Reader — This is the final functionality that passes the output response to the user. The various methods in the Reader function are as follows:-

Prompt Compression
- Prompt compression refers to techniques used to shorten the instructions given to LLMs.
- It is like the KonMari method for tidying up the user prompt. Here are some common techniques for prompt compression:
Summarization: Identify the key points of your original prompt and condense them into a shorter version.
Selective Context: Focus on providing only the context that’s absolutely necessary for the LLM to understand your request.
Specialized Tools: There are AI tools designed specifically for prompt compression, like Selective Context or Microsoft’s LLMLingua, which can help automate this process.
Reranking the retriever results
- The context retriever returns a list of candidates based on the user context.
- The list of candidates are then re-ranked based on the relevance to the user query using a cross-attention mechanism.
Here’s an analogy: Imagine searching a library for a book.
- The initial ranking is like browsing the library catalog by title or keyword. It gets you to a shelf containing relevant books.
- Reranking is like scanning the shelf itself, considering factors like author reputation, publication date, or blurbs on the cover to pick the most fitting book for your needs.
- The reranker serves several important purposes:
Refinement of Relevance
- The initial retrieval process retrieves a set of candidate documents or passages based on their relevance to the input query or context. However, these retrieved candidates may still contain irrelevant or noisy information.
- The reranker evaluates the relevance of the generated outputs in the context of the input query or context more comprehensively. It considers factors beyond simple similarity scores, such as coherence, informativeness, or domain-specific criteria.
Improvement of Quality
- While the generation model aims to produce coherent and contextually relevant outputs, it may occasionally generate responses that are grammatically correct but semantically inconsistent or nonsensical.
- The reranker helps improve the overall quality of the generated outputs by filtering out low-quality responses or adjusting their rankings based on more sophisticated criteria.
Fine-tuning of Rankings
- The reranker fine-tunes the rankings of the generated outputs to better match the preferences or requirements of the specific application or task.
- It may prioritize certain characteristics of the responses, such as factual accuracy, diversity, or novelty, which may not be adequately captured by the initial retrieval and generation steps alone.
Adaptation to User Preferences
- In interactive settings or applications where user feedback is available, the reranker can adapt the rankings of the generated outputs based on user preferences or feedback.
- By incorporating user feedback into the reranking process, the system can learn and improve over time, providing more personalized and relevant responses to users.
Flexibility and Customization
- The reranker provides flexibility to customize the ranking criteria and optimize the generated outputs according to specific requirements or constraints of the application.
- It allows developers to tailor the reranking process to suit different use cases, domains, or user preferences, ensuring optimal performance and user satisfaction.
Prompt Generation/ Augmentation
- This effectively is the last stop before asking the LLM to do the AG part of RAG and is also one of the most controlled and impactful component.
- This is the recent AI industry buzz — prompt engineering. This involves curating a prompt template, a parent and topic specific ones that can be injected into the parent prompt to make the final prompt as contextual as possible.
Generation using LLM
- The augmented input prompt, along with the encoded passages, is fed into the generation model, typically a large language model like GPT (Generative Pre-trained Transformer).
- The generation model processes the augmented input and generates a sequence of tokens representing the output response. This sequence is generated autoregressively, with each token conditioned on the preceding tokens.
- During generation, the model leverages both the input prompt and the retrieved passages to produce contextually relevant and coherent responses.
- After generation, the generated outputs are reranked based on their relevance and quality. Reranking criteria may include factors such as semantic similarity to the input query/context, coherence, informativeness, or adherence to specific task requirements.
- Reranking may involve scoring the generated outputs using techniques like cosine similarity, perplexity, or other domain-specific metrics, and adjusting their rankings accordingly.
- The highest-ranked generated output is selected as the final response and presented to the user or application.

Passing the appropriate data to generate a RAG LLM response provides an innate guardrail. But additional guardrails can be provided to moderate the the prompts and the retrieved contexts. There are specialized open-source python libraries like Guardrails AI and NVIDIA’s NeMo, and can even perform semantic validation such as checking for bias in generated text and checking for bugs in generated code. This article provides more information on LLM guardrails.

Types of RAG

RAG architecture is broadly classified into 3 categories:- Naive, Advanced and Modular. The primary differences between the naive and advanced approaches are the pre and post retrieval processes. Since the naive approach doesn’t involve any fine-tuning processes like query rewriting or candidate reranking the resultant response is of poorer quality compared to the advanced RAG approach. Therefore, naive approach essentially consists of indexing, retrieving, augmenting and generation of response whereas advanced approach involves additional steps to improve the quality of the data to be indexed, rewriting the user query to have the best context, adding guardrails, reranking candidates, and prompt compression to add more nuance to the content that is being fed to the LLM. The modular RAG differs from the naive approach by introducing enhanced functionalities like a search module for similarity retrieval and fine-tuning the retriever. One such popular inference based retrieval is the StepBack approach where the model is made to explore a broader concept and collect a wider range of potentially relevant information. This information is then used for refining the prompt to get a precise response.

If interested try building your own RAG without langchain by following this Hugging Face cookbook or try building one using langchain from here — replace the components as you see fit.

Benefits of RAG like Approach

Increased Transparency and Explainability
Enhanced Accuracy and Factual Grounding
Improved Contextual Understanding
Personalized Recommendations and Tailored Outputs
Reduced Reliance on Manual Training and Updates
Broader Applicability and Expanding Use Cases

Fine-tuning large language models

In the context of LLMs such as GPT (Generative Pre-trained Transformer) or BERT (Bidirectional Encoder Representations from Transformers), fine-tuning involves updating the model’s parameters by continuing training on a task-specific dataset with a lower learning rate. The general steps involved in fine-tuning LLMs are as follows:-

Choosing a pre-trained model
Preparing the dataset in the format specific to the pre-trained model
Using this dataset for supervised learning and modifying the weights accordingly
using task-specific evaluation metrics to assess the model performance
Model refinement or replacement

Essentially, we are updating the model weights until a desired checkpoint gets created. Some common approaches to fine-tuning LLMs are as follows:-

Full Fine-tuning
- In this approach, all the weights and parameters of the pre-trained LLM are adjusted during the fine-tuning process. It’s like completely retraining the model on your specific task data.
When to use
- The task-specific data is large and significantly different from the pre-trained data.
Pitfalls
- Computationally expensive, requiring significant processing power and time.
- If the model overfits, the effort needed to retrain the model is significant.
Parameter-Efficient Fine-tuning (PEFT)
- This approach focuses on modifying only a small subset of the LLM’s parameters during fine-tuning. This is achieved by simpler techniques like gradient masking and saliency scores or advanced techniques like Lottery Ticket Hypothesis (LTH), Low-Rank Updates (LoRA) or Quantized LoRA (QLORA).
When to use
- Always preferred if there is no significant difference between the pre-trained and current training data.
- Cost of utilizing enterprise maintained LLMs like openAI GPT is higher than running your own hosted LLM.
Pitfalls
- Selection of parameters might not always be perfect leading to degradation in the model performance.
Transfer Learning
- In this approach we freeze the initial layers of the model that has captured all the natural language features and retrain the latter layers and add new layers where needed to adapt the model to the task at hand.
When to use
- Retain the model knowledge and add additional functionalities to the model.
Pitfalls
- Probability of catastrophic forgetting( It can occur in any learning system without lossless memory). Techniques like Elastic Weight Consolidation (EWC) are being explored to mitigate this issue.
- The model’s inherent bias might affect your final output as it gets passed on. Worst case scenario is to replace the base model and retrain with a new model and restructure the training data accordingly.
Reinforcement Learning from Human Feedback (RLHF)
- In RLHF, the model learns to generate outputs by interacting with a human evaluator who provides feedback on the quality of the generated responses. The model’s parameters are updated based on this feedback to improve its performance over time.
- While RLHF involves updating the model’s parameters based on task-specific feedback, it differs from traditional fine-tuning approaches used with LLMs in several key aspects:
—Trained on feedback that is typically task-specific and focuses on assessing the quality or utility of the model’s outputs for a particular task.
— RLHF involves an online learning paradigm where the model interacts with the human evaluator in real-time, receiving feedback on generated responses and updating its parameters accordingly. The other approaches are not real-time learning/ offline fine-tuning approaches on a static dataset.
— RLHF often employs adaptive learning strategies where the model dynamically adjusts its behavior based on the received feedback.
When to Use
- RLHF is particularly useful when the evaluation criteria for the task are subjective or difficult to quantify using traditional objective metrics.
- RLHF is well-suited for tasks that are dynamic or evolve over time, where the evaluation criteria or requirements may change frequently.
- In scenarios where labeled training data is sparse or costly to obtain, RLHF offers a data-efficient approach to model training. By leveraging human feedback to guide learning, RLHF can achieve effective performance with relatively few labeled examples compared to traditional supervised learning approaches.
- When multiple evaluation criteria or objectives are involved in the learning process.
- RLHF is well-suited for human-in-the-loop systems where human expertise or judgment is integral to the decision-making process.
Pitfalls
- The entire fine-tuning is dependent on the quality of the human involved in the process, margin of human error.
- RLHF typically requires a large number of interactions with human evaluators to achieve effective learning outcomes. Collecting sufficient and diverse feedback data can be challenging, especially in tasks where human expertise is scarce or costly to obtain.
- It often suffers from reward sparsity, where meaningful feedback signals are rare or delayed. Therefore human evaluators are expected to provide feedback in a timely manner.
- Designing appropriate reward functions for RLHF can be challenging and may require careful consideration of task objectives and evaluation criteria.
- Exploration-Exploitation Trade-off
Additionally, there’s RLAIF, where an AI substitutes the human. It’s a double-edged sword, capable of both enhancing and undermining the model’s functionality. There is also the risk of privacy concerns and lack of human interactions narrowing the usage of it.

To summarize, any kind of LLM fine-tuning is expensive, laborious, time-consuming, uncertainty always exists and mostly is a continual training process. Other approaches are to take a lower precision (low-level quantized) version of a LLM or even build a task specific SLM depending on how the model is going to be used. You can find the list of fine-tuning approaches here .Fine-tuning large language models (LLMs) can be a daunting journey, and many may never have the opportunity to explore it. However, there are now new frameworks available that streamline this process, making fine-tuning more accessible with minimal effort.

RAG vs Fine-tuning

RAG vs Fine-tune vs Both in Use Cases — Galileo

As we have understood, RAG focusses on connecting the LLM to an external knowledge base and rely heavily on the retrieval mechanism to make it work, whereas fine-tuning is retaining the model’s natural language capabilities and adapting it to perform a specific task. So, the principal question is how personalized the task is and how much supporting data do we have? Let us take a look at a few considerations:-

Data Availability — Consider having a sizeable amount of training data or source data and enough resources to translate them into training data, we can then look for developing smaller language models specific to the task or fine-tuning a medium language model . One key thing to remember is that we don’t always require a large language model. Apart from fine-tuning and RAG, we should also remember that building a task-specific model will perform better for static environments. Data world maybe dynamic, but is you data reflecting that, if yes how dynamic is it? In-case of a smaller data availability or rapid evolution of data formats/ requirements, utilizing the LLM inferences with effective prompt engineering might solve your zero-shot learning use-cases.
Training Infrastructure — With the rapid rise of GPU-poor companies we can understand that not all can afford these expensive GPU supported architectures. So, it is prudent for organizations to understand the problem scenario before deciding to fine-tune a model. For example, consider using a chatbot for customer support activities (pretty common nowadays and have also become annoying at the user end). These chatbots need to leverage the diversity of LLMs and are going to perform activities similar to the model training data. Feeding the right context and leveraging the model’s natural language abilities (RAG like approach) is the right way to approach this instead of fine-tuning a similar model with your training data. A minimal training for personalizing the chatbot to your organization’s tone is understandable but today’s chatbots can perform this mostly with the help of prompt-engineering with effective guardrails. the only issue is addition of more non-knowledge base tokens. This in-turn can help in distributing the GPUs for an actual pain point-killer thereby opening new doors for improvement.
Model Versatility — As discussed in the data availability, there has been a gradual increase in dynamic data scenarios and it is only going to go up(in general). So, having an enterprise hosted and maintained versatile LLM will enable us to focus more on the insights/ action points coming from the LLM rather than wasting efforts on constantly fine-tuning a snapshot with the newly arrived data. With the post COVID surge in dataset pricing fine-tuning models have become more expensive from the start. One other thing to consider is that the price of making API calls is not that expensive when you club it with effective contextualization and efficient cache handling of results.
Interpretability — Neural networks are still not 100% predictable. While tools exist to monitor model gradients and parameters(like profilers, debuggers and XAI packages(like SHAP and DALEX)), they do not provide sufficient nuance to be considered truly comprehensive. Fine-tuning the neural-nets also face the same issues. While there are methods like seeding and deterministic CuDNN (by reusing computational algorithms) to reproduce results, we don’t really get to understand the complete working of it. Incase of a RAG, we have a better control over the model performance and the final object is an effective prompt that is going to be passed to a LLM. It becomes easy to backtrack the RAG process (until generation) as every stage performs computation that can be replicated by a human. This better explainability makes it an ideal tool for strict user output generation or intermediate steps that requires certainty.
Technical Expertise — Is there sufficient technical expertise to build or fine-tune a language model available? Implementing RAG typically requires a moderate level of technical expertise and in rare-cases a deep understanding of the RAG activities. Whereas fine-tuning, especially with large language models, demands high technical expertise. In the context of RAGs, configuring retrieval mechanisms, incorporating external data sources, and maintaining data currency can present challenges. However, various pre-built RAG frameworks and tools are available, simplifying the process to some extent. On the other hand, Crafting and curating top-tier training datasets, establishing fine-tuning goals, and orchestrating the fine-tuning workflow demand precision. Moreover, the fine-tuning process typically demands significant computational resources, underscoring the need for adept management of such infrastructure. Fine-tuning also necessitates a grasp of domain-specific intricacies and the development of tailored evaluation criteria. In essence, RAG demands lesser technical expertise and coupled with it’s transparency RAG seems to be easier to work with.
Inference Speed — In RAG, the inference speed can vary depending on the complexity of the retrieval mechanism and the size of the knowledge base, whereas fine-tuning can offer fast inference speed. Consider a scenario where a company wants to improve its customer support service by implementing an AI-powered chatbot to handle customer inquiries and provide timely assistance. The company contemplates employing RAG in its chatbot to furnish contextually relevant responses by fetching data from its knowledge base (e.g., FAQs, product manuals). Inference speed is pivotal as the chatbot must promptly address customer queries. Alternatively, fine-tuning a pre-trained language model tailored for customer support tasks is an option. Though resource-intensive during training, this approach may yield swift inference speed in practice. Depending on the trade-off both are viable options.

Hybrid Approach

A better approach is to employ Hybrids ( RAG + fine-tuning ). Some viable options for implementing hybrids are as follows:-

RAG-guided Fine-tuning
- Fine-tune the LLM using examples generated by RAG as additional training data.
- Use the retrieved information from RAG to augment the training dataset for fine-tuning the LLM.
- This method leverages the contextually relevant responses generated by RAG to improve the fine-tuned LLM’s performance.
- Here, RAG is used for generating additional training data for fine-tuning the LLM.
RAG with Attention-based Fusion
- This approach incorporates attention mechanisms to improve how the retrieved information is integrated with the prompt during response generation.
- The retrieval model finds relevant passages, and the LLM processes both the prompt and retrieved passages.
- An attention mechanism is then used to assign weights to different parts of the combined input, allowing the LLM to focus on the most relevant information from the retrieved passages while generating the response.
T-RAG (Tree RAG)
- A pre-trained LLM is first fine-tuned on a specific task. Then, both the retrieval model and the LLM within the RAG system are fine-tuned jointly on task-specific data that includes the retrieved passages and desired outputs.
- When the query references any organizational entities, data concerning those entities is extracted from the entity tree and incorporated into the context.
- This joint fine-tuning allows the entire system to learn how to better leverage retrieved information for improved response generation on the target task.
Retrieval-Augmented Dual Instruction Tuning (RA-DIT)
- The RA-DIT approach separately fine-tunes the LLM and the retriever. The LLM is updated to maximize the probability of the correct answer given the retrieval-augmented instructions, while the retriever is updated to minimize how much the document is semantically similar (relevant) to the query. Three types of fine-tuning are made, each improving the model output.
Fine-tuning the dataset — Generating Q/A pairs, summarizing data, and incorporating chain-of-thought reasoning can lead to improved results when integrated with the models.
Language Model Fine-tuning — With the fine-tuned dataset in-hand, we can refine our LLM to yield two key benefits: optimize the LLM’s utilization of pertinent background knowledge and train it to generate accurate predictions even in cases of erroneously retrieved information, thereby empowering the model to draw upon its own knowledge base.
Retriever Fine-tuning — The retriever is fine-tuned using the LM-Supervised Retrieval (LSR) method. The LLM assesses the information fetched by the retriever, if it finds the information misaligned with the given query, it sends feedback to the retriever which is used for refining the search ensuring it fetches data that the LLM can effectively use.

Evaluation Metrics

Leveraging standardized benchmarks is a cornerstone of effective large language model evaluation. This facilitates both comprehension of individual model performance and facilitates objective comparisons to identify the most suitable model for a given task. Constructing a performance evaluator for your LLM can be a challenging endeavor. It demands a thorough understanding of various facets of the LLM and a robust contextual grasp to develop a reliable model. For example, if you are building a summarizer you’ll need a score that considers factors like, “is there sufficient context or whether there are contradictions to the original information”. One key thing to remember is that evaluation is a multi-step iterative process and not just eyeballing. Some key considerations for evaluation metrics are as follows:-
1) factual consistency
2) answer relevancy
3) coherence
4) toxicity
5) bias
Based on these factors and the LLM use-case an appropriate scorer is used. Some popular ones are as follows:-
1) BLEU (BiLingual Evaluation Understudy)
2) ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
3) METEOR (Metric for Evaluation of Translation with Explicit Ordering)
4) BLEURT (Bilingual Evaluation Understudy with Representations from Transformers)
5) Prometheus and G-Eval
General process for model evaluation involves creating an evaluation dataset, identifying relevant metrics and scorers, application of these metrics to the eval dataset and finally setup the evaluation pipeline for continuous evaluation. Open-source frameworks like DeepEval facilitate this with minimal overhead. In addition to all these metrics, human evaluation is necessary for a more comprehensive assessment. You can find a detailed note about LLM evaluation from here.

Conclusion

In conclusion, Retrieval-Augmented Generation (RAG) combines the power of language models with external knowledge sources to produce contextually relevant responses. When used alongside fine-tuning techniques, RAG enhances the adaptability and accuracy of language models for various natural language tasks. Evaluation metrics such as perplexity, accuracy, and BLEU score provide valuable insights into the performance of these models. I hope this article has equipped you with the fundamentals to begin leveraging Large Language Models effectively.

References

Aleksandra Piktus, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Douwe Kiela, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Meta Research (2020)
Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, Luke Zettlemoyer, Scott Yih, RA-DIT: Retrieval-Augmented Dual Instruction Tuning — arXiv.2310.01352 (2023)
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang, Retrieval-Augmented Generation for Large Language Models: A Survey — arXiv.2312.10997 (2023)
Shenglai Zeng, Jiankun Zhang, Pengfei He, Yue Xing, Yiding Liu, Han Xu, Jie Ren, Shuaiqiang Wang, Dawei Yin, Yi Chang, Jiliang Tang, The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG) — arXiv.2402.16893 (2024)
Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, Zhifang Sui, Can Large Multimodal Models Uncover Deep Semantics Behind Images? — arXiv.2402.11281 (2024)
Sebastian Raschka, Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation) — Ahead of AI (2023)
Philipp Schmid, How to Fine-Tune LLMs in 2024 with Hugging Face — philschmid blog (2024)
Jeffrey lp, LLM Evaluation Metrics: Everything You Need for LLM Evaluation — Confident AI on Medium (2024)