Monitor and evaluate RAG pipeline performance on AWS using Ragas.

10 September 2025 24 min read Tech Community

Large language models (LLMs) have emerged as powerful AI assistants, with expertise spanning various domains from question answering and text generation to code generation and beyond. One of the focus areas of LLMs is in-context learning, which enables them to generate contextually relevant responses to the user queries. 

As AI assistants powered by LLMs (like question answering bots) become more prevalent, properly assessing their performance is crucial for companies relying on them to provide quality customer service. Without rigorous evaluation frameworks, the risks of bots providing incorrect information increases. 

Retrieval Augmented Generation (RAG) applications that use external data to enrich LLMs' responses have great potential, but also great complexity. RAG-powered bots must accurately retrieve relevant information and coherently incorporate it into generated text. Companies considering implementing RAG for customer-facing applications need assurances these systems will perform reliably. 

This article demonstrates a framework for monitoring and evaluating RAG applications using metric-based evaluation as introduced in Ragas, encompassing both retrieval and generation components. We present a generative AI-powered assistant augmented with documents using Amazon Bedrock and assess its performance on key metrics. Continuous evaluation provides insights on metrics defined for retrieval or generation that are crucial for maintaining quality over time. 

By showcasing this RAG evaluation solution centred on Amazon Bedrock and Ragas, organisations can better understand how to deploy, monitor and evaluate RAG applications using their own data. Monitoring ensures bots provide accurate information to customers, uphold brand integrity through coherent dialogue, and identify areas needing improvement –giving confidence in progressing RAG usage. 

Metric-based evaluation of a RAG pipeline 

In order to put LLM applications to production, you need continuous monitoring and tuning. Such tuning is only possible by monitoring certain metrics over time to know how well the applications are performing to user queries. The Ragas framework introduces the ideology of metric-driven development (MDD) for continuous improvement of the RAG applications. MDD is a product development that makes well-informed decisions by using data. This approach involves tracking key metrics over time to produce valuable insights over the application’s performance. 

RAG applications require evaluation at both retrieval and generation stages. The retrieval stage needs to be evaluated to ensure it’s fetching relevant and accurate information. The generation stage must be evaluated to determine if the model is producing accurate, relevant and coherent responses based on the retrieved information. Ragas facilitates this by providing evaluation of each of these components individually. The Ragas score is determined by averaging the individual scores obtained for each metric. Let's delve into the important metrics defined by Ragas: 

  • Context precision – It evaluates whether all the ground-truth relevant items are present and ranked higher in the context or not. This metric is computed using the question and the context, with values ranging from 0 to 1, where higher scores indicate better precision for a given set of chunks. This metric is useful to know about the precision of the retrieved top-k chunks based on the user’s question. 
  • Context recall – It measures how well the retrieved context aligns with the ground truth answer. It’s calculated based on the retrieved context and the ground truth information. The range of scores is between 0 to 1. The higher the score, the better the performance. 
  • Answer relevancy – Its primary emphasis lies in evaluating the relevance of a generated answer to a provided prompt. Answers deemed incomplete or containing redundant information receive lower scores. This metric is calculated based on both the question and the answer, with scores falling within a range of 0 to 1. Higher scores are indicative of greater relevance. This metric is useful to know if the model is generating irrelevant answers based on the in-context information provided to it. 
  • Answer semantic similarity – It’s used to measure how well the generated response aligns with the ground truth answer. The evaluation is done using ground truth and generated response, with values ranging from 0 to 1. A higher score means better alignment of the answer to ground truth in terms of quality of response generation. The calculation is performed by taking the cosine similarity over the embedded vectors of generated answer and ground truth. 
  • Faithfulness – The factual consistency of the generated answer against the given context is measured. It’s calculated from the answer and retrieved context. The answer is scaled to (0 to 1) range, and the higher the better. This metric is useful to judge if an answer is faithful by inferring it from a given context. 
  • Correctness – This is one of the aspect critique measures, which uses the generated answer to judge if it’s factually correct and free from errors by performing a majority voting from the judge LLM. It’s a binary metric indicating whether it’s correct or not. 

This article examines metrics that don't rely on ground truth. The following section demonstrates how to use defined metrics to build a metric-based evaluation for a RAG application using AWS infrastructure. 

Solution overview 

The following diagram illustrates the solution architecture and RAG evaluation methodology. 

The QA bot facilitates the ingestion of PDF documents into a vector store using Amazon Aurora PostgreSQL-Compatible Edition, which is selected because of its ease of use and straightforward integration into the AWS ecosystem. PostgreSQL is a versatile database system that supports a wide array of extensions and modules. The pgvector extension allows you to store embeddings derived from embedding models like Amazon Titan Embeddings from Amazon Bedrock. This serves as a knowledge base for in-context data retrieval. 

The workflow of the RAG inference and evaluation methodology of a QA bot deployed within an Amazon Elastic Container Service (Amazon ECS) container using AWS Fargate is as follows:  

1. The user logins to the application using the username and password registered at Amazon Cognito User pool. 

2. Based on the provided details, the application performs an initiate authentication using the user details plus the client-id and the client-secret of the user pool client information. 

3. If the point-2 is successful, then the authentication token is granted and the user is redirected to the Chatbot home page. 

4. A user posts a question to the chatbot and waits for a response.  

5. The question is searched against the vector database to obtain relevant information from the knowledge base by using the similarity measure maximal marginal relevance (MMR) to return k documents similar to the question.  

6. This information is used to augment the context to the question passed to an LLM. The LLM uses the prompt, question and enhanced context to generate an answer. This custom-designed prompt uses an instruction and system prompt as follows: 

 prompt_template = ( 

            "[INST]<<SYS>> You are an assistant for question-answering tasks and you will only answer as much as " 

            "possible by strictly looking into the context. If you don't know the answer, just say that " 

            "you are trained on the uploaded document information and do not make up answers by looking into context. " 

            "Use three sentences maximum and keep the answer concise. If and only if the question is about yourself, " 

            "like \"who are you?\" or \"what is your name\", then ignore the given context and answer exactly " 

            "with \"I am QA Bot\".<</SYS>> \n" 

            "        Question: {question}\n" 

            "        Context: {context} \n" 

            "        Answer: \n[/INST]" 

            "        ") 

7. The meta.llama3-70b-instruct-v1:0 model (don’t forget to request model access) is used for generating the answer through Amazon Bedrock by integrating it within the LangChain framework and the RetrievalQAChain functionality. For the inference from the model, the following inference parameters are set: 

 def get_bedrock_llm(): 

        """ 

        Perform inference using the LLM through AWS via Bedrock and use vectordb 

        """ 

        llm = Bedrock( 

            client=bedrock_client, # Created by using boto3 

            model_id="meta.llama3-70b-instruct-v1:0", 

            model_kwargs={"temperature": 0.7, "top_p": 0.9, "max_gen_len": 512}, 

        ) 

        return llm 

8. The RetrievalQAChain allows you to add a callback functionality to the QA chain process at step of the chain until inference from the LLM. This way, it allows you to record all required information for making an evaluation. A callback handler allows you to override on_llm_start(...), on_llm_end(...), on_chain_start(...) and on_chain_end(...) methods. The newly introduced RagasEvaluationAndDbLoggingCallbackHandler callback handler uses the base methods to store the chain of results in an Aurora PostgreSQL database (see RagScore for more details on the data model). It also evaluates context precision, faithfulness, answer relevancy and correctness within the callback using the Ragas framework. Therefore, the RAG evaluation and generation of the response are carried out simultaneously/concurrently. This approach for evaluating and monitoring the RAG pipeline adds a unique and innovative dimension to the RAG application. 

9. The quality of the RAG pipeline is evaluated using Ragas and a judge language model. A judge model is a superior model that can evaluate the responses from another base language model. The judge model can be configured to use any model. You might also demonstrate by choosing a different judge model, distinct from the generation model. One of the advantages of Amazon Bedrock is that it provides access to a set of high-performing language models. In this evaluation, we use the llama3 (meta.llama3-70b-instruct-v1:0) model on Amazon Bedrock as the base model for question answering. Its responses are evaluated by Anthropic Claude 3 Sonnet (anthropic.claude-3-sonnet-20240229-v1:0) on Amazon Bedrock as the judge model. Anthropic Claude is fully capable of performing complex reasoning tasks very well. The evaluation methodology requires a dataset with questions, context, and generated answers, defined as follows (refer to RagasEvaluator for more details): 

def create_dataset(run_data: dict): 

        """ 

 

        Args: 

            run_data (dict): Dictionary information of the run data consisting of question, contexts and answer 

 

        Returns: 

            Dataset: dataset in ragas format 

        """ 

 

        data_dict = { 

            "question": [run_data["question"]], 

            "contexts": [run_data["contexts"]], 

            "answer": [run_data["output_text"]], 

        } 

        return Dataset.from_dict(data_dict) 

 

    def evaluate(run_data: dict): 

       """Performs evaluation using Anthropic Claude-v3-Sonnet as judge for the answer generated from 

        LLM family like llama2, llama3. Note: The judge can be any model. We here use Anthropic Claude Sonnet. 

 

        Args: 

            run_data (dict): The dataset to be evaluated 

 

        Returns: 

            dict: Scores 

        """ 

        from ragas.metrics import ( 

            answer_relevancy, 

            faithfulness, 

            context_utilization, 

        ) 

        from ragas.metrics.critique import correctness 

        from ragas import evaluate 

 

        data = create_dataset(run_data) 

        embeddings = BedrockEmbeddings( 

            client=bedrock_client, # Created by using boto3 

            model_id="amazon.titan-embed-text-v1" 

        ) 

        judge_model = BedrockChat( 

            client=bedrock_client, # Created by using boto3 

            model_id="anthropic.claude-3-sonnet-20240229-v1:0", 

            model_kwargs={"temperature": 0.2, "top_p": 1, "max_tokens_to_sample": 4000, "top_k": 250}, 

        ) 

        result = evaluate( 

            dataset=data, 

            metrics=[ 

                context_utilization, # context_precision is now called as context_utilization 

                faithfulness, 

                answer_relevancy, 

                correctness, 

            ], 

            llm=judge_model, 

            embeddings=embeddings 

        ) 

        return result 

These functions are used within the callback handler to enable continuous logging of the evaluation metrics in the database. 

Although there are many tools available for monitoring the observability of LLMs, there is a noticeable gap in tools and functionalities designed to track and monitor the evaluation of the RAG pipeline. Our evaluation framework fills this void by offering the capability to monitor the RAG pipeline over time for metrics that are introduced. This feature empowers companies to confidently deploy RAG applications while also continuously monitoring and identifying degradation in metrics, enabling them to make improvements as needed. 

The deployment stack comprises the capabilities to deploy foundational models (both within AWS and from Hugging Face), and our application stack is constructed through the AWS Cloud Development Kit (AWS CDK). Users can experiment by working on different models and approaches, and deploying them using the repository (for more information, refer to the GitHub repo). 

The following sections detail the steps involved in building and deploying the application. 

Building and deploying the application 

Prerequisites 

Before proceeding, make sure you have cloned the repository to a desired location and configured the AWS Command Line Interface (AWS CLI), including setting up your profile and credentials. If you haven't configured AWS yet, refer to Configuration and credential file settings. The steps to clone the repository are as follows: 

$ mkdir genai-bot-applications 

$ cd genai-bot-application 

$ git clone https://github.com/sprakash21/aws-genai-rageval-bot.git 

$ cd aws-genai-rageval-bot # This will be the directory where the application details are present. 

Develop the application locally 

The chatbot application framework is designed in such a way that it can facilitate an extension to add new features or modules. To understand the project structure, refer to the README, which details the necessary steps. For local development, follow these steps: 

1. Open the cloned repository on your preferred IDE. 

2. Navigate to nc-bot/environment_templates and copy the .env.local.template file to the nc-bot directory, renaming it to .env. Update the following environment variables in the .env file according to your environment: 

  • AWS_PROFILE – The profile to use for accessing AWS services using Boto3. If the default profile is used, set AWS_PROFILE to default
  • AWS_REGION – The AWS Region where the profile is set up. For example, if the AWS_PROFILE is created in eu-central-1, then the AWS_REGION environment variable should be set to eu-central-1
  • BUCKET_NAME – We require an Amazon Simple Storage Service (Amazon S3) bucket to store the PDF document data. You can create the bucket on the Amazon S3 console by using the AWS CLI: 
$ aws s3api create-bucket --bucket <my-bucket> --profile <my-profile> 
  • INFERENCE_ENGINE and BEDROCK_EVALUATION_ENGINE – Both the inference engine (INFERENCE_ENGINE) and the evaluation engine (BEDROCK_EVALUATION_ENGINE) are set to bedrock by default. 
  • BEDROCK_EMBEDDINGS_REGION – The Region where the Amazon Titan Embeddings G1 - Text model has access. For example, if the model access is granted in eu-west-1, then that will be the Region. 
  • BEDROCK_INFERENCE_REGION – The Region where the Llama 3 70B Instruct model has access. For example, if model access is granted in eu-west-1, then that will be the Region. 
  • BEDROCK_EVALUATION_REGION – The Region where Anthropic Claude v3 Sonnet model has access. For example, if the model access is granted in eu-west-1, then that will be the Region. 
  • BEDROCK_EVALUATION_MODEL_ID and BEDROCK_INFERENCE_MODEL_ID – For the evaluation, anthropic.claude-3-sonnet-20240229-v1:0 model is used, and for the inference meta.llama3-70b-instruct-v1:0 model is used. 
  • DB_LOCAL – Indicates whether to use a local database, or use Amazon Aurora if deployed using CDK. 
  • AUTH_LOCAL – Indicates whether to use local environment variables to perform authentication or take the details for Amazon Cognito from Secrets Manager. Always set this to false and provide the COGNITO_SECRET_ID value using the CDK stack but testing the application locally. 
  • COGNITO_SECRET_ID – The secret-id to be provided when the AUTH_LOCAL is set to false. 

3. Now that the nc-bot/.env file is set up, you can install the required dependencies, set up the PostgreSQL database with the pg_vector extension within Docker and run the application: 

    • Set up the Python3.11 virtual environment using the commands python3 -m venv .venv, and activate it with source .venv/bin/activate in your terminal of the IDE. 
    • After the virtual environment is activated, run pip install -r requirements_aws.txt to install the required packages. 
    • To build the PostgreSQL database with the pg_vector extension Docker image, run the following command. The POSTGRES_PASSWORD used here should be the same as the one to be used in the .env file. For example, if the POSTGRES_PASSWORD is test, then the same should be in the .env file: 
    $ cd aws-genai-rageval-bot/nc-bot 
    
    $ docker build -t pgvector_local -f pg_vector/Dockerfile . 
    # Run the docker container 
    
    $ docker run \ 
    
        --name postgresql-container \ 
    
        -p 5432:5432 \ 
    
        -e POSTGRES_PASSWORD=test \ 
    
        -d pgvector_local 
    • If everything is correctly set up for the database, it should be possible to connect it through the pgAdmin tool. 
    • To start the Streamlit application, run the following command. The application should be accessible at http://localhost:8501
    $ streamlit run chatbot.py 

    Build and push the application 

    The deployment stack references the Amazon Elastic Container Registry (Amazon ECR) repository to use the image for running the container using Fargate for Amazon ECS. To build the application as a Docker image, tag it and push it to Amazon ECR, you create a shell script for this purpose, which takes some arguments. 

    Make sure Docker Desktop or the Docker daemon is running: 

    $ export AWS_PROFILE=<your_profile> 
    
    $ cd nc-bot/scripts 
    
    $ ./build_and_push_docker.sh ../../nc-bot/Dockerfile <ecr_repository> <tag> <region>

    The script arguments are as follows: 

    • ecr_repository – The ECR repository. 
    • tag – The image tag to use. 
    • region – The Region where the repository and image will be placed. 

    The repo and tag of the image will be used with the cdk.json file during the deployment of the stack. 

    Configure and deploy the application using the AWS CDK 

    Before you deploy the infrastructure components, you must set up the AWS CDK for Python. For more information, see Getting started with the AWS CDK and Working with the AWS CDK in Python. Then complete the following steps: 

    1. Change the directory using $ cd aws-genai-rageval-bot/deploy

    2. Set up the Python3.11 virtual environment using the commands python3 -m venv .venv, and activate it with source .venv/bin/activate in your terminal of the IDE for deployment. After the virtual environment is activated, run pip install -r requirements.txt to install the required packages. 

    3. Copy the aws-genai-rageval-bot/deploy/cdk.template.json file as aws-genai-rageval-bot/deploy/cdk.json and modify the configuration parameters. The following are the important parameters to update in the cdk.json file to deploy the application and access the Amazon Bedrock models: 

    • project_prefix – The prefix to use for the stack. The prefix cannot contain underscores. 
    • deploy_stage – A placeholder to indicate the deployment stage, like dev or lab.  
    • deploy_region – The Region for deployment of the stack. For example, eu-central-1
    • ecr_repo – The ECR repository used in the previous section. 
    • ecr_image_tag – The image tag used in the previous section. 
    • ecr_url - The Amazon ECR URL in the format <account_id>.dkr.ecr.<region>.amazonaws.com. This can be obtained on the Amazon ECR console. 
    • app_params – The environment variables in the previous section. These could be left as defaults or the model_id, Region for inference, and evaluation can be experimented with based on your Amazon Bedrock model access

    4. After the cdk.json values are configured, run the AWS CDK commands to synthesize and deploy the stacks. Refer to the README for more information. 

    • Synthesize with the following code: 
    $ export AWS_PROFILE=<profile> 
    
    $ cdk synth –all 
    • Deploy with the following code: 
    $ export AWS_PROFILE=<profile> 
    
    $ cdk deploy –all 
    • Post operational activities: 

    Initially a temporary password is created for the provided value for the key COGNITO_EMAIL in the cdk.json. To set the permanent password for the user created through the AWS CDK stack use: 

    $ cd aws-genai-rageval-bot/deploy/operations 
    
    $ export AWS_PROFILE=<profile> 
    
    $ python3 set_permanent_password.py --username <username> --pool_id <pool_id> --region <region> 

    The user_pool_id can be obtained from the Amazon Cognito Page. The username is defaulted to admin

    AWS CDK application best practices are maintained by using cdk-nag.  

    After you complete these steps, the application should be deployed, and you can obtain the application URL by navigating to the Outputs tab of the AWS CloudFormation deployment stack details on the AWS CloudFormation console.  

    Furthermore, if you have added values for domain_name, hosted_zone_id in the cdk.json, then the application will be available under https://chat.<domain_name>

    Clean up 

    If the stack is left as is and running for a long period of time, it will encounter costs. Therefore, when you don’t need the stack to be running, it’s important to clean up the resources.  

    Complete the following steps: 

    1. Disable the termination protection from the database:  
    • On the Amazon RDS console, choose Clusters in the navigation pane. 
    • Select the cluster deployed from the stack and choose Modify
    • Deselect Enable delete protection to destroy the stack with RDS cluster deletion. 
    1. On the Amazon S3 console, empty the bucket and delete it. 
    1. Clean up the application stack with $ cdk destroy –all
    1. On the Amazon ECR console, delete the repository. 

    For a more detailed explanation of the deployment and clean-up process, refer to this tutorial

    Test the RAG evaluation pipeline against a knowledge base 

    To evaluate the QAbot application with the RAG evaluation pipeline against the knowledge base, we use PDF data. Big Data Analytics Options is a 76-page document that provides an overview of AWS analytics services, along with their usage patterns, costs and performance, helping you choose a service and gain a better understanding of them. The following screenshots show some question scenarios for which the llama3-70B-instruct model will answer based on the provided in-context data or generalisations. The generated in-context data and answers are then evaluated using Anthropic Claude v3 Sonnet on Amazon Bedrock, which is the judge model. The scores for each of the generations are recorded and visualized, thereby enabling the continuous monitoring of the metrics. 

    Our first scenario demonstrates answering a question with full relevance of the knowledge base 

    ScenarioMetrics
    Context precision: 1.0 
    Answer relevancy: 0.2 
    Faithfulness: 1.0 
    Correctness: 1.0 

    As seen from the results obtained from the metrics, the question asked has a direct reference to the in-context data obtained by the similarity between the question and the knowledge base – getting very high scores on precision of context, faithfulness and correctness of the answer. 

    Next, we look at a scenario-based question without any direct relevance to the knowledge base 

    ScenarioMetrics
    Context precision: 1.0 
    Answer relevancy: 0.81 
    Faithfulness: 0.86 
    Correctness: 1.0 

    The question constructed is very similar to one of the example scenarios explained in the PDF data (Refer to Example 2: Capturing and analysing sensor data, Big Data Analytics Options). From the metrics, it can be seen that the model fairs well to generate an answer based on the in-context information. It was also observed that sometimes the model starts to generate extra information text when asked the same question multiple times within a session. 

    Our final example uses a scenario that’s similar to the underlying knowledge base 

    ScenarioMetrics
    Context precision: 1.0 
    Answer relevancy: 0.76 
    Faithfulness: 1.0 
    Correctness: 1.0 

    This example is constructed the same as the PDF data (Refer to Example 2: Capturing and analysing sensor data, Big Data Analytics Options), and the difference in results is noticed. The context precision and faithfulness of the answer are scored higher because the obtained context is ranked higher, which thereby makes the generated answer more inclined towards the obtained in-context information.  

    Furthermore, we add a sentence within the question to only use the obtained context data to answer the question. This addition enhances, so the answer is generated more from the in-context information. 

    Quality monitoring 

    This section illustrates a custom approach to monitoring the evaluation scores acquired from the RAG pipeline.  

    All the information is stored in the database, including the metric values and the time at which they were recorded. Using the data and a filter to drill down to past weeks, the information is visualised in the application using Plotly, as shown in the following figure. This approach enables continuous assessment of model responses to the questions over time – offering valuable insights, enhancing confidence in RAG usage and identifying opportunities for refinements and improvements.  

    evaluation scores from RAG pipeline

    Conclusion 

    This article introduced the concept of metric-driven development, highlighting various metrics provided by the Ragas framework. We explored an evaluation framework applied to a RAG application powered by Amazon Bedrock, covering embeddings, inference, judgment and overall RAG pipeline performance using Ragas metrics.  

    The use cases we presented offered insights into how these metrics can inform the judgment of the model's responses or the relevance of retrieved context information. Such metrics enable organisations to continuously monitor and enhance their RAG application's performance and in-context data quality. This was demonstrated by integrating a callback handler within RetrievalQAChain for robust quality monitoring, including storing and visualising results. Notably, testing showed Amazon Bedrock effectively filters out harmful content, providing safe model outputs. It was also noted that the performance scores could vary slightly each time they were tested against the judge model. Furthermore, adding a user feedback loop to refine or reframe queries when metrics indicate lower performance can enhance the RAG pipeline's interactivity and effectiveness. 

    To learn more about RAG applications and how to implement them using AWS services, refer to What is RAG (Retrieval-Augmented Generation)? Additionally, check how AWS Partner Nordcloud is focusing on generative AI use cases in multiple areas, and learn how customers are expanding and streamlining digital capabilities by enabling greenfield generative AI and using generative AI to optimize HR operations

    For further assistance with the development and deployment of generative AI use cases, including this article, please contact us

    Scroll to top