RAG Best Practices

Query vs. Data Ingestion

Learn about which method - data ingestion with vector search or querying with function tools - work best out-of-the-box for RAG

This is the first article of a series our team is doing on Retrieval Augmented Generation (RAG) best practices. We’ve seen implementations of RAG vary tremendously, as a result of working with AI SaaS companies build RAG ingestion with our workflows and 3rd-party API tools for agents with ActionKit. When it comes to incorporating RAG with 3rd-party integrations - where an AI agent has the ability to retrieve data and perform actions in their users’ Google Drive, Slack, Salesforce, Gmail, etc. - there are many unanswered questions on best practices for optimized performance and response quality.

That brings us to the topic and overarching question of this article: Should we query directly from the 3rd-party provider or ingest our users’ data for vector searching for our RAG use case?

Or more simply: Should we query or ingest data for RAG?

Background

If this question isn’t super clear, let’s take a step back and describe what we mean by querying and data ingestion.

  • Querying data for RAG is giving our AI agent tools to be able to query our users’ 3rd-party platforms directly at prompt-time

  • Data ingestion involves ingesting indexing data from a 3rd-party integration provider into our own vector database (ingestion), and then searching the database at prompt-time with vector search

As you can see, querying data has a more simple implementation where data is only retrieved when it’s needed and doesn’t involve storing a copy of the data in a separate database. Data ingestion requires an expensive job of ingesting all the data from a 3rd-party provider, storing it in a database for later retrieval, staying up-to-date with the data source at all times, not to mention the permissions/authorization logic needed when storing and returning 3rd-party data.

That brings us back to our overarching question again: Should we query or ingest data for RAG?

To answer that question, we need to consider:

  1. In general, which method provides better responses?

  2. How does each approach compare across different data types (structured vs unstructured)

  3. Are there specific situations querying is advantageous and situations where vector search is advantageous?

Quantitative Evaluation Methodology

To answer these questions, we took a quantitative approach where we boiled down performance with a few metrics across different 3rd-party platforms and data types.

First things first, we utilized Parato, a RAG enabled AI agent we’ve built in our RAG tutorial series. Parato has basic RAG functionality with data ingestion across multiple 3rd-party platforms, permissions enforcement of ingested data assets, and tool calling capabilities across integration providers. (If you’d like to look through the tutorial and learn about building your own AI application with 3rd-party integrations, check out our Parato Series.

We used Parato (our sample RAG AI agent) as the basis of our evaluation, and::

  1. Built out a small knowledge base of a few data assets from Notion, Slack, and Salesforce

    1. We decided on these 3 integration providers because:

      1. Notion provides long-form unstructured data

      2. Slack provides short-form unstructured data

      3. Salesforce provides structured, tabular data

  2. Generated synthetic data from our data assets to generate a larger sample size

    1. An LLM was used to generate prompts and responses from different chunks of each data asset

    2. Different problem types were also used across prompts such as reasoning problems, comparative, hypothetical, etc.

  3. Ran Parato through our synthetic dataset using the two different approaches across the following metrics:

    1. Answer Relevancy: How relevant is the answer to the prompt

    2. Faithfulness: How close is the context retrieved to the resulting answer

    3. Context Relevancy: How relevant is the context retrieved to the original prompt

Throughout this exercise, we heavily utilized a python library called DeepEval from Confident AI. If you’d like to replicate this exercise or perform a similar one with your AI application, I encourage you to try DeepEval and give them a star on their github. (We're not affiliated with them; we just really enjoyed using their framework and tools)

Building the Evaluation Workflow

While this isn’t a tutorial, we wanted to show a few snippets for readers to follow along.

For generating the synthetic prompts out of our knowledge base from Notion, Salesforce, and Slack, we used DeepEval to generate test cases called “goldens.”

synthesizer = Synthesizer(
    evolution_config=EvolutionConfig(evolutions={
        Evolution.REASONING: 1/7,
        Evolution.MULTICONTEXT: 1/7,
        Evolution.CONCRETIZING: 1/7,
        Evolution.CONSTRAINED: 1/7,
        Evolution.COMPARATIVE: 1/7,
        Evolution.HYPOTHETICAL: 1/7,
        Evolution.IN_BREADTH: 1/7,
    },
    num_evolutions=3)
)
synthesizer.generate_goldens_from_docs(
    document_paths=['../document-base/sf-csv.txt',
                    '../document-base/sf-report.pdf',
                    '../document-base/rag-tech-stack.txt',
                    '../document-base/permissions-tutorial-2.txt',
                    '../document-base/permissions-tutorial-2-5.txt',
                    '../document-base/paragon-slack-questions.txt'],
    context_construction_config=ContextConstructionConfig(chunk_size=200, chunk_overlap=20),
)

We then programmatically passed all of our synthetically generated prompts to Parato.

prompt = goldens_dict['input'][i]
messages = [{"content": prompt,
            "role": "user"}]
body = {"messages": messages, "tools": {}}
response = requests.post(url + "/api/chat", json=body,
	headers={"Authorization": "Bearer " + jwt}, stream=True)

Here are some examples of a few test cases generated from our “goldens” after running the synthetic inputs into our AI application, Parato (with the data ingestion and vector search method).

Lastly, we used DeepEval’s evaluation library methods to calculate our answer relevancy, faithfulness, and contextual relevancy metrics.

for index in output_dict['input'].keys():
	test_case = LLMTestCase(
	    input=output_dict['input'][index],
	    actual_output=output_dict['actual_output'][index],
	    expected_output=output_dict['expected_output'][index],
	    context=output_dict['context'][index],
	    retrieval_context=output_dict['retrieval_context'][index],
	) 
	test_cases.append(test_case)

answer_relevancy = AnswerRelevancyMetric(threshold=0.5)
faithfulness = FaithfulnessMetric(threshold=0.5)
contextual_relevancy = ContextualRelevancyMetric(threshold=0.5)

evaluation = evaluate(test_cases, [answer_relevancy, faithfulness, contextual_relevancy])

Initial Evaluation Results

Looking at the overall metrics we saw a pretty drastic difference in answer relevancy.

Data Ingestion with vector search provided more relevant responses and when slicing the data by integration provider (a proxy for data type as well - Notion: long unstructured, Slack: short unstructured, Salesforce: structured/tabular), we again saw data ingestion with vector search outperform querying.

While answer relevancy was higher across the board when vector search was used for RAG, we observed that contextual relevance was higher when querying. A contributor to that discrepancy was the “top-K” setting of our vector search method. We set K=10, which means the top 10 chunks were retrieved for any given prompt with the potential for many of those chunks to be irrelevant.

What this discrepancy in contextual relevancy and answer relevancy demonstrates when it comes to vector search is that Parato’s retrieval ability is not the best, however Parato’s generator has the ability to parse out irrelevant context and come up with a proper response to the original input. In other words, our RAG application can properly parse out noise from extra context retrieved from vector search.

Deeper dive on querying

It was clear that data ingestion/vector search seemed to outperform querying for RAG, but we wanted to take a harder look on why this was the case.

I cherry-picked a few query-based test cases to showcase that when querying for RAG, there were many chat interactions where no context was retrieved by a query tool (such as search via Slack's API), and as a result, no relevant answer was generated. When a query tool successfully retrieved context - for example in Salesforce and Notion in the table below - responses were relevant. In contrast, in our Slack examples, Parato was never able to retrieve any context using a query function tool.

These “zero-retrieval” situations skewed the metrics for the querying method. Taking a look at the proportion of test cases where no context was retrieved we see the following trends.

Our RAG application’s ability to query Slack was nonexistent, whereas the ability to query Notion and Salesforce was fair. In contrast, vector searching retrieved context 100% of the time.

Limitations and further analysis

Before moving to our main takeaways, I want to acknowledge two main limitations of this evaluation.

  1. First, neither method (data ingestion with vector search nor query-based tool calling) was optimized

    1. We wanted to keep this exercise simple and look at how these two methods out-of-the-box stack up

    2. For vector search, we didn’t optimize retrieval or generation by optimizing hyper-parameters like chunk size, top-K, LLM temperature, embedding models, LLMs, etc.

    3. For query-based tool calling, we didn’t optimize tool descriptions, tool availability, utilize agent swarms, etc.

  2. Second, our knowledge base is extremely small

    1. We used only 5 data assets for our knowledge base (3 Notion documents, 1 Slack thread, and 1 Salesforce table with ~50 records)

These confounding factors no doubt affect the results of this exercise in many different ways. As seen from the results, “zero-retrieval” events when querying was a common occurrence and so more optimized tool calling (calling the right query tool for the job) would undoubtably change results. With a small knowledge base, it’s easier for vector search to extract relevant context when the ratio of top-K to knowledge base is extremely favorable for vector search.

On future iterations, we will evaluate the impact of optimizing tool calling as well as stress testing a larger scale knowledge base. We’ll also be writing more about optimizing vector search and tool calling in future articles so stay tuned!

Takeaways & Implications

1.) Vector searching performs surprisingly well on structured data (at least with small datasets)

While keeping the limitations of this exercise in mind, we saw the data ingestion with vector search method clearly outperform the query tool call method in this exercise. While this isn’t too surprising, I was surprised that vector search seemed to perform well on structured data (aka the Salesforce use case). Thinking about how vector search works - where input prompts are vectorized and compared against embeddings in a vector database - I wouldn’t have thought this method would work well with structured data. I think iterating on this exercise with a larger knowledge base would confirm this observation further.

2.) Query-based tool calling requires fine tuning

For query-based tool calling, it was apparent that this method is extremely sensitive with an “all-or-nothing” retrieval result. If the tool is the right one for the prompt, we were guaranteed a relevant result. If a tool call was incorrect, Parato was not going to retrieve any context.

What’s difficult about optimizing tool use when it come to using 3rd-party APIs is that:

  1. You must have knowledge on what APIs the 3rd-party provider supports and how they work

    1. For example, Slack provides this GET endpoint which takes in a string to query

    2. This endpoint is extremely useful, but for this endpoint to be used in it’s full potential we need to have Parato extract information from the user prompt and input the information into the search.messages API in specific ways according to the Slack documentation

      To specifically search within a channel, group, or DM, add in:channel_namein:group_name, or in:<@UserID>. To search for messages from a specific speaker, add from:<@UserID> or from:botname.

    3. In contrast, some 3rd-party APIs are extremely easy to use with minimal knowledge such as the Salesforce SOQL API, where our LLM can take advantage of its pre-trained SQL knowledge base (Querying with Salesforce saw a retrieval rate of ~75%)

      GET /query/?q=SELECT+name,id+from+Contact -H "Authorization: Bearer token"
  2. Tools need a very specific set of parameters which may not provided from the initial prompt

    1. Even if our agent has the right query tool for the job, the tool description as well as the parameter descriptions need to be relevant to a user’s prompt for it to be called

    "function": {
    	"name": "NOTION_GET_PAGE_CONTENT",
    	"description": "Triggered when a user wants to get a page content in Notion",
    	"parameters": {
    	  "type": "object",
    	  "properties": {
    	    "blockId": {
    	      "type": "string",
    	      "description": "Page ID : Specify a Block or Page ID to " +
    								      "receive all of its block’s children in order. " + 
    								      "(example: \\"59833787-2cf9-4fdf-8782-e53db20768a5\\")"
    	    }
    	  },
    	  "required": [
    	    "blockId"
    	  ],
    	  "additionalProperties": false
    
    

Query-based tool calling has many advantages as outlined in the Background section, but in order to balance performance with the storage/permission benefits from querying, our agent’s 3rd-party query tooling must be optimized.

Closing Thoughts

Should we query or ingest data for RAG?

From this exercise, data ingestion with vector search proved to be the best overall solution for RAG with external data from 3rd-party integration providers when no optimizations were made. RAG with vector search was able to handle a more varied range of prompts - from more broad hypothetical prompts to more constrained prompts. Vector search worked well out-of-the-box with the potential for even more relevant responses after hyper-parameter tuning, making the expensive task of ingesting data worth it.

While querying for RAG with agent tools was less performant and more sensitive to user prompts (without additional tuning and agent architecture), this method introduces much less noise when it comes to context retrieval. Query-based tools also doesn’t require your team to keep a copy of your users’ data and go through the intricate process of correctly enforcing permissions on retrieved context. With more optimized tool calling, I believe RAG-enabled agents would be able to reach acceptable levels of answer relevancy while reaping those benefits for most prompt types, but not to the same extent as vector search retrieval.

We hope you found this experiment interesting and useful. Our team at Paragon works with many AI SaaS companies and specialize in helping these companies build and scale their 3rd-party integrations for AI use cases. To learn more about Paragon, check out our blog or book a demo with our team to see if Paragon can help your team build integrations your customers care about.

TABLE OF CONTENTS
    Table of contents will appear here.
Jack Mu
,

Developer Advocate

mins to read

Ship native integrations 7x faster with Paragon

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.