Guides
RAG vs. Fine-tuning for Multi-Tenant AI SaaS Applications
Building a useful AI SaaS product requires your models to have access to your users' external data. But should you fine-tune or use retrieval augmented generation (RAG)? This article compares the two approaches.
Brian Yam
,
AI Strategy
12
mins to read
Companies building AI features for their SaaS applications are using Paragon to ingest data from the hundreds or thousands of their external files and SaaS applications, and we've seen the majority lean towards implementing RAG instead of fine-tuning LLMs. This post walks through how we see our customers navigating this decision.
If you’re building a multi-tenant AI SaaS application, you’ve probably run into the limitations of relying on prompt engineering alone. Even if you write the perfect underlying prompt, the general purpose models like GPT-4, Llama, or Claude lack user-specific data, and without that contextual data, they’re ‘cool’ at best.
Even if you've fine-tuned the underlying model with your use case-specific training data set, the model won’t have access to any up-to-date, proprietary, or contextual information about your customers.
For example, say you were selling an AI customer service chatbot. The chatbot would be virtually useless and unable to answer queries sufficiently if it doesn't have access to your customers' data (such as past conversations, product information etc.). This data exists, but lives in other applications such as your customers' CRMs, documents, ticketing systems, etc.
So how do you retrieve and use this contextual data into your LLMs?
The first part is ingesting this data from the 3rd party sources through integrations (which we won't get into today), and once you have the data, there are two predominant approaches to using it with your LLM — RAG and fine-tuning.
In this article, we’ll detail and compare the differences between RAG and fine-tuning so you can determine which approach makes the most sense for your AI SaaS product.
Overview: RAG VS Fine-tuning
Retrieval Augmented Generation (RAG)
RAG is a process for enhancing the accuracy of large language models through on-demand retrieval of external data and injecting context into the prompt, at runtime. This data can come from various sources, such as your customers' documentation/web pages (through scraping) and data or documents from dozens, if not hundreds of 3rd party applications like their CRM, Google Drive, Notion etc.
Note: This system design represents how many Gen AI SaaS companies implement RAG with Paragon as their ingestion engine for extracting structured and unstructured data from their users’ 3rd party SaaS applications.
There are a few variations in how you can implement RAG for your product, but I’ll break it down into two key components - data ingestion/storage and prompt injection.
Data ingestion/Storage
Initial Ingestion Job
The first step is to do an initial pull of all the relevant data that your customers provide you access to. This could be years of sales data from their CRM, thousands of files from their Google Drive, and even tens of thousands of Slack threads.
Background jobs for ongoing updates
Context changes every second as new information is gained, decisions are made, and events and conversations occur, which is why real-time data is critical. As such, beyond the initial ingestion, you need to have background CRON jobs or webhook listeners running to pull in new data from all these 3rd party applications.
Embeddings and storage
For both jobs above, once you ingest the data, you need to store it in a database for retrieval at runtime. Since RAG is mostly implemented for unstructured data, this requires generating embeddings from the ingested data and storing them into a vector database like Pinecone. We’ll get into this in more detail further down.
Prompt injection
At Runtime
When prompts/queries are triggered from actions your users perform, you will retrieve the most relevant text chunks (contextual data) from your vector database through a vector search. Once you have those chunks, you will inject them into the initial prompt/query and pass it to your LLM to generate the final response.
This runtime orchestration can be done with frameworks like Langchain. We won’t get into that today, but here’s a simple diagram to help visualize the process.
Fine-tuning
Fine-tuning is the process of further training a pre-trained LLM on a domain-specific dataset (with pre-determined inputs and outputs) to make it more performant for domain-specific tasks. For example, if you are building an AI sales agent product, you can fine-tune the GPT/Anthropic/Llama models on thousands of highly performant cold emails. If you are building an AI business lawyer agent product, you would fine-tune the model with hundreds of legal contracts.
The key difference here vs. RAG is that fine-tuning modifies the parameters in the actual LLM, and naturally that means this process occurs before deploying the LLM into production.
The main challenge with fine-tuning is that getting a clean training data set can be difficult and time-consuming, but it can produce more predictable results.
TL;DR Should you fine-tune or build RAG
We go into much more detail on the challenges and benefits of RAG vs. fine-tuning for your multi-tenant SaaS app, but in case you’re short on time, here’s my perspective.
Unless you have domain-specific data sets readily available to fine-tune the underlying models, it’s better to prioritize RAG before fine-tuning for the following reasons:
RAG allows you to inject real-time/near real-time context (dependent on your ingestion strategy) into your prompts to a deployed LLM, whereas fine-tuning is limited to context and data available in your training data set
RAG does not require you to have a cleaned and structured training data set - you only need to ingest and store the data in a database such as Pinecone/Weaviate (for unstructured data specifically)
RAG can retrieve relevant context sourced from several data sources
LLMs are getting better every day, and are surprisingly adept at solving even domain-specific tasks out of the box
With that said, let’s dive deeper into how you can implement RAG and fine-tuning for your SaaS application.
Implementing RAG With Third-Party Data
As highlighted earlier, RAG involves a few key stages, from data ingestion to information retrieval upon execution, but there are quite a few steps to achieve this.
1. Data Ingestion
The first step is knowing where your users’ external contextual data lives. For example, product knowledge may be stored in their Notion workspaces or onboarding documents in Google Drive, common Q&A in Slack threads, sales emails in Salesforce/Salesloft, and call transcripts in Gong or Zoom.
Once you know where you need to pull data from, you need to build a mechanism for ingesting all of their existing data, as well as any updates to those data sources.
Here are few things to you’ll have to build for and consider when building your ingestion engine:
Authentication: OAuth policies are different across every 3rd party API - you need to ensure you’re constantly refreshing tokens so you stay connected (we cover some of the common challenges with auth here).
Webhooks/CRON jobs: You have to spin up webhook listeners or CRON jobs (if webhooks aren’t supported) for every object type that you want to ingest from each app, for each customer, and build in monitoring mechanisms to ensure they’re active.
Horizontal scaling: The initial jobs to ingest all of your customers’ existing data will lead to spikes of millions, if not billions, of requests at once. You need to ensure that your infrastructure can auto-scale to handle these spikes, without taking down your servers.
Rate limits: Each of the 3rd party APIs you’ll have to integrate with will have rate limits, though the specific rate limits are often not transparently documented in their docs. You will inevitably run into these rate limits, so ensure you have auto-retry and queuing mechanisms to prevent jobs from failing.
Security: The data you ingest needs to be safely stored and silo’d between customers’ instances, given the multi-tenant nature of SaaS applications.
Breaking changes: 3rd party APIs often release breaking changes that your team will need to stay on top of.
Solving these challenges will require a ton of engineering effort, but it won’t serve as a differentiator for your company. That’s why many AI SaaS companies, from startups to public companies, use Paragon as their products’ ingestion engines for their customers’ external data, so they can focus on their core competencies.
2. Chunking (Tokenization): Most contextual data will be unstructured by nature, which means you’ll be left dealing with large volumes of strings. Due to the short context windows of LLMs downstream, you can’t simply inject all of your customer’s data into every prompt (not to mention how inefficient that would be). Picking the right chunking strategy, such as fixed size, recursive, semantic etc., is a whole separate discussion that we won’t get into today.
3. Generate Embedding: To enable similarity searches across chunks (so your RAG process can retrieve the most relevant data at runtime), you need to generate embeddings that vectorize the chunks into a numerical representation. There are many embedding models you can use to achieve this, but you can check Huggingface’s leaderboard for the top models based on the latest benchmarks.
4. Storing in a Vector Database: Once you’ve generated the embeddings from the ingested data, you need to store it in a vector database. Storing data as vectors enables you to perform similarity searches through high volumes of data quickly, which is vital for RAG because of its extensive knowledge bases.
5. Runtime Retrieval: At runtime, when a user performs an action that sends a query, you’ll have to vectorize the query and perform a similarity search against all the data in your vector database (that you’ve ingested and stored from steps 1-5). This will enable you to retrieve the most relevant chunks of context, which you can then include in your prompt to your LLM of choice (e.g. GPT, Llama, Gemini, etc). That way, the LLM can leverages the retrieved information as context to generate a much more personalized, relevant, and comprehensive response to your user’s query.
RAG Considerations: Security and Permissions
While implementing RAG in theory is very straightforward, there are some challenges around RAG today, especially around privacy, security, and user permissions, since you will be passing proprietary data to the underlying LLM.
Security
Beyond the security considerations around securely storing your customer’s proprietary data in your vector/relational data stores, you need to ensure that their data doesn’t leak to the underlying models.
For example, any data you pass through to GPT via a prompt will be retained by OpenAI for a certain period of time, and can be used in their training models. Logically, this means your customers’ data can leak to other users of their models.
Even if you self-host an open-sourced LLM like Llama 3, this could introduce leakage across your various tenants.
There are a few solutions to this problem:
Use an open-source/self-hosted LLM and deploy a separate instance of your LLM to each of your customers
Use an Enterprise, closed source LLM (like GPT Enterprise) that provides security guarantees - however you would need each of your users’ to provide their own Enterprise LLM API keys to prevent cross-tenant leakage
Permissions
Assuming you have your customers authenticate into the 3rd party data sources for their entire organization via an admin account, individual end-user level scopes and permissions will not persist by default.
As a result, your ingestion jobs need to associate additional, more granular, permissions metadata (such as object-level or document level read permissions) with individual users. Otherwise, when a user generates a query, the RAG process will retrieve all relevant context from the vector store, even if they didn’t have permissions to view that data in the original data source. Pinecone has a very detailed explanation of how this can be implemented with ACL services such as Averto.
End-to-End Process of Fine-tuning an LLM
Now let’s look at how fine-tuning works in a multi-tenant SaaS architecture - from data ingestion, cleaning, training, to deployment. The following is a breakdown of the processes of fine-tuning a pre-trained model.
Data Ingestion: Similar to RAG, you need to ingest data from your users’ external applications for the purposes of building a training data set. This can look like sales emails from their sales engagement platforms, conversation histories from Intercom, and sales performance data from their CRMs.
Preparation: Unlike RAG however, this data needs to be prepped and cleaned such that it can be used for fine-tuning. At a high level, you need training, validating, and testing data sets in order to fine-tune a model. Examples of training data sets:
Training data set for an AI support chatbot would contain:
Input - all accessible data on a customer as well as the ticket they submitted
Output - the response a support rep gave
Training data set for an AI content writer
Input - a customer’s blog outline and brief
Output - the blog post they published from that outline
The challenge is ensuring that your test data sets are clean and have only the optimal inputs. For example, you wouldn’t want to train the model on data from your customers’ worst articles or support reps. This is why some AI SaaS companies require customers to take part in an onboarding phase to validate the training data sets with their customers before they begin fine-tuning their models.
Fine-tune and validate: For most companies, it doesn’t make sense computationally to manually configure the actual parameters of the underlying model. Instead, transfer learning (the examples given above) enables you to train the underlying LLM even with a limited data set.
Evaluation and deployment: Once you’ve fine-tuned the model, you need to test and validate the fine-tuned model with a validation set to ensure you get the desired responses. If the results aren’t satisfactory, you need to continue fine-tuning the model with additional data until it’s product-ready, at which point you’d deploy the fine-tuned model.
Reinforcement learning: In production, you can also introduce fine-tuning loops via Reinforcement learning from human feedback (RLHF). In the most basic form, you can surface the ability for users’ to rate the responses they receive and leverage that rating in a reward/punishment model. However, this can get significantly more sophisticated based on behavior, such as evaluating whether the user asked a clarification question, user sentiment, or whether they used the output as is (within the context of AI content generation).
Conclusion
It’s clear that RAG and fine-tuning are both useful techniques for leveraging your customers’ external data to improve the outputs generated by the LLMs that power your product’s AI features.
Given the improvement of foundational LLMs and the difficulties around building a robust training data set for fine-tuning, most companies should prioritize building a RAG process before fine-tuning – the effort/reward ratio is much higher. However, over time, I expect that companies will implement a both RAG and fine-tuning as the AI SaaS landscape gets more competitive.