Guides
Why unstructured data makes building RAG applications so hard
Context is critical for LLMs, but it lives in unstructured data that's scattered across hundreds of your users' files and apps. But extracting that valuable context comes with many challenges, which we cover in this article.
Mathew Pregasen
,
Developer Relations
8
mins to read
Engineers have largely recognized that fine-tuning LLMs is not sufficient in most use cases. To generate relevant and useful responses, LLMs require context, which lives in troves of unstructured data that is difficult to access and integrate into your RAG pipelines.
This data spans departments, across sales, engineering, product, and more. Sales generates emails, call transcripts, and chat conversations. Engineering writes code to Github and internal documentation to Confluence. Product pumps out countless PDFs and Notion files. To top it off, the entire organization produces tens of thousands of Slack messages per month. This unstructured data amounts to aggregations of a company’s knowledge.
Prior to AI, this contextual data just sat around. In fact, it was rarely called “data” as it wasn’t exactly independently useful for gathering insights. It might’ve been yanked into a few integrations here and there for automation purposes (such as uploading a file to Google Drive), but generally speaking, it was stale media. That’s no longer the case today; we’re seeing an influx of AI software companies racing to productize the capabilities of LLMs, offering out-of-the-box solutions such as AI chatbots, AI sales reps, and copilots across various domains. But without access to this “everyday data”, these products will only be as useful as your favorite airline’s website chatbot (hint: not very useful).
Unfortunately, working with unstructured data is as much of a data challenge as it is an AI challenge. By definition, unstructured data does not fit into a structured schema or relational database. It’s also scattered across locations, gated by various permissions and limits, and can live in different formats such as text, images, and videos, which, often times, requires pre-processing before it’s even usable.
Working with some of the leading AI SaaS companies like AI21, Pryon, and Copy.ai has kept us attuned to the challenges of siphoning unstructured data into AI / RAG pipelines.
Today, we’re going to share our learnings.
Why is unstructured data so valuable?
If you had to guess, you might wager that unstructured data is the icing, not the cake. After all, the data isn’t created with the intention of adding context to prompts to LLMs; it’s just the byproduct of general business activities. And frankly, before 2023, unstructured data was the icing; at Paragon, we scarcely got requests for integrating anything beyond structured fielded data (e.g. CRM records or marketing campaign performance).
But today, unstructured data is the golden lacquer for any AI software product. Automating workflows were good for simple IFTTT logic, but it always lacked nuance. But by injecting relevant conversational data, notes from different apps, Slack conversations, and more into the LLM prompts via RAG, these AI SaaS products can generate significantly more nuanced and useful responses.
The various sources of unstructured data
When products need to tap into their users’ unstructured data, they usually need to access all of it. After all, contextual data is only as strong as how fleshed out the context is (you’d never want data taken out of context). But maximizing context coverage translates to working with many different types and sources of unstructured data (and structured data) to build clustered knowledge.
We’re seeing a lot of patterns across our customers in terms of the data they’re ingesting for their AI SaaS applications’ RAG pipelines, including:
Sales or marketing emails (i.e. Salesloft, Outreach, HubSpot, and Mailchimp)
Past support chat conversations (i.e. Intercom, Zendesk, and Freshdesk)
Ongoing and historical projects/tasks/tickets (i.e. Jira, Asana, and Linear)
Code repos (i.e. Github)
Internal documents and files (Notion, Confluence, and files from Google Drive)
This brings us to the classic integration challenge.
For every one of the data sources, your team will need to figure out auth, research the API, spin up webhooks and listeners for real-time ingestion, building in rate limit and error-handling… all in a multi-tenant architecture. Multiply that by the dozens of 3rd party applications where this data lives, and you’ve got yourself a multi-year roadmap of integrations development work.
But the breadth of 3rd party services you need to work with isn’t the only challenge when it comes to context ingestion - the more technical challenge is building an infrastructure capable of ingesting high volumes of data.
The challenge with high-volume ingestion
Unlike structured data which is generally designed to be streamed/transferred, unstructured data usually hails from applications with low rate-limit APIs and sometimes a lack of support for webhooks. If moved around, it was usually in one-time import/export fashion.
But ingesting context from this unstructured data introduces a lot more complexity, especially if there is high volume at play. High volume data ingestion, as recently discussed by a past piece, are difficult in nature. They can easily overload compute power and create all sorts of scaling issues.
For example, say you wanted to ingest all the files from your users’ Google Drive directories - your infrastructure needs to be able to handle a wide range of unpredictable file sizes, anywhere from a small 2-page PDF to a 2GB webinar recording.
And if you have webhooks set up to listen for file changes and a user suddenly copies over 1000 files from another directory, you’ll need to be able to catch and queue all of those webhook payloads without fail.
In this department, we especially speak from experience. Paragon serves as the unstructured data ingestion layer for many of our customers’ applications. Our integration engine is designed to withstand varying throughput, from tiny JSON blobs to massive video files. But that was built from years of engineering and headache.
Note: This is why AI SaaS companies like AI21, Pryon, and Writesonic use Paragon, so their engineers can stay focused on core competencies, not integrations.
But getting the data is just the first challenge (albeit a pretty big one). Because the data arrives in various formats, you’ll need different pre-processing pipelines. The unstructured data could be full of fluff or HTML tags that you don’t want to inject into your prompts, which means you need to clean the data. And permissions becomes a huge concern because all the ingested data is stored in a central vector database.
Let’s discuss these issues in detail, starting with preprocessing.
Preprocessing—the bane of unstructured data
One thing that makes unstructured data so difficult to work with is preprocessing. Preprocessing, where data is transformed so it can be vectorized, can be split into three sub-steps—conversion, cleaning, and chunking.
Conversion
Because LLMs ingest language a.k.a strings, non-string inputs need to be converted into strings. For some unstructured data, this is easy and deterministic; for example, a JSON blob could be converted into a string with one-liner logic. But for others, the process is trickier.
For example, PDFs are a common type of unstructured data. PDFs are, by nature, protected. You cannot extract a PDFs’ inner text without using more computationally-intensive processes like Optical Character Recognition (OCR) or Intelligent Document Processing (IDP). Some of our enterprise customers have built proprietary tooling for this in-house, but the majority of AI SaaS startups are leveraging open-source libraries like Tesseract and manually handling computational scale or using solutions like Sensible, Mindee, or AWS Textract.
There is a reason that third-party OCR / IDP products are popular. While open-source frameworks are quite effective from a unit test perspective, they struggle with scale. When an AI application is pulling in and processing thousands of PDFs, compute power, distribution, and network limits need to be considered.
The same applies to converting video and audio files. Obviously, video or audio files aren’t immediately useful to LLMs. The audio layer needs to be transcribed to strings via speech-to-text algorithms, and the video layer needs to be synthesized into visual descriptions. This all takes non-trivial computational bandwidth, though there are more and more solutions popping up that solve these problems, including like Deepgram and AssemblyAI, or even OpenAI’s toolkits.
Cleaning
Regardless of whether the data requires conversion to strings, it still needs to be cleaned. LLMs don’t perform as well when the data/context is filled with filler or noise, as it dilutes the importance of the most critical pieces of context. This can even impact the performance and accuracy of searches to the indexes of the vector store, but we won’t get into that here.
Optimally, data should be stripped to only the context that actually is relevant to the task, no more, no less. Depending on the data, this cleaning process can be quite rudimentary, such as getting rid of emojis and Unicode characters or lowercasing strings. Sometimes, applications also need to remove filler words (e.g. “in”, “on”, “at”) that aren’t helpful to the model. After all, the goal is to produce a tight string that only has the information necessary to the LLM’s job. But cleaning might also involve more complex parsing like stripping HTML tags and syntactical sugar or reducing words to their root form.
Additionally, sensitive information such as PII or company EINs need to be redacted before it can be passed to an LLM, though this often occurs post-embedding, before the data is injected into the prompt. This is a huge concern for companies today, but thankfully tools like Skyflow provide first-class data privacy tooling for this problem.
Chunking
Finally, there is chunking. LLMs have a limit of how large a message can be (typically measured in characters). This constraint calls for chunking, where text is broken into smaller segments that are sequentially ingested.
There are various types of chunking, from bare-bones approaches like splitting text by paragraphs or sentences, to more context-aware strategies such as prepositional chunking. The latter usually produces better results because data isn’t cut-off at confusing points, but there are always tradeoffs with each approach. Ultimately, you’ll need to test different chunking strategies against the various types and sources of data you’re ingesting, to see what results in the most precise and useful chunks.
The permissions challenge
The final issue that makes unstructured data so tricky is permissions and access control lists (ACL).
This is quite easy to manage in a regular RBAC/ACL use case. A user requesting access to a file or page would be quickly checked against the permissions of that asset through role-based-access-control (RBAC) or ACL. And unlike sturctured data, where entire objects/tables have unilateral restrictions, individual unstructured files have varying access.
In a RAG pipeline, an admin-user is generally authorizing the ingestion of an entire organization’s files, and all of that data is ingested into a single vector store.
Without a robust permissions strategy, the retrieval mechanism in your RAG process could pull context chunks that the user initiating the query should not have access to. For example, all of an organizations data may be ingested into an AI application, but files privy to an organization’s C-suite shouldn’t be bled to sales reps, even through diluted AI responses. This requires companies to craft a framework to ensure that any RAG process is only retrieving data that the requester is privy to when answering a request.
In essence, the challenge is less so about preventing external actors from accessing the contextual data, but rather ensuring the right guardrails exist within an organization.
This creates two sub-problems: (a) pulling permissions lists and (b) respecting permissions during retrieval.
Pulling permissions lists
Pulling lists can be tricky because integrations often de-couple data API routes and ACL API routes. Often, companies trust a third-party to manage the permissions integration layer. For example, AI companies like Pyron use ingest permissions metadata as part of the ingestion workflow for each file.
This is of course just for illustrative purposes - in practice, you may want to pull the entire Google Drive account’s permissions hierarchy (think of a file tree) at a directory level first. Then upon file ingestion, only pull in user-level permissions for each file as the rest can be inferred from where the file lives within the permissions hierarchy.
That said, implementation can also vary greatly between 3rd party services. For example, Notion doesn’t provide a permissions API, which makes the permissions challenge much more difficult to navigate.
Respecting permissions during retrieval
The next piece is storing the permissions metadata alongside each chunk, so at runtime, only the chunks that the querying user has permissions to access can be retrieved.
Pinecone has a tutorial on how to achieve RAG with permissions when using their vector database, although it is simplified into attribute-based (think departments) access control vs. user-specific ACLs.
Closing Thoughts
For years, unstructured data was the consequence of “business-as-usual” practices. Today, with the advent of generally-available LLMs, it is now incredibly valuable. However, integrating unstructured data into AI / RAG applications poses some unique challenges. Data needs to be converted, cleaned, and chunked. Permissions need to be respected. And data, with its varying sizes, needs to be stably integrated.
If you are interested in learning how we’ve thought about this problem, please feel free to book some time with our team.