Insights
Challenges with High Volume Data Ingestion
Sharing some of the challenges that we had to overcome when building Paragon's workflow engine in order to support our customers' enterprise scale ingestion jobs.
Mathew Pregasen
,
Dev Rel
12
mins to read
For the last two decades, users have increasingly come to expect SaaS applications to integrate with their other SaaS applications. But today, it’s become an even more pervasive desire with the advent of generally-available AI, where software companies are building RAG pipelines and ingesting tons of contextual data from their customers’ various third-party sources like Salesforce, Notion, and Gong. We see this trend across both startups and enterprises in our experience working with both early-stage startups and enterprise customers like AI21 and Dropbox.
Unfortunately, integrations are difficult—even in the most rudimentary cases. Integrations involve lots of moving parts (which are often poorly-documented or bug-prone). Between secure API key management, keeping OAuth tokens refreshed, accessing the right permissions and scopes, request routing, and data batching, there are many sources of significant headache for engineers.
To make matters worse, when integrations process heavy volume (an almost default-case for today’s AI-powered apps that need to ingest contextual data from everywhere), even harder problems emerge. These problems are the by-product of integrations not being scale-friendly in nature. Webhook receivers crash. Rate limits get tripped. Data custody gets violated. It’s like pumping a river’s supply through a cottage’s plumbing system; in theory, it’s possible, but there’s a very high likelihood of a fatal crash.
Today, we’re going to discuss the problems that high volume data integrations uniquely induce, which we’ve also had to account for when designing Paragon’s architecture. The goal isn’t to scare you away from maxing out your integrations; ingesting your users’ external data will enable you to multiply the value your product can provide customers, whether it’s through enriching data, operationalizing it, or generating better responses via RAG and fine-tuned LLMs. Instead, our goal is to make you aware of the common but sometimes unexpected hazards that are common to pressure-tested integrations.
To begin, let’s define what “high volume” actually entails.
What is high volume data ingestion?
High volume data ingestion defines whenever the total bytes transacted between an API and a client application cross a reasonable threshold. High volume data ingestion is sometimes referred to as HVDI (that’s a complete lie, but it has a nice cable-like ring to it).
High volume data ingestion is the result of either of three scenarios. Either:
(i) An application needs to pull lots and lots of records from a third-party platform. This might be an entire CRM of millions of records being ingested to feed a customer genie LLM chatbot (alternatively, an end-user importing two million records). Here, the records each are small in size but collectively sum to a lot of data.
(ii) Or, an application needs to pull a non-trivial amount of data concurrently. It might not be the entire database, but the number of requests happening simultaneously overwhelms the port.
(iii) Finally, an application may need to extract very large, often unstructured records, such as massive video files, images, and PDFs. The records might be few, but fat.
Of course, with most of our AI customers, size and volume are both significant concerns given the vast amount of structured and unstructured data they needed to ingest and index.
That leaves a question of what is the threshold? Like most things, it depends. An API’s static constraints and a stack’s structure all play a role in when a system goes from business-as-usual to battle-tested. But, to provide a very rough gist, we’re referring to nearly a terabyte of data being transacted daily, where bursts in volume might eclipse a machine’s available RAM. Alternatively, we might be talking about thousands of requests in short bursts, where a machine’s digital port cannot handle the concurrency.
The two types of ingestion jobs
Data could be ingested in one of two ways: an API query (polling) or webhooks (streamed updates).
Polling API Queries
The simplest way to ingest data would be to ping an API to check for any changes and then pull those changes. This is typically how the initial ingestion job is run in an integration setting. The background job would look like this.
Your end-user authenticates their 3rd party account (ie. HubSpot)
This triggers a background ingestion job that gets all relevant records from their 3rd party API
You pass that data into your app’s API/relational database
(If the data is unstructured) Process and chunk it before pushing it to a vector store or knowledge graph.
If your customers have millions of records, ingesting all of those records concurrently can take down your servers. In fact this is exactly what happened to Crystal Knows when one of their users suddenly imported a few thousand records into their CRM at once - took their whole app down for a few hours. That’s why it’s extremely important to be able to dynamically scale workers to handle these types of spikes whenever an ingestion job is kicked off, and have a queuing system in place to distribute the load over time (our engineering team spent years building optimizing our workflow engine to support just that).
That covers the initial ingestion job, but what about ongoing updates? This creates a decision for an application: either, constantly dispatch requests for updates and risk tripping a rate limit, or reduce request frequency and miss-out on real-time behavior.
3rd party API Rate Limits
Rate limits are an API provider’s safeguard against too many requests from overriding their servers; accordingly, they protect against DoS attacks. However, rate limits often hinder legitimate uses of an API, and when not properly handled, they can cause data integrity issues.
Rate limits can particularly become an issue if the integration doesn’t support webhooks, which we'll get to. The work-around is to constantly poll for updates, which can risk tripping a rate limit.
To account for this, when the limit is approached, requests need to be quarantined and only dispatched after enough time has passed. This is possible with a basic counter if there is a singular job making requests, but with multiple background jobs making requests (such as one for Contact updates, one for Company updates etc.), more robust cross-instance logic is needed. This is because any one instance doesn’t have context if the rate limit is close due to heavy requests made by another instance.
If a rate limit is tripped, the error needs to be caught, and the failed request has to be re-tried after the limit is raised. Otherwise, data integrity is violated.
Overall, handling rate limits is a tough problem, and one our customers wanted us to solve for them out of the box, which led to the creation of Smart Rate Limits.
Frequency decisions aside, API queries also are made blindly; they don’t account for periods of no change; or, they don’t capture granularity but matching how data may be flurrying-in. In short, API queries leave the guesswork to the client machine who has no context of newly-available data.
In short, there are issues with using a polling mechanism to ingest ongoing updates. The solution is a webhook.
Webhooks
Webhooks are public endpoints that are subscribed to updates from a third-party application. That application will then hit the webhook endpoint whenever an updated record is issued.
The difference between API queries and webhooks is akin to calling a bank to learn about recent transactions versus getting a push notification. Webhooks fundamentally off-load the timing issues to the third-party application, which has visibility to when an update is available. For example, many of our customers use Paragon to listen for new Jira Issues that are created in their users’ Jira accounts in real-time.
However, webhook listeners and endpoints need to be robust. For high-volume ingestion scenarios, they are going to be receiving a lot of traffic. But they are processing units, not data stores designed for volume and security. This distinction defines the tough problems of implementing webhooks.
The issues with webhooks
The issues with webhooks could be broken down into a few categories, which often compound with one another.
Data Integrity
One of the hallmark hazards of webhooks is breaking data integrity.
Data integrity is the expectation for data to remain valid and accurate. However, that’s a bit of a tricky definition, and data integrity could be better explained through an example.
Imagine if one of your customer’s Salesforce data was synced with your Postgres database through an integration. To the general system, there is an expectation that the Salesforce data aligns with the Postgres data. However, if the data sync failed, and that failure was not error-handled, then data integrity is violated because that expectation isn’t met.
When it comes to high-volume ingestion jobs, without the proper infrastructure to support it, there is a very high likelihood of failures. For reasons discussed in detail in later sections, committed changes might break due to network issues, application issues, or programmatic errors. Irrespective of the reason, a failure needs to be error-handled so that (a) it can be re-attempted or (b) there is an established record of data not being synced.
Date integrity can also be violated due to duplicate requests, especially if records don’t have a primary key to safeguard against duplication. For example, if an event log dispatches a few events twice, then idempotency is violated because duplicate records are created. These are difficult to debug without pruning through every event. Yet they are unfortunately common, particularly whenever a request timeout triggers another attempt even if the original request was partially successful.
While error-handling is the solution to data integrity issues, robust monitoring is also important. Mistakes happen (often with error-handling logic), and robust monitoring enables engineers to figure out and undo the mistake. This is particularly important with high volume data integrations because an unchecked error could lead to non-trivial data loss, and patches need to be implemented programmatically, taking advantage of the rich monitoring log. The importance of this was why we built Event Destinations, which stream integration errors to your own monitoring solution like DataDog, Sentry, etc.
The Batching Question
There are two extremes of transferring data via integrations. The first is streaming, where data is dispatched as it is made available. The second is batching, where data is collected together and dispatched in a single request.
Streaming is ideal for availability. It gets the data from one location to the next in the fastest time. This is often necessary; users expect up-to-date context and content when interfacing with chatbots or AI agents. However, streaming does have downsides; without a robust throttling strategy, streaming requests can fail after crossing a supported threshold. This is because machines can only accept a certain number of concurrent connections (typically in the low thousands). After that threshold is crossed, connections will fail until previous connections are released.
Batching, meanwhile, is ideal for efficiency. Data is collected together and then dispatched in a single request. It minimizes net connections while maximizing latency. Batching also makes it easier for developers to implement integrations given less connection pooling concerns. However, it creates a different problem: data isn’t available until after the entire collection is batched together. It also tasks intermediary webhooks with keeping abundant data in-memory, which might trip their RAM threshold, crashing the application.
Overall, most companies prefer real-time availability - at least that’s what we’ve observed across our customers. Everyone prioritizes using our integration webhook triggers, despite Paragon supporting the usage of 3rd party batch APIs and CRON jobs that ping the third party service. However, for those building in-house, efficiency remains a concern given the technical limitations of physical machines.
Some third-party vendors like Salesforce have acknowledged these combined needs and have resorted to a compromise: micro-batching. Micro-batching creates an approximate maximum latency by batching requests together within a set window. Once that window is crossed, the request is made, and a new request is prepared with the next set of data. Alternatively, micro-batching may use a bytes size volume instead of a set window of time; regardless, it breaks batching into many jobs that still dramatically divide the amount of net requests.
The difficulty of deploying micro-batching is that it isn’t a perfect science. If requests are micro-batched by a time window, then a RAM threshold could still be tripped if a flurry of data is streamed in. Sometimes, that happens because of file sizes—a Google Drive payload may typically be 5MB, but it’s possible that two 5GB files are uploaded at once. The alternative is to micro-batch based on a size threshold, but that could lead to severely delayed requests if the data coming in is also too small. Algorithms need to use some combination of time and size to be robust; we have fine-tuned ours to successfully process thousands of requests per second from an origin like Salesforce.
Scaling
Because webhooks sometimes have to handle abundant traffic, they often need to be parallelized across instances to effectively scale. Otherwise, a single machine’s port would max out on connections available. Additionally, relying on a single deployment creates a single point of failure should it break; with high volume, this could lead to major data loss.
There are new challenges introduced with parallelization, however. Webhook instance code does change, and if those changes aren’t deployed simultaneously, there could be non-deterministic incongruity in data processing.
Security
Webhooks create a unique security hazard because typically encrypted, sensitive data remains unencrypted in memory while it is processed. Data being plaintext is essential for any transformations, but it does make it vulnerable.
To curb the attack vector, webhooks need to restrict network connections to only trusted endpoints and be protected by proper access controls. Otherwise, unauthorized access could lead to a serious data breach. They should also implement HMAC (Hash-based Message Authentication Codes), where messages are signed using a secret key by the provider, enabling the listener to verify if the request is legitimate.
However, it is not just external attackers to worry about. With today’s modern consumer data privacy conventions (e.g. GDPR and CCPA) and corporate compliance standards (e.g. SOC 2), it’s important for employees to not have unauthorized access to sensitive data. However, engineers with the ability to access webhooks do have technical access to data; this means that webhook maintenance needs to follow compliance standards to ensure only trusted employees working on trusted machines get limited access.
It is worth noting that there is ongoing research on applying homomorphic encryption—where data processing is done on encrypted data—to webhooks. However, homomorphic encryption is extraordinarily expensive to CPUs and isn’t ready for production deployment. However, if CPUs or homomorphic encryption techniques make dramatic improvements on efficiency, it may eventually be a satiable solution.
Velocity
Integration code often needs changes because APIs evolve and data requirements grow. However, engineers need to make several considerations whenever pushing a change to webhook code. Will the change impact data integrity? If there is an integrity issue, is there a way to easily address it? And does the change create any security issues?
These considerations often hamper the velocity of shipping changes to integrations, a problem that only compounds whenever multiple integrations in the same class (e.g. Salesforce, Hubspot, and Zoho) need a single change.
Enter LLMs + RAG
To add even more complexity into the mix, today, LLMs, RAG, and vector databases/knowledge graphs have taken over the development world. Their popularity has made it even more important for applications to pull incredible swaths of contextual data to include in LLM prompts through a RAG pipeline. For companies building these RAG pipelines, a simple sync isn’t enough; data is needs to be pulled from significantly more sources and can exists across formats, including unstructured data such as videos, PDFs, and other documents.
For example, a regular sales SaaS platform generally needs to sync Contacts, Companies, and Deals with their customers’ CRMs. However AI sales application are now racing to not only sync those standard objects, but also ingest any emails, call transcripts, and internal notes that may be related to those records, in addition to each piece of data’s permissions metadata.
This introduces a few new challenges. PII needs to be redacted and permissions metadata needs to be pulled and associated to every piece of data. Not only is implementing all of this complex, it also calls for a CPU-expensive process and can substantially strain a server’s RAM. That’s why many enterprise companies like AI21 and Pryon are using Paragon has their ingestion engine for RAG. It’s not worth building the complex infrastructure and dozens of integrations - instead, they’re able to focus on their core competencies (knowledge graph design and proprietary LLMs).
Closing Thoughts
High-volume data ingestion from integrations create difficult problems for developers and an application’s stability. While webhooks solve many of the underlying problems with raw API requests such as rate limits and concurrency, they introduce their own reliability issues given how unpredictable inbound data could be. However, with the right real-time ingestion infrastructure, data can be efficiently ported-in reliably.