Knowledge AI Chatbot
Build vs Buy: Data Sync
Learn what it's like to build data synchronization in-house versus with Paragon's platform
We started this “Build vs Buy” series because we wanted to open the curtain on what it’s like to build integrations in-house. In our previous “Build vs Buy” articles we:
Started with authentication - registering our application with each integration provider, building handlers for the OAuth process, and testing authentication with a simple API request
Moved to an actual use case where we built data ingestion pipelines - paginated across all the files in a users’ drive, extracted the contents from those files, and piped the file data into our own database
In those previous articles, we took a first-hand approach to show the difficulties involved with integrating with a 3rd-party provider. We worked through handling different authentication implementations, learning different API behaviors, and working through a mixed bag of documentation quality.
In this article, we built out a two-way, real-time data sync between our ParaHack application and our users’ Salesforce data. We picked Salesforce because it is the most popular integration that is built on Paragon, with bidirectional sync being the most popular use case within the Salesforce implementations.This involves:
An initial data ingestion process, pulling all current Salesforce contacts into ParaHack
Real-time updates to our application whenever a new Salesforce contact is created or updated
The ability to push contacts created in ParaHack to our users’ Salesforce
As a result, ParaHack is essentially constantly in-sync with our Salesforce contacts and vice versa.

Rather than following the same pattern as the previous two articles where I walked through difficulties of implementation, I wanted this article to be more high level. I still built out data synchronization for ParaHack from scratch, but this time I’ll walk through design decisions I had to make while building in-house versus Paragon’s out-of-the-box implementation, specifically:
Polling vs webhook pushes with the Salesforce API
Handling your multiple tenant’s Salesforce credentials
Working with the Salesforce API’s payload and rate limits
Scaling concurrent jobs involved in the data synchronization use case
Push vs Poll: Finding the best solution with Salesforce’s API
The first design pattern I looked into for data synchronization with Salesforce is how to keep data in-sync via their API. I found it best to do some research on what APIs Salesforce had available before architecting a solution.

Real-time data pushes
Deciding on an endpoint to push data to Salesforce from ParaHack was a no-brainer. I had ParaHack send a POST request to create a contact object in Salesforce whenever a user creates a contact. Pulling all data was also pretty straightforward when deciding on an API. I chose to do a SELECT FIELDS(STANDARD) FROM Contact
using Salesforce’s query language which also supports partial results (pagination) to avoid extremely large payloads.
Real-time Data Pulls
Where it got a bit harder to decide on a pattern was for real-time data pulls. While I personally would have liked a more event-driven pattern where an event is sent whenever a contact is created in Salesforce, I didn’t find a clean method that supported this in Salesforce's API.
Two methods I considered that used an event-driven pattern:
Building an application using Apex (a Salesforce specific Java-like language) deployed on Salesforce’s Lightning Platform
Using “Outbound Messages,” a feature in the Salesforce UI where you can configure messages to be sent to an API endpoint
I decided against Apex because I didn’t feel like learning an entirely new Salesforce-specific language and I wasn’t sure if Apex could support a multi-tenant application where I needed the Apex application to interact with my users’ Salesforce accounts.
The “Outbound Messages” feature seemed promising but I couldn’t find a way to programmatically configure this for each of my users. I only found documentation on how to do this via the Salesforce dashboard.

After my research, I opted for a polling pattern to poll for data updates on a cadenced schedule. I found a Salesforce endpoint that allowed me to query for updates to my Salesforce contacts given start and end timestamps. Although it would have been nice if the API returned the updated data immediately, I could use the returned contact IDs for another Salesforce API endpoint that returned the contact data.
With Paragon
Paragon has abstractions for pushing created contacts and pulling all contacts, making it an easier process to use the Salesforce API and see what’s possible. Paragon works with many customers on their Salesforce integrations and have curated a list of popular “actions” to make the Salesforce API easier to work with.

Where Paragon’s abstractions really shine when it comes to data synchronization is the webhooks triggers. From my experience building without Paragon, there wasn’t an easy way to use Salesforce webhooks building purely with their API. Whereas with Paragon’s webhook triggers, I had the option to use webhooks and architect a more event driven architecture and avoid polling which has downsides like the many empty trips to the Salesforce API.

Multi-tenancy: what Salesforce credentials and configurations to store
My general rule for integrations is to keep it simple and store only what’s needed.
Following this rule when building ParaHack, I decided to keep Salesforce credentials at a user level as each user can request Salesforce access and refresh tokens. I also needed to keep track of their instance URL (unique Salesforce domain), as API calls to Salesforce use that unique domain to discern between organizations.
For the data sync use case, I also added a “sync” flag to signal which Salesforce accounts to enable data syncs, as we would like the ability for users to toggle whether they would like their Salesforce contacts to be synced. When polling for Salesforce updates, I will only be polling the domains WHERE sync = true
.
What made me question if this minimalist approach to storing credentials was enough is this note from the Salesforce docs:
“Each connected app allows five unique approvals per user. After a fifth approval is made, the oldest approval is revoked. For OAuth 1.x, each issued access token counts as an approval and is listed as a separate entry in the table. For OAuth 2.0, the table lists each refresh token that counts as an approval. Other flows, such as user-agent flows, might also count as approvals. For consumers that use connected apps, avoid requesting OAuth 1.x access tokens or OAuth 2.0 refresh tokens more than once for each device. That way the limit of five unique approvals doesn’t impact your org.”
Because of this, it may be more robust for ParaHack to keep track of tokens per instance_url (per organization essentially). There can also be a case that credentials should be kept at the organization level, to prevent issuing too many refresh tokens and potentially invalidating all tokens.
With Paragon
To avoid revisiting too much of our authentication “build vs buy” article, Paragon essentially handles each part of the authentication process, storing your users’ tokens so you don’t have to. When it comes to working around token limits, Paragon’s engineering team has seen many edge cases and tested with many different SaaS companies to iron out the details of token management such as preventing race conditions, handling de-authorization, and working with a variety of error codes and refresh policies. For a more detailed article on authentication challenges for a multi-tenant application, we have an article on that here.
Using Paragon, I didn’t need to think about authentication and just focus on using Paragon’s Salesforce actions and using the Salesforce API for data synchronization.
API behavior: Polling windows and rate limits
While I mentioned that I wanted to avoid talking about the details of actually working with the Salesforce API, I think it’s worth talking about design patterns to handle API behaviors. The first one is polling windows. For ParaHack, I decided on a 5 minute polling window, meaning every 5 minutes I polled every unique Salesforce domain that opted-into data syncs.
Here’s a simplified version of the serverless background function in ParaHack that used the updated/?start={datetime}&end={datetime}
endpoint to poll for updates.
This code works fine, but doesn’t take into account API limits for this endpoint, more specifically when too many records are returned. Unlike the SOQL endpoint, pagination is not supported so when the record limit is reached, best practice is to shorten our window to get under that record limit.
Another API behavior I didn’t accommodate for in ParaHack was Salesforce’s API rate limits. While I considered implementing an exponential backoff, rate limiting gets even more complicated with distributed jobs and concurrent requests, since Salesforce has both concurrent API limits as well as daily API limits.

To work with rate limits, job queues can ensure that jobs are triggered eventually even if rate limits are reached, as messages are retained in the queue until they are processed. Queueing also enables horizontal scaling of workers where we can scale the number of workers per Salesforce organization to work under Salesforce’s concurrent API limits.
For the purposes of this exercise and given my time constraints, I didn’t accommodate for the polling API limit nor Salesforce’s rate limits. These API behaviors would need to be addressed in a production-grade application, taking days of development time and robust testing.
With Paragon
Building with Paragon, I didn’t need to worry about polling windows as Paragon enables webhooks, simplifying the process tremendously. Similarly for rate limits, Paragon has a mechanism called Smart Rate Limits. They abstract the process of working under rate limits with their workflow engine, helping me make sure my application didn’t trip up Salesforce rate limits while giving me visibility on what workflows have been run, what user they ran for, and individual task statuses as well.

Concurrency: How to distribute jobs horizontally
Our last design decision for building ParaHack in-house is how to horizontally scale our workload. Each job type (synchronizing updated records vs the initial data pull) should scale differently as syncing records is more of a lightweight process whereas initial data pulls will likely involve large handling large volumes of historic data. For ParaHack’s data synchronization implementation, I decided on an implementation that scaled workers based on tenant.
In this code snippet from ParaHack, I used a CRON job to trigger a poll for contacts every 5 minutes. In that poll, I queried for the Salesforce organizations that opted into data syncs from our Postgres database and called the syncSalesforceTask
to trigger a task per organization.
This pattern works well for polling in most cases, as it’s unlikely that there will be a large quantity of contacts that come in a 5 minute window (although possible if a user decides to bulk upload Salesforce contacts).
While I didn’t implement a different pattern for the initial data ingestion use case as I wanted to focus on data synchronization, this pattern would not work well for an initial data pull. Imagine a user newly integrates their Salesforce with ParaHack. That first initial data pull may import tens of thousands of contacts especially if the user is a part of an enterprise or their Salesforce is decades old (recall the GET /services/data/v62.0/query/?q=SELECT+FIELDS(STANDARD)+from+Contact
endpoint).
In this case, distributing jobs by partial result or by time window - essentially splitting a large ingestion job into chunks that can then be picked up by individual workers concurrently - would be a more optimal solution.
With Paragon
Paragon was built for multi-tenant applications in mind. Paragon’s workflow engine allows workflows to be triggered by multiple users concurrently, and further allows for parallelization within workflows. In the example below, I used a “Fan Out” step to parallelize by contact, meaning each contact will enter a queue and be picked up by concurrent workers that will perform a data transformation step and subsequent API call.

Instead of managing clusters, microservices, and scaling groups for large data ingestion and synchronization jobs, Paragon made it easy for me to use their infrastructure and provide different ways I could scale concurrent tasks based on which use case I had in mind - data sync vs data ingestion.
Paragon: An Integration Platform and Partner
In the previous “Build vs Buy” articles, I focused on how Paragon’s platform provides abstractions to make it easier for product and engineering teams to build useful integration quickly and reliably. What I hoped to show in this article is how these abstractions are built thoughtfully and how they allow our users to design performant systems that can handle tough jobs like data ingestion and synchronization.
“Buying” Paragon means getting the Paragon platform AND working with an integration domain expert that has thought deeply about production-ready integrations and how to help our customers scale (our pricing is based off the number of users you have that use the integrations you’ve built on Paragon).

If you’re considering “buying” or would just like to learn more, check out our blog, try Paragon with a free trial, or get an in-person touch with a demo.
CHAPTERS
TABLE OF CONTENTS

Jack Mu
,
Developer Advocate
mins to read