Platform

Use Cases

Developers

Resources

Build vs Buy

Build vs Buy: Data Ingestion

Learn what it's like building data ingestion for Google Drive in-house versus with Paragon

In the last chapter, we kicked off an integration build vs buy hackathon starting with establishing authentication. We registered our external application (ParaHack) with our integration providers, built a few routes and handlers in our application to navigate the OAuth process, and retrieved our 3rd-party tokens to perform some simple API requests.

As useful as it is to send “yello” into a Slack channel, today’s article will focus on a data ingestion use case. I chose this use case as it’s one of the most common use cases built with Paragon (think RAG, search, and analytics applications) and because it’s a lot more interesting than a one off API request.

With that said, let’s walk through the process for building data ingestion into ParaHack and compare the experience of building in-house vs with Paragon!

Building In-House: Project Plan

For this exercise, we built a data ingestion process that extracts the contents of every file in a users’ Google Drive and stores the text in our own database. We also built a simple text search to test out that all the file data was loaded correctly to ParaHack’s database.

In the last chapter, I already built out the handlers and routes for OAuth, and so I used that existing functionality to get the right access tokens for Google Drive’s API. I then used the credentials database and refresh token code from last time to interact with Google Drive’s API. The new components we needed for data ingestion were:

  1. A database to store file contents

  2. Some handlers for working with Google Drive’s list and get files API endpoints.

Another feature we wanted was the ability to run the data ingestion job in the background of our application. In production, ParaHack could potentially be extracting and storing data for thousands or tens of thousands of files, and so a long running background job was best suited for this use case rather than a synchronous job or a job that runs on our web backend.

In summary, the steps to build data ingestion for ParaHack are:

  1. Establish connection and pull file data from Google Drive using their API

  2. Deploy data ingestion processes to background jobs

  3. Test that data ingestion was successful via a simple text search functionality

The Building In-House Process

Step 1: Working with Google Drive’s list and get file APIs

Since we had all of our auth taken care of, I was excited to jump right into listing all the files and extracting the file contents with Google Drive’s API. I found the right endpoints on Google’s documentation:

  • files.list for listing all the files in a users’ Drive

  • files.get for getting the file and exporting to a text format

Step 1.1: Fix my authentication errors

OK, it wasn’t that easy. Turns out the Google Drive scopes I requested for the last article weren’t the correct scopes. Who would have thought “https://www.googleapis.com/auth/drive.file” only gives you permissions to files created or opened by our app versus the “https://www.googleapis.com/auth/drive” scope that gives full permissions to view a users’ files. After debugging for around an hour, I found this stackoverflow response that did the trick.

One thing I’ve come to learn from building integrations in-house is you’re very much at the mercy of the quality of the integration provider’s docs and error messages, especially when it comes to authentication and scopes.

Step 1.2: Working with Google Drive’s list and get file API (For real this time)

Just as I started feeling good and ready to work with the API, I again ran into issues with unclarity in the documentation. This time, while I was getting data back from the files.get endpoint, I wasn’t able to get a “web view link” that I needed to redirect users to their file after they search for it.

It was only after finding this Github issue was I able to find out that there’s another query parameter called “fields” that needs to include extra fields from the “file” object. I’m not sure how anyone was supposed to know that from the docs.

I found that in this exercise, the times I was most stuck was when working with the Google Drive API - not so much the actual implementation in my own application. There’s a lot of documentation reading and searching for errors on forums just for one feature within one integration. Building data ingestion for other integration providers would require familiarity with a whole new set of documentation and going through other forums. That work just scales linearly.

Anyways, with that all squared away I implemented a function that calls the files.get endpoint to get the “web view link” and the files.export endpoint to get the file contents in a plain text or csv format for handling different file types.

export const getFileContents = async (file: DriveFile, email: string): Promise<FileRecord | null> => {
    if (!acceptedFiles.includes(file.mimeType)) return null;

    const driveCreds = await getLatestDriveCredential(email);
    const headers = new Headers();
    headers.append("Content-Type", "application/json");
    headers.append("Authorization", "Bearer " + driveCreds[0].access_token);

    const contentParams = new URLSearchParams({
        mimeType: file.mimeType === "application/vnd.google-apps.spreadsheet" ? "text/csv" : "text/plain"
    }).toString();
    const metadataParams = new URLSearchParams({
        fields: "*"
    }).toString();

    const fileContent = await fetch(`https://www.googleapis.com/drive/v3/files/${file.id}/export?` + contentParams, {
        method: "GET",
        headers: headers
    });
    const fileMetadata = await fetch(`https://www.googleapis.com/drive/v3/files/${file.id}?` + metadataParams, {
        method: "GET",
        headers: headers
    });

    const record: FileRecord = { content: "", fileName: file.name, id: file.id, link: "", mimeType: file.mimeType, email: email };
    record.content = await new Response(fileContent.body).text();

    const metadata = await new Response(fileMetadata.body).json();
    record.link = metadata.webViewLink;

    return record;
}

Step 1.3: Handling pagination

For drives with a large number of files, it’s rare that a simple “files.list” will do the trick. I needed to implement methods to handle pagination, while keeping in mind that authentication and potential token refreshes may be needed.

export const iteratePages = async (googleResponse: GoogleResponse, email: string): Promise<boolean> => {
    let successfulResponse = true;

    for(const file of googleResponse.files){
        let content = await getFileContents(file, email);
        if (content !== null) {
            let record = await insertRecord(content);
            if (!record) {
                console.log("[DRIVE FILE INGESTION] " + file.name + " unable to be processed");
            }
        }
    }
    if(googleResponse.nextPageToken){
        const driveCreds = await getLatestDriveCredential(email);
        const headers = new Headers();
        headers.append("Content-Type", "application/json");
        headers.append("Authorization", "Bearer " + driveCreds[0].access_token);

        const params = new URLSearchParams({
            pageSize: "20",
            pageToken: googleResponse.nextPageToken
        }).toString();
        const newGoogleResponse = await fetch("<https://www.googleapis.com/drive/v3/files?"> + params, {
            method: "GET",
            headers: headers
        });

        if (newGoogleResponse.status !== 200) {
            successfulResponse = false;
        } else {
            const body = await newGoogleResponse.json();
            successfulResponse = successfulResponse && await iteratePages(body, email);
        }
    }
    return successfulResponse;
}

Here, we called the getFileContents method from above on each file and then recursively called the iteratePages again if the last response contained a “nextPageToken.” I included this section partly because I think it’s an effective way of handling pagination and partly because I love using recursion in real life.

Step 2: Put ingestion code into a background job

Data ingestion can be quite a time and compute intensive job, which is why it’s better suited for an asynchronous, long-running background job. A teammate at Paragon recently put me on to trigger.dev, a serverless platform for background jobs. I think of it like a lambda function for background jobs. We aren’t partnered with them but I really did enjoy the developer experience of writing and calling background jobs (aka tasks on trigger.dev’s platform).

export const digestFiles = task({
  id: "Drive-File-Ingestion",
  // Set an optional maxDuration to prevent tasks from running indefinitely
  maxDuration: 1800,
  run: async (payload: any, { ctx }) => {
    const headers = new Headers();
    headers.append("Content-Type", "application/json");
    headers.append("Authorization", "Bearer " + payload.token);

    const params = new URLSearchParams({
      pageSize: "20",
    }).toString();
    const googleResponse = await fetch("<https://www.googleapis.com/drive/v3/files?"> + params, {
      method: "GET",
      headers: headers
    });
    const body = await googleResponse.json()
    console.log(body);

    const pgUser = await envvars.retrieve("proj_ref", "dev", "PGUSER");
    const pgPassword = await envvars.retrieve("proj_ref", "dev", "PGPASSWORD");
    const pgHost = await envvars.retrieve("proj_ref", "dev", "PGHOST");
    const pgPort = await envvars.retrieve("proj_ref", "dev", "PGPORT");
    const pgDatabase = await envvars.retrieve("proj_ref", "dev", "PGDATABASE");

    const pool = initializePool(pgUser.value, pgPassword.value, pgHost.value, Number(pgPort.value), pgDatabase.value);

    const successful = await iteratePages(body, payload.email, payload.token, pool);

    return {
      success: successful
    }
  }
});

It was easy to write a task that called the Google Drive API recursively to list/get file contents and insert them to a postgres database. I could even use trigger.dev’s cloud dashboard to monitor individual task runs.

Step 3: Testing

We finished building the ingestion code and deployed our code to a background service. In total, this took around 24 working hours to build a simple data ingestion job using the correct Google Drive APIs, handle pagination, create background jobs, and of course, debugging all the errors. I then tested that ParaHack indeed ingested all the file data from Google Drive with a text search feature that can search across the contents of every file.

This time, it’s bit more impressive than just listing files or saying “yello” in Slack. We were able to build text search - essentially a “control f” for your Google Drive.

The Building with Paragon Process

Step 1: Build a workflow

With Paragon, rather than going through steps 1.1 - 1.3, I could quickly define and build a Google Drive data ingestion workflow on the Paragon platform. Inside a workflow, I used some actions that Paragon abstracts like the “Get Files” action that handles pagination so developers don’t need to worry about recursion.

I also used “conditional” steps for branching logic, “HTTP” steps to send data back to our backend service, “fan outs” to enable concurrent workers across files, and “Google Drive Request” steps where I was able to use any Drive API endpoint with auth already managed.

Similar to trigger.dev, Paragon also has built-in monitoring so I can monitor workflow runs and even see step data and errors within a run.

Workflows on Paragon’s workflow engine are background jobs with built-in serverless infrastructure, parallel workers, queueing, and retries. Didn’t even need to lift a finger for step 2 of our in-house process.

Build vs Buy: Race!

Lastly, to test our Paragon-built data ingestion pipeline, we timed our built-in-house ParaHack data ingestion with our Paragon-built data ingestion. While it’s a bit unfair to compare a quickly built hackathon style project with a mature platform trusted by 150+ SaaS companies, my dad always said that life wasn’t fair. So let’s do it.

We set up a Google Drive with 100 files and triggered ingestion jobs with both ParaHack and Paragon.

The result

Paragon was able to ingest and load the files into our database in less than a minute vs the three and a half minutes it took our built-in-house ParaHack to complete.

Looking at time to build, all in all building data ingestion from scratch took around 24 hours (spread out across 4 days). We built the skeleton of a scalable data ingestion pattern that uses Google Drive APIs, paginates across all files in a users’ Drive, and offloads the data ingestion job to long-running background jobs. It was fun to build in-house and like any self-built application, I could control the details of my implementation (page size, when to parallelize tasks, how to refresh tokens).

Building with Paragon, many of those details are managed by our product. This let me build faster while knowing that Paragon’s engineering team is using best practices to handle the implementation details I mentioned above as well as details I didn’t even think of, like rate limits and mid-process failures.

My 24 hour hackathon data ingestion feature was built in just 2-3 hours with Paragon. That’s not even taking into account how much more robust and production-ready my Paragon data ingestion is compared to my hackathon version.

Verdict

From my experience building data ingestion from scratch without Paragon, I would say that if your product and engineering team needs to control your implementation of data ingestion and can dedicate resources to building data ingestion across the integrations your customers want (Google Drive, OneDrive, Box, etc.), then building in-house can be the right choice for your team.

When you want to build out data ingestion across integrations reliably and quickly, when you care more about building the logic for using the data in your SaaS core product, or when you want a tried-and-true platform that has been already battle-tested in production, your team should consider Paragon.

That’s a wrap for this build vs buy article focusing on data ingestion! You may have noticed that this article didn’t mention how to stay on top of data updates (such as when your user uploads a new file or updates an existing file). That’s a whole another topic that we’ll dive into in a future chapter.

In the meantime, if you’d like to learn more about Paragon, check out a higher-level build vs buy article, learn how Paragon fits into your product’s tech stack, or go through our docs to learn more about how the platform works.

TABLE OF CONTENTS
    Table of contents will appear here.
Jack Mu
,

Developer Advocate

mins to read

Ship native integrations 7x faster with Paragon

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.