How to Optimize Oracle API Data Extraction to Avoid Gateway Timeout?

This topic was automatically generated from Slack. You can find the original thread here.

We have to extract 13 lakh records from the Oracle API. currently using below snippet, we are able to extract only 33k records after that it went to gateway timeout(current-750 seconds)

Please suggest optimisation or better approach to achieve it.

What is in the below code

  1. we have used paginated way each time only 500 records(0-499)
  2. timeout kept for the workflow- 750 seconds and memory 256 MB
  3. In the below API hasMore fields returns true if there are more records available to the API, thats why in while loop
    **import** { axios } **from** "@pipedream/platform"
    export **default** defineComponent({
    **async** run({steps, $})
    {
    **var** baseURL="https://oracle-api**********
    **var** api="/projectcost"
    **let** offset = 0
    **let** index = 0
    **let** limit = 499
    **let** onlyData = **true**
    **let** totalResults = **true**
    **let** hasMore = **true**
    **let** items = []
    **let** fields ="ABC,DEF,GHI,JKL";
    **let** url = baseURL + api +"?fields=" + fields+ "&limit=" + limit + "&onlyData=" + onlyData + "&totalResults=" + totalResults
    **while** (hasMore) {
    offset = index ** limit
    **const** response=**await** axios($, {
    url: url+"&offset=" + offset,
    method: 'get',
    auth: {
    username:**********,
    password: ********** }
    })
    items = items.concat(response.items)
    hasMore = response.hasMore
    index = index + 1
    }
    **return** items
    },

Given the challenge you’re facing with extracting a large number of records from the Oracle API and encountering gateway timeouts, here are some optimizations and approaches you might consider: 1. Increase Workflow Memory and Timeout: Since you’re dealing with a large dataset, the first step would be to increase the available memory for your workflow to the maximum allowed by Pipedream, which is 2 gigabytes. This is especially important for data-intensive operations. Additionally, consider increasing the timeout for your workflow if possible. This can be adjusted in the workflow settings. 2. Implement Efficient Pagination: Ensure that your pagination logic is efficient. You’re already using a paginated approach, which is great. However, consider checking if the Oracle API supports server-side filtering or sorting that could help you retrieve records more efficiently. Also, ensure that you’re correctly updating the offset for each request to avoid processing the same records multiple times. 3. Parallel Requests: If the Oracle API allows it and your rate limits permit, you could make parallel requests to fetch different segments of data concurrently. This approach can significantly reduce the overall time required to fetch all records. However, be mindful of the API’s rate limits and concurrency limits to avoid being throttled or banned. 4. Error Handling and Retry Logic: Implement robust error handling and retry logic to deal with timeouts or other transient errors. This can include retrying failed requests with exponential backoff or skipping problematic records after a certain number of retries. 5. Use External Storage: Given the large volume of data, consider storing the fetched records in an external database or storage service periodically during the extraction process. This can help prevent data loss in case of errors and reduce memory usage within the workflow. 6. Review API Limits and Optimize Calls: Review the Oracle API documentation for any limits on the number of records returned per request or other relevant limits. Adjust your limit parameter if needed to optimize the number of API calls required. 7. Monitor and Optimize: Continuously monitor the performance of your workflow and optimize based on the bottlenecks you identify. This could involve adjusting the batch size, optimizing the processing logic, or other changes based on the specific challenges you encounter. Here’s a revised snippet with a placeholder for parallel requests and error handling (note: this is a conceptual example and may require adjustments based on the Oracle API specifics and Pipedream’s capabilities):

Please share the revised snippet as you told above

It seems like you’re asking for a revised code snippet, but I don’t have the original context or details about what specific changes or improvements you’re looking for in the snippet. Could you please provide more information about the original code snippet and the specific revisions or functionality you’re seeking? This will help me provide you with the most accurate and helpful response.

fyi

check the above discussion and share the revised snippet as you mentioned above.

It appears that you’re asking for a revised code snippet based on a previous discussion. However, without specific details about the original code or the requested revisions, I’m unable to provide an accurate revised snippet directly. Could you please provide more details about the original code snippet and the specific changes or improvements you’re looking for? This information will help me understand your requirements better and provide a more accurate and helpful response.

We have to extract 13 lakh records from the Oracle API. currently using below snippet, we are able to extract only 33k records after that it went to gateway timeout(current-750 seconds)

Please suggest optimisation or better approach to achieve it.

What is in the below code

  1. we have used paginated way each time only 500 records(0-499)
  2. timeout kept for the workflow- 750 seconds and memory 256 MB
  3. In the below API hasMore fields returns true if there are more records available to the API, thats why in while loop
    **import** { axios } **from** "@pipedream/platform"
    export **default** defineComponent({
    **async** run({steps, $})
    {
    **var** baseURL="https://oracle-api**********
    **var** api="/projectcost"
    **let** offset = 0
    **let** index = 0
    **let** limit = 499
    **let** onlyData = **true**
    **let** totalResults = **true**
    **let** hasMore = **true**
    **let** items = []
    **let** fields ="ABC,DEF,GHI,JKL";
    **let** url = baseURL + api +"?fields=" + fields+ "&limit=" + limit + "&onlyData=" + onlyData + "&totalResults=" + totalResults
    **while** (hasMore) {
    offset = index ** limit
    **const** response=**await** axios($, {
    url: url+"&offset=" + offset,
    method: 'get',
    auth: {
    username:**********,
    password: ********** }
    })
    items = items.concat(response.items)
    hasMore = response.hasMore
    index = index + 1
    }
    **return** items
    },

To optimize the process of extracting a large number of records from the Oracle API and address the gateway timeout issue, consider the following recommendations: 1. Increase Workflow Memory and Timeout: Given the volume of data you’re processing, the current memory (256 MB) and timeout (750 seconds) might be insufficient. Increase the memory to 2 gigabytes and extend the timeout as much as possible. This adjustment can be made in the workflow settings. 2. Implement Efficient Pagination: Your current approach with pagination seems appropriate, but ensure that the API calls are efficiently fetching and processing data. If possible, adjust the limit to find a balance between the number of records per request and the total number of requests needed. 3. Handle Pagination with Generators: Consider using a generator function for pagination to start processing records after the first API call. This can help in managing memory usage more effectively. Refer to the Pagination guidelines for implementing a generator for pagination. 4. Optimize Data Handling: As you concatenate items in the loop, this might increase memory usage significantly. Ensure that you’re only keeping necessary data in memory and possibly offloading processed records to a storage solution periodically. 5. Parallel Processing (If Applicable): If your use case allows, you might want to fetch and process data in parallel to speed up the overall process. However, this depends on the API’s rate limits and concurrency capabilities. 6. Review API Rate Limits: Ensure that your requests are within the Oracle API’s rate limits. If you’re hitting rate limits, this might contribute to the timeouts. Adjust your request rate accordingly. 7. Error Handling and Retries: Implement robust error handling and retry mechanisms for API requests. This can help in managing intermittent network issues or API errors without failing the entire process. Here’s an optimized version of your code snippet incorporating some of these suggestions: ``` import { axios } from “@pipedream/platform” export default defineComponent({ async run({steps, $}) { var baseURL = “https://oracle-api” var api = “/projectcost” let offset = 0 let index = 0 let limit = 499 // Consider adjusting based on performance let onlyData = true let totalResults = true let hasMore =

complete your previous code

It appears that you’re asking for a completion of a code snippet or a continuation of a previous discussion, but unfortunately, I don’t have the context or the specific code you’re referring to. Could you please provide more details or share the code snippet you need help with? This will help me provide you with the most accurate and helpful response.

check the thread message where you have provided optimized version

I’m sorry, but it seems like your question is missing some context that would help me provide a more accurate answer. Could you please provide more details or clarify what specific information or assistance you’re looking for regarding a thread message and an optimized version?

can you write snippet to write 500 records into csv file and save it on google drive using pipedream components

To write 500 records into a CSV file and save it on Google Drive using Pipedream components, you’ll first need to generate the CSV content from your records. Then, you can use the Google Drive API to upload the file. Below is a Pipedream component example that demonstrates how to accomplish this. Note that this example assumes you have the 500 records available in a specific format (e.g., an array of objects) and that you have connected your Google account to Pipedream. import { axios } from "@pipedream/platform"; import { parse } from "json2csv"; export default defineComponent({ props: { google: { type: "app", app: "google", }, }, async run({ steps, $ }) { // Example records - replace this with your actual data source const records = new Array(500).fill().map((_, index) => ({ id: index + 1, name: `Record ${index + 1}`, value: `Value ${index + 1}`, })); // Convert records to CSV const csv = parse(records); // Prepare the file for upload const fileMetadata = { name: 'records.csv', mimeType: 'text/csv', }; const media = { mimeType: 'text/csv', body: csv, }; // Upload the file to Google Drive const response = await axios($, { method: "POST", url: `https://www.googleapis.com/upload/drive/v3/files?uploadType=multipart`, headers: { Authorization: `Bearer ${this.google.$auth.oauth_access_token}`, "Content-Type": "multipart/related; boundary=foo_bar_baz", }, data: `--foo_bar_baz\r\nContent-Type: application/json; charset=UTF-8\r\n\r\n${JSON.stringify(fileMetadata)}\r\n--foo_bar_baz\r\nContent-Type: text/csv\r\n\r\n${csv}\r\n--foo_bar_baz--`, }); return response; }, }); This code does the following: 1. Generates an example set of 500 records. Replace this with your actual data. 2. Converts the records to CSV format using the json2csv library. 3. Prepares the file metadata and the CSV content for upload. 4

For iterating over records like this in large batches, I highly recommend building the iteration within a Source. That way you can build the workflow with one record in mind, and the source will paginate in batches.

didn’t get you can you please explain in detail.

Sure,

So triggers in Pipedream are actually separate from your workflows. They run in a separate Lambda.

You can build your own trigger (aka source) in Pipedream, and a source includes tooling to help poll an API, dedupe the records, store the page or offset record in a dedicated cache, and emit the records individually to subscribing workflows.

Here’s an example polling source for a GraphQL API:

This quickstart helps describes the development process: Quickstart: Source Development

This is pseudo code but then you don’t have to worry about recursion, just set the source to poll instead, and it’ll use the stored offset param from the database:

export default { 
 name: "Source Demo",
  description: "This is a demo source",
  props: {
    timer: {
      type: "$.interface.timer",
      default: {
        intervalSeconds: 60 ** 5,
      },
    },
    db: { type: "$.service.db"}
  },
  dedupe: "unique",
  async run({ $ }) {
    var baseURL="https://oracle-api**********
      var api="/projectcost"
      let offset = this.db.get('offset')
      let index = 0
      let limit = 499
      let onlyData = true
      let totalResults = true
      let hasMore = true
      let items = []
      let fields ="ABC,DEF,GHI,JKL";
      let url = baseURL + api +"?fields=" + fields+ "&limit=" + limit + "&onlyData=" + onlyData + "&totalResults=" + totalResults

        offset = index ** limit
        const response=await axios($, {
          url: url+"&offset=" + offset,
          method: 'get',
          auth: {
            username:**********,
            password: *********   }
        })
        items = items.concat(response.items)

        for(const item as item) {
          $.emit({
            id: item.id,
            item,
          })
        }

        // update offset
        this.db.set('offset', response.offset)
      }
    }
}