PDF Text Extraction

mrodgers.junk · March 2, 2023, 1:01am

Does anyone have any example code using pdf extract (pdf.js-extract - npm). I’m not using import correctly or something else?

Error: SyntaxError: ‘import’ and ‘export’ may appear only with ‘sourceType: module’ (4:0)

I’m running into errors about how to use import for this code:

import PDFExtract from “pdf.js-extract”

export default defineComponent({
async run({ steps, $ }) {
const pdfExtract = new PDFExtract();
const fs = require(‘fs’);
const buffer = fs.readFileSync(“https://www.zscaler.com/resources/data-sheets/zscaler-data-protection-benefits.pdf”);
const options = {}; /* see below */
pdfExtract.extractBuffer(buffer, options, (err, data) => {
if (err) return console.log(err);
console.log(data);
});

// Reference previous step data using the steps object and return data to use it in future steps
return steps.trigger.event

},
})

vunguyenhung · March 2, 2023, 2:40am

Hello @mrodgers.junk,

I think the error is because you’re using const fs = require('fs') in your action code. Would you mind changing it to import fs from 'fs' and put it on the top? For example:

import PDFExtract from "pdf.js-extract"
import fs from "fs"

export default defineComponent({
   async run({steps, $}) {
     /// action code
   }
})

mrodgers.junk · March 2, 2023, 7:00pm

Thank you very much @vunguyenhung, I messed around with the code a bit and was able to get this working.

// working code
import { PDFExtract } from ‘pdf.js-extract’;
import fetch from “node-fetch”;

export default defineComponent({
async run({ steps, $ }) {
// const url = “https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf”;
const url = steps.trigger.event.body.text;
const response = await fetch(url);
const buffer = await response.buffer();

const options = { };
const pdfExtract = new PDFExtract();
let data;

try {
  data = await pdfExtract.extractBuffer(buffer, options);
  console.log(data);
} catch (err) {
  console.log("Error extracting PDF data:", err);
} finally {
  return data;
}

},
});

This code defines a function that extracts data from a PDF file available at a given URL. The function uses the pdf.js-extract library to extract the data and the node-fetch library to fetch the PDF file from the URL.

The function is an asynchronous function defined as a default export of a component. The function receives two arguments: steps and $. It returns a Promise that resolves to the extracted data.

The function first creates a url variable that contains the URL of the PDF file to be extracted. It then uses node-fetch library to fetch the PDF file from the URL and stores the file content in a buffer variable.

It then defines an empty options object and creates an instance of the PDFExtract class from the pdf.js-extract library. It also initializes a data variable to an undefined value.

Next, the function tries to extract the data from the PDF file using the extractBuffer method of the pdfExtract object. If the extraction is successful, it logs the extracted data to the console and assigns the data to the data variable. If the extraction fails, it logs an error message to the console with the error message.

Finally, the function returns the data variable inside a finally block. The finally block ensures that the function always returns the data variable, whether or not there was an error during the extraction.