Can a Python Step Return Both a Dictionary and a PDF File for Later Use in an API?

user-1 · October 8, 2024, 8:31pm

This topic was automatically generated from Slack. You can find the original thread here.

Hi , in my current worflow I send a pdf file to a webhook, I download it to tmp with a helper funtion and then I do some parsing with a python code step and I return a dictionary with strings to later send to an api endpoint. But I’d like to also send the original pdf file to that API.
Can I have my python step return both a dictionnary and the pdf file for later use?

user-1 · October 8, 2024, 8:31pm

Yes, you can return both a dictionary and the PDF file from your Python code step in a Pipedream workflow. Here’s how you can achieve that: 1. Store the PDF file in the /tmp directory: Ensure that your helper function saves the PDF file in the /tmp directory, which is accessible within your workflow steps. 2. Return both the dictionary and the file path: In your Python code step, you can return a dictionary containing both your parsed data and the file path to the PDF. This way, downstream steps can access both the parsed data and the file itself. Here’s an example of how you might structure your Python code step: def handler(pd: "pipedream"): # Assume your helper function has saved the PDF to /tmp/original.pdf pdf_file_path = "/tmp/original.pdf" # Your existing logic to parse the PDF and create a dictionary parsed_data = { "key1": "value1", "key2": "value2", # Add more key-value pairs as needed } # Return both the parsed data and the file path return { "parsed_data": parsed_data, "pdf_file_path": pdf_file_path } In this example, the parsed_data dictionary contains your parsed strings, and pdf_file_path holds the path to the PDF file. Downstream steps can access these using pd.steps["your_step_name"]["parsed_data"] and pd.steps["your_step_name"]["pdf_file_path"]. You can then use the file path to send the PDF file to your API endpoint in a subsequent step.

user-1 · October 8, 2024, 8:31pm

this approach seems similar to what I just tried, which was to add the tmp file path to my API attachment field, but that generated an error. How is returning that same path in python different? It feels like you are returning a path and not the file object itself

user-1 · October 8, 2024, 8:31pm

My understanding is that after using that pdf filefrom the tmp folder to parse it in my python step, it no longer lives or exists in the tmp, I would have to save it there at the end of the python processing using something like shutil or some other package to store the pdf file onto the tmp

user-1 · October 8, 2024, 8:31pm

You are correct that files in the /tmp directory are temporary and will not persist beyond the execution of your workflow. If you need to save a file to the /tmp directory after processing it in a Python step, you can use Python’s built-in file handling capabilities or a package like shutil to write the file back to /tmp. Here’s an example of how you might save a PDF file to the /tmp directory in a Python step: import shutil def handler(pd: "pipedream"): # Assume 'processed_pdf' is your processed PDF data processed_pdf = b"%PDF-1.4..." # Example binary content of a PDF # Define the path where you want to save the file file_path = "/tmp/processed_file.pdf" # Write the processed PDF to the /tmp directory with open(file_path, "wb") as f: f.write(processed_pdf) # Optionally, use shutil to copy or move files if needed # shutil.copy(file_path, "/tmp/another_location.pdf") # Return the file path for use in future steps return {"file_path": file_path} This code snippet demonstrates how to write binary data to a file in the /tmp directory. You can replace processed_pdf with your actual PDF data. If you need to perform additional file operations, the shutil module provides functions like copy and move to manage files.

user-1 · October 8, 2024, 8:31pm

it didn’t work, I tried writing the pdf file to tmp but got this error code:
a bytes-like object is required, not ‘PDF’
Details

Traceback (most recent call last):

  File "/nano-py/pipedream/worker.py", line 137, in execute
    user_retval = handler(pd)
                  ^^^^^^^^^^^

  File "/tmp/__pdg__/dist/code/2b0925cd941265456d2b2d4f4ed47b2eaf6c428a36af3a5be261657512082195/code.py", line 127, in handler
    invoice_data = extract_data_from_invoice(pdf_path)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/tmp/__pdg__/dist/code/2b0925cd941265456d2b2d4f4ed47b2eaf6c428a36af3a5be261657512082195/code.py", line 28, in extract_data_from_invoice
    f.write(pdf)

TypeError: a bytes-like object is required, not 'PDF

Timestamp
01/10/2024, 15:45:14Duration
180ms
Build from event
Replay event

Joba**lto Automations**

Airt**able**
Back to project
Process Invoice From Webhook
v72
Active
Use legacy inspector
Edit

user-1 · October 8, 2024, 8:31pm

It seems like you’re encountering a TypeError because the write method expects a bytes-like object, but it appears you’re passing a ‘PDF’ object instead. This usually happens when you’re trying to write a PDF object directly to a file without converting it to bytes. To resolve this issue, you need to ensure that the PDF data is in a bytes-like format before writing it to a file. Here’s a general approach to handle this: 1. If you’re using a library to generate or manipulate PDFs, check its documentation to see how you can convert the PDF object to bytes. 2. If you’re reading a PDF file, make sure to open the file in binary mode ('rb' for reading, 'wb' for writing). Here’s an example of how you might write a PDF file to the /tmp directory in Python: def handler(pd: "pipedream"): # Assuming `pdf_data` is a bytes-like object containing your PDF content pdf_data = b'%PDF-1.4...' # Example PDF data in bytes # Write the PDF data to a file in the /tmp directory with open("/tmp/invoice.pdf", "wb") as f: f.write(pdf_data) If you’re using a library like PyPDF2 or reportlab, make sure to convert the PDF object to bytes before writing. For example, if you’re using reportlab, you can use a BytesIO object to capture the PDF data: from io import BytesIO from reportlab.pdfgen import canvas def handler(pd: "pipedream"): # Create a BytesIO buffer buffer = BytesIO() # Generate a PDF using reportlab c = canvas.Canvas(buffer) c.drawString(100, 750, "Hello, World!") c.save() # Get the PDF data from the buffer pdf_data = buffer.getvalue() # Write the PDF data to a file in the /tmp directory with open("/tmp/invoice.pdf", "wb") as f: f.write(pdf_data) Ensure that the pdf_data variable contains the PDF content in bytes before writing it to the file. If you need further assistance, please provide more details about how you’re generating or obtaining the PDF data.

user-1 · October 8, 2024, 8:31pm

Here’s a portion of my code:
with pdfplumber.open(pdf_path) as pdf:
first_page = pdf.pages[0]
text = first_page.extract_text()

        invoice_data = extract_invoice_p1_data(text)
        # Extract week number and dates
        second_page = pdf.pages[1]
        date_info = extract_date_info(second_page.extract_text())

        if date_info:
            invoice_data.update(date_info)

        # Write the processed PDF to the /tmp directory
        with open(pdf_path, “wb”) as f:
            f.write(pdf)

user-1 · October 8, 2024, 8:31pm

It looks like you’re trying to process a PDF file using pdfplumber and extract some data from it. However, there seems to be an issue with the code snippet you provided, specifically in the section where you attempt to write the processed PDF back to a file. Here’s a corrected version of your code: import pdfplumber def extract_invoice_p1_data(text): # Your logic to extract invoice data from the text return {} def extract_date_info(text): # Your logic to extract date information from the text return {} pdf_path = "/path/to/your/pdf/file.pdf" with pdfplumber.open(pdf_path) as pdf: first_page = pdf.pages[0] text = first_page.extract_text() invoice_data = extract_invoice_p1_data(text) # Extract week number and dates second_page = pdf.pages[1] date_info = extract_date_info(second_page.extract_text()) if date_info: invoice_data.update(date_info) # Write the processed PDF text to a file in the /tmp directory with open("/tmp/processed_pdf.txt", "w") as f: f.write(text) ### Key Changes: 1. Writing to a File: If you intend to save the extracted text or processed data, you should write the text or data to a file, not the PDF object itself. In the example above, I’m writing the extracted text from the first page to a text file in the /tmp directory. 2. File Mode: Ensure that you’re using the correct file mode. For writing text, use "w" for text files or "wb" for binary files. 3. Correct Path: Make sure you specify the correct path for the output file, especially if you’re writing to the /tmp directory. If you have any further questions or need additional help, feel free to ask!

user-1 · October 8, 2024, 8:31pm

God this is frustrating… I simply want my original pdf file to remain accessible from the tmp directory so that I can attach it to my airtable attchement field later. Why is it that the attachment after being parsed “disappears” from tmp?

user-1 · October 8, 2024, 8:31pm

are you overwriting the original pdf in the /tmp path? It should still be accessible

user-1 · October 8, 2024, 8:31pm

No I’m not…

user-1 · October 8, 2024, 8:31pm

Gotcha, could you share the workflow URL so I can take a look?

user-1 · October 8, 2024, 8:31pm

user-1 · October 8, 2024, 8:31pm

Sorry timezones and I went out to eat!

user-1 · October 8, 2024, 8:31pm

Also for some reason it’s quite hard to get into edit mode for the python steps, very often I’ll see a red square outline where my cursor is and I can’t edit… are you aware of this bug?

user-1 · October 8, 2024, 8:31pm

ping me if you need me to trigger by sending a new pdf to the webhook

user-1 · October 8, 2024, 8:31pm

Here’s an example

user-1 · October 8, 2024, 8:31pm

This here is my issue:

user-1 · October 8, 2024, 8:31pm