How to fix "cannot import name" ImportError for 'open_filename' from pdfminer.utils in a pipeline for creating embeddings?

This topic was automatically generated from Slack. You can find the original thread here.

Hi, I’m trying to get a pipeline running to create embeddings from Google Drive files. I’ve worked through most of the issues, but I’ve reached a point that I can’t seem to overcome. Here’s the python code I’m using in the workflow:

import openai
import unstructured
import os
import nltk

# pipedream add-package pdfminer.six
import pdfminer


from langchain.document_loaders import UnstructuredFileLoader

def handler(pd: "pipedream"):
    
    nltk_dir = "/tmp/nltk_data"
    
    if not os.path.exists(nltk_dir):
        os.mkdir(nltk_dir)
        os.chmod(nltk_dir, 0o777)
    
    os.environ.setdefault('NLTK_DATA', nltk_dir)
    #nltk.download('punkt')
    nltk.download('punkt', download_dir=nltk_dir)
    nltk.download('averaged_perceptron_tagger', download_dir=nltk_dir)
    nltk.data.path.append(nltk_dir)
    file_path = f'tmp/{pd.steps["download_file"]["$return_value"]["name"]}'
    loader = UnstructuredFileLoader(file_path)
    document = loader.load()
    return {"foo": {"test": True}}

and the error I receive

ImportError
cannot import name 'open_filename' from 'pdfminer.utils' (/tmp/__pdg__/dist/python/pdfminer/utils.py)

DETAILS
Traceback (most recent call last):

  File "/nano-py/pipedream/worker.py", line 118, in execute
    user_retval = handler(pd)

  File "/tmp/__pdg__/dist/code/07ccd3aeb48bbf3653a754b9addf89f6a7743ac122582016f3ec75cd7b711227/code.py", line 29, in handler
    document = loader.load()

  File "/tmp/__pdg__/dist/python/langchain/document_loaders/unstructured.py", line 61, in load
    elements = self._get_elements()

  File "/tmp/__pdg__/dist/python/langchain/document_loaders/unstructured.py", line 93, in _get_elements
    from unstructured.partition.auto import partition

  File "/tmp/__pdg__/dist/python/unstructured/partition/auto.py", line 14, in <module>
    from unstructured.partition.image import partition_image

  File "/tmp/__pdg__/dist/python/unstructured/partition/image.py", line 4, in <module>
    from unstructured.partition.pdf import partition_pdf_or_image

  File "/tmp/__pdg__/dist/python/unstructured/partition/pdf.py", line 6, in <module>
    from pdfminer.utils import open_filename

ImportError: cannot import name 'open_filename' from 'pdfminer.utils' (/tmp/__pdg__/dist/python/pdfminer/utils.py)

SO says this error is caused by having multiple versions of pdfminer installed.

Digging through langchain code I can see that they depend on pdfminer.six and digging around the packages that have been downloaded locally I see:
['pdfminer', 'pdfminer.six-20221105.dist-info']
**print**(pdfminer.__version__)
20221105

However, on the Pipedream worker I see:
['pdfminer', 'pdfminer-20191125-py3.9.egg-info']
**print**(pdfminer.__version__)
20191125

From what I can tell it seems like the magic is not working. If I remove # pipedream add-package pdfminer.six the worker still resolves the same version of pdfminer 20191125.

Hi John, thanks for the detailed report.

This sounds like an issue at the core of the Python package system, could you please report it on our public GH Issue tracker?

It just gives us better visibility into these errors and helps with the triage process.

Absolutely, I’d be glad to

Good afternoon, . I think the challenge has to do with the use of unstructure. That package is a dependency in an object I was using within LangChain. I had to install a few packages at the system level (since I’m on a Mac I used Homebrew) to get unstructured to work.

@U02SX7ETBFB, are we able to do a !pip install [package] at the top of our scripts? If yes, I was able to get unstructured running on Google Colab where there a similar challenges in not having direct access to the system.

Hope this helps.

you’re definitely correct about it being some conflict with unstructured . I stripped the code down to

# pipedream add-package pdfminer.six
import pdfminer

def handler(pd: "pipedream"):
   expected_version = pdfminer.__version__ == '20221105'
   return expected_version

and the magic comment works as expected

From that good state, I’m trying to add layers until it breaks and so far it’s working having import pdfprimer as the first line of code. I’m going to hold off on creating a bug for the moment