This topic was automatically generated from Slack. You can find the original thread here.
Hi, I’m trying to get a pipeline running to create embeddings from Google Drive files. I’ve worked through most of the issues, but I’ve reached a point that I can’t seem to overcome. Here’s the python code I’m using in the workflow:
import openai
import unstructured
import os
import nltk
# pipedream add-package pdfminer.six
import pdfminer
from langchain.document_loaders import UnstructuredFileLoader
def handler(pd: "pipedream"):
nltk_dir = "/tmp/nltk_data"
if not os.path.exists(nltk_dir):
os.mkdir(nltk_dir)
os.chmod(nltk_dir, 0o777)
os.environ.setdefault('NLTK_DATA', nltk_dir)
#nltk.download('punkt')
nltk.download('punkt', download_dir=nltk_dir)
nltk.download('averaged_perceptron_tagger', download_dir=nltk_dir)
nltk.data.path.append(nltk_dir)
file_path = f'tmp/{pd.steps["download_file"]["$return_value"]["name"]}'
loader = UnstructuredFileLoader(file_path)
document = loader.load()
return {"foo": {"test": True}}
and the error I receive
ImportError
cannot import name 'open_filename' from 'pdfminer.utils' (/tmp/__pdg__/dist/python/pdfminer/utils.py)
DETAILS
Traceback (most recent call last):
File "/nano-py/pipedream/worker.py", line 118, in execute
user_retval = handler(pd)
File "/tmp/__pdg__/dist/code/07ccd3aeb48bbf3653a754b9addf89f6a7743ac122582016f3ec75cd7b711227/code.py", line 29, in handler
document = loader.load()
File "/tmp/__pdg__/dist/python/langchain/document_loaders/unstructured.py", line 61, in load
elements = self._get_elements()
File "/tmp/__pdg__/dist/python/langchain/document_loaders/unstructured.py", line 93, in _get_elements
from unstructured.partition.auto import partition
File "/tmp/__pdg__/dist/python/unstructured/partition/auto.py", line 14, in <module>
from unstructured.partition.image import partition_image
File "/tmp/__pdg__/dist/python/unstructured/partition/image.py", line 4, in <module>
from unstructured.partition.pdf import partition_pdf_or_image
File "/tmp/__pdg__/dist/python/unstructured/partition/pdf.py", line 6, in <module>
from pdfminer.utils import open_filename
ImportError: cannot import name 'open_filename' from 'pdfminer.utils' (/tmp/__pdg__/dist/python/pdfminer/utils.py)
SO says this error is caused by having multiple versions of pdfminer
installed.
Digging through langchain code I can see that they depend on pdfminer.six
and digging around the packages that have been downloaded locally I see:
['pdfminer', 'pdfminer.six-20221105.dist-info']
**print**(pdfminer.__version__)
20221105
However, on the Pipedream worker I see:
['pdfminer', 'pdfminer-20191125-py3.9.egg-info']
**print**(pdfminer.__version__)
20191125
From what I can tell it seems like the magic is not working. If I remove # pipedream add-package pdfminer.six
the worker still resolves the same version of pdfminer
20191125.