How to fix Langchain error when loading "punkt" model in Pipedream's read-only temp folder using Python?

This topic was automatically generated from Slack. You can find the original thread here.

Hi!

I’m using langchain to read documents with the python module, but I have an error and don’t know how to solve it.

Here is my code:

import langchain
import unstructured
import nltk
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders.image import UnstructuredImageLoader

def handler(pd: "pipedream"):

    # Access the data store under the pd.inputs
    path = pd.steps["Download_to_tmp"]["$return_value"]["tmpPath"]

    # Image file formats
    loader = UnstructuredImageLoader(path)
    docs = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=10)
    texts = text_splitter.split_documents(docs)

    page_contents = [text.page_content for text in texts]

    return page_contents

As I understand, I need to load the model “punkt” via nltk.download(‘punkt’) but I can’t since the temp folder in the pipedream side is in a read-only mode.

Here is the error:

Traceback (most recent call last):

  File "/tmp/__pdg__/dist/python/unstructured/nlp/tokenize.py", line 21, in _download_nltk_package_if_not_present
    nltk.find(f"{package_category}/{package_name}")

  File "/tmp/__pdg__/dist/python/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)

LookupError: 
********************************************************************************************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt[0m

  Searched in:
    - '/home/sbx_user1051/nltk_data'
    - '/var/lang/nltk_data'
    - '/var/lang/share/nltk_data'
    - '/var/lang/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
********************************************************************************************************************************************



During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/nano-py/pipedream/worker.py", line 118, in execute
    user_retval = handler(pd)

  File "/tmp/__pdg__/dist/code/7999065291c2b703f7c7dd35804cadf422cc96dd1e5c5d63046755cb9b52d2cc/code.py", line 27, in handler
    docs = loader.load()

  File "/tmp/__pdg__/dist/python/langchain/document_loaders/unstructured.py", line 61, in load
    elements = self._get_elements()

  File "/tmp/__pdg__/dist/python/langchain/document_loaders/image.py", line 11, in _get_elements
    from unstructured.partition.image import partition_image

  File "/tmp/__pdg__/dist/python/unstructured/partition/image.py", line 4, in <module>
    from unstructured.partition.pdf import partition_pdf_or_image

  File "/tmp/__pdg__/dist/python/unstructured/partition/pdf.py", line 13, in <module>
    from unstructured.partition.text import partition_text

  File "/tmp/__pdg__/dist/python/unstructured/partition/text.py", line 16, in <module>
    from unstructured.partition.text_type import (

  File "/tmp/__pdg__/dist/python/unstructured/partition/text_type.py", line 20, in <module>
    from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize

  File "/tmp/__pdg__/dist/python/unstructured/nlp/tokenize.py", line 32, in <module>
    _download_nltk_package_if_not_present(package_name, package_category)

  File "/tmp/__pdg__/dist/python/unstructured/nlp/tokenize.py", line 23, in _download_nltk_package_if_not_present
    nltk.download(package_name)

  File "/tmp/__pdg__/dist/python/nltk/downloader.py", line 777, in download
    for msg in self.incr_download(info_or_id, download_dir, force):

  File "/tmp/__pdg__/dist/python/nltk/downloader.py", line 642, in incr_download
    yield from self._download_package(info, download_dir, force)

  File "/tmp/__pdg__/dist/python/nltk/downloader.py", line 699, in _download_package
    os.makedirs(download_dir)

  File "/var/lang/lib/python3.9/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)

  File "/var/lang/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)

OSError: [Errno 30] Read-only file system: '/home/sbx_user1051'

Can you help me with that issue? Thanks!

Hi Anthony,

Thanks for the background - just to clarify, you mentioned you’re trying to download to temp, that particular directory isn’t available for read/write access.

Have you tried using /tmp instead?

Thanks for your reply I forgot to include the exact name of the directory:

“Read-only file system: ‘/home/sbx_user1051’”

I think that the error occur because I need to download the model “punkt” with the command nltk.download(‘punkt’) but since the folder that contains the nltk package on the pipedream side is in a read-only mode, I can’t proceed.

Based on that what can I do to solve this issue?

Thanks!

Ah thanks for clarifying, that makes sense. Only the /tmp directory is available in workflows.

Just a suggestion, but have you looked into the nltk.download function documentation? I would hope there’s a path parameter that you can use to specify a specific location to download the model to.

Then you can specify /tmp

I found the solution nltk.download('punkt', download_dir='/tmp/nltk_data')

python - Change nltk.download() path directory from default ~/ntlk_data - Stack Overflow thanks for your help :slight_smile:

Awesome :slightly_smiling_face: glad you were able to find the answer!

I have another question.

How can I install new python packages?

I need “detectron2” but “import detectron2” does not work.

how can I install Python packages with different import names than the package name?

Sorry Mario appears to be taking some time off.

But long story short, you can use a special Python magic comment to import packages with different import names than their package name:

thanks for posting this, you saved me a bunch of time