Kaggle python package causes ECONNRESET error

ogoid · June 26, 2022, 2:06pm

Hi,

I’m trying to use Kaggle’s python package to access their api, but just by importing it in my python workflow node receives a connection reset error:

Is this a bug on your backend or is kaggle package unsupported?

dylburger · June 27, 2022, 1:17am

Hi @ogoid , thanks for reaching out. From what I can tell on their docs, the Kaggle package is meant to be downloaded via pip install but used using the kaggle CLI, instead of via Python API. It doesn’t look like they provide a kaggle script that can be downloaded directly, otherwise I would recommend using our Bash code steps to run those commands.

I actually saw a different issue installing Kaggle, which I was able to get around by setting the KAGGLE_CONFIG_DIR environment variable to /tmp:

Screen Shot 2022-06-26 at 6.13.41 PM

But then I encountered this error:

sl = self._semlock = _multiprocessing.SemLock(
  OSError: [Errno 38] Function not implemented

Kaggle tries to create a ThreadPool object for parallel processing when it instantiates an API connection. We run Pipedream workflows on an AWS Lambda runtime, which doesn’t appear to implement the full multiprocessing API, so that fails.

I would recommend reaching out on the kaggle-api GitHub and opening an issue there asking how they recommend running their package in AWS Lambda environments. It’s possible they don’t support that, but since this is an issue with the package, it would need to be fixed by the maintainers.

Let me know if that helps or if you have any other questions.

ogoid · June 27, 2022, 2:19am

What a bummer.

Kaggle tries to create a ThreadPool object for parallel processing when it instantiates an API connection. We run Pipedream workflows on an AWS Lambda runtime, which doesn’t appear to implement the full multiprocessing API, so that fails.

Their code seems automatically generated by Swagger Codegen. I’ll see if I can invoke their api directly instead.

Thanks, Dylan.

ogoid · June 29, 2022, 3:13pm

I managed to get it working.

It seems Kaggle’s python package doesn’t use the ThreadPool by default, so we can just substitute the class:


class FakePool:
  pass

import multiprocessing
multiprocessing.pool.ThreadPool = FakePool

import os
os.environ["KAGGLE_CONFIG_DIR"] = "/tmp"
os.environ["KAGGLE_USERNAME"] = "username"
os.environ["KAGGLE_KEY"] = "api key"

from kaggle import api

print(api.kernels_list(mine=True))