Hi,
I’m trying to use Kaggle’s python package to access their api, but just by importing it in my python workflow node receives a connection reset error:
Is this a bug on your backend or is kaggle package unsupported?
Hi,
I’m trying to use Kaggle’s python package to access their api, but just by importing it in my python workflow node receives a connection reset error:
Is this a bug on your backend or is kaggle package unsupported?
Hi @ogoid , thanks for reaching out. From what I can tell on their docs, the Kaggle package is meant to be downloaded via pip install
but used using the kaggle
CLI, instead of via Python API. It doesn’t look like they provide a kaggle
script that can be downloaded directly, otherwise I would recommend using our Bash code steps to run those commands.
I actually saw a different issue installing Kaggle, which I was able to get around by setting the KAGGLE_CONFIG_DIR
environment variable to /tmp
:
But then I encountered this error:
sl = self._semlock = _multiprocessing.SemLock(
OSError: [Errno 38] Function not implemented
Kaggle tries to create a ThreadPool
object for parallel processing when it instantiates an API connection. We run Pipedream workflows on an AWS Lambda runtime, which doesn’t appear to implement the full multiprocessing
API, so that fails.
I would recommend reaching out on the kaggle-api
GitHub and opening an issue there asking how they recommend running their package in AWS Lambda environments. It’s possible they don’t support that, but since this is an issue with the package, it would need to be fixed by the maintainers.
Let me know if that helps or if you have any other questions.
What a bummer.
Kaggle tries to create a
ThreadPool
object for parallel processing when it instantiates an API connection. We run Pipedream workflows on an AWS Lambda runtime, which doesn’t appear to implement the fullmultiprocessing
API, so that fails.
Their code seems automatically generated by Swagger Codegen. I’ll see if I can invoke their api directly instead.
Thanks, Dylan.
I managed to get it working.
It seems Kaggle’s python package doesn’t use the ThreadPool by default, so we can just substitute the class:
class FakePool:
pass
import multiprocessing
multiprocessing.pool.ThreadPool = FakePool
import os
os.environ["KAGGLE_CONFIG_DIR"] = "/tmp"
os.environ["KAGGLE_USERNAME"] = "username"
os.environ["KAGGLE_KEY"] = "api key"
from kaggle import api
print(api.kernels_list(mine=True))