-
Notifications
You must be signed in to change notification settings - Fork 9
Add Kaggle datasets #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@allisonwang-db |
pyspark_datasources/kaggle.py
Outdated
from functools import cached_property | ||
from typing import Iterator | ||
|
||
import pyarrow as pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to depend on pyarrow? Can we throw a better error message if pyarrow is not installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Depending on pyarrow should be fine since pyspark data source itself depends on pyarrow. Let's import pyarrow later so that pyspark shows the error message if it's missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
The Kaggle data source is simply a wrapper around
kagglehub.dataset_load
which allows loading a dataset as a pandas dataframe.Addition of Kaggle Data Source:
pyspark_datasources/kaggle.py
: Added theKaggleDataSource
class for reading Kaggle datasets in Spark, including methods for schema and data reading.pyspark_datasources/__init__.py
: Imported theKaggleDataSource
class to make it available in the module.Documentation Updates:
docs/datasources/kaggle.md
: Added documentation for theKaggleDataSource
, including requirements and usage examples.docs/index.md
: Updated the index to include theKaggleDataSource
in the list of available data sources.Project Configuration:
pyproject.toml
: Added thekagglehub
library as an optional dependency for the project.Testing:
tests/test_data_sources.py
: Added a new test case for theKaggleDataSource
to ensure it can read a dataset from Kaggle correctly.