Add Kaggle datasets #9

wengh · 2025-03-27T20:16:58Z

The Kaggle data source is simply a wrapper around kagglehub.dataset_load which allows loading a dataset as a pandas dataframe.

Addition of Kaggle Data Source:

pyspark_datasources/kaggle.py: Added the KaggleDataSource class for reading Kaggle datasets in Spark, including methods for schema and data reading.
pyspark_datasources/__init__.py: Imported the KaggleDataSource class to make it available in the module.

Documentation Updates:

docs/datasources/kaggle.md: Added documentation for the KaggleDataSource, including requirements and usage examples.
docs/index.md: Updated the index to include the KaggleDataSource in the list of available data sources.

Project Configuration:

pyproject.toml: Added the kagglehub library as an optional dependency for the project.

Testing:

tests/test_data_sources.py: Added a new test case for the KaggleDataSource to ensure it can read a dataset from Kaggle correctly.

wengh · 2025-03-27T20:17:58Z

@allisonwang-db
oops I accidentally closed #7 when rebasing

allisonwang-db · 2025-03-27T20:38:48Z

pyspark_datasources/kaggle.py

+from functools import cached_property
+from typing import Iterator
+
+import pyarrow as pa


Do we have to depend on pyarrow? Can we throw a better error message if pyarrow is not installed?

Good point. Depending on pyarrow should be fine since pyspark data source itself depends on pyarrow. Let's import pyarrow later so that pyspark shows the error message if it's missing.

allisonwang-db

🚢

allisonwang-db reviewed Mar 27, 2025

View reviewed changes

allisonwang-db approved these changes Mar 27, 2025

View reviewed changes

wengh added 9 commits March 28, 2025 11:37

Add Kaggle datasets

cc52e9d

fix

56e4589

cache in temp directory

dea05d3

add dependencies

c00d40f

fix dependencies

af5b246

fix dependencies

bc351c3

fix

9fb9dc9

update docs

bfb6646

lock

e897d0d

wengh force-pushed the kaggle branch from 44f0cf8 to e897d0d Compare March 28, 2025 18:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Kaggle datasets #9

Add Kaggle datasets #9

wengh commented Mar 27, 2025

wengh commented Mar 27, 2025

allisonwang-db Mar 27, 2025

wengh Mar 27, 2025

allisonwang-db left a comment

Add Kaggle datasets #9

Are you sure you want to change the base?

Add Kaggle datasets #9

Conversation

wengh commented Mar 27, 2025

Addition of Kaggle Data Source:

Documentation Updates:

Project Configuration:

Testing:

wengh commented Mar 27, 2025

allisonwang-db Mar 27, 2025

Choose a reason for hiding this comment

wengh Mar 27, 2025

Choose a reason for hiding this comment

allisonwang-db left a comment

Choose a reason for hiding this comment