Skip to content

Commit

Permalink
Deprecate kedro.extras.datasets and add top-level docs for `kedro_d…
Browse files Browse the repository at this point in the history
…atasets` (kedro-org#2546)

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
  • Loading branch information
astrojuanlu authored Apr 28, 2023
1 parent a95ce7a commit a7d0e7d
Show file tree
Hide file tree
Showing 24 changed files with 123 additions and 96 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ venv.bak/
# Additional files created by sphinx.ext.autosummary
# Some of them are actually tracked to control the output
/docs/source/kedro.*
/docs/source/kedro_datasets.*

# mypy
.mypy_cache/
Expand Down
2 changes: 0 additions & 2 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
# Required
version: 2

# .readthedocs.yml hook to copy kedro-datasets to kedro.datasets before building the docs
build:
os: ubuntu-22.04
tools:
Expand All @@ -16,7 +15,6 @@ build:
jobs:
post_create_environment:
- npm install -g @mermaid-js/mermaid-cli
- ./docs/kedro-datasets-docs.sh
pre_build:
- python -m sphinx -WETan -j auto -D language=en -b linkcheck -d _build/doctrees docs/source _build/linkcheck

Expand Down
3 changes: 2 additions & 1 deletion RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,12 @@
### Documentation changes
* Improvements to Sphinx toolchain including incrementing to use a newer version.
* Improvements to documentation on visualising Kedro projects on Databricks, and additional documentation about the development workflow for Kedro projects on Databricks.
* Updated Technnical Steering Committee membership documentation.
* Updated Technical Steering Committee membership documentation.
* Revised documentation section about linting and formatting and extended to give details of `flake8` configuration.
* Updated table of contents for documentation to reduce scrolling.
* Expanded FAQ documentation.
* Added a 404 page to documentation.
* Added deprecation warnings about the removal of `kedro.extras.datasets`.

## Breaking changes to the API

Expand Down
4 changes: 0 additions & 4 deletions docs/build-docs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,6 @@ set -o nounset

action=$1

# Reinstall kedro-datasets locally
rm -rf kedro/datasets
bash docs/kedro-datasets-docs.sh

if [ "$action" == "linkcheck" ]; then
sphinx-build -WETan -j auto -D language=en -b linkcheck -d docs/build/doctrees docs/source docs/build/linkcheck
elif [ "$action" == "docs" ]; then
Expand Down
13 changes: 0 additions & 13 deletions docs/kedro-datasets-docs.sh

This file was deleted.

4 changes: 2 additions & 2 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@
"integer -- return number of occurrences of value",
"integer -- return first index of value.",
"kedro.extras.datasets.pandas.json_dataset.JSONDataSet",
"kedro.datasets.pandas.json_dataset.JSONDataSet",
"kedro_datasets.pandas.json_dataset.JSONDataSet",
"pluggy._manager.PluginManager",
"_DI",
"_DO",
Expand Down Expand Up @@ -309,7 +309,7 @@
"kedro.config",
"kedro.extras.datasets",
"kedro.extras.logging",
"kedro.datasets",
"kedro_datasets",
]


Expand Down
6 changes: 3 additions & 3 deletions docs/source/data/data_catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This section introduces `catalog.yml`, the project-shareable Data Catalog. The file is located in `conf/base` and is a registry of all data sources available for use by a project; it manages loading and saving of data.

All supported data connectors are available in [`kedro-datasets`](/kedro.datasets).
All supported data connectors are available in [`kedro-datasets`](/kedro_datasets).

## Use the Data Catalog within Kedro configuration

Expand Down Expand Up @@ -261,7 +261,7 @@ scooters_query:
index_col: [name]
```

When you use [`pandas.SQLTableDataSet`](/kedro.datasets.pandas.SQLTableDataSet) or [`pandas.SQLQueryDataSet`](/kedro.datasets.pandas.SQLQueryDataSet), you must provide a database connection string. In the above example, we pass it using the `scooters_credentials` key from the credentials (see the details in the [Feeding in credentials](#feeding-in-credentials) section below). `scooters_credentials` must have a top-level key `con` containing a [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) connection string. As an alternative to credentials, you could explicitly put `con` into `load_args` and `save_args` (`pandas.SQLTableDataSet` only).
When you use [`pandas.SQLTableDataSet`](/kedro_datasets.pandas.SQLTableDataSet) or [`pandas.SQLQueryDataSet`](/kedro_datasets.pandas.SQLQueryDataSet), you must provide a database connection string. In the above example, we pass it using the `scooters_credentials` key from the credentials (see the details in the [Feeding in credentials](#feeding-in-credentials) section below). `scooters_credentials` must have a top-level key `con` containing a [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) connection string. As an alternative to credentials, you could explicitly put `con` into `load_args` and `save_args` (`pandas.SQLTableDataSet` only).


### Example 14: Loads data from an API endpoint, example US corn yield data from USDA
Expand Down Expand Up @@ -535,7 +535,7 @@ The code API allows you to:

### Configure a Data Catalog

In a file like `catalog.py`, you can construct a `DataCatalog` object programmatically. In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro.datasets).
In a file like `catalog.py`, you can construct a `DataCatalog` object programmatically. In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro_datasets).

```python
from kedro.io import DataCatalog
Expand Down
6 changes: 3 additions & 3 deletions docs/source/data/kedro_io.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ For contributors, if you would like to submit a new dataset, you must extend the
In order to enable versioning, you need to update the `catalog.yml` config file and set the `versioned` attribute to `true` for the given dataset. If this is a custom dataset, the implementation must also:
1. extend `kedro.io.core.AbstractVersionedDataSet` AND
2. add `version` namedtuple as an argument to its `__init__` method AND
3. call `super().__init__()` with positional arguments `filepath`, `version`, and, optionally, with `glob` and `exists` functions if it uses a non-local filesystem (see [kedro_datasets.pandas.CSVDataSet](/kedro.datasets.pandas.CSVDataSet) as an example) AND
4. modify its `_describe`, `_load` and `_save` methods respectively to support versioning (see [`kedro_datasets.pandas.CSVDataSet`](/kedro.datasets.pandas.CSVDataSet) for an example implementation)
3. call `super().__init__()` with positional arguments `filepath`, `version`, and, optionally, with `glob` and `exists` functions if it uses a non-local filesystem (see [kedro_datasets.pandas.CSVDataSet](/kedro_datasets.pandas.CSVDataSet) as an example) AND
4. modify its `_describe`, `_load` and `_save` methods respectively to support versioning (see [`kedro_datasets.pandas.CSVDataSet`](/kedro_datasets.pandas.CSVDataSet) for an example implementation)

```{note}
If a new version of a dataset is created mid-run, for instance by an external system adding new files, it will not interfere in the current run, i.e. the load version stays the same throughout subsequent loads.
Expand Down Expand Up @@ -239,7 +239,7 @@ Although HTTP(S) is a supported file system in the dataset implementations, it d

## Partitioned dataset

These days, distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro.datasets.spark.SparkDataSet) cater for such use cases, but the use of Spark is not always feasible.
These days, distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases, but the use of Spark is not always feasible.

This is why Kedro provides a built-in [PartitionedDataSet](/kedro.io.PartitionedDataSet), with the following features:

Expand Down
4 changes: 2 additions & 2 deletions docs/source/extend_kedro/common_use_cases.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@ Kedro has a few built-in mechanisms for you to extend its behaviour. This docume

## Use Case 1: How to add extra behaviour to Kedro's execution timeline

The execution timeline of a Kedro pipeline can be thought of as a sequence of actions performed by various Kedro library components, such as the [DataSets](/kedro.datasets), [DataCatalog](/kedro.io.DataCatalog), [Pipeline](/kedro.pipeline.Pipeline), [Node](/kedro.pipeline.node.Node) and [KedroContext](/kedro.framework.context.KedroContext).
The execution timeline of a Kedro pipeline can be thought of as a sequence of actions performed by various Kedro library components, such as the [DataSets](/kedro_datasets), [DataCatalog](/kedro.io.DataCatalog), [Pipeline](/kedro.pipeline.Pipeline), [Node](/kedro.pipeline.node.Node) and [KedroContext](/kedro.framework.context.KedroContext).

At different points in the lifecycle of these components, you might want to add extra behaviour: for example, you could add extra computation for profiling purposes _before_ and _after_ a node runs, or _before_ and _after_ the I/O actions of a dataset, namely the `load` and `save` actions.

This can now achieved by using [Hooks](../hooks/introduction.md), to define the extra behaviour and when in the execution timeline it should be introduced.

## Use Case 2: How to integrate Kedro with additional data sources

You can use [DataSets](/kedro.datasets) to interface with various different data sources. If the data source you plan to use is not supported out of the box by Kedro, you can [create a custom dataset](custom_datasets.md).
You can use [DataSets](/kedro_datasets) to interface with various different data sources. If the data source you plan to use is not supported out of the box by Kedro, you can [create a custom dataset](custom_datasets.md).

## Use Case 3: How to add or modify CLI commands

Expand Down
8 changes: 4 additions & 4 deletions docs/source/extend_kedro/custom_datasets.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Custom datasets

[Kedro supports many datasets](/kedro.datasets) out of the box, but you may find that you need to create a custom dataset. For example, you may need to handle a proprietary data format or filesystem in your pipeline, or perhaps you have found a particular use case for a dataset that Kedro does not support. This tutorial explains how to create a custom dataset to read and save image data.
[Kedro supports many datasets](/kedro_datasets) out of the box, but you may find that you need to create a custom dataset. For example, you may need to handle a proprietary data format or filesystem in your pipeline, or perhaps you have found a particular use case for a dataset that Kedro does not support. This tutorial explains how to create a custom dataset to read and save image data.

## Scenario

Expand Down Expand Up @@ -504,7 +504,7 @@ You may also want to consult the [in-depth documentation about the Versioning AP

Kedro datasets should work with the [SequentialRunner](/kedro.runner.SequentialRunner) and the [ParallelRunner](/kedro.runner.ParallelRunner), so they must be fully serialisable by the [Python multiprocessing package](https://docs.python.org/3/library/multiprocessing.html). This means that your datasets should not make use of lambda functions, nested functions, closures etc. If you are using custom decorators, you need to ensure that they are using [`functools.wraps()`](https://docs.python.org/3/library/functools.html#functools.wraps).

There is one dataset that is an exception: [SparkDataSet](/kedro.datasets.spark.SparkDataSet). The explanation for this exception is that [Apache Spark](https://spark.apache.org/) uses its own parallelism and therefore doesn't work with Kedro [ParallelRunner](/kedro.runner.ParallelRunner). For parallelism within a Kedro project that leverages Spark please consider the alternative [ThreadRunner](/kedro.runner.ThreadRunner).
There is one dataset that is an exception: [SparkDataSet](/kedro_datasets.spark.SparkDataSet). The explanation for this exception is that [Apache Spark](https://spark.apache.org/) uses its own parallelism and therefore doesn't work with Kedro [ParallelRunner](/kedro.runner.ParallelRunner). For parallelism within a Kedro project that leverages Spark please consider the alternative [ThreadRunner](/kedro.runner.ThreadRunner).

To verify whether your dataset is serialisable by `multiprocessing`, use the console or an iPython session to try dumping it using `multiprocessing.reduction.ForkingPickler`:

Expand Down Expand Up @@ -562,7 +562,7 @@ class ImageDataSet(AbstractVersionedDataSet):
...
```
We provide additional examples of [how to use parameters through the data catalog's YAML API](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api). For an example of how to use these parameters in your dataset's constructor, please see the [SparkDataSet](/kedro.datasets.spark.SparkDataSet)'s implementation.
We provide additional examples of [how to use parameters through the data catalog's YAML API](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api). For an example of how to use these parameters in your dataset's constructor, please see the [SparkDataSet](/kedro_datasets.spark.SparkDataSet)'s implementation.
## How to contribute a custom dataset implementation
Expand Down Expand Up @@ -592,7 +592,7 @@ kedro-plugins/kedro-datasets/kedro_datasets/image
```{note}
There are two special considerations when contributing a dataset:
1. Add the dataset to `kedro.datasets.rst` so it shows up in the API documentation.
1. Add the dataset to `kedro_datasets.rst` so it shows up in the API documentation.
2. Add the dataset to `static/jsonschema/kedro-catalog-X.json` for IDE validation.
```
2 changes: 1 addition & 1 deletion docs/source/get_started/kedro_concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ greeting_pipeline = pipeline([return_greeting_node, join_statements_node])

The Kedro Data Catalog is the registry of all data sources that the project can use to manage loading and saving data. It maps the names of node inputs and outputs as keys in a `DataCatalog`, a Kedro class that can be specialised for different types of data storage.

[Kedro provides different built-in datasets](/kedro.datasets) for numerous file types and file systems, so you don’t have to write the logic for reading/writing data.
[Kedro provides different built-in datasets](/kedro_datasets) for numerous file types and file systems, so you don’t have to write the logic for reading/writing data.

## Kedro project directory structure

Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,7 @@ API documentation
:recursive:

kedro
kedro_datasets

Indices and tables
==================
Expand Down
8 changes: 4 additions & 4 deletions docs/source/integrations/pyspark_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,10 @@ HOOKS = (SparkHooks(),)

We recommend using Kedro's built-in Spark datasets to load raw data into Spark's [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html), as well as to write them back to storage. Some of our built-in Spark datasets include:

* [spark.DeltaTableDataSet](/kedro.datasets.spark.DeltaTableDataSet)
* [spark.SparkDataSet](/kedro.datasets.spark.SparkDataSet)
* [spark.SparkJDBCDataSet](/kedro.datasets.spark.SparkJDBCDataSet)
* [spark.SparkHiveDataSet](/kedro.datasets.spark.SparkHiveDataSet)
* [spark.DeltaTableDataSet](/kedro_datasets.spark.DeltaTableDataSet)
* [spark.SparkDataSet](/kedro_datasets.spark.SparkDataSet)
* [spark.SparkJDBCDataSet](/kedro_datasets.spark.SparkJDBCDataSet)
* [spark.SparkHiveDataSet](/kedro_datasets.spark.SparkHiveDataSet)

The example below illustrates how to use `spark.SparkDataSet` to read a CSV file located in S3 into a `DataFrame` in `conf/base/catalog.yml`:

Expand Down
52 changes: 0 additions & 52 deletions docs/source/kedro.datasets.rst

This file was deleted.

20 changes: 20 additions & 0 deletions docs/source/kedro.extras.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
kedro.extras
============

.. rubric:: Description

.. automodule:: kedro.extras

.. rubric:: Modules

.. autosummary::
:toctree:
:recursive:

kedro.extras.extensions
kedro.extras.logging

.. toctree::
:hidden:

kedro.extras.datasets
52 changes: 52 additions & 0 deletions docs/source/kedro_datasets.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
kedro_datasets
==============

.. rubric:: Description

.. automodule:: kedro_datasets

.. rubric:: Classes

.. autosummary::
:toctree:
:template: autosummary/class.rst

kedro_datasets.api.APIDataSet
kedro_datasets.biosequence.BioSequenceDataSet
kedro_datasets.dask.ParquetDataSet
kedro_datasets.email.EmailMessageDataSet
kedro_datasets.geopandas.GeoJSONDataSet
kedro_datasets.holoviews.HoloviewsWriter
kedro_datasets.json.JSONDataSet
kedro_datasets.matplotlib.MatplotlibWriter
kedro_datasets.networkx.GMLDataSet
kedro_datasets.networkx.GraphMLDataSet
kedro_datasets.networkx.JSONDataSet
kedro_datasets.pandas.CSVDataSet
kedro_datasets.pandas.ExcelDataSet
kedro_datasets.pandas.FeatherDataSet
kedro_datasets.pandas.GBQQueryDataSet
kedro_datasets.pandas.GBQTableDataSet
kedro_datasets.pandas.GenericDataSet
kedro_datasets.pandas.HDFDataSet
kedro_datasets.pandas.JSONDataSet
kedro_datasets.pandas.ParquetDataSet
kedro_datasets.pandas.SQLQueryDataSet
kedro_datasets.pandas.SQLTableDataSet
kedro_datasets.pandas.XMLDataSet
kedro_datasets.pickle.PickleDataSet
kedro_datasets.pillow.ImageDataSet
kedro_datasets.plotly.JSONDataSet
kedro_datasets.plotly.PlotlyDataSet
kedro_datasets.redis.PickleDataSet
kedro_datasets.spark.DeltaTableDataSet
kedro_datasets.spark.SparkDataSet
kedro_datasets.spark.SparkHiveDataSet
kedro_datasets.spark.SparkJDBCDataSet
kedro_datasets.svmlight.SVMLightDataSet
kedro_datasets.tensorflow.TensorFlowModelDataset
kedro_datasets.text.TextDataSet
kedro_datasets.tracking.JSONDataSet
kedro_datasets.tracking.MetricsDataSet
kedro_datasets.video.VideoDataSet
kedro_datasets.yaml.YAMLDataSet
Loading

0 comments on commit a7d0e7d

Please sign in to comment.