Deprecate kedro.extras.datasets and add top-level docs for `kedro_d…

…atasets` (kedro-org#2546) Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Bradley39e · Apr 28, 2023 · a7d0e7d · a7d0e7d
1 parent a95ce7a
commit a7d0e7d
Show file tree

Hide file tree

Showing 24 changed files with 123 additions and 96 deletions.
diff --git a/.gitignore b/.gitignore
@@ -138,6 +138,7 @@ venv.bak/
 # Additional files created by sphinx.ext.autosummary
 # Some of them are actually tracked to control the output
 /docs/source/kedro.*
+/docs/source/kedro_datasets.*
 
 # mypy
 .mypy_cache/

diff --git a/.readthedocs.yml b/.readthedocs.yml
@@ -5,7 +5,6 @@
 # Required
 version: 2
 
-# .readthedocs.yml hook to copy kedro-datasets to kedro.datasets before building the docs
 build:
   os: ubuntu-22.04
   tools:
@@ -16,7 +15,6 @@ build:
   jobs:
     post_create_environment:
       - npm install -g @mermaid-js/mermaid-cli
-      - ./docs/kedro-datasets-docs.sh
     pre_build:
       - python -m sphinx -WETan -j auto -D language=en -b linkcheck -d _build/doctrees docs/source _build/linkcheck
 

diff --git a/RELEASE.md b/RELEASE.md
@@ -22,11 +22,12 @@
 ### Documentation changes
 * Improvements to Sphinx toolchain including incrementing to use a newer version.
 * Improvements to documentation on visualising Kedro projects on Databricks, and additional documentation about the development workflow for Kedro projects on Databricks.
-* Updated Technnical Steering Committee membership documentation.
+* Updated Technical Steering Committee membership documentation.
 * Revised documentation section about linting and formatting and extended to give details of `flake8` configuration.
 * Updated table of contents for documentation to reduce scrolling.
 * Expanded FAQ documentation.
 * Added a 404 page to documentation.
+* Added deprecation warnings about the removal of `kedro.extras.datasets`.
 
 ## Breaking changes to the API
 

diff --git a/docs/build-docs.sh b/docs/build-docs.sh
@@ -7,10 +7,6 @@ set -o nounset
 
 action=$1
 
-# Reinstall kedro-datasets locally
-rm -rf kedro/datasets
-bash docs/kedro-datasets-docs.sh
-
 if [ "$action" == "linkcheck" ]; then
   sphinx-build -WETan -j auto -D language=en -b linkcheck -d docs/build/doctrees docs/source docs/build/linkcheck
 elif [ "$action" == "docs" ]; then

diff --git a/docs/kedro-datasets-docs.sh b/docs/kedro-datasets-docs.sh
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -131,7 +131,7 @@
         "integer -- return number of occurrences of value",
         "integer -- return first index of value.",
         "kedro.extras.datasets.pandas.json_dataset.JSONDataSet",
-        "kedro.datasets.pandas.json_dataset.JSONDataSet",
+        "kedro_datasets.pandas.json_dataset.JSONDataSet",
         "pluggy._manager.PluginManager",
         "_DI",
         "_DO",
@@ -309,7 +309,7 @@
     "kedro.config",
     "kedro.extras.datasets",
     "kedro.extras.logging",
-    "kedro.datasets",
+    "kedro_datasets",
 ]
 
 

diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md
@@ -2,7 +2,7 @@
 
 This section introduces `catalog.yml`, the project-shareable Data Catalog. The file is located in `conf/base` and is a registry of all data sources available for use by a project; it manages loading and saving of data.
 
-All supported data connectors are available in [`kedro-datasets`](/kedro.datasets).
+All supported data connectors are available in [`kedro-datasets`](/kedro_datasets).
 
 ## Use the Data Catalog within Kedro configuration
 
@@ -261,7 +261,7 @@ scooters_query:
     index_col: [name]
 ```
 
-When you use [`pandas.SQLTableDataSet`](/kedro.datasets.pandas.SQLTableDataSet) or [`pandas.SQLQueryDataSet`](/kedro.datasets.pandas.SQLQueryDataSet), you must provide a database connection string. In the above example, we pass it using the `scooters_credentials` key from the credentials (see the details in the [Feeding in credentials](#feeding-in-credentials) section below). `scooters_credentials` must have a top-level key `con` containing a [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) connection string. As an alternative to credentials, you could explicitly put `con` into `load_args` and `save_args` (`pandas.SQLTableDataSet` only).
+When you use [`pandas.SQLTableDataSet`](/kedro_datasets.pandas.SQLTableDataSet) or [`pandas.SQLQueryDataSet`](/kedro_datasets.pandas.SQLQueryDataSet), you must provide a database connection string. In the above example, we pass it using the `scooters_credentials` key from the credentials (see the details in the [Feeding in credentials](#feeding-in-credentials) section below). `scooters_credentials` must have a top-level key `con` containing a [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) connection string. As an alternative to credentials, you could explicitly put `con` into `load_args` and `save_args` (`pandas.SQLTableDataSet` only).
 
 
 ### Example 14: Loads data from an API endpoint, example US corn yield data from USDA
@@ -535,7 +535,7 @@ The code API allows you to:
 
 ### Configure a Data Catalog
 
-In a file like `catalog.py`, you can construct a `DataCatalog` object programmatically. In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro.datasets).
+In a file like `catalog.py`, you can construct a `DataCatalog` object programmatically. In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro_datasets).
 
 ```python
 from kedro.io import DataCatalog

diff --git a/docs/source/data/kedro_io.md b/docs/source/data/kedro_io.md
@@ -41,8 +41,8 @@ For contributors, if you would like to submit a new dataset, you must extend the
 In order to enable versioning, you need to update the `catalog.yml` config file and set the `versioned` attribute to `true` for the given dataset. If this is a custom dataset, the implementation must also:
   1. extend `kedro.io.core.AbstractVersionedDataSet` AND
   2. add `version` namedtuple as an argument to its `__init__` method AND
-  3. call `super().__init__()` with positional arguments `filepath`, `version`, and, optionally, with `glob` and `exists` functions if it uses a non-local filesystem (see [kedro_datasets.pandas.CSVDataSet](/kedro.datasets.pandas.CSVDataSet) as an example) AND
-  4. modify its `_describe`, `_load` and `_save` methods respectively to support versioning (see [`kedro_datasets.pandas.CSVDataSet`](/kedro.datasets.pandas.CSVDataSet) for an example implementation)
+  3. call `super().__init__()` with positional arguments `filepath`, `version`, and, optionally, with `glob` and `exists` functions if it uses a non-local filesystem (see [kedro_datasets.pandas.CSVDataSet](/kedro_datasets.pandas.CSVDataSet) as an example) AND
+  4. modify its `_describe`, `_load` and `_save` methods respectively to support versioning (see [`kedro_datasets.pandas.CSVDataSet`](/kedro_datasets.pandas.CSVDataSet) for an example implementation)
 
 ```{note}
 If a new version of a dataset is created mid-run, for instance by an external system adding new files, it will not interfere in the current run, i.e. the load version stays the same throughout subsequent loads.
@@ -239,7 +239,7 @@ Although HTTP(S) is a supported file system in the dataset implementations, it d
 
 ## Partitioned dataset
 
-These days, distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro.datasets.spark.SparkDataSet) cater for such use cases, but the use of Spark is not always feasible.
+These days, distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases, but the use of Spark is not always feasible.
 
 This is why Kedro provides a built-in [PartitionedDataSet](/kedro.io.PartitionedDataSet), with the following features:
 

diff --git a/docs/source/extend_kedro/common_use_cases.md b/docs/source/extend_kedro/common_use_cases.md
@@ -4,15 +4,15 @@ Kedro has a few built-in mechanisms for you to extend its behaviour. This docume
 
 ## Use Case 1: How to add extra behaviour to Kedro's execution timeline
 
-The execution timeline of a Kedro pipeline can be thought of as a sequence of actions performed by various Kedro library components, such as the [DataSets](/kedro.datasets), [DataCatalog](/kedro.io.DataCatalog), [Pipeline](/kedro.pipeline.Pipeline), [Node](/kedro.pipeline.node.Node) and [KedroContext](/kedro.framework.context.KedroContext).
+The execution timeline of a Kedro pipeline can be thought of as a sequence of actions performed by various Kedro library components, such as the [DataSets](/kedro_datasets), [DataCatalog](/kedro.io.DataCatalog), [Pipeline](/kedro.pipeline.Pipeline), [Node](/kedro.pipeline.node.Node) and [KedroContext](/kedro.framework.context.KedroContext).
 
 At different points in the lifecycle of these components, you might want to add extra behaviour: for example, you could add extra computation for profiling purposes _before_ and _after_ a node runs, or _before_ and _after_ the I/O actions of a dataset, namely the `load` and `save` actions.
 
 This can now achieved by using [Hooks](../hooks/introduction.md), to define the extra behaviour and when in the execution timeline it should be introduced.
 
 ## Use Case 2: How to integrate Kedro with additional data sources
 
-You can use [DataSets](/kedro.datasets) to interface with various different data sources. If the data source you plan to use is not supported out of the box by Kedro, you can [create a custom dataset](custom_datasets.md).
+You can use [DataSets](/kedro_datasets) to interface with various different data sources. If the data source you plan to use is not supported out of the box by Kedro, you can [create a custom dataset](custom_datasets.md).
 
 ## Use Case 3: How to add or modify CLI commands
 

diff --git a/docs/source/extend_kedro/custom_datasets.md b/docs/source/extend_kedro/custom_datasets.md
@@ -1,6 +1,6 @@
 # Custom datasets
 
-[Kedro supports many datasets](/kedro.datasets) out of the box, but you may find that you need to create a custom dataset. For example, you may need to handle a proprietary data format or filesystem in your pipeline, or perhaps you have found a particular use case for a dataset that Kedro does not support. This tutorial explains how to create a custom dataset to read and save image data.
+[Kedro supports many datasets](/kedro_datasets) out of the box, but you may find that you need to create a custom dataset. For example, you may need to handle a proprietary data format or filesystem in your pipeline, or perhaps you have found a particular use case for a dataset that Kedro does not support. This tutorial explains how to create a custom dataset to read and save image data.
 
 ## Scenario
 
@@ -504,7 +504,7 @@ You may also want to consult the [in-depth documentation about the Versioning AP
 
 Kedro datasets should work with the [SequentialRunner](/kedro.runner.SequentialRunner) and the [ParallelRunner](/kedro.runner.ParallelRunner), so they must be fully serialisable by the [Python multiprocessing package](https://docs.python.org/3/library/multiprocessing.html). This means that your datasets should not make use of lambda functions, nested functions, closures etc. If you are using custom decorators, you need to ensure that they are using [`functools.wraps()`](https://docs.python.org/3/library/functools.html#functools.wraps).
 
-There is one dataset that is an exception: [SparkDataSet](/kedro.datasets.spark.SparkDataSet). The explanation for this exception is that [Apache Spark](https://spark.apache.org/) uses its own parallelism and therefore doesn't work with Kedro [ParallelRunner](/kedro.runner.ParallelRunner). For parallelism within a Kedro project that leverages Spark please consider the alternative [ThreadRunner](/kedro.runner.ThreadRunner).
+There is one dataset that is an exception: [SparkDataSet](/kedro_datasets.spark.SparkDataSet). The explanation for this exception is that [Apache Spark](https://spark.apache.org/) uses its own parallelism and therefore doesn't work with Kedro [ParallelRunner](/kedro.runner.ParallelRunner). For parallelism within a Kedro project that leverages Spark please consider the alternative [ThreadRunner](/kedro.runner.ThreadRunner).
 
 To verify whether your dataset is serialisable by `multiprocessing`, use the console or an iPython session to try dumping it using `multiprocessing.reduction.ForkingPickler`:
 
@@ -562,7 +562,7 @@ class ImageDataSet(AbstractVersionedDataSet):
     ...
 ```
 
-We provide additional examples of [how to use parameters through the data catalog's YAML API](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api). For an example of how to use these parameters in your dataset's constructor, please see the [SparkDataSet](/kedro.datasets.spark.SparkDataSet)'s implementation.
+We provide additional examples of [how to use parameters through the data catalog's YAML API](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api). For an example of how to use these parameters in your dataset's constructor, please see the [SparkDataSet](/kedro_datasets.spark.SparkDataSet)'s implementation.
 
 
 ## How to contribute a custom dataset implementation
@@ -592,7 +592,7 @@ kedro-plugins/kedro-datasets/kedro_datasets/image
 ```{note}
 There are two special considerations when contributing a dataset:
 
-   1. Add the dataset to `kedro.datasets.rst` so it shows up in the API documentation.
+   1. Add the dataset to `kedro_datasets.rst` so it shows up in the API documentation.
    2. Add the dataset to `static/jsonschema/kedro-catalog-X.json` for IDE validation.
 
 ```
diff --git a/docs/source/get_started/kedro_concepts.md b/docs/source/get_started/kedro_concepts.md
@@ -55,7 +55,7 @@ greeting_pipeline = pipeline([return_greeting_node, join_statements_node])
 
 The Kedro Data Catalog is the registry of all data sources that the project can use to manage loading and saving data. It maps the names of node inputs and outputs as keys in a `DataCatalog`, a Kedro class that can be specialised for different types of data storage.
 
-[Kedro provides different built-in datasets](/kedro.datasets) for numerous file types and file systems, so you don’t have to write the logic for reading/writing data.
+[Kedro provides different built-in datasets](/kedro_datasets) for numerous file types and file systems, so you don’t have to write the logic for reading/writing data.
 
 ## Kedro project directory structure
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -148,6 +148,7 @@ API documentation
    :recursive:
 
    kedro
+   kedro_datasets
 
 Indices and tables
 ==================

diff --git a/docs/source/integrations/pyspark_integration.md b/docs/source/integrations/pyspark_integration.md
@@ -66,10 +66,10 @@ HOOKS = (SparkHooks(),)
 
 We recommend using Kedro's built-in Spark datasets to load raw data into Spark's [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html), as well as to write them back to storage. Some of our built-in Spark datasets include:
 
-* [spark.DeltaTableDataSet](/kedro.datasets.spark.DeltaTableDataSet)
-* [spark.SparkDataSet](/kedro.datasets.spark.SparkDataSet)
-* [spark.SparkJDBCDataSet](/kedro.datasets.spark.SparkJDBCDataSet)
-* [spark.SparkHiveDataSet](/kedro.datasets.spark.SparkHiveDataSet)
+* [spark.DeltaTableDataSet](/kedro_datasets.spark.DeltaTableDataSet)
+* [spark.SparkDataSet](/kedro_datasets.spark.SparkDataSet)
+* [spark.SparkJDBCDataSet](/kedro_datasets.spark.SparkJDBCDataSet)
+* [spark.SparkHiveDataSet](/kedro_datasets.spark.SparkHiveDataSet)
 
 The example below illustrates how to use `spark.SparkDataSet` to read a CSV file located in S3 into a `DataFrame` in `conf/base/catalog.yml`:
 

diff --git a/docs/source/kedro.datasets.rst b/docs/source/kedro.datasets.rst
diff --git a/docs/source/kedro.extras.rst b/docs/source/kedro.extras.rst
@@ -0,0 +1,20 @@
+kedro.extras
+============
+
+.. rubric:: Description
+
+.. automodule:: kedro.extras
+
+.. rubric:: Modules
+
+.. autosummary::
+   :toctree:
+   :recursive:
+
+   kedro.extras.extensions
+   kedro.extras.logging
+
+.. toctree::
+   :hidden:
+
+   kedro.extras.datasets
diff --git a/docs/source/kedro_datasets.rst b/docs/source/kedro_datasets.rst
@@ -0,0 +1,52 @@
+kedro_datasets
+==============
+
+.. rubric:: Description
+
+.. automodule:: kedro_datasets
+
+.. rubric:: Classes
+
+.. autosummary::
+   :toctree:
+   :template: autosummary/class.rst
+
+   kedro_datasets.api.APIDataSet
+   kedro_datasets.biosequence.BioSequenceDataSet
+   kedro_datasets.dask.ParquetDataSet
+   kedro_datasets.email.EmailMessageDataSet
+   kedro_datasets.geopandas.GeoJSONDataSet
+   kedro_datasets.holoviews.HoloviewsWriter
+   kedro_datasets.json.JSONDataSet
+   kedro_datasets.matplotlib.MatplotlibWriter
+   kedro_datasets.networkx.GMLDataSet
+   kedro_datasets.networkx.GraphMLDataSet
+   kedro_datasets.networkx.JSONDataSet
+   kedro_datasets.pandas.CSVDataSet
+   kedro_datasets.pandas.ExcelDataSet
+   kedro_datasets.pandas.FeatherDataSet
+   kedro_datasets.pandas.GBQQueryDataSet
+   kedro_datasets.pandas.GBQTableDataSet
+   kedro_datasets.pandas.GenericDataSet
+   kedro_datasets.pandas.HDFDataSet
+   kedro_datasets.pandas.JSONDataSet
+   kedro_datasets.pandas.ParquetDataSet
+   kedro_datasets.pandas.SQLQueryDataSet
+   kedro_datasets.pandas.SQLTableDataSet
+   kedro_datasets.pandas.XMLDataSet
+   kedro_datasets.pickle.PickleDataSet
+   kedro_datasets.pillow.ImageDataSet
+   kedro_datasets.plotly.JSONDataSet
+   kedro_datasets.plotly.PlotlyDataSet
+   kedro_datasets.redis.PickleDataSet
+   kedro_datasets.spark.DeltaTableDataSet
+   kedro_datasets.spark.SparkDataSet
+   kedro_datasets.spark.SparkHiveDataSet
+   kedro_datasets.spark.SparkJDBCDataSet
+   kedro_datasets.svmlight.SVMLightDataSet
+   kedro_datasets.tensorflow.TensorFlowModelDataset
+   kedro_datasets.text.TextDataSet
+   kedro_datasets.tracking.JSONDataSet
+   kedro_datasets.tracking.MetricsDataSet
+   kedro_datasets.video.VideoDataSet
+   kedro_datasets.yaml.YAMLDataSet