Skip to content

Commit 2bbba19

Browse files
authored
Merge pull request #1231 from Kotlin/unsupported-data-sources-examples
IDE sample of "unsupported sources"->DataFrame
2 parents 8d28a31 + f0bf04a commit 2bbba19

File tree

24 files changed

+1738
-23
lines changed

24 files changed

+1738
-23
lines changed

README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,16 @@
1111
Kotlin DataFrame aims to reconcile Kotlin's static typing with the dynamic nature of data by utilizing both the full power of the Kotlin language and the opportunities provided by intermittent code execution in Jupyter notebooks and REPL.
1212

1313
* **Hierarchical** — represents hierarchical data structures, such as JSON or a tree of JVM objects.
14-
* **Functional** — data processing pipeline is organized in a chain of `DataFrame` transformation operations. Every operation returns a new instance of `DataFrame` reusing underlying storage wherever it's possible.
14+
* **Functional** — the data processing pipeline is organized in a chain of `DataFrame` transformation operations.
15+
* **Immutable** — every operation returns a new instance of `DataFrame` reusing underlying storage wherever it's possible.
1516
* **Readable** — data transformation operations are defined in DSL close to natural language.
1617
* **Practical** — provides simple solutions for common problems and the ability to perform complex tasks.
1718
* **Minimalistic** — simple, yet powerful data model of three column kinds.
18-
* **Interoperable** — convertable with Kotlin data classes and collections.
19+
* **Interoperable** — convertable with Kotlin data classes and collections. This also means conversion to/from other libraries' data structures is usually quite straightforward!
1920
* **Generic** — can store objects of any type, not only numbers or strings.
20-
* **Typesafe** — on-the-fly generation of extension properties for type safe data access with Kotlin-style care for null safety.
21+
* **Typesafe** — on-the-fly [generation of extension properties](https://kotlin.github.io/dataframe/extensionpropertiesapi.html) for type safe data access with Kotlin-style care for null safety.
2122
* **Polymorphic** — type compatibility derives from column schema compatibility. You can define a function that requires a special subset of columns in a dataframe but doesn't care about other columns.
23+
In notebooks this works out-of-the-box. In ordinary projects this requires casting (for now).
2224

2325
Integrates with [Kotlin Notebook](https://kotlinlang.org/docs/kotlin-notebook-overview.html).
2426
Inspired by [krangl](https://github.com/holgerbrandl/krangl), Kotlin Collections and [pandas](https://pandas.pydata.org/)

build.gradle.kts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,7 +196,7 @@ allprojects {
196196
logger.warn("Could not set ktlint config on :${this.name}")
197197
}
198198

199-
// set the java toolchain version to 11 for all subprojects for CI stability
199+
// set the java toolchain version to 21 for all subprojects for CI stability
200200
extensions.findByType<KotlinJvmProjectExtension>()?.jvmToolchain(21)
201201

202202
// Attempts to configure buildConfig for each sub-project that uses it

docs/StardustDocs/topics/guides/Guides-And-Examples.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,8 +49,17 @@ and make working with your data both convenient and type-safe.
4949
— explore the GeoDataFrame module that brings a convenient Kotlin DataFrame API to geospatial workflows,
5050
enhanced with beautiful Kandy-Geo visualizations (*experimental*).
5151

52+
5253
<img src="geoguide_preview.png" border-effect="rounded" width="705"/>
5354

55+
56+
* [Using Unsupported Data Sources](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/examples):
57+
— A guide by examples. While these might one day become proper integrations of DataFrame, for now,
58+
we provide them as examples for how to make such integrations yourself.
59+
* [Apache Spark Interop](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/examples/spark)
60+
* [Apache Spark Interop (With Kotlin Spark API)](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/examples/kotlinSpark)
61+
* [Multik Interop](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/examples/multik)
62+
* [JetBrains Exposed Interop](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/examples/exposed)
5463
* [OpenAPI Guide](https://github.com/Kotlin/dataframe/blob/master/examples/notebooks/json/KeyValueAndOpenApi.ipynb)
5564
— learn how to parse and explore [OpenAPI](https://swagger.io) JSON structures using Kotlin DataFrame,
5665
enabling structured access and intuitive analysis of complex API schemas (*experimental*, supports OpenAPI 3.0.0).

docs/StardustDocs/topics/overview.md

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -36,30 +36,33 @@ The goal of data wrangling is to assure quality and useful data.
3636

3737
## Main Features and Concepts
3838

39-
* [**Hierarchical**](hierarchical.md) — the Kotlin DataFrame library provides an ability to read and present data from different sources including not only plain **CSV** but also **JSON** or **[SQL databases](readSqlDatabases.md)**.
40-
That’s why it has been designed hierarchical and allows nesting of columns and cells.
41-
42-
* [**Interoperable**](collectionsInterop.md) — hierarchical data layout also opens a possibility of converting any objects
43-
structure in application memory to a data frame and vice versa.
44-
45-
* **Safe** — the Kotlin DataFrame library provides a mechanism of on-the-fly [**generation of extension properties**](extensionPropertiesApi.md)
39+
* [**Hierarchical**](hierarchical.md) — the Kotlin DataFrame library provides an ability to read and present data from different sources,
40+
including not only plain **CSV** but also **JSON** or **[SQL databases](readSqlDatabases.md)**.
41+
This is why it was designed to be hierarchical and allows nesting of columns and cells.
42+
* **Functional** — the data processing pipeline is organized in a chain of [`DataFrame`](DataFrame.md) transformation operations.
43+
* **Immutable** — every operation returns a new instance of [`DataFrame`](DataFrame.md) reusing underlying storage wherever it's possible.
44+
* **Readable** — data transformation operations are defined in DSL close to natural language.
45+
* **Practical** — provides simple solutions for common problems and the ability to perform complex tasks.
46+
* **Minimalistic** — simple, yet powerful data model of three [column kinds](DataColumn.md#column-kinds).
47+
* [**Interoperable**](collectionsInterop.md) — convertable with Kotlin data classes and collections.
48+
This also means conversion to/from other libraries' data structures is usually quite straightforward!
49+
See our [examples](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/examples)
50+
for some conversions between DataFrame and [Apache Spark](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/examples/spark), [Multik](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/examples/multik), and [JetBrains Exposed](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/unsupported-data-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/examples/exposed).
51+
* **Generic** — can store objects of any type, not only numbers or strings.
52+
* **Typesafe** — the Kotlin DataFrame library provides a mechanism of on-the-fly [**generation of extension properties**](extensionPropertiesApi.md)
4653
that correspond to the columns of a data frame.
4754
In interactive notebooks like Jupyter or Datalore, the generation runs after each cell execution.
4855
In IntelliJ IDEA there's a Gradle plugin for generation properties based on CSV file or JSON file.
4956
Also, we’re working on a compiler plugin that infers and transforms [`DataFrame`](DataFrame.md) schema while typing.
5057
You can now clone this [project with many examples](https://github.com/koperagen/df-plugin-demo) showcasing how it allows you to reliably use our most convenient extension properties API.
5158
The generated properties ensure you’ll never misspell column name and don’t mess up with its type, and of course nullability is also preserved.
52-
53-
* **Generic** — columns can store objects of any type, not only numbers or strings.
54-
5559
* [**Polymorphic**](schemas.md)
56-
if all columns of [`DataFrame`](DataFrame.md) are presented in some other dataframes,
57-
then the first one could be a superclass for latter.
58-
Thus,
59-
one can define a function on an interface with some set of columns
60-
and then execute it in a safe way on any [`DataFrame`](DataFrame.md) which contains this set of columns.
61-
62-
* **Immutable** — all operations on [`DataFrame`](DataFrame.md) produce new instance, while underlying data is reused wherever it's possible
60+
if all columns of a [`DataFrame`](DataFrame.md) instance are presented in another dataframe,
61+
then the first one will be seen as a superclass for the latter.
62+
This means you can define a function on an interface with some set of columns
63+
and then execute it safely on any [`DataFrame`](DataFrame.md) which contains this same set of columns.
64+
In notebooks, this works out-of-the-box.
65+
In ordinary projects, this requires casting (for now).
6366

6467
## Syntax
6568

docs/StardustDocs/topics/schemasInheritance.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ New schema interface for `filtered` variable will be derived from previously gen
1818
interface DataFrameType1 : DataFrameType
1919
```
2020

21-
Extension properties for data access are generated only for new and overriden members of `DataFrameType1` interface:
21+
Extension properties for data access are generated only for new and overridden members of `DataFrameType1` interface:
2222

2323
```kotlin
2424
val ColumnsContainer<DataFrameType1>.age: DataColumn<Int> get() = this["age"] as DataColumn<Int>

examples/README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,18 @@
99
* [json](idea-examples/json) Using OpenAPI support in DataFrame's Gradle and KSP plugins to access data from [API guru](https://apis.guru/) in a type-safe manner
1010
* [imdb sql database](https://github.com/zaleslaw/KotlinDataFrame-SQL-Examples) This project prominently showcases how to convert data from an SQL table to a Kotlin DataFrame
1111
and how to transform the result of an SQL query into a DataFrame.
12+
* [unsupported-data-sources](idea-examples/unsupported-data-sources) Showcases of how to use DataFrame with
13+
(momentarily) unsupported data libraries such as [Spark](https://spark.apache.org/) and [Exposed](https://github.com/JetBrains/Exposed).
14+
They show how to convert to and from Kotlin Dataframe and their respective tables.
15+
* **JetBrains Exposed**: See the [exposed folder](./idea-examples/unsupported-data-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/examples/exposed)
16+
for an example of using Kotlin Dataframe with [Exposed](https://github.com/JetBrains/Exposed).
17+
* **Apache Spark**: See the [spark folder](./idea-examples/unsupported-data-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/examples/spark)
18+
for an example of using Kotlin Dataframe with [Spark](https://spark.apache.org/).
19+
* **Spark (with Kotlin Spark API)**: See the [kotlinSpark folder](./idea-examples/unsupported-data-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/examples/kotlinSpark)
20+
for an example of using Kotlin DataFrame with the [Kotlin Spark API](https://github.com/JetBrains/kotlin-spark-api).
21+
* **Multik**: See the [multik folder](./idea-examples/unsupported-data-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/examples/multik)
22+
for an example of using Kotlin Dataframe with [Multik](https://github.com/Kotlin/multik).
23+
1224

1325
### Notebook examples
1426

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
plugins {
2+
application
3+
kotlin("jvm")
4+
5+
id("org.jetbrains.kotlinx.dataframe")
6+
7+
// only mandatory if `kotlin.dataframe.add.ksp=false` in gradle.properties
8+
id("com.google.devtools.ksp")
9+
}
10+
11+
repositories {
12+
mavenLocal() // in case of local dataframe development
13+
mavenCentral()
14+
}
15+
16+
dependencies {
17+
// implementation("org.jetbrains.kotlinx:dataframe:X.Y.Z")
18+
implementation(project(":"))
19+
20+
// exposed + sqlite database support
21+
implementation(libs.sqlite)
22+
implementation(libs.exposed.core)
23+
implementation(libs.exposed.kotlin.datetime)
24+
implementation(libs.exposed.jdbc)
25+
implementation(libs.exposed.json)
26+
implementation(libs.exposed.money)
27+
28+
// (kotlin) spark support
29+
implementation(libs.kotlin.spark)
30+
compileOnly(libs.spark)
31+
implementation(libs.log4j.core)
32+
implementation(libs.log4j.api)
33+
34+
// multik support
35+
implementation(libs.multik.core)
36+
implementation(libs.multik.default)
37+
}
38+
39+
/**
40+
* Runs the kotlinSpark/typedDataset example with java 11.
41+
*/
42+
val runKotlinSparkTypedDataset by tasks.registering(JavaExec::class) {
43+
classpath = sourceSets["main"].runtimeClasspath
44+
javaLauncher = javaToolchains.launcherFor { languageVersion = JavaLanguageVersion.of(11) }
45+
mainClass = "org.jetbrains.kotlinx.dataframe.examples.kotlinSpark.TypedDatasetKt"
46+
}
47+
48+
/**
49+
* Runs the kotlinSpark/untypedDataset example with java 11.
50+
*/
51+
val runKotlinSparkUntypedDataset by tasks.registering(JavaExec::class) {
52+
classpath = sourceSets["main"].runtimeClasspath
53+
javaLauncher = javaToolchains.launcherFor { languageVersion = JavaLanguageVersion.of(11) }
54+
mainClass = "org.jetbrains.kotlinx.dataframe.examples.kotlinSpark.UntypedDatasetKt"
55+
}
56+
57+
/**
58+
* Runs the spark/typedDataset example with java 11.
59+
*/
60+
val runSparkTypedDataset by tasks.registering(JavaExec::class) {
61+
classpath = sourceSets["main"].runtimeClasspath
62+
javaLauncher = javaToolchains.launcherFor { languageVersion = JavaLanguageVersion.of(11) }
63+
mainClass = "org.jetbrains.kotlinx.dataframe.examples.spark.TypedDatasetKt"
64+
}
65+
66+
/**
67+
* Runs the spark/untypedDataset example with java 11.
68+
*/
69+
val runSparkUntypedDataset by tasks.registering(JavaExec::class) {
70+
classpath = sourceSets["main"].runtimeClasspath
71+
javaLauncher = javaToolchains.launcherFor { languageVersion = JavaLanguageVersion.of(11) }
72+
mainClass = "org.jetbrains.kotlinx.dataframe.examples.spark.UntypedDatasetKt"
73+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
package org.jetbrains.kotlinx.dataframe.examples.exposed
2+
3+
import org.jetbrains.exposed.v1.core.BiCompositeColumn
4+
import org.jetbrains.exposed.v1.core.Column
5+
import org.jetbrains.exposed.v1.core.Expression
6+
import org.jetbrains.exposed.v1.core.ExpressionAlias
7+
import org.jetbrains.exposed.v1.core.ResultRow
8+
import org.jetbrains.exposed.v1.core.Table
9+
import org.jetbrains.exposed.v1.jdbc.Query
10+
import org.jetbrains.kotlinx.dataframe.AnyFrame
11+
import org.jetbrains.kotlinx.dataframe.DataFrame
12+
import org.jetbrains.kotlinx.dataframe.annotations.DataSchema
13+
import org.jetbrains.kotlinx.dataframe.api.convertTo
14+
import org.jetbrains.kotlinx.dataframe.api.toDataFrame
15+
import org.jetbrains.kotlinx.dataframe.codeGen.NameNormalizer
16+
import org.jetbrains.kotlinx.dataframe.impl.schema.DataFrameSchemaImpl
17+
import org.jetbrains.kotlinx.dataframe.schema.ColumnSchema
18+
import org.jetbrains.kotlinx.dataframe.schema.DataFrameSchema
19+
import kotlin.reflect.KProperty1
20+
import kotlin.reflect.full.isSubtypeOf
21+
import kotlin.reflect.full.memberProperties
22+
import kotlin.reflect.typeOf
23+
24+
/**
25+
* Retrieves all columns of any [Iterable][Iterable]`<`[ResultRow][ResultRow]`>`, like [Query][Query],
26+
* from Exposed row by row and converts the resulting [Map] into a [DataFrame], cast to type [T].
27+
*
28+
* In notebooks, the untyped version works just as well due to runtime inference :)
29+
*/
30+
inline fun <reified T : Any> Iterable<ResultRow>.convertToDataFrame(): DataFrame<T> =
31+
convertToDataFrame().convertTo<T>()
32+
33+
/**
34+
* Retrieves all columns of an [Iterable][Iterable]`<`[ResultRow][ResultRow]`>` from Exposed, like [Query][Query],
35+
* row by row and converts the resulting [Map] of lists into a [DataFrame] by calling
36+
* [Map.toDataFrame].
37+
*/
38+
@JvmName("convertToAnyFrame")
39+
fun Iterable<ResultRow>.convertToDataFrame(): AnyFrame {
40+
val map = mutableMapOf<String, MutableList<Any?>>()
41+
for (row in this) {
42+
for (expression in row.fieldIndex.keys) {
43+
map.getOrPut(expression.readableName) {
44+
mutableListOf()
45+
} += row[expression]
46+
}
47+
}
48+
return map.toDataFrame()
49+
}
50+
51+
/**
52+
* Retrieves a simple column name from [this] [Expression].
53+
*
54+
* Might need to be expanded with multiple types of [Expression].
55+
*/
56+
val Expression<*>.readableName: String
57+
get() = when (this) {
58+
is Column<*> -> name
59+
is ExpressionAlias<*> -> alias
60+
is BiCompositeColumn<*, *, *> -> getRealColumns().joinToString("_") { it.readableName }
61+
else -> toString()
62+
}
63+
64+
/**
65+
* Creates a [DataFrameSchema] from the declared [Table] instance.
66+
*
67+
* This is not needed for conversion, but it can be useful to create a DataFrame [@DataSchema][DataSchema] instance.
68+
*
69+
* @param columnNameToAccessor Optional [MutableMap] which will be filled with entries mapping
70+
* the SQL column name to the accessor name from the [Table].
71+
* This can be used to define a [NameNormalizer] later.
72+
* @see toDataFrameSchemaWithNameNormalizer
73+
*/
74+
@Suppress("UNCHECKED_CAST")
75+
fun Table.toDataFrameSchema(columnNameToAccessor: MutableMap<String, String> = mutableMapOf()): DataFrameSchema {
76+
// we use reflection to go over all `Column<*>` properties in the Table object
77+
val columns = this::class.memberProperties
78+
.filter { it.returnType.isSubtypeOf(typeOf<Column<*>>()) }
79+
.associate { prop ->
80+
prop as KProperty1<Table, Column<*>>
81+
82+
// retrieve the SQL column name
83+
val columnName = prop.get(this).name
84+
// store the SQL column name together with the accessor name in the map
85+
columnNameToAccessor[columnName] = prop.name
86+
87+
// get the column type from `val a: Column<Type>`
88+
val type = prop.returnType.arguments.first().type!!
89+
90+
// and we add the name and column shema type to the `columns` map :)
91+
columnName to ColumnSchema.Value(type)
92+
}
93+
return DataFrameSchemaImpl(columns)
94+
}
95+
96+
/**
97+
* Creates a [DataFrameSchema] from the declared [Table] instance with a [NameNormalizer] to
98+
* convert the SQL column names to the corresponding Kotlin property names.
99+
*
100+
* This is not needed for conversion, but it can be useful to create a DataFrame [@DataSchema][DataSchema] instance.
101+
*
102+
* @see toDataFrameSchema
103+
*/
104+
fun Table.toDataFrameSchemaWithNameNormalizer(): Pair<DataFrameSchema, NameNormalizer> {
105+
val columnNameToAccessor = mutableMapOf<String, String>()
106+
return Pair(toDataFrameSchema(), NameNormalizer { columnNameToAccessor[it] ?: it })
107+
}

0 commit comments

Comments
 (0)