-
Notifications
You must be signed in to change notification settings - Fork 431
Add Spark TableProvider API Documentation and Databricks Integration Guide + Variant datatype support #5124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 10 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
69dd54e
Add comprehensive TableProvider API documentation with Databricks int…
ShimonSte c9fd8a8
Fix MDX compilation error: close TabItem and Tabs tags in Databricks …
ShimonSte fcb684e
Fix markdown linting: add explicit anchor IDs to headings
ShimonSte 02d4777
Add Spark connector documentation improvements
ShimonSte 3075793
Fix markdown linting errors
ShimonSte 46b1740
removed data lakes by mistake
ShimonSte 2c49e62
fixed linting
ShimonSte ca6e9d5
Add Databricks integration documentation and screenshots
ShimonSte 46f4253
Add Configuring ClickHouse Options section and partition overwrite li…
ShimonSte 63d2dc9
Add partition overwrite limitation note to Catalog API Write data sec…
ShimonSte 206ae1e
Update docs/integrations/data-ingestion/apache-spark/databricks.md
ShimonSte f81743a
Update Databricks installation instructions
ShimonSte File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
311 changes: 311 additions & 0 deletions
311
docs/integrations/data-ingestion/apache-spark/databricks.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,311 @@ | ||
| --- | ||
| sidebar_label: 'Databricks' | ||
| sidebar_position: 3 | ||
| slug: /integrations/data-ingestion/apache-spark/databricks | ||
| description: 'Integrate ClickHouse with Databricks' | ||
| keywords: ['clickhouse', 'databricks', 'spark', 'unity catalog', 'data'] | ||
| title: 'Integrating ClickHouse with Databricks' | ||
| doc_type: 'guide' | ||
| --- | ||
|
|
||
| import Image from '@theme/IdealImage'; | ||
| import Tabs from '@theme/Tabs'; | ||
| import TabItem from '@theme/TabItem'; | ||
| import ClickHouseSupportedBadge from '@theme/badges/ClickHouseSupported'; | ||
|
|
||
| # Integrating ClickHouse with Databricks | ||
|
|
||
| <ClickHouseSupportedBadge/> | ||
|
|
||
| The ClickHouse Spark connector works seamlessly with Databricks. This guide covers platform-specific setup, installation, and usage patterns for Databricks. | ||
|
|
||
| ## API Selection for Databricks {#api-selection} | ||
|
|
||
| By default, Databricks uses Unity Catalog, which blocks Spark catalog registration. In this case, you **must** use the **TableProvider API** (format-based access). | ||
|
|
||
| However, if you disable Unity Catalog by creating a cluster with **No isolation shared** access mode, you can use the **Catalog API** instead. The Catalog API provides centralized configuration and native Spark SQL integration. | ||
|
|
||
| | Unity Catalog Status | Recommended API | Notes | | ||
| |---------------------|------------------|-------| | ||
| | **Enabled** (default) | TableProvider API (format-based) | Unity Catalog blocks Spark catalog registration | | ||
| | **Disabled** (No isolation shared) | Catalog API | Requires cluster with "No isolation shared" access mode | | ||
|
|
||
| ## Installation on Databricks {#installation} | ||
|
|
||
| ### Option 1: Upload JAR via Databricks UI {#installation-ui} | ||
|
|
||
| 1. Build or [download](https://repo1.maven.org/maven2/com/clickhouse/spark/) the runtime JAR: | ||
| ```bash | ||
| clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}-{{ stable_version }}.jar | ||
| ``` | ||
|
|
||
| 2. Upload the JAR to your Databricks workspace: | ||
| - Go to **Workspace** → Navigate to your desired folder | ||
| - Click **Upload** → Select the JAR file | ||
| - The JAR will be stored in your workspace | ||
|
|
||
| 3. Install the library on your cluster: | ||
| - Go to **Compute** → Select your cluster | ||
| - Click the **Libraries** tab | ||
| - Click **Install New** | ||
| - Select **DBFS** or **Workspace** → Navigate to the uploaded JAR file | ||
| - Click **Install** | ||
|
|
||
| <Image img={require('@site/static/images/integrations/data-ingestion/apache-spark/databricks/databricks-libraries-tab.png')} alt="Databricks Libraries tab" /> | ||
|
|
||
| <Image img={require('@site/static/images/integrations/data-ingestion/apache-spark/databricks/databricks-install-from-volume.png')} alt="Installing library from workspace volume" /> | ||
|
|
||
| 4. Restart the cluster to load the library | ||
|
|
||
| ### Option 2: Install via Databricks CLI {#installation-cli} | ||
|
|
||
| ```bash | ||
| # Upload JAR to DBFS | ||
| databricks fs cp clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}-{{ stable_version }}.jar \ | ||
| dbfs:/FileStore/jars/ | ||
|
|
||
| # Install on cluster | ||
| databricks libraries install \ | ||
| --cluster-id <your-cluster-id> \ | ||
| --jar dbfs:/FileStore/jars/clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}-{{ stable_version }}.jar | ||
| ``` | ||
|
|
||
| ### Option 3: Maven Coordinates (Recommended) {#installation-maven} | ||
|
|
||
| 1. Navigate to your Databricks workspace: | ||
| - Go to **Compute** → Select your cluster | ||
| - Click the **Libraries** tab | ||
| - Click **Install New** | ||
| - Select **Maven** tab | ||
|
|
||
| 2. Add the Maven coordinates: | ||
|
|
||
| ```text | ||
| com.clickhouse.spark:clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}:{{ stable_version }} | ||
| ``` | ||
|
|
||
| <Image img={require('@site/static/images/integrations/data-ingestion/apache-spark/databricks/databricks-maven-tab.png')} alt="Databricks Maven libraries configuration" /> | ||
|
|
||
| 3. Click **Install** and restart the cluster to load the library | ||
|
|
||
| ## Using TableProvider API {#tableprovider-api} | ||
|
|
||
| When Unity Catalog is enabled (default), you **must** use the TableProvider API (format-based access) because Unity Catalog blocks Spark catalog registration. If you've disabled Unity Catalog by using a cluster with "No isolation shared" access mode, you can use the [Catalog API](/docs/integrations/data-ingestion/apache-spark/spark-native-connector#register-the-catalog-required) instead. | ||
|
|
||
| ### Reading Data {#reading-data-table-provider} | ||
|
|
||
| <Tabs groupId="databricks_usage"> | ||
| <TabItem value="Python" label="Python" default> | ||
|
|
||
| ```python | ||
| # Read from ClickHouse using TableProvider API | ||
| df = spark.read \ | ||
| .format("clickhouse") \ | ||
| .option("host", "your-clickhouse-cloud-host.clickhouse.cloud") \ | ||
| .option("protocol", "https") \ | ||
| .option("http_port", "8443") \ | ||
| .option("database", "default") \ | ||
| .option("table", "events") \ | ||
| .option("user", "default") \ | ||
| .option("password", dbutils.secrets.get(scope="clickhouse", key="password")) \ | ||
| .option("ssl", "true") \ | ||
| .load() | ||
|
|
||
| # Schema is automatically inferred | ||
| df.display() | ||
| ``` | ||
|
|
||
| </TabItem> | ||
| <TabItem value="Scala" label="Scala"> | ||
|
|
||
| ```scala | ||
| val df = spark.read | ||
| .format("clickhouse") | ||
| .option("host", "your-clickhouse-cloud-host.clickhouse.cloud") | ||
| .option("protocol", "https") | ||
| .option("http_port", "8443") | ||
| .option("database", "default") | ||
| .option("table", "events") | ||
| .option("user", "default") | ||
| .option("password", dbutils.secrets.get(scope="clickhouse", key="password")) | ||
| .option("ssl", "true") | ||
| .load() | ||
|
|
||
| df.show() | ||
| ``` | ||
|
|
||
| </TabItem> | ||
| </Tabs> | ||
|
|
||
| ### Writing Data {#writing-data-unity} | ||
ShimonSte marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| <Tabs groupId="databricks_usage"> | ||
| <TabItem value="Python" label="Python" default> | ||
|
|
||
| ```python | ||
| # Write to ClickHouse - table will be created automatically if it doesn't exist | ||
| df.write \ | ||
| .format("clickhouse") \ | ||
| .option("host", "your-clickhouse-cloud-host.clickhouse.cloud") \ | ||
| .option("protocol", "https") \ | ||
| .option("http_port", "8443") \ | ||
| .option("database", "default") \ | ||
| .option("table", "events_copy") \ | ||
| .option("user", "default") \ | ||
| .option("password", dbutils.secrets.get(scope="clickhouse", key="password")) \ | ||
| .option("ssl", "true") \ | ||
| .option("order_by", "id") \ # Required: specify ORDER BY when creating a new table | ||
| .option("settings.allow_nullable_key", "1") \ # Required for ClickHouse Cloud if ORDER BY has nullable columns | ||
| .mode("append") \ | ||
| .save() | ||
| ``` | ||
|
|
||
| </TabItem> | ||
| <TabItem value="Scala" label="Scala"> | ||
|
|
||
| ```scala | ||
| df.write | ||
| .format("clickhouse") | ||
| .option("host", "your-clickhouse-cloud-host.clickhouse.cloud") | ||
| .option("protocol", "https") | ||
| .option("http_port", "8443") | ||
| .option("database", "default") | ||
| .option("table", "events_copy") | ||
| .option("user", "default") | ||
| .option("password", dbutils.secrets.get(scope="clickhouse", key="password")) | ||
| .option("ssl", "true") | ||
| .option("order_by", "id") // Required: specify ORDER BY when creating a new table | ||
| .option("settings.allow_nullable_key", "1") // Required for ClickHouse Cloud if ORDER BY has nullable columns | ||
| .mode("append") | ||
| .save() | ||
| ``` | ||
|
|
||
| </TabItem> | ||
| </Tabs> | ||
|
|
||
| :::note | ||
| This example assumes preconfigured secret scopes in Databricks. For setup instructions, see the Databricks [Secret management documentation](https://docs.databricks.com/aws/en/security/secrets/). | ||
| ::: | ||
|
|
||
| ## Databricks-Specific Considerations {#considerations} | ||
ShimonSte marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Secret Management {#secret-management} | ||
ShimonSte marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Use Databricks secret scopes to securely store ClickHouse credentials: | ||
|
|
||
| ```python | ||
| # Access secrets | ||
| password = dbutils.secrets.get(scope="clickhouse", key="password") | ||
| ``` | ||
|
|
||
| For setup instructions, see the Databricks [Secret management documentation](https://docs.databricks.com/aws/en/security/secrets/). | ||
|
|
||
| <!-- TODO: Add screenshot of Databricks secret scopes configuration --> | ||
|
|
||
| ### ClickHouse Cloud Connection {#clickhouse-cloud} | ||
ShimonSte marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| When connecting to ClickHouse Cloud from Databricks: | ||
|
|
||
| 1. Use **HTTPS protocol** (`protocol: https`, `http_port: 8443`) | ||
| 2. Enable **SSL** (`ssl: true`) | ||
|
|
||
| ## Examples {#examples} | ||
|
|
||
| ### Complete Workflow Example {#workflow-example} | ||
ShimonSte marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| <Tabs groupId="databricks_usage"> | ||
| <TabItem value="Python" label="Python" default> | ||
|
|
||
| ```python | ||
| from pyspark.sql import SparkSession | ||
| from pyspark.sql.functions import col | ||
|
|
||
| # Initialize Spark with ClickHouse connector | ||
| spark = SparkSession.builder \ | ||
| .config("spark.jars.packages", "com.clickhouse.spark:clickhouse-spark-runtime-3.4_2.12:0.9.0") \ | ||
| .getOrCreate() | ||
|
|
||
| # Read from ClickHouse | ||
| df = spark.read \ | ||
| .format("clickhouse") \ | ||
| .option("host", "your-host.clickhouse.cloud") \ | ||
| .option("protocol", "https") \ | ||
| .option("http_port", "8443") \ | ||
| .option("database", "default") \ | ||
| .option("table", "source_table") \ | ||
| .option("user", "default") \ | ||
| .option("password", dbutils.secrets.get(scope="clickhouse", key="password")) \ | ||
| .option("ssl", "true") \ | ||
| .load() | ||
|
|
||
| # Transform data | ||
| transformed_df = df.filter(col("status") == "active") | ||
|
|
||
| # Write to ClickHouse | ||
| transformed_df.write \ | ||
| .format("clickhouse") \ | ||
| .option("host", "your-host.clickhouse.cloud") \ | ||
| .option("protocol", "https") \ | ||
| .option("http_port", "8443") \ | ||
| .option("database", "default") \ | ||
| .option("table", "target_table") \ | ||
| .option("user", "default") \ | ||
| .option("password", dbutils.secrets.get(scope="clickhouse", key="password")) \ | ||
| .option("ssl", "true") \ | ||
| .option("order_by", "id") \ | ||
| .mode("append") \ | ||
| .save() | ||
| ``` | ||
|
|
||
| </TabItem> | ||
| <TabItem value="Scala" label="Scala"> | ||
|
|
||
| ```scala | ||
| import org.apache.spark.sql.SparkSession | ||
| import org.apache.spark.sql.functions.col | ||
|
|
||
| // Initialize Spark with ClickHouse connector | ||
| val spark = SparkSession.builder | ||
| .config("spark.jars.packages", "com.clickhouse.spark:clickhouse-spark-runtime-3.4_2.12:0.9.0") | ||
| .getOrCreate() | ||
|
|
||
| // Read from ClickHouse | ||
| val df = spark.read | ||
| .format("clickhouse") | ||
| .option("host", "your-host.clickhouse.cloud") | ||
| .option("protocol", "https") | ||
| .option("http_port", "8443") | ||
| .option("database", "default") | ||
| .option("table", "source_table") | ||
| .option("user", "default") | ||
| .option("password", dbutils.secrets.get(scope="clickhouse", key="password")) | ||
| .option("ssl", "true") | ||
| .load() | ||
|
|
||
| // Transform data | ||
| val transformedDF = df.filter(col("status") === "active") | ||
|
|
||
| // Write to ClickHouse | ||
| transformedDF.write | ||
| .format("clickhouse") | ||
| .option("host", "your-host.clickhouse.cloud") | ||
| .option("protocol", "https") | ||
| .option("http_port", "8443") | ||
| .option("database", "default") | ||
| .option("table", "target_table") | ||
| .option("user", "default") | ||
| .option("password", dbutils.secrets.get(scope="clickhouse", key="password")) | ||
| .option("ssl", "true") | ||
| .option("order_by", "id") | ||
| .mode("append") | ||
| .save() | ||
| ``` | ||
|
|
||
| </TabItem> | ||
| </Tabs> | ||
|
|
||
| ## Related Documentation {#related} | ||
ShimonSte marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| - [Spark Native Connector Guide](/docs/integrations/data-ingestion/apache-spark/spark-native-connector) - Complete connector documentation | ||
| - [TableProvider API Documentation](/docs/integrations/data-ingestion/apache-spark/spark-native-connector#using-the-tableprovider-api-format-based-access) - Format-based access details | ||
| - [Catalog API Documentation](/docs/integrations/data-ingestion/apache-spark/spark-native-connector#register-the-catalog-required) - Catalog-based access details | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.