Skip to content

Commit 95490b0

Browse files
authored
Merge pull request #2825 from ClickHouse/add-pyspark-glue-example
AWS Glue - add Pyspark (Python) example + tabs
2 parents 5a1de6e + 93b5b80 commit 95490b0

File tree

1 file changed

+49
-0
lines changed
  • docs/en/integrations/data-ingestion/aws-glue

1 file changed

+49
-0
lines changed

docs/en/integrations/data-ingestion/aws-glue/index.md

+49
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,19 @@ description: Integrate ClickHouse and Amazon Glue
66
keywords: [ clickhouse, amazon, aws, glue, migrating, data ]
77
---
88

9+
import Tabs from '@theme/Tabs';
10+
import TabItem from '@theme/TabItem';
11+
912
# Integrating Amazon Glue with ClickHouse
1013

1114
[Amazon Glue](https://aws.amazon.com/glue/) is a fully managed, serverless data integration service provided by Amazon Web Services (AWS). It simplifies the process of discovering, preparing, and transforming data for analytics, machine learning, and application development.
1215

1316

1417
Although there is no Glue ClickHouse connector available yet, the official JDBC connector can be leveraged to connect and integrate with ClickHouse:
1518

19+
<Tabs>
20+
<TabItem value="Java" label="Java" default>
21+
1622
```java
1723
import com.amazonaws.services.glue.util.Job
1824
import com.amazonaws.services.glue.util.GlueArgParser
@@ -55,6 +61,49 @@ object GlueJob {
5561
}
5662
```
5763

64+
</TabItem>
65+
<TabItem value="Python" label="Python">
66+
67+
```python
68+
import sys
69+
from awsglue.transforms import *
70+
from awsglue.utils import getResolvedOptions
71+
from pyspark.context import SparkContext
72+
from awsglue.context import GlueContext
73+
from awsglue.job import Job
74+
75+
## @params: [JOB_NAME]
76+
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
77+
78+
sc = SparkContext()
79+
glueContext = GlueContext(sc)
80+
logger = glueContext.get_logger()
81+
spark = glueContext.spark_session
82+
job = Job(glueContext)
83+
job.init(args['JOB_NAME'], args)
84+
jdbc_url = "jdbc:ch://{host}:{port}/{schema}"
85+
query = "select * from my_table"
86+
# For cloud usage, please add ssl options
87+
df = (spark.read.format("jdbc")
88+
.option("driver", 'com.clickhouse.jdbc.ClickHouseDriver')
89+
.option("url", jdbc_url)
90+
.option("user", 'default')
91+
.option("password", '*******')
92+
.option("query", query)
93+
.load())
94+
95+
logger.info("num of rows:")
96+
logger.info(str(df.count()))
97+
logger.info("Data sample:")
98+
logger.info(str(df.take(10)))
99+
100+
101+
job.commit()
102+
```
103+
104+
</TabItem>
105+
</Tabs>
106+
58107
For more details, please visit our [Spark & JDBC documentation](/en/integrations/apache-spark#read-data).
59108

60109

0 commit comments

Comments
 (0)