Skip to content

Commit 93b5b80

Browse files
committed
add Pyspark (Python) example + tabs
1 parent 7465b03 commit 93b5b80

File tree

1 file changed

+49
-0
lines changed
  • docs/en/integrations/data-ingestion/aws-glue

1 file changed

+49
-0
lines changed

docs/en/integrations/data-ingestion/aws-glue/index.md

+49
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,19 @@ description: Integrate ClickHouse and Amazon Glue
66
keywords: [ clickhouse, amazon, aws, glue, migrating, data ]
77
---
88

9+
import Tabs from '@theme/Tabs';
10+
import TabItem from '@theme/TabItem';
11+
912
# Integrating Amazon Glue with ClickHouse
1013

1114
[Amazon Glue](https://aws.amazon.com/glue/) is a fully managed, serverless data integration service provided by Amazon Web Services (AWS). It simplifies the process of discovering, preparing, and transforming data for analytics, machine learning, and application development.
1215

1316

1417
Although there is no Glue ClickHouse connector available yet, the official JDBC connector can be leveraged to connect and integrate with ClickHouse:
1518

19+
<Tabs>
20+
<TabItem value="Java" label="Java" default>
21+
1622
```java
1723
import com.amazonaws.services.glue.util.Job
1824
import com.amazonaws.services.glue.util.GlueArgParser
@@ -55,6 +61,49 @@ object GlueJob {
5561
}
5662
```
5763

64+
</TabItem>
65+
<TabItem value="Python" label="Python">
66+
67+
```python
68+
import sys
69+
from awsglue.transforms import *
70+
from awsglue.utils import getResolvedOptions
71+
from pyspark.context import SparkContext
72+
from awsglue.context import GlueContext
73+
from awsglue.job import Job
74+
75+
## @params: [JOB_NAME]
76+
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
77+
78+
sc = SparkContext()
79+
glueContext = GlueContext(sc)
80+
logger = glueContext.get_logger()
81+
spark = glueContext.spark_session
82+
job = Job(glueContext)
83+
job.init(args['JOB_NAME'], args)
84+
jdbc_url = "jdbc:ch://{host}:{port}/{schema}"
85+
query = "select * from my_table"
86+
# For cloud usage, please add ssl options
87+
df = (spark.read.format("jdbc")
88+
.option("driver", 'com.clickhouse.jdbc.ClickHouseDriver')
89+
.option("url", jdbc_url)
90+
.option("user", 'default')
91+
.option("password", '*******')
92+
.option("query", query)
93+
.load())
94+
95+
logger.info("num of rows:")
96+
logger.info(str(df.count()))
97+
logger.info("Data sample:")
98+
logger.info(str(df.take(10)))
99+
100+
101+
job.commit()
102+
```
103+
104+
</TabItem>
105+
</Tabs>
106+
58107
For more details, please visit our [Spark & JDBC documentation](/en/integrations/apache-spark#read-data).
59108

60109

0 commit comments

Comments
 (0)