-
Notifications
You must be signed in to change notification settings - Fork 475
/
Copy pathpyspark-session-2021-01-26.txt
executable file
·87 lines (81 loc) · 1.79 KB
/
pyspark-session-2021-01-26.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
Spark's Mapper Transformations:
# map: 1 -> 1
# flatMap: 1 -> Many
# mapPartitions: partition -> 1 (Many to 1)
Many = 0, 1, 2, 3, 4, ...
partition = many elements
$ ./bin/pyspark
Python 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018, 02:44:43)
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.0.0
/_/
Using Python version 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018 02:44:43)
SparkSession available as 'spark'.
>>>
>>>
>>> spark
<pyspark.sql.session.SparkSession object at 0x7f8fc593dba8>
>>> sc = spark.sparkContext
>>> sc
<SparkContext master=local[*] appName=PySparkShell>
>>>
>>>
>>> data = [ [1, 2, 3], [4, 5, 6, 7] ]
>>> data
[[1, 2, 3], [4, 5, 6, 7]]
>>> data[0]
[1, 2, 3]
>>> data[1]
[4, 5, 6, 7]
>>>
>>> rdd = spark.sparkContext.parallelize(data)
>>> rdd.collect()
[[1, 2, 3], [4, 5, 6, 7]]
>>> rdd.count()
2
>>>
>>> rdd_mapped = rdd.map(lambda x: x)
>>> rdd_mapped.collect()
[[1, 2, 3], [4, 5, 6, 7]]
>>> rdd_mapped.count()
2
>>>
>>> rdd_flat_mapped = rdd.flatMap(lambda x: x)
>>> rdd_flat_mapped.collect()
[1, 2, 3, 4, 5, 6, 7]
>>> rdd_flat_mapped.count()
7
>>> data = [ [1, 2, 3], [], [4, 5, 6, 7], [], [9] ]
>>> data
[[1, 2, 3], [], [4, 5, 6, 7], [], [9]]
>>> data[0]
[1, 2, 3]
>>> data[1]
[]
>>> data[3]
[]
>>> data[2]
[4, 5, 6, 7]
>>> data[3]
[]
>>> data[4]
[9]
>>> rdd = spark.sparkContext.parallelize(data)
>>> rdd.collect()
[[1, 2, 3], [], [4, 5, 6, 7], [], [9]]
>>> rdd.count()
5
>>> rdd_mapped = rdd.map(lambda x: x)
>>> rdd_mapped.collect()
[[1, 2, 3], [], [4, 5, 6, 7], [], [9]]
>>> rdd_mapped.count()
5
>>> rdd_flat_mapped = rdd.flatMap(lambda x: x)
>>> rdd_flat_mapped.collect()
[1, 2, 3, 4, 5, 6, 7, 9]
>>> rdd_flat_mapped.count()
8