-
Notifications
You must be signed in to change notification settings - Fork 476
/
Copy pathpyspark-session-2019-04-26.txt
executable file
·62 lines (59 loc) · 1.97 KB
/
pyspark-session-2019-04-26.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
Finding Average by Key using reduceByKey() Transformation
$ ./bin/pyspark
Python 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018, 02:44:43)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Python version 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018 02:44:43)
SparkSession available as 'spark'.
>>>
>>>
>>>
>>>
>>> data = [('k1', 3), ('k1', 4),('k1', 5),('k2', 7),('k2', 7),('k2', 7),('k3', 30),('k3', 30),('k3', 40),('k3', 50)]
>>> data
[('k1', 3), ('k1', 4), ('k1', 5), ('k2', 7), ('k2', 7), ('k2', 7), ('k3', 30), ('k3', 30), ('k3', 40), ('k3', 50)]
>>>
>>> pairs = spark.sparkContext.parallelize(data)
>>> pairs.collect()
[('k1', 3), ('k1', 4), ('k1', 5), ('k2', 7), ('k2', 7), ('k2', 7), ('k3', 30), ('k3', 30), ('k3', 40), ('k3', 50)]
>>> pairs.count()
10
>>> pairs2 = pairs.distinct()
>>> pairs2.count()
7
>>> pairs2.collect()
[('k1', 5), ('k3', 40), ('k1', 3), ('k3', 50), ('k2', 7), ('k1', 4), ('k3', 30)]
>>>
>>> tuples = pairs.map(lambda x: (x[0], (x[1], 1) ) )
>>> tuples.collect()
[('k1', (3, 1)), ('k1', (4, 1)), ('k1', (5, 1)), ('k2', (7, 1)), ('k2', (7, 1)), ('k2', (7, 1)), ('k3', (30, 1)), ('k3', (30, 1)), ('k3', (40, 1)), ('k3', (50, 1))]
>>>
>>> def adder(x, y):
... sum2 = x[0] + y[0]
... count = x[1] + y[1]
... return (sum2, count)
...
>>>
>>> x = (10, 2)
>>> y = (20, 4)
>>> r = adder(x, y)
>>> r
(30, 6)
>>>
>>> result = tuples.reduceByKey(adder)
>>> result.collect()
[('k1', (12, 3)), ('k3', (150, 4)), ('k2', (21, 3))]
>>> result = tuples.reduceByKey(lambda x, y: adder(x, y))
>>> result.collect()
[('k1', (12, 3)), ('k3', (150, 4)), ('k2', (21, 3))]
>>> avg = result.mapValues(lambda pair: float(pair[0])/float(pair[1]))
>>> avg.collect()
[('k1', 4.0), ('k3', 37.5), ('k2', 7.0)]
>>>