tutorial/pyspark-examples/rdds/pyspark-session-2019-04-26.txt

Finding Average by Key using reduceByKey() Transformation

$ ./bin/pyspark
Python 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018, 02:44:43)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Python version 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018 02:44:43)
SparkSession available as 'spark'.
>>>
>>>
>>>
>>>
>>> data = [('k1', 3), ('k1', 4),('k1', 5),('k2', 7),('k2', 7),('k2', 7),('k3', 30),('k3', 30),('k3', 40),('k3', 50)]
>>> data
[('k1', 3), ('k1', 4), ('k1', 5), ('k2', 7), ('k2', 7), ('k2', 7), ('k3', 30), ('k3', 30), ('k3', 40), ('k3', 50)]
>>>
>>> pairs = spark.sparkContext.parallelize(data)
>>> pairs.collect()
[('k1', 3), ('k1', 4), ('k1', 5), ('k2', 7), ('k2', 7), ('k2', 7), ('k3', 30), ('k3', 30), ('k3', 40), ('k3', 50)]
>>> pairs.count()
10
>>> pairs2 = pairs.distinct()
>>> pairs2.count()
7
>>> pairs2.collect()
[('k1', 5), ('k3', 40), ('k1', 3), ('k3', 50), ('k2', 7), ('k1', 4), ('k3', 30)]
>>>
>>> tuples = pairs.map(lambda x: (x[0], (x[1], 1) ) )
>>> tuples.collect()
[('k1', (3, 1)), ('k1', (4, 1)), ('k1', (5, 1)), ('k2', (7, 1)), ('k2', (7, 1)), ('k2', (7, 1)), ('k3', (30, 1)), ('k3', (30, 1)), ('k3', (40, 1)), ('k3', (50, 1))]

>>>
>>> def adder(x, y):
...     sum2 = x[0] + y[0]
...     count = x[1] + y[1]
...     return (sum2, count)
...
>>>
>>> x = (10, 2)
>>> y = (20, 4)
>>> r = adder(x, y)
>>> r
(30, 6)
>>>
>>> result = tuples.reduceByKey(adder)
>>> result.collect()
[('k1', (12, 3)), ('k3', (150, 4)), ('k2', (21, 3))]
>>> result = tuples.reduceByKey(lambda x, y: adder(x, y))
>>> result.collect()
[('k1', (12, 3)), ('k3', (150, 4)), ('k2', (21, 3))]
>>> avg = result.mapValues(lambda pair: float(pair[0])/float(pair[1]))
>>> avg.collect()
[('k1', 4.0), ('k3', 37.5), ('k2', 7.0)]
>>>