You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: Spark_Distributed_R/udfs.md
+6-8
Original file line number
Diff line number
Diff line change
@@ -10,9 +10,8 @@
10
10
*[Leveraging Packages in Distributed R](#leveraging-packages-in-distributed-r)
11
11
*[Apache Arrow](#apache-arrow)
12
12
13
-
___
14
13
15
-
####Understanding UDFs
14
+
### Understanding UDFs
16
15
17
16
Both `SparkR` and `sparklyr` support user-defined functions (UDFs) in R which allow you to execute arbitrary R code across a cluster. The advantage here is the ability to distribute the computation of functions included in R's massive ecosystem of 3rd party packages. In particular, you may want to use a domain-specific package for machine learning or apply a specific statisical transformation that is not available through the Spark API. Running in-house custom R libraries on larger data sets would be another place to use this family of functions.
18
17
@@ -36,7 +35,7 @@ The general best practice is to leverage the Spark API first and foremost, then
36
35
37
36
___
38
37
39
-
####Distributed `apply`
38
+
### Distributed `apply`
40
39
41
40
Between `sparklyr` and `SparkR` there are a number of options for how you can distribute your R code across a cluster with Spark. Functions can be applied to each *group* or each *partition* of a Spark DataFrame, or to a list of elements in R. In the following table you can see the whole family of distributed `apply` functions:
42
41
@@ -51,7 +50,7 @@ Between `sparklyr` and `SparkR` there are a number of options for how you can di
51
50
52
51
Let's work through these different functions one by one.
53
52
54
-
#####`spark_apply`
53
+
#### `spark_apply`
55
54
56
55
For the first example, we'll use **`spark_apply()`**.
57
56
@@ -102,7 +101,7 @@ head(resultsDF)
102
101
```
103
102
104
103
105
-
#####`dapply` & `gapply`
104
+
#### `dapply` & `gapply`
106
105
107
106
In `SparkR`, there are separate functions depending on whether you want to run R code on each partition of a Spark DataFrame (`dapply`), or each group (`gapply`). With these functions you **must** supply the schema ahead of time. In the next example we will recreate the first but use `gapply` instead.
108
107
@@ -141,7 +140,7 @@ head(resultsDF)
141
140
6 XE XE_new
142
141
```
143
142
144
-
#####`spark.lapply`
143
+
#### `spark.lapply`
145
144
146
145
This final function is also from SparkR. It accepts a list and then uses Spark to apply R code to each element in the list across the cluster. As [the docs](https://spark.apache.org/docs/latest/api/R/spark.lapply.html) state, it is conceptually similar to `lapply` in base R, so it will return a **list** back to the driver.
147
146
@@ -175,7 +174,6 @@ head(tidied)
175
174
6 XE XE_new
176
175
```
177
176
178
-
%md
179
177
___
180
178
181
179
### Leveraging Packages in Distributed R
@@ -223,7 +221,7 @@ head(coefDF)
223
221
6 MSY DepDelay 0.981 0.00663 148. 0.
224
222
```
225
223
226
-
## Apache Arrow
224
+
###Apache Arrow
227
225
228
226
[Apache Arrow](https://arrow.apache.org/) is a project that aims to improve analytics processing performance by representing data in-memory in columnar format and taking advantage of modern hardware. The main purpose and benefit of the project can be summed up in the following image, taken from the homepage of the project.
0 commit comments