Skip to content

Commit aa910ab

Browse files
Update udfs.md
1 parent a47d99d commit aa910ab

File tree

1 file changed

+6
-8
lines changed

1 file changed

+6
-8
lines changed

Spark_Distributed_R/udfs.md

+6-8
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,8 @@
1010
* [Leveraging Packages in Distributed R](#leveraging-packages-in-distributed-r)
1111
* [Apache Arrow](#apache-arrow)
1212

13-
___
1413

15-
#### Understanding UDFs
14+
### Understanding UDFs
1615

1716
Both `SparkR` and `sparklyr` support user-defined functions (UDFs) in R which allow you to execute arbitrary R code across a cluster. The advantage here is the ability to distribute the computation of functions included in R's massive ecosystem of 3rd party packages. In particular, you may want to use a domain-specific package for machine learning or apply a specific statisical transformation that is not available through the Spark API. Running in-house custom R libraries on larger data sets would be another place to use this family of functions.
1817

@@ -36,7 +35,7 @@ The general best practice is to leverage the Spark API first and foremost, then
3635

3736
___
3837

39-
#### Distributed `apply`
38+
### Distributed `apply`
4039

4140
Between `sparklyr` and `SparkR` there are a number of options for how you can distribute your R code across a cluster with Spark. Functions can be applied to each *group* or each *partition* of a Spark DataFrame, or to a list of elements in R. In the following table you can see the whole family of distributed `apply` functions:
4241

@@ -51,7 +50,7 @@ Between `sparklyr` and `SparkR` there are a number of options for how you can di
5150

5251
Let's work through these different functions one by one.
5352

54-
##### `spark_apply`
53+
#### `spark_apply`
5554

5655
For the first example, we'll use **`spark_apply()`**.
5756

@@ -102,7 +101,7 @@ head(resultsDF)
102101
```
103102

104103

105-
##### `dapply` & `gapply`
104+
#### `dapply` & `gapply`
106105

107106
In `SparkR`, there are separate functions depending on whether you want to run R code on each partition of a Spark DataFrame (`dapply`), or each group (`gapply`). With these functions you **must** supply the schema ahead of time. In the next example we will recreate the first but use `gapply` instead.
108107

@@ -141,7 +140,7 @@ head(resultsDF)
141140
6 XE XE_new
142141
```
143142

144-
##### `spark.lapply`
143+
#### `spark.lapply`
145144

146145
This final function is also from SparkR. It accepts a list and then uses Spark to apply R code to each element in the list across the cluster. As [the docs](https://spark.apache.org/docs/latest/api/R/spark.lapply.html) state, it is conceptually similar to `lapply` in base R, so it will return a **list** back to the driver.
147146

@@ -175,7 +174,6 @@ head(tidied)
175174
6 XE XE_new
176175
```
177176

178-
%md
179177
___
180178

181179
### Leveraging Packages in Distributed R
@@ -223,7 +221,7 @@ head(coefDF)
223221
6 MSY DepDelay 0.981 0.00663 148. 0.
224222
```
225223

226-
## Apache Arrow
224+
### Apache Arrow
227225

228226
[Apache Arrow](https://arrow.apache.org/) is a project that aims to improve analytics processing performance by representing data in-memory in columnar format and taking advantage of modern hardware. The main purpose and benefit of the project can be summed up in the following image, taken from the homepage of the project.
229227

0 commit comments

Comments
 (0)