Skip to content

Commit c2953f1

Browse files
committed
Deploy Apache Spark on Google Axion C4A virtual machine
Signed-off-by: odidev <[email protected]>
1 parent abac95f commit c2953f1

File tree

8 files changed

+571
-0
lines changed

8 files changed

+571
-0
lines changed
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
---
2+
title: Deploy Apache Spark on Google Axion C4A virtual machine
3+
4+
minutes_to_complete: 60
5+
6+
who_is_this_for: This is an introductory topic for the software developers who are willing to migrate their Apache Spark workloads from the x86_64 platforms to Arm-based platforms, or on Google Axion-based C4A virtual machines specifically.
7+
8+
learning_objectives:
9+
- Provision an Arm virtual machine on the Google Cloud Platform using the C4A Google Axion instance family, and RHEL 9 as the base image.
10+
- Understand how to install and configure Apache Spark on Arm-based GCP C4A instances.
11+
- Validate the functionality of spark through baseline testing.
12+
- Perform benchmarking to evaluate Apache Spark’s performance on Arm.
13+
14+
prerequisites:
15+
- A [Google Cloud Platform (GCP)](https://cloud.google.com/free?utm_source=google&hl=en) account with billing enabled.
16+
- Basic understanding of Linux command line.
17+
- Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/).
18+
19+
author: Jason Andrews
20+
21+
##### Tags
22+
skilllevels: Advanced
23+
subjects: Performance and Architecture
24+
cloud_service_providers: Google Cloud
25+
26+
armips:
27+
- Neoverse
28+
29+
tools_software_languages:
30+
- Apache Spark
31+
- Python
32+
33+
operatingsystems:
34+
- Linux
35+
36+
# ================================================================================
37+
# FIXED, DO NOT MODIFY
38+
# ================================================================================
39+
further_reading:
40+
- resource:
41+
title: Google Cloud official website and documentation
42+
link: https://cloud.google.com/docs
43+
type: documentation
44+
45+
- resource:
46+
title: Spark official website and documentation
47+
link: https://spark.apache.org/
48+
type: documentation
49+
50+
- resource:
51+
title: The Scala programming language official website
52+
link: scala-lang.org
53+
type: website
54+
55+
56+
weight: 1 # _index.md always has weight of 1 to order correctly
57+
layout: "learningpathall" # All files under learning paths have this same wrapper
58+
learning_path_main_page: "yes" # Indicates this should be surfaced when looking for related content. Only set for _index.md of learning path content.
59+
---
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
title: "About Google Axion C4A series and Apache Spark"
3+
4+
weight: 2
5+
6+
layout: "learningpathall"
7+
---
8+
9+
## Google Axion C4A series
10+
11+
The Google Axion C4A series is a family of Arm-based virtual machines built on Google’s custom Axion CPU, which is based on Arm Neoverse-V2 cores. Designed for high-performance and energy-efficient computing, these virtual machine offer strong performance ideal for modern cloud workloads such as CI/CD pipelines, microservices, media processing, and general-purpose applications.
12+
13+
The C4A series provides a cost-effective alternative to x86 virtual machine while leveraging the scalability and performance benefits of the Arm architecture in Google Cloud.
14+
15+
To learn more about Google Axion, refer to the blog [Introducing Google Axion Processors, our new Arm-based CPUs](https://cloud.google.com/blog/products/compute/introducing-googles-new-arm-based-cpu).
16+
17+
## Apache Spark
18+
19+
Apache Spark is an open-source, distributed computing system designed for fast and general-purpose big data processing.
20+
21+
It provides high-level APIs in Java, Scala, Python, and R, and supports in-memory computation for increased performance.
22+
23+
Spark is widely used for large-scale data analytics, machine learning, and real-time data processing. Learn more from the [Apache Spark official website](https://spark.apache.org/) and its [detailed official documentation](https://spark.apache.org/docs/latest/).
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
---
2+
title: Baseline Testing
3+
weight: 5
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
10+
Since Apache Spark is installed successfully on your GCP C4A Arm virtual machine, let's now perform simple baseline testing to validate that Spark runs correctly and gives expected output.
11+
12+
## Spark Baseline Test
13+
14+
Create a simple Spark job file:
15+
```console
16+
nano ~/spark_baseline_test.scala
17+
```
18+
Below is this content of **spark_baseline_test.scala** file:
19+
20+
```scala
21+
val data = Seq(1, 2, 3, 4, 5)
22+
val distData = spark.sparkContext.parallelize(data)
23+
24+
// Basic transformation and action
25+
val squared = distData.map(x => x * x).collect()
26+
27+
println("Squared values: " + squared.mkString(", "))
28+
```
29+
Code Explanation:
30+
This code is a basic Apache Spark example in Scala, demonstrating how to create an RDD (Resilient Distributed Dataset), perform a transformation, and collect results.
31+
32+
What it does, step by step:
33+
34+
- **val data = Seq(1, 2, 3, 4, 5)** : Creates a local Scala sequence of integers.
35+
- **val distData = spark.sparkContext.parallelize(data)** : Uses parallelize to convert the local sequence into a distributed RDD (so Spark can operate on it in parallel across cluster nodes or CPU cores).
36+
- **val squared = distData.map(x => x * x).collect()** : `map(x => x * x)` squares each element in the list, `.collect()` brings all the transformed data back to the driver program as a regular Scala collection.
37+
- **println("Squared values: " + squared.mkString(", "))** : Prints the squared values, joined by commas.
38+
39+
40+
### Run the Test in Spark Shell
41+
42+
Run the test in the interactive shell:
43+
```console
44+
spark-shell < ~/spark_baseline_test.scala
45+
```
46+
You should see an output similar to:
47+
```output
48+
Squared values: 1, 4, 9, 16, 25
49+
```
50+
This confirms that Spark is working correctly with its driver, executor, and cluster manager in local mode.
51+

0 commit comments

Comments
 (0)