Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
title: Deploy Apache Spark on Google Axion C4A virtual machine

draft: true
cascade:
draft: true

minutes_to_complete: 60

who_is_this_for: This is an introductory topic for the software developers who are willing to migrate their Apache Spark workloads from the x86_64 platforms to Arm-based platforms, or on Google Axion-based C4A virtual machines specifically.

learning_objectives:
- Provision an Arm virtual machine on the Google Cloud Platform using the C4A Google Axion instance family, and RHEL 9 as the base image.
- Understand how to install and configure Apache Spark on Arm-based GCP C4A instances.
- Validate the functionality of spark through baseline testing.
- Perform benchmarking to evaluate Apache Spark’s performance on Arm.

prerequisites:
- A [Google Cloud Platform (GCP)](https://cloud.google.com/free?utm_source=google&hl=en) account with billing enabled.
- Basic understanding of Linux command line.
- Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/).

author: Jason Andrews

##### Tags
skilllevels: Advanced
subjects: Performance and Architecture
cloud_service_providers: Google Cloud

armips:
- Neoverse

tools_software_languages:
- Apache Spark
- Python

operatingsystems:
- Linux

# ================================================================================
# FIXED, DO NOT MODIFY
# ================================================================================
further_reading:
- resource:
title: Google Cloud official website and documentation
link: https://cloud.google.com/docs
type: documentation

- resource:
title: Spark official website and documentation
link: https://spark.apache.org/
type: documentation

- resource:
title: The Scala programming language official website
link: scala-lang.org
type: website


weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # Indicates this should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
title: "About Google Axion C4A series and Apache Spark"

weight: 2

layout: "learningpathall"
---

## Google Axion C4A series

The Google Axion C4A series is a family of Arm-based virtual machines built on Google’s custom Axion CPU, which is based on Arm Neoverse-V2 cores. Designed for high-performance and energy-efficient computing, these virtual machine offer strong performance ideal for modern cloud workloads such as CI/CD pipelines, microservices, media processing, and general-purpose applications.

The C4A series provides a cost-effective alternative to x86 virtual machine while leveraging the scalability and performance benefits of the Arm architecture in Google Cloud.

To learn more about Google Axion, refer to the blog [Introducing Google Axion Processors, our new Arm-based CPUs](https://cloud.google.com/blog/products/compute/introducing-googles-new-arm-based-cpu).

## Apache Spark

Apache Spark is an open-source, distributed computing system designed for fast and general-purpose big data processing.

It provides high-level APIs in Java, Scala, Python, and R, and supports in-memory computation for increased performance.

Spark is widely used for large-scale data analytics, machine learning, and real-time data processing. Learn more from the [Apache Spark official website](https://spark.apache.org/) and its [detailed official documentation](https://spark.apache.org/docs/latest/).
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
title: Baseline Testing
weight: 5

### FIXED, DO NOT MODIFY
layout: learningpathall
---


Since Apache Spark is installed successfully on your GCP C4A Arm virtual machine, let's now perform simple baseline testing to validate that Spark runs correctly and gives expected output.

## Spark Baseline Test

Create a simple Spark job file:
```console
nano ~/spark_baseline_test.scala
```
Below is this content of **spark_baseline_test.scala** file:

```scala
val data = Seq(1, 2, 3, 4, 5)
val distData = spark.sparkContext.parallelize(data)

// Basic transformation and action
val squared = distData.map(x => x * x).collect()

println("Squared values: " + squared.mkString(", "))
```
Code Explanation:
This code is a basic Apache Spark example in Scala, demonstrating how to create an RDD (Resilient Distributed Dataset), perform a transformation, and collect results.

What it does, step by step:

- **val data = Seq(1, 2, 3, 4, 5)** : Creates a local Scala sequence of integers.
- **val distData = spark.sparkContext.parallelize(data)** : Uses parallelize to convert the local sequence into a distributed RDD (so Spark can operate on it in parallel across cluster nodes or CPU cores).
- **val squared = distData.map(x => x * x).collect()** : `map(x => x * x)` squares each element in the list, `.collect()` brings all the transformed data back to the driver program as a regular Scala collection.
- **println("Squared values: " + squared.mkString(", "))** : Prints the squared values, joined by commas.


### Run the Test in Spark Shell

Run the test in the interactive shell:
```console
spark-shell < ~/spark_baseline_test.scala
```
You should see an output similar to:
```output
Squared values: 1, 4, 9, 16, 25
```
This confirms that Spark is working correctly with its driver, executor, and cluster manager in local mode.

Loading