Skip to content

Run Spark applications on the Microsoft Azure Cobalt 100 processors #2191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
title: Run Spark applications on the Microsoft Azure Cobalt 100 processors

minutes_to_complete: 60

who_is_this_for: This Learning Path introduces Spark deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating Spark applications from x86_64 to Arm with minimal or no changes.

learning_objectives:
- Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu as the base image.
- Learn how to create an Azure Linux 3.0 Docker container.
- Deploy a Spark application inside an Azure Linux 3.0 Arm64-based Docker container and an Azure Linux 3.0 custom-image based Azure virtual machine.
- Perform Spark benchmarking inside the container as well as the custom virtual machine.

prerequisites:
- A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6).
- A machine with [Docker](/install-guides/docker/) installed.
- Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/).

author: Jason Andrews

### Tags
skilllevels: Advanced
subjects: Performance and Architecture
cloud_service_providers: Microsoft Azure

armips:
- Neoverse

tools_software_languages:
- Apache Spark
- Python
- Docker


operatingsystems:
- Linux

further_reading:
- resource:
title: Azure Virtual Machines documentation
link: https://learn.microsoft.com/en-us/azure/virtual-machines/
type: documentation
- resource:
title: Azure Container Instances documentation
link: https://learn.microsoft.com/en-us/azure/container-instances/
type: documentation
- resource:
title: Docker overview
link: https://docs.docker.com/get-started/overview/
type: documentation
- resource:
title: Spark official website and documentation
link: https://spark.apache.org/
type: documentation
- resource:
title: Hadoop official website
link: https://hadoop.apache.org/
type: website


### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
title: "About Cobalt 100 Arm-based processor and Apache Spark"

weight: 2

layout: "learningpathall"
---

## What is Cobalt 100 Arm-based processor?

Azure’s Cobalt 100 is built on Microsoft's first-generation, in-house Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance.

To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353).

## Introduction to Azure Linux 3.0

Azure Linux 3.0 is Microsoft's in-house, lightweight Linux distribution optimized for running cloud-native workloads on Azure. Designed with performance, security, and reliability in mind, it is fully supported by Microsoft and tailored for containers, microservices, and Kubernetes. With native support for Arm64 (AArch64) architecture, Azure Linux 3.0 enables efficient execution of workloads on energy-efficient Arm-based infrastructure, making it a powerful choice for scalable and cost-effective cloud deployments.

## Apache Spark

Apache Spark is an open-source, distributed computing system designed for fast and general-purpose big data processing.

It provides high-level APIs in Java, Scala, Python, and R, and supports in-memory computation for increased performance.

Spark is widely used for large-scale data analytics, machine learning, and real-time data processing. Learn more from the [Apache Spark official website](https://spark.apache.org/) and its [detailed official documentation](https://spark.apache.org/docs/latest/).
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
title: Baseline Testing
weight: 6

### FIXED, DO NOT MODIFY
layout: learningpathall
---


## Baseline Testing
Since Apache Spark is installed successfully on your Arm virtual machine, let's now perform simple baseline testing to validate that Spark runs correctly and gives expected output.

Run a simple PySpark script, create a file named `test_spark.py`, and add the below content to it:

```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test").getOrCreate()
df = spark.createDataFrame([(1, "ARM64"), (2, "Azure")], ["id", "name"])
df.show()
spark.stop()
```
Execute with:
```console
spark-submit test_spark.py
```
You should see an output similar to:

```output
25/07/22 05:16:00 INFO CodeGenerator: Code generated in 10.545923 ms
25/07/22 05:16:00 INFO SparkContext: SparkContext is stopping with exitCode 0.
+---+-----+
| id| name|
+---+-----+
| 1|ARM64|
| 2|Azure|
+---+-----+
```
Output summary:

- The output shows Spark successfully generated code **(10.5ms)** and executed a simple DataFrame operation.
- Displaying the test data **[1, "ARM64"]** and **[2, "Azure"]** before cleanly shutting down **(exitCode 0)**. This confirms a working Spark deployment on Arm64.
Loading