diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/_index.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/_index.md new file mode 100644 index 0000000000..587b792ea0 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/_index.md @@ -0,0 +1,66 @@ +--- +title: Run Spark applications on the Microsoft Azure Cobalt 100 processors + +minutes_to_complete: 60 + +who_is_this_for: This Learning Path introduces Spark deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating Spark applications from x86_64 to Arm with minimal or no changes. + +learning_objectives: + - Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu as the base image. + - Learn how to create an Azure Linux 3.0 Docker container. + - Deploy a Spark application inside an Azure Linux 3.0 Arm64-based Docker container and an Azure Linux 3.0 custom-image based Azure virtual machine. + - Perform Spark benchmarking inside the container as well as the custom virtual machine. + +prerequisites: + - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6). + - A machine with [Docker](/install-guides/docker/) installed. + - Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/). + +author: Jason Andrews + +### Tags +skilllevels: Advanced +subjects: Performance and Architecture +cloud_service_providers: Microsoft Azure + +armips: + - Neoverse + +tools_software_languages: + - Apache Spark + - Python + - Docker + + +operatingsystems: + - Linux + +further_reading: + - resource: + title: Azure Virtual Machines documentation + link: https://learn.microsoft.com/en-us/azure/virtual-machines/ + type: documentation + - resource: + title: Azure Container Instances documentation + link: https://learn.microsoft.com/en-us/azure/container-instances/ + type: documentation + - resource: + title: Docker overview + link: https://docs.docker.com/get-started/overview/ + type: documentation + - resource: + title: Spark official website and documentation + link: https://spark.apache.org/ + type: documentation + - resource: + title: Hadoop official website + link: https://hadoop.apache.org/ + type: website + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/background.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/background.md new file mode 100644 index 0000000000..ef37aba819 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/background.md @@ -0,0 +1,25 @@ +--- +title: "About Cobalt 100 Arm-based processor and Apache Spark" + +weight: 2 + +layout: "learningpathall" +--- + +## What is Cobalt 100 Arm-based processor? + +Azure’s Cobalt 100 is built on Microsoft's first-generation, in-house Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance. + +To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353). + +## Introduction to Azure Linux 3.0 + +Azure Linux 3.0 is Microsoft's in-house, lightweight Linux distribution optimized for running cloud-native workloads on Azure. Designed with performance, security, and reliability in mind, it is fully supported by Microsoft and tailored for containers, microservices, and Kubernetes. With native support for Arm64 (AArch64) architecture, Azure Linux 3.0 enables efficient execution of workloads on energy-efficient Arm-based infrastructure, making it a powerful choice for scalable and cost-effective cloud deployments. + +## Apache Spark + +Apache Spark is an open-source, distributed computing system designed for fast and general-purpose big data processing. + +It provides high-level APIs in Java, Scala, Python, and R, and supports in-memory computation for increased performance. + +Spark is widely used for large-scale data analytics, machine learning, and real-time data processing. Learn more from the [Apache Spark official website](https://spark.apache.org/) and its [detailed official documentation](https://spark.apache.org/docs/latest/). diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/baseline.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/baseline.md new file mode 100644 index 0000000000..4438e06449 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/baseline.md @@ -0,0 +1,41 @@ +--- +title: Baseline Testing +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + + +## Baseline Testing +Since Apache Spark is installed successfully on your Arm virtual machine, let's now perform simple baseline testing to validate that Spark runs correctly and gives expected output. + +Run a simple PySpark script, create a file named `test_spark.py`, and add the below content to it: + +```python +from pyspark.sql import SparkSession +spark = SparkSession.builder.appName("Test").getOrCreate() +df = spark.createDataFrame([(1, "ARM64"), (2, "Azure")], ["id", "name"]) +df.show() +spark.stop() +``` +Execute with: +```console +spark-submit test_spark.py +``` +You should see an output similar to: + +```output +25/07/22 05:16:00 INFO CodeGenerator: Code generated in 10.545923 ms +25/07/22 05:16:00 INFO SparkContext: SparkContext is stopping with exitCode 0. ++---+-----+ +| id| name| ++---+-----+ +| 1|ARM64| +| 2|Azure| ++---+-----+ +``` +Output summary: + +- The output shows Spark successfully generated code **(10.5ms)** and executed a simple DataFrame operation. +- Displaying the test data **[1, "ARM64"]** and **[2, "Azure"]** before cleanly shutting down **(exitCode 0)**. This confirms a working Spark deployment on Arm64. diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/benchmarking.md new file mode 100644 index 0000000000..71a204abfc --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/benchmarking.md @@ -0,0 +1,247 @@ +--- +title: Spark Internal Benchmarking +weight: 7 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Apache Spark Internal Benchmarking +Apache Spark includes internal micro-benchmarks to evaluate the performance of core components like SQL execution, aggregation, joins, and data source reads. These benchmarks are helpful for comparing platforms such as x86_64 vs Arm64. +Below are the steps to run Spark’s built-in SQL benchmarks using the SBT-based framework. + +1. Clone the Apache Spark source code +```console +git clone https://github.com/apache/spark.git +``` +This downloads the full Spark source including internal test suites and the benchmarking tools. + +2. Checkout the desired Spark version +```console +cd spark/ && git checkout v4.0.0 +``` +Switch to the stable Spark 4.0.0 release, which supports the latest internal benchmarking APIs. + +3. Build Spark with benchmarking profile enabled +```console +./build/sbt -Pbenchmarks clean package +``` +This compiles Spark and its dependencies, enabling the benchmarks build profile for performance testing. + +4. Run a built-in benchmark suite +```console +./build/sbt -Pbenchmarks "sql/test:runMain org.apache.spark.sql.execution.benchmark.JoinBenchmark" +``` +This executes the **JoinBenchmark**, which measures the performance of various SQL join operations (e.g., SortMergeJoin, BroadcastHashJoin) under different query plans. It helps evaluate how Spark SQL optimizes and executes join strategies, especially with and without WholeStageCodegen, a technique that compiles entire query stages into efficient bytecode for faster execution. + +You should see an output similar to: +```output +[info] Running benchmark: Join w long +[info] Running case: Join w long wholestage off +[info] Stopped after 2 iterations, 5297 ms +[info] Running case: Join w long wholestage on +[info] Stopped after 5 iterations, 4238 ms +[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.6.92.2-2.azl3 +[info] 06:05:52.695 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: +[info] Unknown processor +[info] Join w long: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative +[info] ------------------------------------------------------------------------------------------------------------------------ +[info] Join w long wholestage off 2345 2649 429 8.9 111.8 1.0X +[info] Join w long wholestage on 842 848 5 24.9 40.2 2.8X +[info] Running benchmark: Join w long duplicated +[info] Running case: Join w long duplicated wholestage off +[info] Stopped after 2 iterations, 3931 ms +[info] Running case: Join w long duplicated wholestage on +[info] Stopped after 5 iterations, 4350 ms +[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.6.92.2-2.azl3 +[info] 06:06:05.954 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: +[info] Unknown processor +[info] Join w long duplicated: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative +[info] ------------------------------------------------------------------------------------------------------------------------ +[info] Join w long duplicated wholestage off 1965 1966 1 10.7 93.7 1.0X +[info] Join w long duplicated wholestage on 865 870 4 24.2 41.3 2.3X +[info] Running benchmark: Join w 2 ints +[info] Running case: Join w 2 ints wholestage off +[info] Stopped after 2 iterations, 216362 ms +[info] Running case: Join w 2 ints wholestage on +[info] Stopped after 5 iterations, 538414 ms +[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.6.92.2-2.azl3 +[info] 06:22:16.697 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: +[info] Unknown processor +[info] Join w 2 ints: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative +[info] ------------------------------------------------------------------------------------------------------------------------ +[info] Join w 2 ints wholestage off 108110 108181 101 0.2 5155.1 1.0X +[info] Join w 2 ints wholestage on 107521 107683 109 0.2 5127.0 1.0X +[info] Running benchmark: Join w 2 longs +[info] Running case: Join w 2 longs wholestage off +[info] Stopped after 2 iterations, 7806 ms +[info] Running case: Join w 2 longs wholestage on +[info] Stopped after 5 iterations, 10771 ms +[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.6.92.2-2.azl3 +[info] 06:22:41.568 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: +[info] Unknown processor +[info] Join w 2 longs: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative +[info] ------------------------------------------------------------------------------------------------------------------------ +[info] Join w 2 longs wholestage off 3867 3903 51 5.4 184.4 1.0X +[info] Join w 2 longs wholestage on 2061 2154 113 10.2 98.3 1.9X +[info] Running benchmark: Join w 2 longs duplicated +[info] Running case: Join w 2 longs duplicated wholestage off +[info] Stopped after 2 iterations, 17850 ms +[info] Running case: Join w 2 longs duplicated wholestage on +[info] Stopped after 5 iterations, 26145 ms +[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.6.92.2-2.azl3 +[info] 06:23:40.009 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: +[info] Unknown processor +[info] Join w 2 longs duplicated: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative +[info] ------------------------------------------------------------------------------------------------------------------------ +[info] Join w 2 longs duplicated wholestage off 8923 8925 4 2.4 425.5 1.0X +[info] Join w 2 longs duplicated wholestage on 5224 5229 8 4.0 249.1 1.7X +[info] Running benchmark: outer join w long +[info] Running case: outer join w long wholestage off +[info] Stopped after 2 iterations, 3070 ms +[info] Running case: outer join w long wholestage on +[info] Stopped after 5 iterations, 4178 ms +[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.6.92.2-2.azl3 +[info] 06:23:52.993 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: +[info] Unknown processor +[info] outer join w long: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative +[info] ------------------------------------------------------------------------------------------------------------------------ +[info] outer join w long wholestage off 1531 1535 6 13.7 73.0 1.0X +[info] outer join w long wholestage on 833 836 3 25.2 39.7 1.8X +[info] Running benchmark: semi join w long +[info] Running case: semi join w long wholestage off +[info] Stopped after 2 iterations, 2152 ms +[info] Running case: semi join w long wholestage on +[info] Stopped after 5 iterations, 2569 ms +[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.6.92.2-2.azl3 +[info] 06:24:02.077 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: +[info] Unknown processor +[info] semi join w long: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative +[info] ------------------------------------------------------------------------------------------------------------------------ +[info] semi join w long wholestage off 1069 1076 10 19.6 51.0 1.0X +[info] semi join w long wholestage on 512 514 2 40.9 24.4 2.1X +[info] Running benchmark: sort merge join +[info] Running case: sort merge join wholestage off +[info] Stopped after 2 iterations, 1106 ms +[info] Running case: sort merge join wholestage on +[info] Stopped after 5 iterations, 2406 ms +[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.6.92.2-2.azl3 +[info] 06:24:10.485 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: +[info] Unknown processor +[info] sort merge join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative +[info] ------------------------------------------------------------------------------------------------------------------------ +[info] sort merge join wholestage off 551 553 3 3.8 262.7 1.0X +[info] sort merge join wholestage on 478 481 3 4.4 227.8 1.2X +[info] Running benchmark: sort merge join with duplicates +[info] Running case: sort merge join with duplicates wholestage off +[info] Stopped after 2 iterations, 2426 ms +[info] Running case: sort merge join with duplicates wholestage on +[info] Stopped after 5 iterations, 5285 ms +[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.6.92.2-2.azl3 +[info] 06:24:23.172 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: +[info] Unknown processor +[info] sort merge join with duplicates: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative +[info] ------------------------------------------------------------------------------------------------------------------------------ +[info] sort merge join with duplicates wholestage off 1207 1213 9 1.7 575.4 1.0X +[info] sort merge join with duplicates wholestage on 1029 1057 18 2.0 490.5 1.2X +[info] Running benchmark: shuffle hash join +[info] Running case: shuffle hash join wholestage off +[info] Stopped after 2 iterations, 1170 ms +[info] Running case: shuffle hash join wholestage on +[info] Stopped after 5 iterations, 2037 ms +[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.6.92.2-2.azl3 +[info] 06:24:31.312 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: +[info] Unknown processor +[info] shuffle hash join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative +[info] ------------------------------------------------------------------------------------------------------------------------ +[info] shuffle hash join wholestage off 578 585 9 7.3 137.9 1.0X +[info] shuffle hash join wholestage on 385 408 13 10.9 91.8 1.5X +[info] Running benchmark: broadcast nested loop join +[info] Running case: broadcast nested loop join wholestage off +[info] Stopped after 2 iterations, 53739 ms +[info] Running case: broadcast nested loop join wholestage on +[info] Stopped after 5 iterations, 94642 ms +[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.6.92.2-2.azl3 +[info] 06:27:45.952 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: +[info] Unknown processor +[info] broadcast nested loop join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative +[info] ------------------------------------------------------------------------------------------------------------------------- +[info] broadcast nested loop join wholestage off 26847 26870 32 0.8 1280.2 1.0X +[info] broadcast nested loop join wholestage on 18857 18928 84 1.1 899.2 1.4X +[success] Total time: 1644 s (27:24), completed Jul 25, 2025, 6:27:46 AM +``` +### Benchmark Results Table Explained: + +- **Best Time (ms):** Fastest execution time observed (in milliseconds). +- **Avg Time (ms):** Average time across all iterations. +- **Stdev (ms):** Standard deviation of execution times (lower is more stable). +- **Rate (M/s):** Rows processed per second in millions. +- **Per Row (ns):** Average time taken per row (in nanoseconds). +- **Relative Speed comparison:** baseline (1.0X) is the slower version. + +{{% notice Note %}} +Benchmarking was performed in both an Azure Linux 3.0 Docker container and an Azure Linux 3.0 virtual machine. The benchmark results were found to be comparable. +{{% /notice %}} + +Accordingly, this Learning path includes benchmark results from virtual machines only, for both x86 and Arm64 platforms. +### Benchmark summary on x86_64: +The following benchmark results are collected on an x86_64 **D4s_v4 Azure virtual machine using the Azure Linux 3.0 image published by Ntegral Inc**. +| Benchmark | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative | +|------------------------------------------|------------|----------------|----------------|------------|-------------|----------------|----------| +| Join w long | Off | 3168 | 3185 | 24 | 6.6 | 151.1 | 1.0X | +| | On | 1509 | 1562 | 61 | 13.9 | 72.0 | 2.1X | +| Join w long duplicated | Off | 2490 | 2504 | 20 | 8.4 | 118.7 | 1.0X | +| | On | 1151 | 1181 | 27 | 18.2 | 54.9 | 2.2X | +| Join w 2 ints | Off | 217074 | 219364 | 3239 | 0.1 | 10350.9 | 1.0X | +| | On | 119692 | 119756 | 74 | 0.2 | 5707.4 | 1.8X | +| Join w 2 longs | Off | 4367 | 4401 | 49 | 4.8 | 208.2 | 1.0X | +| | On | 2952 | 3003 | 35 | 7.1 | 140.8 | 1.5X | +| Join w 2 longs duplicated | Off | 10255 | 10286 | 45 | 2.0 | 489.0 | 1.0X | +| | On | 7243 | 7300 | 36 | 2.9 | 345.4 | 1.4X | +| Outer join w long | Off | 2401 | 2422 | 30 | 8.7 | 114.5 | 1.0X | +| | On | 1544 | 1564 | 17 | 13.6 | 73.6 | 1.6X | +| Semi join w long | Off | 1344 | 1350 | 8 | 15.6 | 64.1 | 1.0X | +| | On | 673 | 685 | 12 | 31.2 | 32.1 | 2.0X | +| Sort merge join | Off | 1144 | 1145 | 1 | 1.8 | 545.6 | 1.0X | +| | On | 1177 | 1228 | 46 | 1.8 | 561.4 | 1.0X | +| Sort merge join w/ duplicates | Off | 2075 | 2113 | 55 | 1.0 | 989.4 | 1.0X | +| | On | 1704 | 1720 | 14 | 1.2 | 812.3 | 1.2X | +| Shuffle hash join | Off | 672 | 674 | 2 | 6.2 | 160.3 | 1.0X | +| | On | 524 | 525 | 1 | 8.0 | 124.9 | 1.3X | +| Broadcast nested loop join | Off | 36060 | 36103 | 62 | 0.6 | 1719.5 | 1.0X | +| | On | 31254 | 31346 | 78 | 0.7 | 1490.3 | 1.2X | + +### Benchmark summary on Arm64: +The following benchmark results were collected on an Arm64 **D4ps_v6 Azure virtual machine created from a custom Azure Linux 3.0 image using the AArch64 ISO**. +| Benchmark | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative | +|----------------------------------------|------------|----------------|----------------|------------|-------------|----------------|----------| +| Join w long | Off | 2345 | 2649 | 429 | 8.9 | 111.8 | 1.0X | +| | On | 842 | 848 | 5 | 24.9 | 40.2 | 2.8X | +| Join w long duplicated | Off | 1965 | 1966 | 1 | 10.7 | 93.7 | 1.0X | +| | On | 865 | 870 | 4 | 24.2 | 41.3 | 2.3X | +| Join w 2 ints | Off | 108110 | 108181 | 101 | 0.2 | 5155.1 | 1.0X | +| | On | 107521 | 107683 | 109 | 0.2 | 5127.0 | 1.0X | +| Join w 2 longs | Off | 3867 | 3903 | 51 | 5.4 | 184.4 | 1.0X | +| | On | 2061 | 2154 | 113 | 10.2 | 98.3 | 1.9X | +| Join w 2 longs duplicated | Off | 8923 | 8925 | 4 | 2.4 | 425.5 | 1.0X | +| | On | 5224 | 5229 | 8 | 4.0 | 249.1 | 1.7X | +| Outer join w long | Off | 1531 | 1535 | 6 | 13.7 | 73.0 | 1.0X | +| | On | 833 | 836 | 3 | 25.2 | 39.7 | 1.8X | +| Semi join w long | Off | 1069 | 1076 | 10 | 19.6 | 51.0 | 1.0X | +| | On | 512 | 514 | 2 | 40.9 | 24.4 | 2.1X | +| Sort merge join | Off | 551 | 553 | 3 | 3.8 | 262.7 | 1.0X | +| | On | 478 | 481 | 3 | 4.4 | 227.8 | 1.2X | +| Sort merge join with duplicates | Off | 1207 | 1213 | 9 | 1.7 | 575.4 | 1.0X | +| | On | 1029 | 1057 | 18 | 2.0 | 490.5 | 1.2X | +| Shuffle hash join | Off | 578 | 585 | 9 | 7.3 | 137.9 | 1.0X | +| | On | 385 | 408 | 13 | 10.9 | 91.8 | 1.5X | +| Broadcast nested loop join | Off | 26847 | 26870 | 32 | 0.8 | 1280.2 | 1.0X | +| | On | 18857 | 18928 | 84 | 1.1 | 899.2 | 1.4X | + +### **Highlights from Azure Linux Arm64 virtual machine** + +- **Whole-stage codegen improves performance by up to 2.8×** on complex joins (e.g., with long columns). +- **Simple joins (e.g., on integers)** show **negligible performance gain**, remaining close to 1.0×. +- **Broadcast and shuffle-based joins**benefit moderately, with **1.4× to 1.5× improvements**. +- **Overall**, enabling whole-stage codegen consistently improves performance across most join types. +- **Benchmark results were consistent**across both Docker and virtual machine on Arm64. diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/container-setup.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/container-setup.md new file mode 100644 index 0000000000..76aaec9f2b --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/container-setup.md @@ -0,0 +1,34 @@ +--- +title: Setup Azure Linux 3.0 Environment +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + + +You have an option to choose between working with the Azure Linux 3.0 Docker image or inside the virtual machine created with the OS image. + +### Working inside Azure Linux 3.0 Docker container +The Azure Linux Container Host is an operating system image that's optimized for running container workloads on Azure Kubernetes Service (AKS). Microsoft maintains the Azure Linux Container Host and based it on CBL-Mariner, an open-source Linux distribution created by Microsoft. To know more about Azure Linux 3.0, kindly refer [What is Azure Linux Container Host for AKS](https://learn.microsoft.com/en-us/azure/azure-linux/intro-azure-linux). + +Azure Linux 3.0 offers support for AArch64. However, the standalone virtual machine image for Azure Linux 3.0 or CBL Mariner 3.0 is not available for Arm. Hence, to use the default software stack provided by the Microsoft team, you can create a docker container with Azure Linux 3.0 as a base image, and run the Spark application inside the container. + +#### Create Azure Linux 3.0 Docker Container +The [Microsoft Artifact Registry](https://mcr.microsoft.com/en-us/artifact/mar/azurelinux/base/core/about) offers updated docker image for the Azure Linux 3.0. + +To create a docker container, install docker, and then follow the below instructions: + +```console +sudo docker run -it --rm mcr.microsoft.com/azurelinux/base/core:3.0 +``` +The default container startup command is bash. tdnf and dnf are the default package managers. + +### Working with Azure Linux 3.0 OS image +As of now, the Azure Marketplace offers official virtual machine images of Azure Linux 3.0 only for x64-based architectures, published by Ntegral Inc. However, native Arm64 (AArch64) images are not yet officially available. Hence, for this Learning Path, you can create your own custom Azure Linux 3.0 virtual machine image for AArch64 using the [AArch64 ISO for Azure Linux 3.0](https://github.com/microsoft/azurelinux#iso). + +Refer [Create an Azure Linux 3.0 virtual machine with Cobalt 100 processors](https://learn.arm.com/learning-paths/servers-and-cloud-computing/azure-vm) for the details. + +Whether you're using an Azure Linux 3.0 Docker container, or a virtual machine created from a custom Azure Linux 3.0 image, the deployment and benchmarking steps remain the same. + +Once the setup has been established, you can proceed with the Spark Installation ahead. diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/create-instance.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/create-instance.md new file mode 100644 index 0000000000..983b8110e1 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/create-instance.md @@ -0,0 +1,33 @@ +--- +title: Create an Arm based cloud virtual machine using Microsoft Cobalt 100 CPU +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Introduction + +There are several ways to create an Arm-based Cobalt 100 virtual machine : the Microsoft Azure console, the Azure CLI tool, or using your choice of IaC (Infrastructure as Code). This guide will use the Azure console to create a virtual machine with Arm-based Cobalt 100 Processor. + +This learning path focuses on the general-purpose virtual machine of the D series. Please read the guide on [Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series) offered by Microsoft Azure. + +If you have never used the Microsoft Cloud Platform before, please review the microsoft [guide to Create a Linux virtual machine in the Azure portal](https://learn.microsoft.com/en-us/azure/virtual-machines/linux/quick-create-portal?tabs=ubuntu). + +#### Create an Arm-based Azure Virtual Machine + +Creating a virtual machine based on Azure Cobalt 100 is no different from creating any other virtual machine in Azure. To create an Azure virtual machine, launch the Azure portal and navigate to Virtual Machines. + +Select “Create”, and fill in the details such as Name, and Region. Choose the image for your virtual machine (for example – Ubuntu 24.04) and select “Arm64” as the virtual machine architecture. + +In the “Size” field, click on “See all sizes” and select the D-Series v6 family of Virtual machine. Select “D4ps_v6” from the list and create the virtual machine. + +![Instance Screenshot](./instance-new.png) + +The virtual machine should be ready and running; you can SSH into the virtual machine using the PEM key, along with the Public IP details. + +{{% notice Note %}} + +To learn more about Arm-based virtual machine in Azure, refer to “Getting Started with Microsoft Azure” in [Get started with Arm-based cloud instances](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/azure). + +{{% /notice %}} diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/deploy.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/deploy.md new file mode 100644 index 0000000000..4783931164 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/deploy.md @@ -0,0 +1,61 @@ +--- +title: Install Apache Spark on Microsoft Azure Cobalt 100 processors +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Install Apache Spark + +Install Java, Python, and essential tools on Azure Cobalt 100, then download, configure, and verify Apache Spark for use on the Arm-based platform. +### Install Required Packages + +```console +sudo tdnf update -y +sudo tdnf install -y java-17-openjdk java-17-openjdk-devel git maven wget nano curl unzip tar +sudo dnf install -y python3 python3-pip +``` +Verify Java and Python installation: +```console +java -version +python3 --version +``` + +### Install Apache Spark on Arm +```console +wget https://downloads.apache.org/spark/spark-3.5.6/spark-3.5.6-bin-hadoop3.tgz +tar -xzf spark-3.5.6-bin-hadoop3.tgz +sudo mv spark-3.5.6-bin-hadoop3 /opt/spark +``` +### Set Environment Variables +Add this line to ~/.bashrc or ~/.zshrc to make the change persistent across terminal sessions. + +```cosole +echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc +echo 'export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin' >> ~/.bashrc +``` +Apply changes immediately + +```console +source ~/.bashrc +``` + +### Verify Spark Installation + +```console +spark-submit --version +``` +You should see output like: + +```output +Welcome to + ____ __ + / __/__ ___ _____/ /__ + _\ \/ _ \/ _ `/ __/ '_/ + /___/ .__/\_,_/_/ /_/\_\ version 3.5.6 + /_/ + +Using Scala version 2.12.18, OpenJDK 64-Bit Server VM, 17.0.15 +``` +Spark installation is complete. You can now proceed with the baseline testing. diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/instance-new.png b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/instance-new.png new file mode 100644 index 0000000000..285cd764a5 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/instance-new.png differ