Apache Spark is written in Scala programming language. To support Python with Spark, Apache Spark community released a tool, PySpark. Using PySpark, you can work with RDDs in Python programming language also. It is because of a library called Py4j that they are able to achieve this.
- PySpark is the combo of Python and Spark
- Scalable
- 100x faster than Hadoop MapReduce
- 10x faster on disk
- Uses RAM instead of local drive, which increases processing speed
To install PySpark, run the following command:
pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySpark Example") \
.getOrCreate()
data = ["Spark", "is", "awesome"]
rdd = spark.sparkContext.parallelize(data)
print(rdd.collect())
A Databricks Cluster is a combination of computation resources and configurations on which you can run jobs and notebooks.
- All-purpose Clusters: Used for collaborative analysis in notebooks.
- Job Clusters: Created for running automated jobs and terminated after execution.
Feature | Driver Node | Worker Node |
---|---|---|
Function | Runs the main function and schedules tasks. | Executes tasks assigned by the driver. |
Storage | Stores metadata, application states. | Reads/Writes from data sources. |
- map() β Applies function to each element
- flatMap() β Similar to map, but flattens results
- filter() β Filters elements based on condition
- groupBy() β Groups elements based on key
Example:
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)
squared_rdd = rdd.map(lambda x: x*x)
print(squared_rdd.collect()) # Output: [1, 4, 9, 16, 25]
- count() β Returns number of elements
- collect() β Returns all elements
- take(n) β Returns first
n
elements
Feature | Windows | GroupBy |
---|---|---|
Purpose | Used for row-based calculations, like ranking and moving averages. | Used for aggregations on groups of data. |
Scope | Works on a subset (window) of data within a group. | Works on entire groups of data. |
Example | ROW_NUMBER(), LAG(), LEAD() | SUM(), COUNT(), AVG() |
- Spark Core: Core engine for distributed computing.
- Spark SQL: Structured data processing using SQL.
- Spark Streaming: Real-time data processing.
- MLlib: Machine learning library.
- GraphX: Graph computations.
Type | Description |
---|---|
Batch Processing | Collects and processes data in groups. Good for large datasets. |
Real-time Processing | Processes data as it arrives. Used in live analytics. |
A data pipeline performing Extract, Transform, Load operations. Example:
df = spark.read.csv("input.csv", header=True)
df = df.withColumn("new_col", df["existing_col"] * 10)
df.write.format("parquet").save("output.parquet")
Feature | Data Warehouse | Data Lake |
---|---|---|
Data Type | Structured | Structured, Semi-structured, Unstructured |
Processing | Batch Processing | Batch & Real-time Processing |
Feature | Data Warehouse | Data Mart |
---|---|---|
Scope | Enterprise-wide | Specific project or department |
Data Size | Large | Small |
Usage | Aggregated data for analytics | Department-specific data |
Feature | Delta Lake | Data Lake |
---|---|---|
ACID Transactions | Yes | No |
Schema Enforcement | Yes | No |
Metadata Handling | Advanced | Basic |
The process of combining data from multiple sources into a single, unified view for analytics and decision-making.
DESCRIBE HISTORY employee1; -- List versions
SELECT * FROM employee1@v1; -- View version 1
RESTORE TABLE employee1 TO VERSION AS OF 1; -- Restore version 1
A view is a read-only logical table based on the result set of a query.
- Temporary View β Exists only in the current session.
- Global View β Exists across multiple sessions.
Example:
df.createOrReplaceTempView("temp_view")
df.createGlobalTempView("global_view")
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window_spec = Window.partitionBy("department").orderBy("salary")
df = df.withColumn("row_number", row_number().over(window_spec))
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def custom_function(value):
return value.upper()
uppercase_udf = udf(custom_function, StringType())
df = df.withColumn("uppercase_column", uppercase_udf(df["existing_column"]))
df = df.na.fill({"age": 0, "name": "Unknown"})
df = df.na.drop()
df1.join(df2, df1.id == df2.id, "inner").show()
PySpark is a powerful tool for distributed computing. Understanding its core concepts, RDD operations, Spark SQL, MLlib, and streaming capabilities enables efficient data processing at scale.