Introduction to PySpark RDD

Deniz Parlak October 13, 2020 Leave a comment

Introduction to PySpark RDD

In this chapter, we will start with RDDs which are Spark’s core abstraction for working with data.

What is RDD

RDD = Resilient Distributed Datasets

it’s simply collection of data distributed across the cluster, RDD is the fundamental and backbone data type in PySpark.

Decomposing RDDs

The name RDD captures 3 important properties.

Resilient Distributed Datasets
- Resilent: Ability to withstand failures
- Distributed: Spanning across multiple machines
- Datasets : Collection of partitioned data e.g, Arrays, Tables, Tuples, etc.

Creating RDDs

Parallelizing an existing collection of objects
External datasets:
- Files in HDF
- Objects in Amazon S3 bucket
- lines in a text file
From existing RDDs

RDD Operations in PySpark

in PySpark supports two different types of operations Transformations and Actions. Transformations are operations on RDDs that return a new RDD and Actions are operations that perform some computation on the RDD.

Transformations follow Lazy evulation.

Basic RDD Transformations
- map() , filter() flatMap() , and union()

Example

# Create map() transformation to cube numbers
numbRDD = sc.parallelize(range(1,10))
cubedRDD = numbRDD.map(lambda x: x ** 3)

# Collect the results
numbers_all = cubedRDD.collect()

# Print the numbers from numbers_all
for numb in numbers_all:
print(numb)

Pair RDD in PySpark

Real life datasets are usually key/value pairs
Each row is a key and maps to one or more values
Pair RDD is a special data structure to work with this kind of dataset
Pair RDD: key is the identifier and value is data

Let’s give an example

reducebykey()

import pyspark

sc = pyspark.SparkContext("local", "First App")

my_list = ["Sam 23", "Mary 34", "Peter 25"]

regularRDD = sc.parallelize(my_list)
pairRDD_RDD = regularRDD.map(lambda s: (s.split(' ')[0], s.split(' ')[1]))
regularRDD = sc.parallelize([("Messi",23),("Ronaldo",34),("Neymar",22),("Messi",24)])
pairRDD_reducebykey = regularRDD.reduceByKey(lambda x,y : x+y)
print(pairRDD_reducebykey.collect())

sortByKey()

pairRDD_reducebykey_rev = pairRDD_reducebykey.map(lambda x: (x[1],x[0]))
result = pairRDD_reducebykey_rev.sortByKey(ascending=False).collect()
print(result)

groupByKey()

airports = [("US","JFK"),("UK","LHR"),("FR","CDG"),("US","SFO")]
regular_RDD = sc.parallelize(airports)
pair_RDD_group = regular_RDD.groupByKey().collect()

for cont,air in pair_RDD_group:
    print(cont,list(air))

See you in the next article.

IT Tutorial IT Tutorial | Oracle DBA | SQL Server, Goldengate, Exadata, Big Data, Data ScienceTutorial

Introduction to PySpark RDD

Introduction to PySpark RDD

What is RDD

Decomposing RDDs

Creating RDDs

RDD Operations in PySpark

Example

reducebykey()

sortByKey()

groupByKey()

About Deniz Parlak

Leave a Reply Cancel reply