Introduction to PySpark RDD
In this chapter, we will start with RDDs which are Spark’s core abstraction for working with data.
What is RDD
- RDD = Resilient Distributed Datasets
it’s simply collection of data distributed across the cluster, RDD is the fundamental and backbone data type in PySpark.
Decomposing RDDs
The name RDD captures 3 important properties.
- Resilient Distributed Datasets
- Resilent: Ability to withstand failures
- Distributed: Spanning across multiple machines
- Datasets : Collection of partitioned data e.g, Arrays, Tables, Tuples, etc.
Creating RDDs
- Parallelizing an existing collection of objects
- External datasets:
- Files in HDF
- Objects in Amazon S3 bucket
- lines in a text file
- From existing RDDs
RDD Operations in PySpark
in PySpark supports two different types of operations Transformations and Actions. Transformations are operations on RDDs that return a new RDD and Actions are operations that perform some computation on the RDD.

Transformations follow Lazy evulation.
- Basic RDD Transformations
- map() , filter() flatMap() , and union()

Example
# Create map() transformation to cube numbers numbRDD = sc.parallelize(range(1,10)) cubedRDD = numbRDD.map(lambda x: x ** 3) # Collect the results numbers_all = cubedRDD.collect() # Print the numbers from numbers_all for numb in numbers_all: print(numb)

Pair RDD in PySpark
- Real life datasets are usually key/value pairs
- Each row is a key and maps to one or more values
- Pair RDD is a special data structure to work with this kind of dataset
- Pair RDD: key is the identifier and value is data
Let’s give an example
reducebykey()
import pyspark
sc = pyspark.SparkContext("local", "First App")
my_list = ["Sam 23", "Mary 34", "Peter 25"]
regularRDD = sc.parallelize(my_list)
pairRDD_RDD = regularRDD.map(lambda s: (s.split(' ')[0], s.split(' ')[1]))
regularRDD = sc.parallelize([("Messi",23),("Ronaldo",34),("Neymar",22),("Messi",24)])
pairRDD_reducebykey = regularRDD.reduceByKey(lambda x,y : x+y)
print(pairRDD_reducebykey.collect())
sortByKey()
pairRDD_reducebykey_rev = pairRDD_reducebykey.map(lambda x: (x[1],x[0])) result = pairRDD_reducebykey_rev.sortByKey(ascending=False).collect() print(result)

groupByKey()
airports = [("US","JFK"),("UK","LHR"),("FR","CDG"),("US","SFO")]
regular_RDD = sc.parallelize(airports)
pair_RDD_group = regular_RDD.groupByKey().collect()
for cont,air in pair_RDD_group:
print(cont,list(air))

See you in the next article.
IT Tutorial IT Tutorial | Oracle DBA | SQL Server, Goldengate, Exadata, Big Data, Data ScienceTutorial