Site icon IT Tutorial

Introduction to PySpark RDD

Introduction to PySpark RDD

In this chapter, we will start with RDDs which are Spark’s core abstraction for working with data.

 

What is RDD

it’s simply collection of data distributed across the cluster, RDD is the fundamental and backbone data type in PySpark.

 

Decomposing RDDs

The name RDD captures 3 important properties.

 

Creating RDDs

 

RDD Operations in PySpark

in PySpark supports two different types of operations Transformations and Actions.  Transformations are operations on RDDs that return a new RDD and Actions are operations that perform some computation on the RDD.

 

Transformations follow Lazy evulation.

 

Example

# Create map() transformation to cube numbers
numbRDD = sc.parallelize(range(1,10))
cubedRDD = numbRDD.map(lambda x: x ** 3)

# Collect the results
numbers_all = cubedRDD.collect()

# Print the numbers from numbers_all
for numb in numbers_all:
print(numb)

Pair RDD in PySpark

Let’s give an example

 

reducebykey()

import pyspark

sc = pyspark.SparkContext("local", "First App")

my_list = ["Sam 23", "Mary 34", "Peter 25"]

regularRDD = sc.parallelize(my_list)
pairRDD_RDD = regularRDD.map(lambda s: (s.split(' ')[0], s.split(' ')[1]))
regularRDD = sc.parallelize([("Messi",23),("Ronaldo",34),("Neymar",22),("Messi",24)])
pairRDD_reducebykey = regularRDD.reduceByKey(lambda x,y : x+y)
print(pairRDD_reducebykey.collect())

sortByKey()

pairRDD_reducebykey_rev = pairRDD_reducebykey.map(lambda x: (x[1],x[0]))
result = pairRDD_reducebykey_rev.sortByKey(ascending=False).collect()
print(result)

groupByKey()

airports = [("US","JFK"),("UK","LHR"),("FR","CDG"),("US","SFO")]
regular_RDD = sc.parallelize(airports)
pair_RDD_group = regular_RDD.groupByKey().collect()

for cont,air in pair_RDD_group:
    print(cont,list(air))

 

See you in the next article.

Exit mobile version