Introduction to PySpark RDD

Introduction to PySpark RDD

In this chapter, we will start with RDDs which are Spark’s core abstraction for working with data.

 

What is RDD

  • RDD = Resilient Distributed Datasets

it’s simply collection of data distributed across the cluster, RDD is the fundamental and backbone data type in PySpark.

 

Decomposing RDDs

The name RDD captures 3 important properties.

  • Resilient Distributed Datasets
    • Resilent: Ability to withstand failures
    • Distributed: Spanning across multiple machines
    • Datasets : Collection of partitioned data e.g, Arrays, Tables, Tuples, etc.

 

Creating RDDs

  • Parallelizing an existing collection of objects
  • External datasets:
    • Files in HDF
    • Objects in Amazon S3 bucket
    • lines in a text file
  • From existing RDDs

 

RDD Operations in PySpark

in PySpark supports two different types of operations Transformations and Actions.  Transformations are operations on RDDs that return a new RDD and Actions are operations that perform some computation on the RDD.

 

Transformations follow Lazy evulation.

  • Basic RDD Transformations
    • map() , filter() flatMap() , and union()

 

Example

# Create map() transformation to cube numbers
numbRDD = sc.parallelize(range(1,10))
cubedRDD = numbRDD.map(lambda x: x ** 3)

# Collect the results
numbers_all = cubedRDD.collect()

# Print the numbers from numbers_all
for numb in numbers_all:
print(numb)

Pair RDD in PySpark

  • Real life datasets are usually key/value pairs
  • Each row is a key and maps to one or more values
  • Pair RDD is a special data structure to work with this kind of dataset
  • Pair RDD: key is the identifier and value is data

Let’s give an example

 

reducebykey()

import pyspark

sc = pyspark.SparkContext("local", "First App")

my_list = ["Sam 23", "Mary 34", "Peter 25"]

regularRDD = sc.parallelize(my_list)
pairRDD_RDD = regularRDD.map(lambda s: (s.split(' ')[0], s.split(' ')[1]))
regularRDD = sc.parallelize([("Messi",23),("Ronaldo",34),("Neymar",22),("Messi",24)])
pairRDD_reducebykey = regularRDD.reduceByKey(lambda x,y : x+y)
print(pairRDD_reducebykey.collect())

sortByKey()

pairRDD_reducebykey_rev = pairRDD_reducebykey.map(lambda x: (x[1],x[0]))
result = pairRDD_reducebykey_rev.sortByKey(ascending=False).collect()
print(result)

groupByKey()

airports = [("US","JFK"),("UK","LHR"),("FR","CDG"),("US","SFO")]
regular_RDD = sc.parallelize(airports)
pair_RDD_group = regular_RDD.groupByKey().collect()

for cont,air in pair_RDD_group:
    print(cont,list(air))

 

See you in the next article.

About Deniz Parlak

Hi, i’m Security Data Scientist & Data Engineer at My Security Analytics. I have experienced Advance Python, Machine Learning and Big Data tools. Also i worked Oracle Database Administration, Migration and upgrade projects. For your questions [email protected]

Leave a Reply

Your email address will not be published. Required fields are marked *