Вы находитесь на странице: 1из 4

BigDataAnalytics

Assignment

SubmittedBy:-VaibhavSingh
14B00033
CSC

Writeaprograminpysparkthefollowingquestions:-
1. Toincrementeachnumberinalistbyone.

l1=sc.parallalize([1,2,3,4,5])
l1.collect()
l1=rdd1.map(lambda x:x+1)
l1.collect()

Output = [2,3,4,5,6]

2. Tomultiplyeachnumberinalistby10

l1=sc.parallalize([1,2,3,4,5])
l1.collect()
l1=rdd1.map(lambda x:x*10)
l1.collect()

Output = [10,20,30,40,50]



3. To find most commonly occurring words with their associated
frequencies.

from operator import add

s=["a","b","a","c","a"]

s1=sc.parallelize(s)

s2=s1.map(lambda x:((x,1).reduce By key (add).collect())

print s2.collect()

Ouput = [("a",3),("b",1),("c",1)]



4. Findfrequencyofeachstate:-
State=["delhi","HP","HR","HR","UP"]

from operator import add

s=["delhi","HP","HR","HR","UP"]

s1=sc.parallelize(s)

s2=s1.map(lambda x:(x,1)).reduceByKey(add).collect())

print s2.collect()

Output = [("delhi",1),("HP",1),("HR",2),("UP",1)]

5. Toprintevennumbersoutofalistofnumbers.

l1=sc.parallalize([1,2,3,4,5,6])
l1.collect()
l2=l1.filter(lambda x:x%2==0)
print l2.collect()

Output = [2,4,6]

6. Write the spark commands to perform join operations between


twofiles.
Each file contains a persons name, DOB, and age. Group the
personbyage.

l1=sc.textFile(/home/1.txt)
l2=sc.textFile(/home/2.txtt)
l3=l1.map(lambdax:tuple(x.split()))
l4=l3.map(lambda(x,y,z):(x,y))
l5=l2.map(lambdax:tuple(x.split()))
l6=l5.map(lambda(x,y,z):(x,y)))
l7=l6.join(l4)
Printl7.collect()

Output=[(a,(23,25)),(s,(20,24)),(m,(21,20))]

7. Differentiatebetweenmapandflatmap.
Here is an example of the difference:
val textFile = sc.textFile("README.md") // create an RDD of lines of text

// MAP:

textFile.map(_.length) // map over the lines:

res2: Array[Int] = Array(14, 0, 71, 0, 0, ...)

// -> one length per line

// FLATMAP:

textFile.flatMap(_.split(" ")) // split each line into words:

res3: Array[String] = Array(#, Apache, Spark, ...)

// -> multiple words per line, and multiple lines


// - but we end up with a single output array of words

map transforms an RDD of length N into another RDD of length N.


For example, it maps from N lines into N line-lengths.

flatMap (loosely speaking) transforms an RDD of length N into a collection of N collections,


then flattens these into a single RDD of results.
For example, flatMapping from a collection of lines to a collection of words.

["aa bb cc", "", "dd"] => [["aa","bb","cc"],[],["dd"]] =>


["aa","bb","cc","dd"]

The input and output RDDs will therefore typically be of different sizes.

(You may need to call collect() on the RDDs generated in the examples above - I have
omitted this for clarity)

Вам также может понравиться