Interview Questions on Apache Spark [Part 1]
Q2: Is there are point of learning Mapreduce, then?
A: Yes. For the following reason:
Mapreduce is a paradigm used by many big data tools including Spark. So, understanding the MapReduce paradigm and how to convert a problem into series of MR tasks is very important.
When the data grows beyond what can fit into the memory on your cluster, the Hadoop Map-Reduce paradigm is still very relevant.
Almost, every other tool such as Hive or Pig converts its query into MapReduce phases. If you understand the Mapreduce then you will be able to optimize your queries better.
Q3: When running Spark on Yarn, do I need to install Spark on all nodes of Yarn Cluster? A: Since spark runs on top of Yarn, it utilizes yarn for the execution of its commands over the cluster's nodes. So, you just have to install Spark on one node.
Q4: What are the downsides of Spark?
A:
Spark utilizes the memory. The developer has to be careful. A casual developer might make following mistakes:
She may end up running everything on the local node instead of distributing work over to the cluster.
She might hit some webservice too many times by the way of using multiple clusters.
The first problem is well tackled by Hadoop Map reduce paradigm as it ensures that the data your code is churning is fairly small a point of time thus you can make a mistake of trying to handle whole data on a single node.
The second mistake is possible in Map-Reduce too. While writing Map-Reduce, user may hit a service from inside of map() or reduce() too many times. This overloading of service is also possible while using Spark.
Q5: What is a RDD?
A:
The full form of RDD is resilience distributed dataset. It is a representation of data located on a network which is
Immutable - You can operate on the rdd to produce another rdd but you can’t alter it.
Partitioned / Parallel - The data located on RDD is operated in parallel. Any operation on RDD is done using multiple nodes.
Resilience - If one of the node hosting the partition fails, another nodes takes its data.
RDD provides two kinds of operations: Transformations and Actions.
Q6: What is Transformations?
A: The transformations are the functions that are applied on an RDD (resilient distributed data set). The transformation results in another RDD. A transformation is not executed until an action follows.
The example of transformations are:
map() - applies the function passed to it on each element of RDD resulting in a new RDD.
filter() - creates a new RDD by picking the elements from the current RDD which pass the function argument.
Q7: What are Actions?
A: An action brings back the data from the RDD to the local machine. Execution of an action results in all the previously created transformation. The example of actions are:
reduce() - executes the function passed again and again until only one value is left. The function should take two argument and return one value.
take() - take all the values back to the local node form RDD.
http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2
Q1: Say I have a huge list of numbers in RDD(say myrdd). And I wrote the following code to compute average:
def myAvg(x, y):
return (x+y)/2.0;
avg = myrdd.reduce(myAvg);
What is wrong with it? And How would you correct it?
A:The average function is not commutative and associative;
cnt = myrdd.count();
def devideByCnd(x):
return x/cnt;
myrdd1 = myrdd.map(devideByCnd);
avg = myrdd.reduce(sum);
Q2: Say I have a huge list of numbers in a file in HDFS. Each line has one number.And I want to compute the square root of sum of squares of these numbers. How would you do it?
# We would first load the file as RDD from HDFS on spark
numsAsText = sc.textFile("hdfs://hadoop1.knowbigdata.com/user/student/sgiri/mynumbersfile.txt");
# Define the function to compute the squares
def toSqInt(str):
v = int(str);
return v*v;
#Run the function on spark rdd as transformation
nums = numsAsText.map(toSqInt);
#Run the summation as reduce action
total = nums.reduce(sum)
#finally compute the square root. For which we need to import math.
import math;
print math.sqrt(total);
3: Is the following approach correct? Is the sqrtOfSumOfSq a valid reducer?
numsAsText =sc.textFile("hdfs://hadoop1.knowbigdata.com/user/student/sgiri/mynumbersfile.txt");
def toInt(str):
return int(str);
nums = numsAsText.map(toInt);
def sqrtOfSumOfSq(x, y):
return math.sqrt(x*x+y*y);
total = nums.reduce(sum)
import math;
print math.sqrt(total);
A: Yes. The approach is correct and sqrtOfSumOfSq is a valid reducer.
Q4: Could you compare the pros and cons of the your approach (in Question 2 above) and my approach (in Question 3 above)?
A:
You are doing the square and square root as part of reduce action while I am squaring in map() and summing in reduce in my approach.
My approach will be faster because in your case the reducer code is heavy as it is calling math.sqrt() and reducer code is generally executed approximately n-1 times the spark RDD.
The only downside of my approach is that there is a huge chance of integer overflow because I am computing the sum of squares as part of map.
Q5: If you have to compute the total counts of each of the unique words on spark, how would you go about it?
A:
#This will load the bigtextfile.txt as RDD in the spark
lines = sc.textFile("hdfs://hadoop1.knowbigdata.com/user/student/sgiri/bigtextfile.txt");
#define a function that can break each line into words
def toWords(line):
return line.split();
# Run the toWords function on each element of RDD on spark as flatMap transformation.
# We are going to flatMap instead of map because our function is returning multiple values.
words = lines.flatMap(toWords);
# Convert each word into (key, value) pair. Her key will be the word itself and value will be 1.
def toTuple(word):
return (word, 1);
wordsTuple = words.map(toTuple);
# Now we can easily do the reduceByKey() action.
def sum(x, y):
return x+y;
counts = wordsTuple.reduceByKey(sum)
# Now, print
counts.collect()
Q6: In a very huge text file, you want to just check if a particular keyword exists. How would you do this using Spark?
A:
lines = sc.textFile("hdfs://hadoop1.knowbigdata.com/user/student/sgiri/bigtextfile.txt");
def isFound(line):
if line.find(“mykeyword”) > -1:
return 1;
return 0;
foundBits = lines.map(isFound);
sum = foundBits.reduce(sum);
if sum > 0:
print “FOUND”;
else:
print “NOT FOUND”;
Q7: Can you improve the performance of this code in previous answer?
A: Yes. The search is not stopping even after the word we are looking for has been found. Our map code would keep executing on all the nodes which is very inefficient.
We could utilize accumulators to report whether the word has been found or not and then stop the job. Something on these line:
import thread, threading
from time import sleep
result = "Not Set"
lock = threading.Lock()
accum = sc.accumulator(0)
def map_func(line):
#introduce delay to emulate the slowness
sleep(1);
if line.find("Adventures") > -1:
accum.add(1);
return 1;
return 0;
def start_job():
global result
try:
sc.setJobGroup("job_to_cancel", "some description")
lines = sc.textFile("hdfs://hadoop1.knowbigdata.com/user/student/sgiri/wordcount/input/big.txt");
result = lines.map(map_func);
result.take(1);
except Exception as e:
result = "Cancelled"
lock.release()
def stop_job():
while accum.value < 3 :
sleep(1);
sc.cancelJobGroup("job_to_cancel")
supress = lock.acquire()
supress = thread.start_new_thread(start_job, tuple())
supress = thread.start_new_thread(stop_job, tuple())
supress = lock.acquire()
http://spark.apache.org/docs/latest/spark-standalone.html#high-availability
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/exercises/spark-exercise-standalone-master-ha.html
In Spark, you can basically do everything using single application / console (pyspark or scala console) and get the results immediately. Switching between 'Running something on cluster' and 'doing something locally' is fairly easy and straightforward. This also leads to less context switch of the developer and more productivity.
Q2: Is there are point of learning Mapreduce, then?
A: Yes. For the following reason:
Mapreduce is a paradigm used by many big data tools including Spark. So, understanding the MapReduce paradigm and how to convert a problem into series of MR tasks is very important.
When the data grows beyond what can fit into the memory on your cluster, the Hadoop Map-Reduce paradigm is still very relevant.
Almost, every other tool such as Hive or Pig converts its query into MapReduce phases. If you understand the Mapreduce then you will be able to optimize your queries better.
Q3: When running Spark on Yarn, do I need to install Spark on all nodes of Yarn Cluster? A: Since spark runs on top of Yarn, it utilizes yarn for the execution of its commands over the cluster's nodes. So, you just have to install Spark on one node.
Q4: What are the downsides of Spark?
A:
Spark utilizes the memory. The developer has to be careful. A casual developer might make following mistakes:
She may end up running everything on the local node instead of distributing work over to the cluster.
She might hit some webservice too many times by the way of using multiple clusters.
The first problem is well tackled by Hadoop Map reduce paradigm as it ensures that the data your code is churning is fairly small a point of time thus you can make a mistake of trying to handle whole data on a single node.
The second mistake is possible in Map-Reduce too. While writing Map-Reduce, user may hit a service from inside of map() or reduce() too many times. This overloading of service is also possible while using Spark.
Q5: What is a RDD?
A:
The full form of RDD is resilience distributed dataset. It is a representation of data located on a network which is
Immutable - You can operate on the rdd to produce another rdd but you can’t alter it.
Partitioned / Parallel - The data located on RDD is operated in parallel. Any operation on RDD is done using multiple nodes.
Resilience - If one of the node hosting the partition fails, another nodes takes its data.
RDD provides two kinds of operations: Transformations and Actions.
Q6: What is Transformations?
A: The transformations are the functions that are applied on an RDD (resilient distributed data set). The transformation results in another RDD. A transformation is not executed until an action follows.
The example of transformations are:
map() - applies the function passed to it on each element of RDD resulting in a new RDD.
filter() - creates a new RDD by picking the elements from the current RDD which pass the function argument.
Q7: What are Actions?
A: An action brings back the data from the RDD to the local machine. Execution of an action results in all the previously created transformation. The example of actions are:
reduce() - executes the function passed again and again until only one value is left. The function should take two argument and return one value.
take() - take all the values back to the local node form RDD.
http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2
Q1: Say I have a huge list of numbers in RDD(say myrdd). And I wrote the following code to compute average:
def myAvg(x, y):
return (x+y)/2.0;
avg = myrdd.reduce(myAvg);
What is wrong with it? And How would you correct it?
A:The average function is not commutative and associative;
cnt = myrdd.count();
def devideByCnd(x):
return x/cnt;
myrdd1 = myrdd.map(devideByCnd);
avg = myrdd.reduce(sum);
Q2: Say I have a huge list of numbers in a file in HDFS. Each line has one number.And I want to compute the square root of sum of squares of these numbers. How would you do it?
# We would first load the file as RDD from HDFS on spark
numsAsText = sc.textFile("hdfs://hadoop1.knowbigdata.com/user/student/sgiri/mynumbersfile.txt");
# Define the function to compute the squares
def toSqInt(str):
v = int(str);
return v*v;
#Run the function on spark rdd as transformation
nums = numsAsText.map(toSqInt);
#Run the summation as reduce action
total = nums.reduce(sum)
#finally compute the square root. For which we need to import math.
import math;
print math.sqrt(total);
3: Is the following approach correct? Is the sqrtOfSumOfSq a valid reducer?
numsAsText =sc.textFile("hdfs://hadoop1.knowbigdata.com/user/student/sgiri/mynumbersfile.txt");
def toInt(str):
return int(str);
nums = numsAsText.map(toInt);
def sqrtOfSumOfSq(x, y):
return math.sqrt(x*x+y*y);
total = nums.reduce(sum)
import math;
print math.sqrt(total);
A: Yes. The approach is correct and sqrtOfSumOfSq is a valid reducer.
Q4: Could you compare the pros and cons of the your approach (in Question 2 above) and my approach (in Question 3 above)?
A:
You are doing the square and square root as part of reduce action while I am squaring in map() and summing in reduce in my approach.
My approach will be faster because in your case the reducer code is heavy as it is calling math.sqrt() and reducer code is generally executed approximately n-1 times the spark RDD.
The only downside of my approach is that there is a huge chance of integer overflow because I am computing the sum of squares as part of map.
Q5: If you have to compute the total counts of each of the unique words on spark, how would you go about it?
A:
#This will load the bigtextfile.txt as RDD in the spark
lines = sc.textFile("hdfs://hadoop1.knowbigdata.com/user/student/sgiri/bigtextfile.txt");
#define a function that can break each line into words
def toWords(line):
return line.split();
# Run the toWords function on each element of RDD on spark as flatMap transformation.
# We are going to flatMap instead of map because our function is returning multiple values.
words = lines.flatMap(toWords);
# Convert each word into (key, value) pair. Her key will be the word itself and value will be 1.
def toTuple(word):
return (word, 1);
wordsTuple = words.map(toTuple);
# Now we can easily do the reduceByKey() action.
def sum(x, y):
return x+y;
counts = wordsTuple.reduceByKey(sum)
# Now, print
counts.collect()
Q6: In a very huge text file, you want to just check if a particular keyword exists. How would you do this using Spark?
A:
lines = sc.textFile("hdfs://hadoop1.knowbigdata.com/user/student/sgiri/bigtextfile.txt");
def isFound(line):
if line.find(“mykeyword”) > -1:
return 1;
return 0;
foundBits = lines.map(isFound);
sum = foundBits.reduce(sum);
if sum > 0:
print “FOUND”;
else:
print “NOT FOUND”;
Q7: Can you improve the performance of this code in previous answer?
A: Yes. The search is not stopping even after the word we are looking for has been found. Our map code would keep executing on all the nodes which is very inefficient.
We could utilize accumulators to report whether the word has been found or not and then stop the job. Something on these line:
import thread, threading
from time import sleep
result = "Not Set"
lock = threading.Lock()
accum = sc.accumulator(0)
def map_func(line):
#introduce delay to emulate the slowness
sleep(1);
if line.find("Adventures") > -1:
accum.add(1);
return 1;
return 0;
def start_job():
global result
try:
sc.setJobGroup("job_to_cancel", "some description")
lines = sc.textFile("hdfs://hadoop1.knowbigdata.com/user/student/sgiri/wordcount/input/big.txt");
result = lines.map(map_func);
result.take(1);
except Exception as e:
result = "Cancelled"
lock.release()
def stop_job():
while accum.value < 3 :
sleep(1);
sc.cancelJobGroup("job_to_cancel")
supress = lock.acquire()
supress = thread.start_new_thread(start_job, tuple())
supress = thread.start_new_thread(stop_job, tuple())
supress = lock.acquire()
http://spark.apache.org/docs/latest/spark-standalone.html#high-availability
By default, standalone scheduling clusters are resilient to Worker failures (insofar as Spark itself is resilient to losing work by moving it to other workers). However, the scheduler uses a Master to make scheduling decisions, and this (by default) creates a single point of failure: if the Master crashes, no new applications can be created. In order to circumvent this, we have two high availability schemes, detailed below.
Standby Masters with ZooKeeper
Overview
Utilizing ZooKeeper to provide leader election and some state storage, you can launch multiple Masters in your cluster connected to the same ZooKeeper instance. One will be elected “leader” and the others will remain in standby mode. If the current leader dies, another Master will be elected, recover the old Master’s state, and then resume scheduling. The entire recovery process (from the time the first leader goes down) should take between 1 and 2 minutes. Note that this delay only affects scheduling new applications – applications that were already running during Master failover are unaffected.