Massive Technical Interviews Tips: Big Data Interview Questions and Answers

Sunday, August 2, 2015

Big Data Interview Questions and Answers

http://www.bigdatainterviewquestions.com/hadoop-interview-questions/
What is Speculative execution?
A job running on a Hadoop cluster could be divided in to many tasks. In a big cluster some of these tasks could be running slow for various reasons, hardware degradation or software miconfiguration etc. Hadoop initiates a replica of a task when it sees a tasks which is running for sometime and failed to make any progress, on average, as the other tasks from the job. This replica or duplicate exeuction of task is referred to as Speculative Execution.

When a task completes successfully all the duplicate tasks that are running will be killed. So if the original task completes before the speculative task, then the speculative task is killed; on the other hand, if the speculative task finishes first, then the original is killed.

What is the benefit of using counters in Hadoop?
Assume you have a 100 node cluster and a job with 100 mappers is running in the cluster on 100 different nodes. Lets say you would like to know each time you see a invalid record in your Map phase.

What is the difference between an InputSplit and a Block?
Block is a physical division of data and does not take in to account the logical boundary of records. Meaning you could have a record that started in one block and ends in another block. Where as InputSplit considers the logical boundaries of records as well.

Can you change the number of mappers to be created for a job in Hadoop?
No. The number of mappers is determined by the no of input splits.

What are the functions of InputFormat?
Validate input data is present and check input configuration
Create InputSplits from blocks
Create RecordReader implementation to create key/value pairs from the raw InputSplit. These pairs will be sent one by one to their mapper.

What is a Record Reader?
A RecordReader uses the data within the boundaries created by the input split to generate key/value pairs. Each of the generated Key/value pair will be sent one by one to their mapper.

What is a sequence file in Hadoop?
Sequence file is used to store binary key/value pairs. Sequence files support splitting even when the data inside the file is compressed which is not possible with a regular compressed file. You can either choose to perform a record level compression in which the value in the key/value pair will be compressed. Or you can also choose to choose at the block level where multiple records will be compressed together.

Assume you are doing a join and you notice that all but one reducer is running for a long time how do you address the problem in Pig?
Pig collects all of the records for a given key together on a single reducer. In many data sets, there are a few keys that have three or more orders of magnitude more records than other keys. This results in one or two reducers that will take much longer than the rest. To deal with this, Pig provides skew join.

 In the first MapReduce job pig scans the second input and identifies keys that have so many records.

 In the second MapReduce job, it does the actual join.

 For all except the records with the key(s) identified from the first job, pig would do a standard join.

 For the records with keys identified by the second job, bases on how many records were seen for a given key, those records will be split across appropriate number of reducers.

 The other input to the join that is not split, only the keys in question are then then split and then replicated to each reducer that contains that keyWhat is the difference between SORT BY and ORDER BY in Hive?
ORDER BY performs a total ordering of the query result set. This means that all the data is passed through a single reducer, which may take an unacceptably long time to execute for larger data sets.

SORT BY orders the data only within each reducer, thereby performing a local ordering, where each reducer’s output will be sorted. You will not achieve a total ordering on the dataset. Better performance is traded for total ordering.

How do you do a file system check in HDFS?
FSCK command is used to do a file system check in HDFS. It is a very useful command to check the health of the file, block names and block locations.
hdfs fsck /dir/hadoop-test -files -blocks -locations

What are the parameters of mappers and reducers function?
Map and Reduce method signature tells you a lot about the type of input and ouput your Job will deal with. Assuming you are using TextInputFormat, Map function’s parameters could look like –

LongWritable (Input Key)
Text (Input Value)
Text (Intermediate Key Output)
IntWritable (Intermediate Output)

The four parameters for reduce function could be –
Text (Intermediate Key Output)
IntWritable (Intermediate Value Output)
Text (Final Key Output)
IntWritable (Final Value Output)

How do you overwrite replication factor?
There are few ways to do this. Look at the below illustration.
hadoop fs -setrep -w 5 -R hadoop-test
hadoop fs -Ddfs.replication=5 -cp hadoop-test/test.csv hadoop-test/test_with_rep5.csv

http://www.fromdev.com/2010/12/interview-questions-hadoop-mapreduce.html
What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?

JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop. There is only One Job Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM process. In a typical production cluster its run on a separate machine. Each slave node is configured with job tracker node location. The JobTracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in Hadoop performs following actions(from Hadoop Wiki:)

Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the location of the data
The JobTracker locates TaskTracker nodes with available slots at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
When the work is completed, the JobTracker updates its status.

Client applications can poll the JobTracker for information.

How JobTracker schedules a task?
The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop Cluster
A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a JobTracker. There is only One Task Tracker process run on any hadoop slave node. Task Tracker runs on its own JVM process. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.

What is a Task instance in Hadoop? Where does it run?
Task instances are the actual MapReduce jobs which are run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task.

How many Daemon processes run on a Hadoop system?
Hadoop is comprised of five separate daemons. Each of these daemon run in its own JVM. Following 3 Daemons run on Master nodes NameNode - This daemon stores and maintains the metadata for HDFS. Secondary NameNode - Performs housekeeping functions for the NameNode. JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker. Following 2 Daemons run on each Slave nodes DataNode – Stores actual HDFS data blocks. TaskTracker - Responsible for instantiating and monitoring individual Map and Reduce tasks.

What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?
Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a separate JVM process.
Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is run as a separate JVM process.
One or Multiple instances of Task Instance is run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.
What is the difference between HDFS and NAS ?
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. Following are differences between HDFS and NAS

In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware.
HDFS is designed to work with MapReduce System, since computation are moved to data. NAS is not suitable for MapReduce since data is stored seperately from the computations.
HDFS runs on a cluster of machines and provides redundancy usinga replication protocal. Whereas NAS is provided by a single machine therefore does not provide data redundancy.
How NameNode Handles data node failures?
NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead. Since blocks will be under replicated the system begins replicating the blocks that were stored on the dead datanode. The NameNode Orchestrates the replication of data blocks from one datanode to another. The replication data transfer happens directly between datanodes and the data never passes through the namenode.

Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer communicate with another reducer?
Nope, MapReduce programming model does not allow reducers to communicate with each other. Reducers run in isolation.

Can I set the number of reducers to zero?
Yes, Setting the number of reducers to zero is a valid configuration in Hadoop. When you set the reducers to zero no reducers will be executed, and the output of each mapper will be stored to a separate file on HDFS. [This is different from the condition when reducers are set to a number greater than zero and the Mappers output (intermediate data) is written to the Local file system(NOT HDFS) of each mappter slave node.]

Where is the Mapper Output (intermediate kay-value data) stored ?
The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.

What are combiners? When should I use a combiner in my MapReduce Job?
Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution.

What is Writable & WritableComparable interface?
org.apache.hadoop.io.Writable is a Java interface. Any key or value type in the Hadoop Map-Reduce framework implements this interface. Implementations typically implement a static read(DataInput) method which constructs a new instance, calls readFields(DataInput) and returns the instance.
org.apache.hadoop.io.WritableComparable is a Java interface. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. WritableComparable objects can be compared to each other using Comparators.

What is the Hadoop MapReduce API contract for a key and value Class?
The Key must implement the org.apache.hadoop.io.WritableComparable interface.
The value must implement the org.apache.hadoop.io.Writable interface.
What is a IdentityMapper and IdentityReducer in MapReduce ?
org.apache.hadoop.mapred.lib.IdentityMapper Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.
org.apache.hadoop.mapred.lib.IdentityReducer Performs no reduction, writing all input values directly to the output. If MapReduce programmer do not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.

When is the reducers are started in a MapReduce job?
In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.

If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer. Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished.

What is HDFS ? How it is different from traditional file systems?
HDFS, the Hadoop Distributed File System, is responsible for storing huge data on the cluster. This is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.

HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files.

What is HDFS Block size? How is it different from traditional file system block size?
In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate each block three times. Replicas are stored on different nodes. HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS Block size can not be compared with the traditional file system block size.

What is a NameNode? How many instances of NameNode run on a Hadoop Cluster?
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. There is only One NameNode process run on any hadoop cluster. NameNode runs on its own JVM process. In a typical production cluster its run on a separate machine. The NameNode is a Single Point of Failure for the HDFS Cluster. When the NameNode goes down, the file system goes offline. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

What is a DataNode? How many instances of DataNode run on a Hadoop Cluster?
A DataNode stores data in the Hadoop File System HDFS. There is only One DataNode process run on any hadoop slave node. DataNode runs on its own JVM process. On startup, a DataNode connects to the NameNode. DataNode instances can talk to each other, this is mostly during replicating data.

How the Client communicates with HDFS?
The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.

How the HDFS Blocks are replicated?
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. HDFS uses rack-aware replica placement policy. In default configuration there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same rack and 3rd copy on a different rack.

http://www.wiziq.com/blog/31-questions-for-hadoop-developers/
Q1. Name the most common Input Formats defined in Hadoop? Which one is default?
The two most common Input Formats defined in Hadoop are:
– TextInputFormat
- KeyValueInputF5ormat
- SequenceFileInputFormat
TextInputFormat is the Hadoop default.

Q2. What is the difference between TextInputFormat and KeyValueInputFormat class?
TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper.

KeyValueInputFormat: Reads text file and parses lines into key, Val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.

Q4. How is the splitting of file invoked in Hadoop framework?
It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the user.

Q7. After the Map phase finishes, the Hadoop framework does “Partitioning, Shuffle and sort”. Explain what happens in this phase?
Partitioning: It is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same.

Shuffle: After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling.

Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer.

Q8. If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?
The default partitioner computes a hash value for the key and assigns the partition based on this result.

Q9. What is a Combiner?
The Combiner is a ‘mini-reduce’ process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.

Q11. What are some typical functions of Job Tracker?
The following are some typical tasks of JobTracker:-
- Accepts jobs from clients
- It talks to the NameNode to determine the location of the data.
- It locates TaskTracker nodes with available slots at or near the data.
- It submits the work to the chosen TaskTracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker.

Q12. What is TaskTracker?
TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations – from a JobTracker.

Q13. What is the relationship between Jobs and Tasks in Hadoop?
One job is broken down into one or many tasks in Hadoop.

Q14. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?
It will restart the task again on some other TaskTracker and only if the task fails more than four (default setting and can be changed) times will it kill the job.

Q15. Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this?
Speculative Execution.

Q16. How does speculative execution work in Hadoop?
JobTracker makes different TaskTrackers pr2ocess same input. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

Q18. What is Hadoop Streaming?
Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations.

Q19. What is the characteristic of streaming API that makes it flexible run MapReduce jobs in languages like Perl, Ruby, Awk etc.?
Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a MapReduce job by having both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout.

Q20. What is Distributed Cache in Hadoop?
Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.

Q21. What is the benefit of Distributed cache? Why can we just have the file in HDFS and have the application read it?
This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.

Q.22 What mechanism does Hadoop framework provide to synchronise changes made in Distribution Cache during runtime of the application?
This is a tricky question. There is no such mechanism. Distributed Cache by design is read only during the time of Job execution.

Q24. Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job?
Yes, the input format class provides methods to add multiple directories as input to a Hadoop job.

Q25. Is it possible to have Hadoop job output in multiple directories? If yes, how?
Yes, by using Multiple Outputs class.

Q26. What will a Hadoop job do if you try to run it with an output directory that is already present? The Hadoop job will throw an exception and exit.

27. How can you set an arbitrary number of mappers to be created for a job in Hadoop?
You cannot set it.

Q28. How can you set an arbitrary number of Reducers to be created for a job in Hadoop?
You can either do it programmatically by using method setNumReduceTasks in the Jobconf Class or set it up as a configuration setting.

Q29. How will you write a custom partitioner for a Hadoop job?
To have Hadoop use a custom partitioner you will have to do minimum the following three:
- Create a new class that extends Partitioner Class
- Override method getPartition
- In the wrapper that runs the Mapreduce, either
- Add the custom partitioner to the job programmatically using method set Partitioner Class or – add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)

Q30. How did you debug your Hadoop code?
There can be several ways of doing this but most common ways are:-
- By using counters.
- The web interface provided by Hadoop framework.

Q31. Did you ever built a production process in Hadoop? If yes, what was the process when your Hadoop job fails due to any reason?
It is an open-ended question but most candidates if they have written a production job, should talk about some type of alert mechanism like email is sent or there monitoring system sends an alert. Since Hadoop works on unstructured data, it is very important to have a good alerting system for errors since unexpected data can very easily break the job.

http://www.edureka.co/blog/hadoop-interview-questions-hdfs-2/
What is a Namenode?

Namenode is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the datanodes. It is a high-availability machine and single point of failure in HDFS.

Is Namenode also a commodity?

No. Namenode can never be a commodity hardware because the entire HDFS rely on it. It is the single point of failure in HDFS. Namenode has to be a high-availability machine.

What is a metadata?

Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.

Why do we use HDFS for applications having large data sets and not when there are lot of small files?

HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files. This is because Namenode is a very expensive high performance system, so it is not prudent to occupy the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files. So, when there is a large amount of data in a single file, name node will occupy less space. Hence for getting optimized performance, HDFS supports large data sets instead of multiple small files.

If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can the blocks be broken at the time of replication?

In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node will figure out what is the actual amount of space required, how many block are being used, how much space is available, and it will allocate the blocks accordingly.

If a data Node is full how it’s identified?

When data is stored in datanode, then the metadata of that data will be stored in the Namenode. So Namenode will identify if the data node is full.

If datanodes increase, then do we need to upgrade Namenode?

While installing the Hadoop system, Namenode is determined based on the size of the clusters. Most of the time, we do not need to upgrade the Namenode because it does not store the actual data, but just the metadata, so such a requirement rarely arise.

On what basis Namenode will decide which datanode to write on?
As the Namenode has the metadata (information) related to all the data nodes, it knows which datanode is free.

What is the communication channel between client and namenode/datanode?
The mode of communication is SSH.

What is a Secondary Namenode? Is it a substitute to the Namenode?

The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard disk or the file system. It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes down.

What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?

In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is known as Active and Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.

What is MapReduce?
Map Reduce is the ‘heart‘ of Hadoop that consists of two parts – ‘map’ and ‘reduce’. Maps and reduces are programs for processing data. ‘Map’ processes the data first to give some intermediate output which is further processed by ‘Reduce’ to generate the final output. Thus, MapReduce allows for distributed processing of the map and reduction operations.

Do we require two servers for the Namenode and the datanodes?

Yes, we need two different servers for the Namenode and the datanodes. This is because Namenode requires highly configurable system as it stores information about the location details of all the files stored in different datanodes and on the other hand, datanodes require low configuration system.

Why are the number of splits equal to the number of maps?

The number of maps is equal to the number of input splits because we want the key and value pairs of all the input splits.

Hadoop is not a database. It’s a filesystem (HDFS) and distributed programming framework (MapReduce).

Which are the two types of ‘writes’ in HDFS?
There are two types of writes in HDFS: posted and non-posted write. Posted Write is when we write it and forget about it, without worrying about the acknowledgement. It is similar to our traditional Indian post. In a Non-posted Write, we wait for the acknowledgement. It is similar to the today’s courier services. Naturally, non-posted write is more expensive than the posted write. It is much more expensive, though both writes are asynchronous. Naturally, non-posted write is more expensive than the posted write. It is much more expensive, though both writes are asynchronous.

Why ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?
Reading is done in parallel because by doing so we can access the data fast. But we do not perform the write operation in parallel. The reason is that if we perform the write operation in parallel, then it might result in data inconsistency.

Miscs: http://insights.dice.com/2013/06/20/how-hadoop-works/
What happens when you invoke a read function in the operating system? How does the operating system perform IO?

Explain how Hadoop differs from a traditional database.

Give an example of a case where using Hadoop makes sense, and a case where using Hadoop is the way to solve a particular problem.
Hadoop is useful in cases where you’re processing large quantities of hourly data and transforming data en mass.
It’s useful for offline batch operations.

Have you ever participated in open source in any way?
Take a smaller chunk of data and sort it in memory, so you partition it in lots of little data sets. Then merge those sorted lists into one big list before writing the results back to disk.

Have you ever participated in open source in any way?
http://insights.dice.com/2014/03/20/4-interview-questions-newcomers-hadoop/
Have you worked on a go-live project or a prototype?
“I had considerable experience as a data warehouse architect before taking classes to learn Hadoop. Then, to make sure I was ready to handle big data sets, I pulled massive amounts of historical data from the New York Stock Exchange and used the sample database to hone my analytical skills. I also used the data to create programs in MapReduce. You can see samples of my work by visiting my website.”

How many nodes can be in one cluster?
“Hadoop scales out nicely, so the load really depends on the structure and data warehouse configuration. Hadoop can easily handle 10 to 50 nodes.”

Which NoSQL databases have you worked with?
“There are four categories of NoSQL databases. The first is key-values stores. I’ve used Redis, primarily when working with semi-structured data. The second is column value stores. I’ve used Cassandra when I needed scalability and high availability. The third is document databases. When I’ve needed to store and access semi-structured documents in formats like JSON, I’ve used CouchDB. Finally, there’s graph databases like InfiniteGraph.”

Which tool have you used for monitoring nodes and clusters?
“I’ve used Nagios for monitoring servers and switches. And I’ve used Ganglia for monitoring the entire grid.”

Sunday, August 2, 2015

Big Data Interview Questions and Answers

Labels

Popular Posts