Kiran Kumar Vasadi Google Cloud Certified Professional Data Engineer & Cloud Architect: Hadoop Interview Questions & Answers

1. I have Hadoop 2.x and my configured block size to 128MB. Now, I have changed my block size to 150MB. So, will this change affect the files which are already present?

No, this change will not affect the existing files, all the existing files will be with a block size of 128MB. Once you restart the cluster after configuring the new block size, the change will come into effect. All the new files which you copy into the HDFS will be with a block size of 150MB.

2. HDFS works on the principle of ‘Write Once, Read Many Times.’ So, by this logic, you can overwrite a file which is already present in the HDFS. If yes, explain how can that be done.

Yes, we can overwrite a file which is already present in the HDFS. Using the -f option in the put or copyFromLocal command of HDFS, we can overwrite a file in HDFS

hadoop fs -put -f <<local path>> <<hdfs path>>

3. What is meant by Safe Mode and when does the NameNode go into safe mode?

Safe mode is a situation wherein you cannot write data to HDFS. HDFS will be in the Read Mode. In this case, you can only read the data, but you cannot write into HDFS. Here, the NameNode will change the file system state from fsimage to edit logs and load into the memory.

Every time you start the HDFS daemons, the NameNode goes into the safe mode and checks for the block report, and, also ensures whether all the data nodes are working or not.

If this process is interrupted in the beginning by some internal or external process, the NameNode will be in the Safe Mode completely. Now, you need to come out of the Safe Mode explicitly by using the command: hdfs dfsadmin -safemode leave

4. Currently, I am using Hadoop2.x, but I want to upgrade to Hadoop3.x. How can I upgrade without losing my data in HDFS?

You can upgrade or downgrade to any Hadoop version without losing data as far as you have the NameNode and DataNode’s Current and Version files. While installing, you just need to give these directories as NameNode and DataNode’s metadata directories.

As we are not changing the cluster, by using the metadata present in that folders, Hadoop will get back your data.

5. Where does the metadata of NameNode reside? Is it in-memory or on the disk?

The answer is both. The metadata will be on the disk, but once you start the Hadoop cluster, the NameNode will take metadata into in-memory for faster access, so whatever the updates that are going to happen after starting the Hadoop cluster will happen in in-memory, and once you shut down the cluster, the changes will be saved to the drive.

6. My Hadoop cluster is running fine, unfortunately, I have deleted the NameNode metadata directory. What will happen now? All my data will be lost and the existing processes will be distracted?

No! Everything will go normally until you shut down your cluster, because, once you start the Hadoop cluster, NameNode’s metadata directory will go in-memory, and there will be no interaction with the local directory from that point. So all your data will be there until you shut down the cluster and nothing will happen to the existing processes too.

But once you shut down and start your cluster, everything will be new and you cannot see any data in your cluster.

7. Suppose, I am using Hadoop 2.x and my block size is 128MB, and I am writing a file of the size of 1GB into the cluster, suddenly after writing 200MB, the process is stopped. What do you think will happen now? Will I be able to read the 200MB of data or not now?

You will be able to read only 128MB of data. A client can read a complete block of data which is written into HDFS. You will not be able to read the rest of the 72MB of data, as the data write to this block is interrupted in between and the other 8024MB of data is not written into the HDFS at all.

While writing the data into HDFS, HDFS will simultaneously maintain replicas also. If your replication factor is 3, then the other 2 replicas will also be written simultaneously.

8. How can you troubleshoot if either of your NameNodes or DataNodes are not running?

We need to check for the CLUSTER ID in NameNode’s VERSION file and DataNode’s VERSION file, both the CLUSTER IDs should match, or else there will be no synchronization between the NameNodes and DataNodes. So, if the CLUSTER IDs of both are different, then you need to keep the CLUSTER ID’s same.

9. When does the reducer phase take place in a MapReduce job?

The reduce phase has 3 steps: Shuffle, Sort, and Reduce. Shuffle phase is where the data is collected by the Reducer from each Mapper. This can happen while Mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the Mappers are done. You can tell which one MapReduce is doing by looking at the reducer completion percentage;

0–33% means it’s doing the shuffle

34–66% is sort

67 –100% is reduce

This is why your reducers will sometimes seem “stuck” at 33%, it’s waiting for Mappers to finish.

Reducers start shuffling based on a threshold of percentage of Mappers that have finished. You can change the parameters to get reducers to start sooner or later.

10.How can you chain MapReduce jobs?

A. Not every problem can be solved with a MapReduce program, but fewer still are those that can be solved with a single MapReduce job. Many problems can be solved with MapReduce, by writing several MapReduce steps which run in a series to accomplish a goal:

Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3

You can easily chain jobs together in this fashion by writing multiple driver methods, one for each job. Call the first driver method, which uses JobClient.runJob() to run the job and wait for it to complete. When that job has completed, then call the next driver method, which creates a new JobConf object referring to different instances of Mapper and Reducer, etc. The first job in the chain should write its output to a path which is then used as the input path for the second job. This process can be repeated for as many jobs that are necessary to arrive at a complete solution to the problem.

Many problems, which at first seem impossible in MapReduce, can be accomplished by dividing one job into two or more.

Hadoop provides another mechanism for managing batches of jobs with dependencies between jobs. Rather than submit a JobConf to the JobClient‘s runJob()or submitJob() methods, org.apache.hadoop.mapred.jobcontrol.Jobobjects can be created to represent each job; A Job takes a JobConf object as its constructor argument. Jobs can depend on one another through the use of the addDependingJob()method. The code:

 x.addDependingJob(y)

11. What are counters in MapReduce?

A. A Counter is generally used to keep track of the occurrences of any event. In the Hadoop Framework, whenever any MapReduce job gets executed, the Hadoop Framework initiates counters to keep track of the job statistics like the number of rows read, the number of rows written as output, etc.

These are built in counters in the Hadoop Framework. Additionally, we can also create and use our own custom counters.

Typically, some of the operations of Hadoop counters are:

Number of Mappers and Reducers launched
Number of bytes that get read and written
The number of tasks that get launched and successfully run
The amount of CPU and memory consumed appropriate or not for job and cluster nodes

12. Are there any other storage systems that can be used with MapReduce other than HDFS?

Yes, Hadoop supports many other compatible file systems. From Hadoop 2.x, you can use Amazon’s S3, from Hadoop 3.x, you can use Microsoft’s Azure data lake storage or Azure blob storage. MongoDB has released a Hadoop-MongoDB connector to integrate with.

13. Is this piece of code correct? If not explain where it went wrong.


public static class TokenizerMapper

       extends Mapper<Object, Text, Text, IntWritable>{
private final static Text one = new Text(1);

    public void map(Object key, Text value, Context context
private Text word = new Text();

      StringTokenizer itr = new StringTokenizer(value.toString());
) throws IOException, InterruptedException {
      while (itr.hasMoreTokens()) {

  }
word.set(itr.nextToken());
        context.write(word, one);
      }

    }

Yes, there is a fault in this code. The output value which is defined in the Mapper parameters is IntWritable, but the data type which is returning in the context is of Text.

The data types of the output key and the output value, which are defined in the Mapper parameters, should match with the data type of the key, and the value should be returned in the context.

So, either you should change the data type of the output value specified in the Mapper class parameters, or you should change the data type of one to IntWritable.

14. What is the difference between HDFS block and an InputSplit, and explain how the input split is prepared in Hadoop?

The HDFS block is a physical representation of data, while the InputSplit is a logical representation of the data. The InputSplit will refer to the HDFS block location.

The InputFormat is responsible for providing the splits.

In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the Mapper on a machine will process the part of the data that is stored on this node. I think this is called Rack awareness.

To cut a long story short: Upload data on the HDFS and start an MR Job. Hadoop will take care for the optimized execution.

15. Can you find the top 10 records based on values using map reduce?

A. Yes, there is a design pattern called Top K records in Hadoop, using which, we can find out the top 10 records.

16. What is meant by a combiner and where exactly it is used in map reduce.

Combiner acts like a mini reducer in Hadoop. Combiner will reduce the output of all the Mappers before sending them to the reducers.

So, because of the usage of the combiner, the burden on the Reducer will be less and the execution will happen faster.

The combiner is used after the Mapper and before the Reducer phases.

17. How can I get the output of a Hive query into a .csv file?

You can save the output of a Hive query into a file by using the Insert overwrite statement as:

INSERT OVERWRITE LOCAL DIRECTORY '/home/acdgild/hiveql_output' select * from table;

18. Can I run a Hive query directly from the terminal without logging into the Hive shell?

Yes, by using hive -e option, we can run any kind of Hive query directly from the terminal without logging into the Hive shell.

Here is an example:

hive -e 'select * from table'

You can also save the output into a file by using the cat ‘>’ command of Linux as shown below:

hive -e 'select * from table' > / home/acdgild/hiveql_output.tsv

19. Explain Cluster By vs. Order By vs. Sort By in Hive.


CLUSTER BY guarantees global ordering, provided you’re willing to join the multiple output files yourself.

The longer version:

ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as the output.
SORT BY x: orders data at each of the N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.
DISTRIBUTE BY x: ensures each of the N reducers gets non-overlapping ranges of x, but does not sort the output of every reducer. You end up with N or unsorted files with non-overlapping ranges.
CLUSTER BY x: ensures each of the N reducers gets non-overlapping ranges, then sorts by those ranges at the Reducers. This gives you a global ordering and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.

Hence, CLUSTER BY is basically the more scalable version of ORDER BY.

20. Explain the differences between Hive internal and External table.

Managed table is also called as an Internal table. This is the default table in Hive. When we create a table in Hive without specifying it as managed or external, by default, we will get a Managed table. If we create a table as a managed table, the table will be created in a specific location in HDFS.

By default, the table data will be created in the /usr/hive/warehouse directory of HDFS.

If we delete a Managed table, both the table data and the metadata for that table will be deleted from the HDFS.

An external table is created for external use, as this when the data is used outside Hive. Whenever we want to delete a table’s metadata and want to keep the table’s data as it is, we use an External table. External table only deletes the schema of the table.

21. How can you select the current date and time using HiveQL?


SELECT from_unixtime(unix_timestamp()); --/Selecting Current Time stamp/
SELECT CURRENT_DATE; --/Selecting Current Date/

SELECT CURRENT_TIMESTAMP; --/Selecting Current Time stamp/

22. How can you skip the first line of the data set while loading it into a Hive table?

While creating the Hive table, we can specify in the tblproperties to skip the first row and load the rest of the dataset. Here is an example for it.


create external table testtable (name string, message string)
row format delimited
fields terminated by '\t'

tblproperties ("skip.header.line.count"="1");
lines terminated by '\n'

location '/testtable'

23. Explain the difference between COLLECT_LIST & COLLECT_SET and say where exactly can they be used in Hive.

When you want to collect an array of values for a key, you can use these COLLECT_LIST & COLLECT_SET functions.

COLLECT_LIST will include duplicate values for a key in the list. COLLECT_SET will keep the unique values for a key in the list.

24. How can you run a Hive query in the Debug mode?

Hive queries can be run in debug by starting your Hive console by switching on the logger mode to DEBUG as follows:

hive --hiveconf hive.root.logger=DEBUG,console

Now all the queries which you run in the Hive shell will run in the debug mode and you can also see the entire stack trace that query.

25. How can you store the output of a Pig relation directly into Hive?

Using the HCatStorer function of HCatalog, you can store the output of a Pig relation directly into a Hive table.

Similarly, you can load the data of a Hive table into a Pig relation for pre-processing using the HCatLoader function of HCatalog.

26. Can you process the data present in MongoDB using Pig?

Yes, you can process the data present in MongoDB using Pig with the help of MongoDB Pig connector.

27. How many kinds of functions available in pig UDF?

There are 3 kinds of functions available in the Pig UDF:

1. Eval function: All kinds of Evaluation functions can be done with Eval functions. It takes one record as input, evaluates and return and returns one result.

2. Aggregate function: There are other kinds of Eval functions that work on a group of data, they take a bag as the input and return a scalar value as the output.

3. Filter function: These are also a kind of Eval functions that return a Boolean value as result. If the record satisfies the condition it returns, then it is true, or else it is false.

28. How can you visualize the outcomes of a Pig relation?

Zeppelin provides you the simplest ways to visualize the outcome of a Pig relation. From Zeppelin 0.7.0, the support to visualize the outcome of Pig is added.

29. How can you load a file into the HBase?

One way you can load a file is through using Bulk loading while using MapReduce.

Another way you can load is by using Hive with the help of the hive-HBase-storage handler. First, you will load the data into Hive, in turn, it will get reflected on the HBase. You can refer to our blog on HBase Write Using Hive.

You can also write a shell script which runs recursively till all the lines are written into the HBase table.

30. How can you transfer data present in Mysql to Hbase?

One way you can migrate data from Mysql to Hbase is by using Sqoop.

You can also migrate data from Mysql to Hbase using MapReduce.

--------------------------------------------------------------------------------------------------------

What are the different types of File formats in hive?

Ans. Different file formats which Hive can handle are:

TEXTFILE
SEQUENCEFILE
RCFILE
ORCFILE

2. Explain Indexing in Hive.

Ans. Index acts as a reference to the records. Instead of searching all the records, we can refer to the index to search for a particular record. Indexes maintain the reference of the records. So, it is easy to search for a record with minimum overhead. Indexes also speed up data searching.

3. Explain about Avro File format in Hadoop.

Ans. Avro is one of the preferred data serialization system because of its language neutrality.

Due to lack of language portability in hadoop writable classes, avro becomes a natural choice because of its ability to handle multiple data formats which can be further processed by multiple languages.

Avro is most preferred for serializing the data in Hadoop.

It uses JSON for defining data types and protocols. It serializes data in a compact binary format.

Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

By this we can define Avro as a file format introduced with Hadoop to store data in a predefined format.This file format can be used in any of the Hadoop’s tools like Pig and Hive

Does Hive support transactions?

Ans. Yes, Hive supports transactions from hive-0.13, with some restrictions.

5. Explain about Top-k Map-Reduce design pattern.

Ans. Top-k Map-reduce design pattern is used for find the top k records from the given dataset.

This design pattern achieves this by defining a ranking function or comparison function between two records that determines whether one is higher than the other. We can apply this pattern to use MapReduce to find the records with the highest value across the entire data set.

Explain about Hive Storage Handlers.

Storage Handlers are a combination of Input Format, Output Format, SerDe, and specific code that Hive uses to identify an external entity as a Hive table. This allows the user to issue SQL queries seamlessly, whether the table represents a text file stored in Hadoop or a column family stored in a NoSQL database such as Apache HBase, Apache Cassandra, and Amazon Dynamo DB. Storage Handlers are not only limited to NoSQL databases; a storage handler can be designed for several different kinds of data stores.

Explain partitioning in Hive.

Ans. Table partitioning means dividing table data into some parts based on the values of particular columns, thus segregating input records into different directories based on that column for practical implementation on partitioning in Hive

8. What is the use of Impala?

Ans. Cloudera’s Impala is a massively parallel processing (MPP) SQL-like query engine that allows users to execute low latency SQL Queries for the data stored in HDFS and HBase, without any data transformation or movement.

The main goal of Impala is to make SQL on Hadoop operations, fast and efficient to appeal to new categories of users and open up Hadoop to new types of use cases. Impala makes SQL queries simple enough to be accessible to analysts who are familiar with SQL and to those using business intelligence tools that run on Hadoop.

9. Explain how to choose between Managed & External tables in Hive.

Ans. Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.

Use EXTERNAL tables when:

The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.

Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing to multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.

Use INTERNAL tables when:

The data is temporary.

You want Hive to completely manage the life cycle of the table and data.

10. What are the different methods in Mapper class and order of their invocation?

Ans. There are 3 methods in Mapper.

*map () –> executes for each line of the input file (In text input format)

*setup () –> Executes once per input split at the beginning of the program

*clean up () –> Executes once per input split at the end of the program

order of invocation:

setup () –1

map () –2

clean up –3

11. What is the purpose of Record Reader in Hadoop?

Ans. In MapReduce, data is divided into input splits. Record Reader, typically, converts the input, provided by the Input Split, and presents a record-oriented view for the Mapper & Reducer tasks for processing. It thus assumes the responsibility of processing record boundaries and presenting the tasks with keys and values.

12. What details are present in FSIMAGE?

Ans. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored FsImage. The FsImage is stored as a file in the Name Node’s local file system too.

The Name Node keeps an image of the entire file system namespace and file Block map in memory. This key metadata item is designed to be compact, such that a NameNode with 4GB of RAM is sufficient to support a huge number of files and directories. When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new FsImage on disk.

It can then truncate the old EditLog because its transactions have been applied to the persistent FsImage. This process is called a checkpoint. In the current implementation, a checkpoint only occurs when the NameNode starts up.

13. Why do we need bucketing in Hive?

Ans. Bucketing is a simple idea if you are already aware. You create multiple buckets. You read each record and place it into one of the buckets based on some logic mostly some kind of hashing algorithm. This allows you to organize your data by decomposing it into multiple parts. You might wonder if we can achieve the same thing using partitioning then why to bother about bucketing. There is one difference. When we do partitioning, we create a partition for each unique value of the column. This may give rise to a situation where you might need to create thousands of tiny partitions. But if you use bucketing, you can limit it to a number which you can choose and decompose your data into those buckets. In Hive, a partition is a directory but a bucket is a file.

14. What is a Sequence File in Hadoop?

Ans. In addition to text files, Hadoop also provides support for binary files, out of these binary file formats, Hadoop specific file format stores serialized key/value pairs.

15. How do you copy files from one cluster to another cluster?

Ans. With the help of DistCp command, we can copy files from one cluster.

The most common invocation of DistCp is an inter-cluster copy:

bash$ Hadoop DistCp
hdfs://nn1:8020/foo/bar
hdfs://nn2:8020/bar/foo

16. I have Hadoop 2.x and my configured block size to 128MB. Now, I have changed my block size to 150MB. So, will this change affect the files which are already present?

No, this change will not affect the existing files, all the existing files will be with a block size of 128MB. Once you restart the cluster after configuring the new block size, the change will come into effect. All the new files which you copy into the HDFS will be with a block size of 150MB.

17. Currently, I am using Hadoop2.x, but I want to upgrade to Hadoop3.x. How can I upgrade without losing my data in HDFS?

You can upgrade or downgrade to any Hadoop version without losing data as far as you have the NameNode and DataNode’s Current and Version files. While installing, you just need to give these directories as NameNode and DataNode’s metadata directories.

As we are not changing the cluster, by using the metadata present in that folders, Hadoop will get back your data.

18. Suppose, I am using Hadoop 2.x and my block size is 128MB, and I am writing a file of the size of 1GB into the cluster, suddenly after writing 200MB, the process is stopped. What do you think will happen now? Will I be able to read the 200MB of data or not now?

You will be able to read only 128MB of data. A client can read a complete block of data which is written into HDFS. You will not be able to read the rest of the 72MB of data, as the data write to this block is interrupted in between and the other 8024MB of data is not written into the HDFS at all.

While writing the data into HDFS, HDFS will simultaneously maintain replicas also. If your replication factor is 3, then the other 2 replicas will also be written simultaneously.

19. How can you troubleshoot if either of your NameNodes or DataNodes are not running?

We need to check for the CLUSTER ID in NameNode’s VERSION file and DataNode’s VERSION file, both the CLUSTER IDs should match, or else there will be no synchronization between the NameNodes and DataNodes. So, if the CLUSTER IDs of both are different, then you need to keep the CLUSTER ID’s same.

20. What are counters in MapReduce?

These are built in counters in the Hadoop Framework. Additionally, we can also create and use our own custom counters.

Typically, some of the operations of Hadoop counters are:

Number of Mappers and Reducers launched
Number of bytes that get read and written
The number of tasks that get launched and successfully run
The amount of CPU and memory consumed appropriate or not for job and cluster nodes

21. What is the difference between HDFS block and an InputSplit, and explain how the input split is prepared in Hadoop?

The HDFS block is a physical representation of data, while the InputSplit is a logical representation of the data. The InputSplit will refer to the HDFS block location.

The InputFormat is responsible for providing the splits.

To cut a long story short: Upload data on the HDFS and start an MR Job. Hadoop will take care for the optimized execution.

22. Can I run a Hive query directly from the terminal without logging into the Hive shell?

Yes, by using hive -e option, we can run any kind of Hive query directly from the terminal without logging into the Hive shell.

Here is an example:

hive -e 'select * from table'

You can also save the output into a file by using the cat ‘>’ command of Linux as shown below:

hive -e 'select * from table' > / home/acdgild/hiveql_output.tsv

23. How can you select the current date and time using HiveQL?


SELECT from_unixtime(unix_timestamp()); --/Selecting Current Time stamp/

SELECT CURRENT_DATE; --/Selecting Current Date/

SELECT CURRENT_TIMESTAMP; --/Selecting Current Time stamp/

24. How can you skip the first line of the data set while loading it into a Hive table?

While creating the Hive table, we can specify in the tblproperties to skip the first row and load the rest of the dataset. Here is an example for it.


create external table testtable (name string, message string)

row format delimited

fields terminated by '\t'

tblproperties ("skip.header.line.count"="1");

lines terminated by '\n'

location '/testtable'

25. Explain the difference between COLLECT_LIST & COLLECT_SET and say where exactly can they be used in Hive.

When you want to collect an array of values for a key, you can use these COLLECT_LIST & COLLECT_SET functions.

COLLECT_LIST will include duplicate values for a key in the list. COLLECT_SET will keep the unique values for a key in the list.

26. How can you store the output of a Pig relation directly into Hive?

Using the HCatStorer function of HCatalog, you can store the output of a Pig relation directly into a Hive table.

Similarly, you can load the data of a Hive table into a Pig relation for pre-processing using the HCatLoader function of HCatalog.

27. Can you process the data present in MongoDB using Pig?

Yes, you can process the data present in MongoDB using Pig with the help of MongoDB Pig connector.

28. Can you visualize the outcomes of a Pig relation?

Yes! Zeppelin provides you the simplest ways to visualize the outcome of a Pig relation. From Zeppelin 0.7.0, the support to visualize the outcome of Pig is added.

-----------------------------------------------------------------------------------------------------

1. Can you join or transform tables/columns when importing using Sqoop?

Yes, we can perform all the SQL commands while importing the table data to Sqoop.

2. What is the importance of indexing in Hive and how does this relate to Partition and Bucketing?

The goal of Hive indexing is to improve the speed of query lookup on certain columns of a table. Without an index, queries with predicates like ‘WHERE tab1.col1 = 10’ load the entire table or partition and processes all the rows. However, if an index exists for col1, then only a portion of the file needs to be loaded and processed.

Indexes become even more essential when the tables grow extremely large, and as you now undoubtedly know, Hive thrives on large tables. We can index tables that are partitioned or bucketed.

Bucketing:

Bucketing is usually used for join operations, as you can optimize joins by bucketing records by a specific ‘key’ or ‘id’. In this way, when you want to do a join operation, records with the same ‘key’ will be in the same bucket and then the join operation will be faster. This is similar to a technique for decomposing data sets into more manageable parts.

Partitioning:

When any user wants a data contained within a table to be split across multiple sections in Hive table, the use of partition is highly suggested.

The entries for the various columns of the dataset are segregated and then stored in their respective partition. When we write the query to fetch the values from the table, only the required partitions of the table are queried, which reduces the time taken by the query to yield the result.

3. How many types of joins are present in Hadoop and when to use them?

In Hadoop, there are two types of joins. One is Map side join and the other one is Reduce side join

Map Side Join:

Joining of datasets in the map phase is called map side join. Map side join is preferred when you need to perform a join between one larger dataset and one smaller dataset. Map side joins are faster and are executed in the cache. A technique called Distributed Cache is implemented in map side joins, where the smaller dataset is given to all the data nodes through cache memory. The smaller dataset size is limited to the cache memory of the cluster.

Reduce Side Join:

Joining of datasets in the reduce class is called reduce side join. When both the datasets are large, we use reduce side join. They are less efficient than maps-side joins because the datasets have to go through the sort and shuffle phase.

4.How to optimize Hive queries?

Follow the below blog link to get the tips to optimize your hive queries

https://kiranvasadibigdata.blogspot.com/p/join-optimization-in-apache-hive.html

5.What are combiners in Hadoop?

Combiner class can summarize the map output records with the same key, and the output (key value collection) of the combiner will be sent over the network to the actual Reducer task as an input. This will help to cut down the amount of data shuffled between the mappers and the reducers.

6.What is the difference between Combiner and in-mapper combiner in Hadoop?

You are probably already aware that a combiner is a process that runs locally on each Mapper machine to pre-aggregate data before it’s shuffled across the network to the various cluster Reducers.

The in-mapper combiner takes this optimization a bit further: the aggregations does not even write to local disk. They occur in-memory in the Mapper itself.

The in-mapper combiner does this by taking advantage of the setup() and cleanup() methods in

org.apache.hadoop.mapreduce.Mapper

7.Let’s consider this scenario; if I have a folder consisting of n number of files (datasets) and if I want to apply the same mapper and reducer logic, what should I do?

The traditional FileInputFormat takes each row as input to the mapper. Instead of that, if you want to take the whole file as input, you need to use wholeFileInputFormat of MapReduce. It takes the whole file as the input to the mapper.

8.Suppose, if you have 50 mappers and 1 reducer, how will your cluster performance be? And if it takes a lot of time, how can you reduce it?

If there are 50 mappers and 1 reducer, it will take a lot of time to run the whole program, because the reducer needs to collect all the mapper’s output and then it need to process. So, for this we can do two things:

A. If possible, you can add a combiner so that the amount of output coming from the mapper will be reduced and the load on the reducer also will get reduced.

B. You can enable map output compression so that the size of the data going to the reducer will be less.

9. Explain some string functions in Hive

String functions perform operations on String data type columns. The various string functions are as follows:

ASCII – Converts the first character of the string to its ASCII value.

Concat – Concatenates all the string columns in the table.

substr(string A, int start) – Returns the sub string starting from the index given until the end.

(string A) – Returns the string converted to upper case.

lower(string A) – Returns the string converted to lower case.

trim(String A) – Returns the string trimming the spaces from both the ends.

10.Can you create a table in Hive, which can skip the header lines from the dataset?

Yes, we can include the skip.header.line.count property inside the tblproperties while creating the table.

For example:

CREATE TABLE Employee (Emp_Number Int,Emp_Name String,Emp_sal Int) row format delimited fields terminated BY ‘,’ lines terminated BY ‘\n’ tblproperties(“skip.header.line.count”=”1”);

11.What are the binary storage formats available in Hive?

The default format in Hive is TextInputFormat, but Hive supports many file formats like Sequence Files, Avro Data files, RCFiles, ORC files, Parquet files, etc.

12.Can you use multiple Hive instances at the same time? If yes, how can you do that?

By default, Hive comes with Derby database. So, you cannot use multiple instances with Derby database. However, if you change the Hive metastore as MySQL, then you can use multiple Hive instances at the same time.

You can refer to the post – MySQL Metastore Integration with Hive, to know how to configure Hive metastore as MySQL.

13.Is there any testing available in Pig? If yes, how can you do it?

Yes, we can do unit testing for Pig scripts.

You can refer to the post – Unit testing Pig scirpts, to know how Pig scripts can be tested.

14. Can you run Pig scripts using Java? If yes, how can you do it?

Yes, it is possible to embed Pig scripts inside a Java code.

You can refer to the post – Embedding Pig in Java, to know how pig scripts can be run using java.

15.Can you automate a Flume job by running it up to a stipulated time? If yes, how can you do that?

Flume job can be run for a stipulated time using a Java program. For this, Flume provides an application class to run it using a Java program.


public class flume {

public static void main(String[] args)

    {

String[] args1 = new String[] { "agent","-nTwitterAgent","-fflume.conf" };

        BasicConfigurator.configure();
Application.main(args1);

}

        System.setProperty("hadoop.home.dir", "/");

}

The following is the code to run the Flume configuration file using a Java program. We can automate this program while keeping this code inside a thread.

Hope this post has been useful in helping you prepare for that big interview. In the case of any queries, feel free to comment below and we will get back to you at the earliest.

16.What is meant by Safe Mode and when does the NameNode go into safe mode?

Safe mode is a situation wherein you cannot write data to HDFS. HDFS will be in the Read Mode. In this case, you can only read the data, but you cannot write into HDFS. Here, the NameNode will change the file system state from fsimage to edit logs and load into the memory.

Every time you start the HDFS daemons, the NameNode goes into the safe mode and checks for the block report, and, also ensures whether all the data nodes are working or not.

hdfs dfsadmin -safemode leave

17.Where does the metadata of NameNode reside? Is it in-memory or on the disk?

The answer is both. The metadata will be on the disk, but once you start the Hadoop cluster, the NameNode will take metadata into in-memory for faster access, so whatever the updates that are going to happen after starting the Hadoop cluster will happen in in-memory, and once you shut down the cluster, the changes will be saved to the drive.

18.My Hadoop cluster is running fine, unfortunately, I have deleted the NameNode metadata directory. What will happen now? All my data will be lost and the existing processes will be distracted?

No! Everything will go normally until you shut down your cluster, because, once you start the Hadoop cluster, NameNode’s metadata directory will go in-memory, and there will be no interaction with the local directory from that point. So all your data will be there until you shut down the cluster and nothing will happen to the existing processes too.

19.When does the reducer phase take place in a MapReduce job?

The reduce phase has 3 steps: Shuffle, Sort, and Reduce. Shuffle phase is where the data is collected by the Reducer from each Mapper. This can happen while Mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the Mappers are done. You can tell which one MapReduce is doing by looking at the reducer completion percentage;

0–33% means it’s doing the shuffle

34–66% is sort

67 –100% is reduce

This is why your reducers will sometimes seem “stuck” at 33%, it’s waiting for Mappers to finish.

Reducers start shuffling based on a threshold of percentage of Mappers that have finished. You can change the parameters to get reducers to start sooner or later.

20.How can you chain MapReduce jobs?

Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3

Many problems, which at first seem impossible in MapReduce, can be accomplished by dividing one job into two or more.

 x.addDependingJob(y)

21.Are there any other storage systems that can be used with MapReduce other than HDFS?

Yes, Hadoop supports many other compatible file systems. From Hadoop 2.x, you can use Amazon’s S3, from Hadoop 3.x, you can use Microsoft’s Azure data lake storage or Azure blob storage. MongoDB has released a Hadoop-MongoDB connector to integrate with.

22.How can you find the top 10 records based on values using map reduce?

A. There is a design pattern called Top K records in Hadoop, using which, we can find out the top 10 records.

23.How can I get the output of a Hive query into a .csv file?

You can save the output of a Hive query into a file by using the Insert overwrite statement as:

INSERT OVERWRITE LOCAL DIRECTORY '/home/acdgild/hiveql_output' select * from table;

24.Explain Cluster By vs. Order By vs. Sort By in Hive.

CLUSTER BY guarantees global ordering, provided you’re willing to join the multiple output files yourself.

The longer version:

ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as the output.
SORT BY x: orders data at each of the N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.
DISTRIBUTE BY x: ensures each of the N reducers gets non-overlapping ranges of x, but does not sort the output of every reducer. You end up with N or unsorted files with non-overlapping ranges.
CLUSTER BY x: ensures each of the N reducers gets non-overlapping ranges, then sorts by those ranges at the Reducers. This gives you a global ordering and is the same as doing (DISTRIBUTE BY and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.

Hence, CLUSTER BY is basically the more scalable version of ORDER BY.

25.How can you run a Hive query in the Debug mode?

Hive queries can be run in debug by starting your Hive console by switching on the logger mode to DEBUG as follows:

hive --hiveconf hive.root.logger=DEBUG,console

26.How many kinds of functions available in pig UDF?

There are 3 kinds of functions available in the Pig UDF:

1. Eval function: All kinds of Evaluation functions can be done with Eval functions. It takes one record as input, evaluates and return and returns one result.

2. Aggregate function: There are other kinds of Eval functions that work on a group of data, they take a bag as the input and return a scalar value as the output.

3. Filter function: These are also a kind of Eval functions that return a Boolean value as result. If the record satisfies the condition it returns, then it is true, or else it is false.

27.How can you load a file into the HBase?

One way you can load a file is through using Bulk loading while using MapReduce.

You can also write a shell script which runs recursively till all the lines are written into the HBase table.

28.How can you transfer data present in Mysql to Hbase?

One way you can migrate data from Mysql to Hbase is by using Sqoop.

You can also migrate data from Mysql to Hbase using MapReduce.

29.Is this piece of code correct? If not explain where it went wrong.


public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

    private final static Text one = new Text(1);

public void map(Object key, Text value, Context context

    private Text word = new Text();

StringTokenizer itr = new StringTokenizer(value.toString());

                    ) throws IOException, InterruptedException {
while (itr.hasMoreTokens()) {

}

        word.set(itr.nextToken());
context.write(word, one);

    }
}

Yes, there is a fault in this code. The output value which is defined in the Mapper parameters is IntWritable, but the data type which is returning in the context is of Text.

The data types of the output key and the output value, which are defined in the Mapper parameters, should match with the data type of the key, and the value should be returned in the context.

So, either you should change the data type of the output value specified in the Mapper class parameters, or you should change the data type of one to IntWritable.

30.HDFS works on the principle of ‘Write Once, Read Many Times.’ So, by this logic, you can overwrite a file which is already present in the HDFS. If yes, explain how can that be done.

Yes, we can overwrite a file which is already present in the HDFS. Using the -f option in the put or copyFromLocal command of HDFS, we can overwrite a file in HDFS

------------------------------------------------------------------------------------------------------------------

2 comments:

Edward19 November 2019 at 04:00
Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

Big Data Solutions

Data Lake Companies

Advanced Analytics Solutions

Full Stack Development Company
Alfred Avina29 January 2020 at 02:09
As per my opinion, videos play a vital role in learning. And when you consider AWS big data consultant , then you should focus on all the learning methods. Udacity seems to be an excellent place to explore machine learning.

Hadoop Interview Questions & Answers

What are the different types of File formats in hive?

3. Explain about Avro File format in Hadoop.

Does Hive support transactions?

5. Explain about Top-k Map-Reduce design pattern.

Explain about Hive Storage Handlers.

Explain partitioning in Hive.

8. What is the use of Impala?

9. Explain how to choose between Managed & External tables in Hive.

10. What are the different methods in Mapper class and order of their invocation?

11. What is the purpose of Record Reader in Hadoop?

12. What details are present in FSIMAGE?

13. Why do we need bucketing in Hive?

15. How do you copy files from one cluster to another cluster?

16. I have Hadoop 2.x and my configured block size to 128MB. Now, I have changed my block size to 150MB. So, will this change affect the files which are already present?

17. Currently, I am using Hadoop2.x, but I want to upgrade to Hadoop3.x. How can I upgrade without losing my data in HDFS?

18. Suppose, I am using Hadoop 2.x and my block size is 128MB, and I am writing a file of the size of 1GB into the cluster, suddenly after writing 200MB, the process is stopped. What do you think will happen now? Will I be able to read the 200MB of data or not now?

19. How can you troubleshoot if either of your NameNodes or DataNodes are not running?

20. What are counters in MapReduce?

22. Can I run a Hive query directly from the terminal without logging into the Hive shell?

23. How can you select the current date and time using HiveQL?

24. How can you skip the first line of the data set while loading it into a Hive table?

25. Explain the difference between COLLECT_LIST & COLLECT_SET and say where exactly can they be used in Hive.

26. How can you store the output of a Pig relation directly into Hive?

27. Can you process the data present in MongoDB using Pig?

2 comments: