Beginner’s Guide for Spark

Introduction to Apache Spark

In this Blog we will be discussing the basics of Spark’s functionality and its installation. Apache spark is a cluster computing framework which runs on top of the Hadoop eco-system and handles different types of data. It is a one stop solution to many problems. Spark has rich resources for handling the data and most importantly, it is 10-100x faster than Hadoop’s MapReduce. It attains this speed of computation by its in-memory primitives. The data is cached and is present in the memory (RAM) and performs all the computations in-memory.
Spark’s rich resources has almost all the components of Hadoop. For example we can perform batch processing in Spark and real time data processing using its own streaming engine called spark streaming.
We can perform various functions with Spark:
  • SQL operations: It has its own SQL engine called Spark SQL. It covers the features of both SQL and Hive.
  • Machine Learning: It has Machine Learning Library , MLib. It can perform Machine Learning without the help of MAHOUT.
  • Graph processing: It performs Graph processing by using GraphX component.
All the above features are in-built in Spark.
It can be run on different types of cluster managers such as Hadoop, YARN framework and Apache Mesos framework. It has its own standalone scheduler to get started, if other frameworks are not available.Spark provides the access and ease of storing the data,it can be run on many file systems. For example, HDFS, Hbase, MongoDB, Cassandra and can store the data in its local files system.

Resilient Distributed Datasets

Resilient Distributed Datasets (RDD) is a simple and immutable distributed collection of objects. Each RDD is split into multiple partitions which may be computed on different nodes of the cluster. In spark all function are performed on RDDs only.
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.
Let’s see now the features of Resilient Distributed Datasets in the below explanation:
  • In Hadoop we store the data as blocks and store them in different data nodes. In Spark, instead of following the above approach, we make partitions of the RDDs and store in worker nodes (datanodes) which are computed in parallel across all the nodes.
  • In Hadoop we need to replicate the data for fault recovery, but in case of Apache Spark, replication is not required as this is performed by RDDs.
  • RDDs load the data for us and are resilient which means they can be recomputed.
  • RDDs perform two types of operations: transformations which creates a new dataset from the previous RDD and actions which return a value to the driver program after performing the computation on the dataset.
  • RDDs keeps a track of transformations and checks them periodically. If a node fails, it can rebuild the lost RDD partition on the other nodes, in parallel.

RDDs can be created in two different ways:

  • Referencing an external dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.
  • By parallelizing a collection of objects(a list or a set) in the driver program.
Lazy evaluation in RDD
If you create any RDD from an existing RDD that is called as transformation and unless you call an action your RDD will not be materialized the reason is spark will delay the result until you really want the result because there could be some situations you have typed something and it went wrong and again you have to correct it in an interactive way it will increase the time and it will create un-necessary delays. Also spark optimizes the required calculations and takes intelligent decisions which is not possible with line by line code execution. Apache Spark recovers from failures and slow workers.
Architecture of Apache Spark
Apache spark application contains two programs a Driver program and Workers program. A cluster manager will be there in-between to interact with the workers on the cluster nodes. Spark context will keep in touch with the worker nodes with the cluster manager.
Apache Spark context is like master and Spark workers are like slaves. Workers contains the executors to run the job . If any dependencies or arguments have to be passed then spark context will take care of that. RDD’s will reside on the spark executors. You can also run spark applications locally using a local thread, and if you want to take advantage of distibuted environments you can take the help of S3, HDFS or any other storage system.

Life cycle of a Apache Spark program:

  1. Some input RDDs are created from external data or by parallelizing the collection of objects in the driver program.
  2. These RDDs are lazily transformed into new RDDs using transformations like filter() or map().
  3. Spark caches any intermediate RDDs that will  be needed to reused.
  4. Actions such as count() and collect are launched to kick off a parallel computation which is then optimized and then executed by Spark.
Let’s now discuss the steps to install spark in your cluster:

Step by step process to install spark
Before installing spark Scala needs to be installed in the system. We need to follow the below steps to install scala.
1.Open the Terminal in your CentOS
To download Scala type the below command:
2.Extract the downloaded tar file by using the command
tar -xvf scala-2.11.1.tgz
After extracting specify the path of scala in .bashrc file
After setting the path we need to save the file and type the below command:
Source .bashrc
The above command will sum up the scala installation. we need to then install spark after that.
To install Apache spark in centos we need to follow the below steps to download and install Single Node cluster of Spark in CentOS.
1.Open the browser and go the link
File will be downloaded into Downloads folder
Go to the Downloads folder and untar the Downloaded file using the below command:
tar –xvf spark-1.5.1-bin-hadoop2.6.tgz
After untaring the file we need to move the file to the Home Folder using the below command:
sudo mv spark-1.5.1-bin-hadoop2.6 /home/acadgild
Now the file is moved on to the home folder
We need to update the path for spark in the .bashrc in the same way as we did for scala.
Refer the below screen shot for updating the path for .bashrc.
After adding the path for SPARK type the command source .bashrc, refer the the screen shot for the same.
Make a folder by Name ‘work’ in HOME using the below command:
mkdir work
Inside the work folder we need to make another folder by name ‘sparkdata’ using the command
mkdir sparkdata
We need to give the permissions to the sparkdata folder as 777 using the command
chmod 777 $HOME/work/sparkdata
Now move into the conf directory of spark folder using the below command:
cd spark-1.5.1-bin-hadoop2.6
cd conf
Type the command ls to see the files inside conf folder:
There will be a file by name spark-env.sh.template , we need to copy that file by name spark-env.sh using the below command:
cp spark-env.sh.template spark-env.sh
Edit the spark-env.sh file using the below command
gedit spark-env.sh
and make the configuration as follows

Note: Make sure that you are giving the paths of Java and Scala correctly. After editing save the file and close the file.

Lets follow the below steps to start the spark single node cluster.Move to the sbin directory of spark folder using the below command:
cd spark-1.5.1-bin-hadoop2.6/sbin
Inside sbin type the below command to start the Master and Slave daemons.
./start-all.sh
Now the Apache spark Single Node cluster will start with One Master and Two Workers.
You can check that the cluster is running or not by using the below command
jps’
If the Master and Worker Nodes are running then you have successfully started the spark single node cluster.
We hope this blog helped you in getting the basic understanding of spark and the ways to install it.


No comments:

Post a Comment