Kiran Kumar Vasadi Google Cloud Certified Professional Data Engineer & Cloud Architect: Integrating Azure Data Lake with Hadoop 3.x

Hadoop 3.x comes with a native support to change the storage system from HDFS to Microsoft Azure Data Lake Storage.

In this blog, we will be discussing about how to integrate your Azure data lake with HDFS.

Azure Data Lake needs OAUTH2.0 for authenticating your request, for that purpose you need to create a user in Azure active directory and you need to add it into your Data lake.

OAuth2 Support

Usage of Azure Data Lake Storage requires OAuth2 bearer token to be present as part of the HTTPS header as per OAuth2 specification. Valid OAuth2 bearer token should be obtained from Azure Active Directory for valid users who have access to Azure Data Lake Storage Account.

Azure Active Directory (Azure AD) is Microsoft’s multi-tenant cloud based directory and identity management service

Creating Service principle using Azure Active Directory

Open your Azure portal and click on Azure Active Directory

2. Click on Add

3. Provide the necessary details and remember the Name

4.Click on the user which you have created just now

5. You Application ID is your Client ID, note down the Client ID and click on Settings and click on Keys

6.Enter a name for your key, select the duration you want to have for that key and click on Save

7. Note down the value of the key and this will be your Client secret

8. Now in the App registrations, click on End points which is just beside Add and note down the OAUTH 2.0 Token End point and this will be your Token Refresh URL

9. Now open your Data lake portal and click on Access control(IAM) and click on Add to add the user which you have created in the Active Directory.

10. Select the Owner role

11. In the Add user, search with the name which you have created in the Azure active directory, select the user and click on Ok

Now finally you can see the user in your Azure Data Lake Storage portal.

With these user credentials, you can communicate with Data lake storage using Hadoop commands.

So, we will summarize what you have generated till now in terms of Hadoop3 configurations

Application ID — Client ID

OAUTH 2.0 Token End point – OAUTH 2.0 Refresh URL

Key value — OAUTH 2.0 Credential or Client secret

Now you need to add these properties in your core-site.xml to make the changes effect.

 <property>
        <name>dfs.adls.oauth2.access.token.provider.type</name>
        <value>ClientCredential</value>
  </property>
  <property>
      <name>dfs.adls.oauth2.refresh.url</name>
      <value>YOUR TOKEN ENDPOINT</value>
  </property>
  <property>
      <name>dfs.adls.oauth2.client.id</name>
      <value>YOUR CLIENT ID</value>
  </property>
  <property>
      <name>dfs.adls.oauth2.credential</name>
      <value>YOUR CLIENT SECRET</value>
  </property>
  <property>
      <name>fs.adl.impl</name>
      <value>org.apache.hadoop.fs.adl.AdlFileSystem</value>
  </property>
  <property>
      <name>fs.AbstractFileSystem.adl.impl</name>
      <value>org.apache.hadoop.fs.adl.Adl</value>
  </property>

After adding these properties, save and close the file and now open your hadoop-env.sh file and add the class path of Hadoop tools (Azure support comes from the Hadoop tools library)

set HADOOP_CLASSPATH=$HADOOP_HOME/share/hadoop/tools/lib/*

After adding the class path, save and close the hadoop-env.sh file

Now without starting the HDFS daemons, you can interact with your ADL, here is the data present in my ADL storage.

For interacting with your ADL storage, you need to provide your ADL storage URI

Let us first query our ADL using Hadoop commands.

Here we are checking the data present in the root directory using the ls command

hadoop fs -ls adl://acdkiran.azuredatalakestore.net/

Now we are checking the data present in the Datasets directory

hadoop fs -ls adl://acdkiran.azuredatalakestore.net/Datasets

Let us try to create one folder using the mkdir command

hadoop fs -mkdir adl://acdkiran.azuredatalakestore.net/Test

You can see that we have successfully created the directory Test in ADL

Let us now copy some files from our local storage to ADL storage using the put command

hadoop fs -put Downloads/Datasets/tweets.rar adl://acdkiran.azuredatalakestore.net/Test

In the above screenshot, you can see that tweets.rar file has been successfully copied into ADL storage.

Let us confirm the same in our azure portal also.

In the below screenshot, you can see that the directory has been created and the data is also populated successfully and we have confirmed the same in the azure portal.

We hope this blog helped you in understanding how to integrate Hadoop3.x with Azure Data Lake Storage and how to interact with ADL storage.

2 comments:

Lafay Tech Plaza13 April 2021 at 01:35
Now that the 3.0.1 release of Apache Hadoop is out, it's time to start planning its migration toAzure Data Lake. The big question is how to get from the 0.20.0 release of Apache Hadoop on-premises to the 0.30.0 release of Apache Hadoop on Azure Data Lake.
George17 July 2021 at 06:18
As you may have heard, the demand forbig data scientists and engineers is growing faster than the demand for other types of IT pros. In fact, many technology companies are now in a big data arms race to develop the next big thing. It’s a hot field and in a hot field, competition is fierce.