Hadoop 3.x comes with a native support to change the storage system from HDFS to Microsoft Azure Data Lake Storage.
In this blog, we will be discussing about how to integrate your Azure data lake with HDFS.
Azure Data Lake needs OAUTH2.0 for authenticating your request, for that purpose you need to create a user in Azure active directory and you need to add it into your Data lake.
OAuth2 Support
Usage of Azure Data Lake Storage requires OAuth2 bearer token to be present as part of the HTTPS header as per OAuth2 specification. Valid OAuth2 bearer token should be obtained from Azure Active Directory for valid users who have access to Azure Data Lake Storage Account.
Azure Active Directory (Azure AD) is Microsoft’s multi-tenant cloud based directory and identity management service
Creating Service principle using Azure Active Directory
- Open your Azure portal and click on Azure Active Directory
2. Click on Add
3. Provide the necessary details and remember the Name
4.Click on the user which you have created just now
5. You Application ID is your Client ID, note down the Client ID and click on Settings and click on Keys
6.Enter a name for your key, select the duration you want to have for that key and click on Save
7. Note down the value of the key and this will be your Client secret
8. Now in the App registrations, click on End points which is just beside Add and note down the OAUTH 2.0 Token End point and this will be your Token Refresh URL
9. Now open your Data lake portal and click on Access control(IAM) and click on Add to add the user which you have created in the Active Directory.
10. Select the Owner role
11. In the Add user, search with the name which you have created in the Azure active directory, select the user and click on Ok
Now finally you can see the user in your Azure Data Lake Storage portal.
With these user credentials, you can communicate with Data lake storage using Hadoop commands.
So, we will summarize what you have generated till now in terms of Hadoop3 configurations
Application ID — Client ID
OAUTH 2.0 Token End point – OAUTH 2.0 Refresh URL
Key value — OAUTH 2.0 Credential or Client secret
Now you need to add these properties in your core-site.xml to make the changes effect.
<property> <name>dfs.adls.oauth2.access.token.provider.type</name> <value>ClientCredential</value> </property> <property> <name>dfs.adls.oauth2.refresh.url</name> <value>YOUR TOKEN ENDPOINT</value> </property> <property> <name>dfs.adls.oauth2.client.id</name> <value>YOUR CLIENT ID</value> </property> <property> <name>dfs.adls.oauth2.credential</name> <value>YOUR CLIENT SECRET</value> </property> <property> <name>fs.adl.impl</name> <value>org.apache.hadoop.fs.adl.AdlFileSystem</value> </property> <property> <name>fs.AbstractFileSystem.adl.impl</name> <value>org.apache.hadoop.fs.adl.Adl</value> </property>
After adding these properties, save and close the file and now open your hadoop-env.sh file and add the class path of Hadoop tools (Azure support comes from the Hadoop tools library)
set HADOOP_CLASSPATH=$HADOOP_HOME/share/hadoop/tools/lib/*
After adding the class path, save and close the hadoop-env.sh file
Now without starting the HDFS daemons, you can interact with your ADL, here is the data present in my ADL storage.
For interacting with your ADL storage, you need to provide your ADL storage URI
Let us first query our ADL using Hadoop commands.
Here we are checking the data present in the root directory using the ls command
hadoop fs -ls adl://acdkiran.azuredatalakestore.net/
Now we are checking the data present in the Datasets directory
hadoop fs -ls adl://acdkiran.azuredatalakestore.net/Datasets
Let us try to create one folder using the mkdir command
hadoop fs -mkdir adl://acdkiran.azuredatalakestore.net/Test
You can see that we have successfully created the directory Test in ADL
Let us now copy some files from our local storage to ADL storage using the put command
hadoop fs -put Downloads/Datasets/tweets.rar adl://acdkiran.azuredatalakestore.net/Test
In the above screenshot, you can see that tweets.rar file has been successfully copied into ADL storage.
Let us confirm the same in our azure portal also.
In the below screenshot, you can see that the directory has been created and the data is also populated successfully and we have confirmed the same in the azure portal.
We hope this blog helped you in understanding how to integrate Hadoop3.x with Azure Data Lake Storage and how to interact with ADL storage.
Now that the 3.0.1 release of Apache Hadoop is out, it's time to start planning its migration toAzure Data Lake. The big question is how to get from the 0.20.0 release of Apache Hadoop on-premises to the 0.30.0 release of Apache Hadoop on Azure Data Lake.
ReplyDeleteAs you may have heard, the demand forbig data scientists and engineers is growing faster than the demand for other types of IT pros. In fact, many technology companies are now in a big data arms race to develop the next big thing. It’s a hot field and in a hot field, competition is fierce.
ReplyDelete