Microsoft Azure Data Factory



Azure Data Factory is a Microsoft cloud service offered by the Azure platform that allows data integration from many different sources. Azure Data Factory is a perfect solution when in need of building hybrid extract-transform-load (ETL), extract-load-transform (ELT) and data integration pipelines.
Microsoft Azure Data Factory
What does Azure Data Factory do?
It allows you to:
  • Copy data from many supported sources both on-premise and cloud sources
  • Transform the data (cf. below paragraphs)
  • Publish the copied and transformed data, sending it to a destination data storage or analytics engine
  • Monitor the data flows using a rich graphical interface
What doesn’t Azure Data Factory do?
Data Factory isn’t SSIS (SQL Server Integration Services) in the cloud. It has less database specific features and focuses on supporting broader data transformation & movements (incl. big datasets, incl. data lake operations).
Data Factory can, however, run your SSIS packages in the Cloud (once build in SSIS). This allows to leverage Data Factory’s scalability with SSIS’s advanced ETL features.
Why do I need Azure Data Factory?
Data Factory is an enabler for any Cloud projects. In almost any Cloud project you will need to perform data movement activities across various networks (on-premise network and Cloud) and across various services (i.e. from and to close different Azure storages).
Data Factory is particularly a required enabler for organizations who are making their first steps in the Cloud & who thus try to connect on-premise data with the Cloud. For this Azure Data Factory has an Integration Runtime engine, a Gateway service which can be installed on-premise which guarantees performant & secure transfer of data from & to the cloud.
How does it differ from other ETL Tools?
Data Factory is one option to use as cloud ETL (or ELT) tool. There are some features that distinguish Azure Data Factory from other tools.
  • It also has the ability to run SSIS packages
  • It auto-scales (fully managed PaaS product) based on the given workload.
  • It allows to run up to once per minute
  • It bridges on-premise & Azure Cloud seamlessly through a gateway
  • It can handle big data volumes
  • It can connect & work together with other compute services (Azure Batch, HDInsights) to even run truly big data computations during ETL
From our expertise, the best alternative to Azure Data Factory would be Apache Airflow which has it advantages but also disadvantages. Contact us for more details.
How do I work with Azure Data Factory?
Azure Data Factory is a user interface tool which offers a very graphical overview to create/manage activities and pipelines. It doesn’t require coding skills, yet complex transformation will require Azure Data Factory experience.
Microsoft Azure Data Factory
Important features:
  • Azure Data Factory has default connectors with close to all on-premise data sources including MySQL, SQL Server, Oracle DBs
Microsoft Azure Data Factory.
  • Azure Data Factory supports branching, where the output of one activity can be a trigger for the start of another activity.
    - e.g. first copy the data from on-premise to Blob, then merge all blobs
  • Azure Data Factory support tumbling window trigger & event trigger. The first is particularly relevant in creating partitioned data in for example a Data Lake set-up (for example storing your data automatically in daily partitioned blobs: e.g. YYYY/MM/DD/Blob.csv).
    An event trigger is applicable when an event such as a new Blob on Blob Storage should automatically trigger a transformation.
  • Azure Data Factory allows to work with parameters and thus enables to pass on dynamically parameters between datasets, pipelines & triggers. An example could be that the filename of the destination file should have the name of the pipeline or should be the date of the data slice.
  • Azure Data Factory allows to run pipeline up to 1 run per minute. It thus doesn’t allow real-time but enables close to real-time.  
  • Azure Data Factory provides monitoring & alerting. The execution of the different pipelines can be easily monitored through the UI & you can set-up alerts (linked to Azure Monitor) if anything fails.
Microsoft Azure Data Factory
  • Azure Data Factory can work well with Azure Databricks to schedule ML algorithms. Read more about this in this insight.
How does Azure Data Factory work with other Azure resources?
Microsoft Azure Data Factory
One of the main advantages of Azure Data Factory is that it integrates great with other Azure Compute & Storage resources. This is the exact purpose of linked services: i.e. to define the connection to external resources. There are 2 kinds of linked services you can define:
  • Data Store Service to: Azure SQL Database, Azure SQL Data-warehouse, an on-premises databases, a Data Lake, a filesystem, a NoSQL DB, etc.
  • Compute Service to transform and enrich data: e.g., Azure HDInsight, Azure Machine Learning, Stored Procedure in any SQL, Data Lake Analytics U-SQL activity, Azure Databricks and/or Azure Batch (using Custom Activity)
The pricing of Data Factory is based on usage: number of “activities” (data processing steps) per month and the integration runtime usage is charged per hour depending on the machine the number of nodes used.
Should I use Azure Data Factory or SSIS?
Use the right tool for the right purpose. Through below overview you understand that they are complementary. They are also built that way: i.e., Azure Data Factory also offers the ability to deploy, manage and run SSIS packages in managed Azure SSIS Integration Runtimes.
Based on your current platform/solution:

Hybrid On-Prem
& Azure Solution
Azure Solution
On-Prem
Only Solution
Azure Data Factory
(ADF V2)
Yes
Yes
No
Integration Services (SSIS)
Yes
Yes
Yes
Based on type of data:

Small data
Close to
real-time data
(every minute)
Big Data
Azure Data Factory
(ADF V2)
Yes
Yes
Yes
Integration Services (SSIS)
Yes
No
No




5 comments:

  1. Azure Data Factory is a fully managed data integration service that enables enterprise data-driven scenarios. These include ETL (Extract, Transform and Load) functions, data movement, and data synchronization between cloud and on-premises data stores. Azure Data Factory is fully managed, so you don't need to worry about managing infrastructure. It's also elastic, so you can scale up or down based on your load. And it's always available, so you don't have to worry about downtime or maintenance windows.

    ReplyDelete
  2. Big Data is defined as the large volume of different types of data. This data is generated by companies, researchers, and individuals and this data is stored in a variety of storage devices. The data can be analyzed in a variety of ways using different technologies.Big Data technologiescan be analyzed to discover business trends, to improve quality of services, or to manage servers.

    ReplyDelete