top of page

Data Movement And Transformation Through ADF

Author : Avinash Kumar Pandey & Pushpender Singh


What is ADF ?

ADF, which stands for "Azure Data Factory", is a cloud-based ETL (extract, transform, load) tool developed by Microsoft Azure. It facilitates the movement of data from source to sink by creating pipelines and offers a range of data transformation functionalities.


Characteristics of ADF

ADF possesses several notable features, such as being serverless, which eliminates the need to manage underlying infrastructure. It's also a platform as a service that provides computing platforms, including an operating system, database, web server, and more. Furthermore, it's a low-code and no-code solution that only charges for what you use.


Elements of ADF

ADF mainly deals with five different elements:

  • Linked Service, which establishes connections with respective data stores.

  • Dataset, which identifies data residing in different stores.

  • Activities, which are particular tasks performed in the ADF pipeline (e.g., copy activity, data flow, etc.).

  • Pipeline, which is a collection of logical groups of activities that connect sources with respective sinks, and Trigger, which initiates pipeline runs.

Typical ADF Pipeline diagram


Our Business Case

Our source data store is Azure Blob Storage, and we'll be moving data from there to Azure Data Lake Gen2, which will serve as our staging area. Once we complete all the necessary data transformations, we'll transport the data to the intended Snowflake table from Data Lake Gen2. To accomplish all these actions, we will exclusively employ Azure Data Factory.



Requirement

The following requirements are necessary from Azure and Snowflake to perform the aforementioned task:

  • Azure Resource Group

  • Azure Blob Storage

  • Azure Data Lake

  • Azure Data Factory

  • Snowflake Account

Our Data

Let's take a moment to examine our raw data, which consists of a sample movie dataset comprising only four records.



Pipeline Implementation steps

1. To implement the pipeline, we will start by creating a Resource Group, followed by a Storage Account and Blob Storage, where our raw data is currently located. This can be accomplished through the Azure Portal. Once we have created Blob Storage, the next step is to create another Storage Account for Data Lake Gen2, which will serve as our staging area. To enable this, we need to activate the 'hierarchical namespace' option during the storage account creation process, which can be found in the 'Advanced' section.



2. After creating the Storage Account and Data Lake, we can view them as shown below:



3. After setting up the Blob Storage Account and Data Lake, the next step is to create a Data Factory and launch its studio once the deployment is successful.


4. Once we're in the ADF working environment, our first task is to ensure that we have the necessary Linked Services for our pipeline. To do so, we need to go to the Manage section and create three different Linked Services (one each for Azure Blob, Azure Data Lake Gen2, and Snowflake, respectively) that we can use when establishing connections with the 'SOURCE' and 'SINK'.

(Note: Authentication type should be ‘SAS’ for storage account while creating linked service.)




5. After setting up all of the necessary Linked Services, our next step is to create our master pipeline, which will load data from Blob Storage (as the source) to Data Lake Storage (as the staging area), and subsequently from Data Lake Storage to Snowflake (as the target) after performing data transformations. To accomplish this, follow these steps:

  1. Navigate to Author -> Pipeline (you can name it anything; in our case, we named it 'Master Load').

Navigate to Activities and drag and drop the 'Execute Pipeline' component onto the canvas area.



6. Under the 'Execute Pipeline1' component, we need to create a sub-pipeline (you can name it anything; in our case, we named it 'Data Loading') which will be assigned to our main component.


7. In the sub-pipeline, we will use two components - the first being 'Get Metadata' which will provide us with metadata information such as column count, file size, last modified date, etc. The second component will be 'Copy Data'. To configure the 'Copy Data' component, follow these steps:

  1. Go to Source and add the linked service that was created for the Blob Storage.

  2. Select the file type (in this case, delimited as we have a CSV file type).

  3. Go to Sink and add the linked service created for DataLake Gen 2.

  4. Select the file type (again, delimited as we want a CSV file type as our resultant).

  5. Go to mapping and map the data by importing the schema.

Once completed, our sub-pipeline will be depicted as follows -



8. To proceed, we need to assign the sub-pipeline (i.e., Data loading pipeline) to the Execute Pipeline1 component. After that, we can validate, publish, and trigger the component. Once everything runs successfully, our sample movie data will be available in the Data Lake container (stage).



9. We need to add another 'Execute Pipeline' component to our canvas, but before that, ensure that we are on the 'Master Load' pipeline. Drag and drop the component onto the canvas area, as shown below:



10. Under "Execute Pipeline2," we will create another sub-pipeline called "Transformation" and attach it to "Execute Pipeline2," similar to how we did it for "Execute Pipeline1." Then, go to the "Transformation" pipeline and add the "Data Flow" component by dragging and dropping it onto the canvas area. Double-click on it to access another canvas where we will build our entire transformation workflow.




To configure the data transformation workflow, follow these steps:


a. Configure the source dataset: Under the Source component, go to Source settings and select the dataset from DataLake Gen2. Then, select the projection and import it.





We can preview the data at any point in time by enabling the Data flow debug



b. To filter the movies based on a specific range, follow these steps:


  1. Select the Filter component in the Data Flow canvas.

  2. Go to Filter settings and under Filter on, write the expression "toInteger(Year) >= 1900 && toInteger(Year) <= 2010". This will filter the movies whose year is between 1900 and 2010.



c. To arrange addresses that are present in multiple lines, we can use the Derived Column component to replace line breaks with commas. To do this, follow these steps:


  1. Select the Derived Column component.

  2. In the Derived Column settings, click "Add Column."

  3. Select the Address column.

  4. Write the expression "regexReplace(regexReplace(Address,'[\n]',','),'[\r]',',')" in the Expression field.



d. To separate the comma-separated values in the address data, we utilized the Split Values Component. Follow the steps below to achieve this:

  1. Under Split Component, select the Derived Column.

  2. In the Derived Column's settings, add a new column and select the address column.

  3. In the expression box, write "split(Address, ',')[1]" to extract the second part of the comma-separated values in the address column.



e. To perform data aggregation and store the results separately, we can use an Aggregate Component. In our use case, we have performed yearly aggregation of ratings. To do this, select the Aggregate Component and go to its settings. Then, group the data by year and aggregate it by adding a column with the expression "avg(toInteger(Rating))".






f. To configure the Sink component, we used two different sinks to load data into two separate Snowflake tables. The first table stores the year-wise rating while the second table stores the data after address splitting. To achieve this, follow the steps below:

  1. Select Sink

  2. Under Dataset, add the linked service (Snowflake)

  3. Select Mapping

  4. Enable Auto mapping

  5. Inspect the data mapping

  6. Preview the data to ensure everything is correct.



g. As mentioned earlier, To store different sets of data into respective tables in Snowflake, add another branch out of the main flow from the Split Values component, and then add another Sink component. Follow these steps to achieve this:

  1. Select Sink component

  2. Under Dataset, add a linked service for Snowflake

  3. Configure mapping for the data by auto-mapping

  4. Inspect the mapping and data preview to ensure everything is correct.




h. After configuring the whole transformation component correctly, the final data flow pipeline for transformation is shown below:



11. To execute the 'Execute Pipeline2' component, follow the previously mentioned steps: select 'Validate', then 'Publish', and finally 'Trigger' the pipeline.


12. After the pipeline runs successfully, the necessary transformed data will be available in our existing snowflake tables.




Conclusion

To sum up, the business case study emphasizes the significance of an efficient data integration process and demonstrates how Azure Data Factory can help organizations achieve this objective. The study highlights the benefits of using a cloud-based architecture, such as scalability and the rich transformation capabilities that Azure Data Factory offers to shape and format data as it moves between source and destination systems. Furthermore, the solution facilitates the processing of data in bulk or real-time and transforms it into the desired format before loading it into Snowflake.


52 views0 comments
bottom of page