Azure Data Factory – Using Data Flow Joins

The data flow is a visual data transformation feature in Azure Data Factory where no coding is required. Well, no coding does not mean that no effort is required to create a data flow, but we have visual representation of transformation components that can be put together to create data flow. Some examples – Join, Select, Union, Exists, Sort, Filter etc

We can pull data from any database tables using SQL join or running SQL query. But when data sources are having data in different formats like JSON, CVS, Excel and XML etc. then data transformation can be done using Data flows easily .

This article is going to cover inner join with transformation components in data flow.

Prerequisite

  1. At least one Data Source. Ex – Azure Blob Storage
  2. An Azure Data Factory Instance.
  3. Understanding of pipelines, dataset and linked services

Post outline

  1. Create Azure Blob Storage Linked Service <AzureStorageLinkedService>.
  2. Create datasets using Azure Storage Linked Service.

Data Preparation

Linked Service and Datasets are already created to focus on data flow. Sample data can be created using the JSON given below and uploaded to Azure Storage blob container.

Employee.json

[
  {
    "EmployeeID": 1,
    "FirstName": "James",
    "LastName": "David",
    "City": "New York",
    "DateOfBirth": "1991-12-01"
  },
  {
    "EmployeeID": 2,
    "FirstName": "Mike",
    "LastName": "Jane",
    "City": "Singapore",
    "DateOfBirth": "1992-06-03"
  },
  {
    "EmployeeID": 3,
    "FirstName": "Juan",
    "LastName": "James",
    "City": "Tokyo",
    "DateOfBirth": "1980-06-26"
  },
  {
    "EmployeeID": 4,
    "FirstName": "Sarah",
    "LastName": "Carlos",
    "City": "London",
    "DateOfBirth": "1986-08-12"
  },
  {
    "EmployeeID": 5,
    "FirstName": "Garcia",
    "LastName": "Paul",
    "City": "Barlin",
    "DateOfBirth": "1996-01-31"
  },
  {
    "EmployeeID": 6,
    "FirstName": "Brown",
    "LastName": "Daniel",
    "City": "New York",
    "DateOfBirth": "1988-09-22"
  },
  {
    "EmployeeID": 7,
    "FirstName": "Williams",
    "LastName": "James",
    "City": "Houston",
    "DateOfBirth": "1981-10-25"
  },
  {
    "EmployeeID": 8,
    "FirstName": "Hover",
    "LastName": "Martin",
    "City": "Los Angeles",
    "DateOfBirth": "1995-01-26"
  },
  {
    "EmployeeID": 9,
    "FirstName": "Jeffey",
    "LastName": "Den",
    "City": "Philadelphia",
    "DateOfBirth": "1998-03-13"
  },
  {
    "EmployeeID": 10,
    "FirstName": "Marker",
    "LastName": "Jonas",
    "City": "Chicago",
    "DateOfBirth": "2000-11-17"
  },
  {
    "EmployeeID": 11,
    "FirstName": "Adnres",
    "LastName": "Kevin",
    "City": "Michigan",
    "DateOfBirth": "1998-09-04"
  },
  {
    "EmployeeID": 12,
    "FirstName": "Scarlet",
    "LastName": "Holding",
    "City": "Paris",
    "DateOfBirth": "1999-05-19"
  },
  {
    "EmployeeID": 13,
    "FirstName": "Blaze",
    "LastName": "May",
    "City": "Rome",
    "DateOfBirth": "2001-04-11"
  }
]

Departments.json

[
  {
    "DepID": 1,
    "DepName": "Human Resource"
  },
  {
    "DepID": 2,
    "DepName": "Finance and Tax"
  },
  {
    "DepID": 3,
    "DepName": "Information Technology"
  },
  {
    "DepID": 4,
    "DepName": "Communication"
  },
  {
    "DepID": 5,
    "DepName": "Department of Home"
  },
  {
    "DepID": 6,
    "DepName": "Department of Justice"
  },
  {
    "DepID": 7,
    "DepName": "Labor Law"
  },
  {
    "DepID": 8,
    "DepName": "Sport Department"
  },
  {
    "DepID": 9,
    "DepName": "Bio Chemical"
  },
  {
    "DepID": 10,
    "DepName": "Child Education"
  }
]

EmployeeDepartmentMapping.json

[
  {
    "MappingID": 1,
    "EmployeeeID": 1,
    "DepartmentID": 1
  },
  {
    "MappingID": 2,
    "EmployeeeID": 2,
    "DepartmentID": 5
  },
  {
    "MappingID": 3,
    "EmployeeeID": 3,
    "DepartmentID": 4
  },
  {
    "MappingID": 4,
    "EmployeeeID": 5,
    "DepartmentID": 2
  },
  {
    "MappingID": 5,
    "EmployeeeID": 6,
    "DepartmentID": 3
  },
  {
    "MappingID": 6,
    "EmployeeeID": 4,
    "DepartmentID": 6
  },
  {
    "MappingID": 7,
    "EmployeeeID": 7,
    "DepartmentID": 8
  },
  {
    "MappingID": 8,
    "EmployeeeID": 9,
    "DepartmentID": 1
  },
  {
    "MappingID": 9,
    "EmployeeeID": 8,
    "DepartmentID": 9
  },
  {
    "MappingID": 10,
    "EmployeeeID": 11,
    "DepartmentID": 4
  },
  {
    "MappingID": 11,
    "EmployeeeID": 10,
    "DepartmentID": 10
  },
  {
    "MappingID": 12,
    "EmployeeeID": 13,
    "DepartmentID": 7
  },
  {
    "MappingID": 13,
    "EmployeeeID": 12,
    "DepartmentID": 9
  }
]

Define Data Sources

First, we need to create Blob Storage Linked Service and using Linked Service three datasets should be created. Let’s suppose datasets are Employees, Departments and EmployeeDepartmentMapping.

Secondly, create a new data flow named <InnerJoinDataFlow> from the data flow option. Mouse hover on Data flows and and select Mapping Data Flow.

In next step, add three sources Employees, Departments and EmpDepMapping under data flow by clicking on Add source in the center panel.

Then, apply required Source settings for all data sources as shown in below screenshots and make sure to perform test connection for the sources.

Implement Inner Join

To implement inner join, we need to use three more transformation components, Join, Select and Sink. you need to click “+” sign on each transformation components to complete join among three data sources.

Enable Data flow debug option to enable Data Preview for each transformation component or at every stage of data flow transformation. After enabling click on Refresh to preview data. if data preview is not displaying data, then there is some problem and that should be resolved before moving forward.

Video will show the settings for Join, Select and Sink.

Inner Join Settings and Complete Flow

Now, data flow is ready, but to invoke this data flow we need to use a Azure data factory pipeline.

Invoke data flow from pipeline

Go to pipeline > New pipeline > Activities > Move & transform > Drag and Drop Data flow activity. Adding data flow panel will open > Use existing data flow and select created data flow from the drop down. Click on OK to get it applied.

Click on Debug to kick off the pipeline with a data flow.

Verify data in blob container

Go to your storage account > Blob container to verify the output of data flow pipeline. The pipeline will create four files, Stated, Committed, Success and Part. Data will be contained in Part file and output is shown below

So, we can see final output is based on the inner join between Employee.json, Departments.json and EmployeeDepartmentMapping.json

There are other joins. For example – Full Outer, Left Outer, Right Outer and Cross Join. So, it depends on requirement which join to be used. So, selecting the required join from join transformation and publishing the changes, would get applied to data flow.

This is an overview of joins in data flow, but data flow is capable of handling the more complex and real time scenario.

Hope, this post would give an idea about how to implement data flows in Azure Data Factory.

If you have any suggestions/feedback, please put them in the comment box.

Happy Learning 🙂

One thought on “Azure Data Factory – Using Data Flow Joins

Add yours

Leave a Reply

Up ↑

%d bloggers like this: