
Generally, Azure Data Factory aggregate transform has been used to perform COUNT, SUM, MIN, and MAX. The aggregate transform uses Azure Data Factory (ADF) expression to perform these computations. However, the aggregate transform can be used with a select transform to remove duplicate data. Let’s see how we can achieve it.
Data Source
Just to keep things simple, we will use below JSON data as a data source, which contains duplicate data at the highlighted rows. So, create an Employee.json file from the JSON.
[
{
"EmployeeID": 1,
"FirstName": "James",
"LastName": "David",
"City": "New York",
"DateOfBirth": "1991-12-01"
},
{
"EmployeeID": 2,
"FirstName": "Mike",
"LastName": "Jane",
"City": "Singapore",
"DateOfBirth": "1992-06-03"
},
{
"EmployeeID": 2,
"FirstName": "Mike",
"LastName": "Jane",
"City": "Singapore",
"DateOfBirth": "1992-06-03"
},
{
"EmployeeID": 3,
"FirstName": "Juan",
"LastName": "James",
"City": "Tokyo",
"DateOfBirth": "1980-06-26"
},
{
"EmployeeID": 4,
"FirstName": "Sarah",
"LastName": "Carlos",
"City": "London",
"DateOfBirth": "1986-08-12"
},
{
"EmployeeID": 5,
"FirstName": "Garcia",
"LastName": "Paul",
"City": "Barlin",
"DateOfBirth": "1996-01-31"
},
{
"EmployeeID": 6,
"FirstName": "Brown",
"LastName": "Daniel",
"City": "New York",
"DateOfBirth": "1988-09-22"
},
{
"EmployeeID": 7,
"FirstName": "Williams",
"LastName": "James",
"City": "Houston",
"DateOfBirth": "1981-10-25"
},
{
"EmployeeID": 8,
"FirstName": "Hover",
"LastName": "Martin",
"City": "Los Angeles",
"DateOfBirth": "1995-01-26"
},
{
"EmployeeID": 9,
"FirstName": "Jeffey",
"LastName": "Den",
"City": "Philadelphia",
"DateOfBirth": "1998-03-13"
},
{
"EmployeeID": 9,
"FirstName": "Jeffey",
"LastName": "Den",
"City": "Philadelphia",
"DateOfBirth": "1998-03-13"
},
{
"EmployeeID": 10,
"FirstName": "Marker",
"LastName": "Jonas",
"City": "Chicago",
"DateOfBirth": "2000-11-17"
},
{
"EmployeeID": 11,
"FirstName": "Adnres",
"LastName": "Kevin",
"City": "Michigan",
"DateOfBirth": "1998-09-04"
},
{
"EmployeeID": 12,
"FirstName": "Scarlet",
"LastName": "Holding",
"City": "Paris",
"DateOfBirth": "1999-05-19"
},
{
"EmployeeID": 13,
"FirstName": "Blaze",
"LastName": "May",
"City": "Rome",
"DateOfBirth": "2001-04-11"
},
{
"EmployeeID": 13,
"FirstName": "Blaze",
"LastName": "May",
"City": "Rome",
"DateOfBirth": "2001-04-11"
}
]
Dataflow Implementation
Data flow implementation requires an Azure Data Factory and a Storage Account instance. Perform the below steps to set up the environment to implement a data flow.
- Create a resource group <demo>.
- Create an Storage Account <dataflowtransformation> and add a container named <jsons> and upload the Employee.json
- Create an Azure Data Factory <data-flow-demo>
- Create a blob storage linked service <AzureStorageLinkedService> for the storage account <dataflowtransformation> and test the connection in the ADF.
- Create a blob storage JSON dataset<Employees> using the linked service, browse the Employee.json, and do the Preview Data in the ADF.
- Create a blob storage JSON dataset<GroupbyOutputDataSet> using the linked service in the ADF. This data set would be used in Sink Transformation.
The steps to create services and components have not been in detail, but an Azure data factory implementation is here with a copy activity.
Now, we are all set to create a mapping data flow. To create data a mapping data flow, Go to Factory Resources > Data Flows > New mapping data Flow <RemoveDuplicateDataflow>. Data flow requires a Source, Aggregate, Select and Sink transform, and required settings are as shown for each transformation. Click on “+” sign to add transforms.
Data flow Diagram

Click on the image to enlarge it transform settings.
Source Transform



Aggregate Transform



Select Transform



Sink Transform



After performing all the steps have shown above, the data flow is ready now. So, the next step is to create a pipeline <DataFlowPipeline>, drag and drop Data flow activity and associate it with the data flow <RemoveDuplicateDataflow> and make sure to publish the changes. This is a mandatory step.

To run the pipeline, select the pipeline and click on the debug option. Now, pipeline execution will start and complete in some time. Let’s verify the output.
Output
To validate the pipeline execution output, go to Storage Account <dataflowtransformation> and navigate to container <jsons> and see the output file <distinctemployee.json>.

Finally, as we know, that group by operation is a CPU intensive operation, especially for large datasets in any T-SQL language. If we group the data for multiple columns, then it would be an extra time-consuming activity for the database server.
So, if the ADF deals with large datasets, then you may avoid group by while fetching the data and use Aggregate transform along with select transform to remove duplicate data.
So, we have seen how we can clean up duplicate records and have a unique output for a data source. Here, source input and target output both are in JSON format, but if you want, then you can change the output format using the corresponding format type linked service like CSV, EXCEL, etc.
Hope, this would be helpful.
Thanks for the post. Very helpful.
Is there a way to remove the duplicates on the basis of one column only?
It would be in the same way. Let’s say that you only have “FirstName” column and rest of the columns are not present in you source data.