The post, Data Flow joins in Azure Data Factory uses select transformation and all the columns from input sources are sent to a blob storage which is a sink source in this case. But, this cannot be a real time requirement specially when there are many input data sources.
In real time scenario, we only need to send useful columns to a sink source. This can be achieve using column mapping feature of data flow. There are two types of mapping supported, first is Fixed Mapping and second is Rule Based Mapping.
We will go through some of the great features of select transformation which would be helpful in real time implementations. Let’s consider pipeline has already created using Data Flow joins in Azure Data Factory.
Remove Unwanted Column
When joins are performed among multiple input data sources, by default all the columns are mapped. However, Fixed Mapping feature of select transformation helps to select only required columns and delete unnecessary columns mapping to achieve transformation.
Once you decided the columns to be removed then Select columns and click on Delete, and publish the changes. However, deleted and published columns can be reverted by clicking on the Reset option.
If columns are removed at one select transformation then those would not be available in further joins or select transformation.
Input and Output columns name and type can also be verified from the Inspect option of select transformation.
Data preview option can be used to verify whenever needed, but data flow debug must be enabled in ADF authoring tool. So, let’s run the the pipeline and see the final output as below.
Change Columns Name
Fixed Mapping feature allows to change the column name as per your choice. It is similar to [as] clause in SQL server. Suppose, we want to change column names in the sink source as shown, then update the mapping column names and publish the changes.
Let’s run the pipeline to verify the output in blob storage.
So, we can see that sink source columns name are changed as per the mapping configured in data flow.
Rule Based Mapping
Rule Based Mapping can be used if columns are following some definite pattern and based on the pattern we can map all the columns in one go.
Pattern matching is available based on name, position and type of the columns. By default, rule based mapping is applicable if input and columns columns are more than 50. Though, we have small result set, but rule based mapping can be apply.
Let’s consider that we only want to map and sink columns having length less than 8 characters.
So, click on the second Select transformation, Select all and delete the fixed mapping columns and then select Rule based mapping. To define Rule based mapping, apply the condition and name as shown or copy and paste the highlighted values in respective text boxes.
- type==’string’ && length(name) < 8 – This represents condition for rule engine.
- $$ – This represents complete schema.
So, let’s publish the changes and run the pipeline to verify the output in blob storage.
We see that the sink source data has changed as per the Rule configured in Select transformation.
Also, there are options to skip duplicate input and output columns which can be used if data sources have numerous columns and have possibilities of duplicate column names due to joins and lookup transformation.
So, we have seen some important features of Select transformation which would be definitely useful while implementing Data Flows in Azure Data Factory.