Azure Data Factory and its building blocks

The Azure Data Factory is a data processing factory from multiple data sources. It is a cloud based ETL (Extract-transform-load) service and data integration service, helps to create data workflow between source and destination data source to establish data movement.

The Azure Data Factory creates data work-flow between source data source and destination data source using pipelines.

Azure Data Factory Main Components

  • Pipeline – It is a data flow channel between source and destination data store. A Data Factory instance can have multiple pipelines and suppose to perform a set of work assign to it. The pipeline is a top level logical container underneath other components work together to complete their responsibilities.
  • Activity An Activity is a work performed during data processing from source to destination. A very common activity is copy activity to copy data, some other activities are data movement activities, control activities, and data transformation activities. An activity uses linked service and data set to work upon.
  • Datasets – A dataset is complete or subset of source data which copied to destination data store. A dataset needs a data source connector, called Linked Service, to retrieve or store. There are two datasets involved in pipeline, one is source dataset and other is sink dataset.
  • Linked Service – Linked Service works as connection string to establish connect to data store. Linked Services can work with databases like MSSQL, Oracle, MySQL etc and Azure Storage services and many more. It is responsible for retrieving and saving using datasets in a pipeline processing.
  • Triggers – There is nothing special. These are just to start the data processing pipeline and pipeline performs the defined activities.
  • Parameters – Parameters are kind of configurations and used to pass during run time execution. For example – if pipeline execution needs to switch between various databases that can be achieved using database connection string parameterization.
  • Control flow Control flow is used to control pipeline activities. Using control flow, activities can be executed in sequence, branched to different path. Data factory UI supports these to control activities.
  • Variables – Like any programming language, variables are defined to store static or runtime values inside pipeline. These variables can be passed between pipeline, data flow etc. A variable value can be set using a Set activity inside pipeline.

So, a Data Factory pipeline must consist of at least one activity to do any kind of processing.

Hope this gives an overview of Azure Data Factory.

If you have any suggestions/feedback, please put them in the comment box.

Happy Learning 🙂

Leave a Reply

Up ↑

%d bloggers like this: