Azure Data Factory Data Flow Use Case Series 1

1

Extract nested JSON data from an API and load into Azure Synapse Analytics using Azure Data Factory

In this blog, I will show you step-by-step of extracting nested JSON data from REST API and loading into Azure Synapse Analytics using Azure Data Factory.

Requirement: Recently one of my projects was to extract data from an REST API site every 30 min to get updated data and loaded into Azure Synapse Analytics as a table. I would like to share one of the method I was using.

Overall high-level pipeline setting

This includes 3 steps, 1st step is to load API data into blob (.json files); 2nd step is to load blob file to ADW; last step is to implement incremental logic for updating table inside Azure Synapse Analytics and to incorporate the API data into downstream ETL steps.

If we directly load API data into Azure Synapse Analytics, the nested json file cannot be correctly mapped to Azure Synapse Analytics tables, so we need to utilize Flatten transformation inside Azure Mapping Data Flow (https://docs.microsoft.com/en-us/azure/data-factory/data-flow-flatten) to unroll array values into individual rows. Details for setting up this step will be talked about later in this blog.

Step 1: Loading API data into blob

Because one API call will have more than 1 page, so we need to use Until activity to iterate the pages for API call.

For completing this iteration, we need 3 variables: PageCount, Counter, and Increment.

We use Web activity to get API metadata, and assign the totalpages to PageCount variable. Please note here we only get last 2 days’ data from REST API since we want to set up incremental load.

PageCount was giving dynamic value: @string(activity(‘Get_RESTAPI_Metadata’).output.paginationMetadata.totalPages). Basically, this will set up how many pages/iterations the Until activity will be performed.

In the Until activity settings, we use the expression “@less(variables(‘PageCount’),variables(‘Counter’))” to determine when the iteration will stop.

Then inside the Until activity, we have the follow setting.

First activity is for the actual copy data step from REST API to Blob storage, one page per time. Then we increase Counter value by 1 after each copy using the @string(add(int(variables(‘Increment’)),1))

Step 2: Transform json files from blob and load into Azure Synapse Analytics stage table

After Until activity, Json files from all the pages in the API call have been loaded into Blob storage, next we will transform those files to load them into a table inside Azure Synapse Analytics. For this transformation, we use Flatten activity.

Flatten activity was introduced on 2020-Mar-19 to the existing set of powerful transformations available in the Azure Data Factory (ADF) Mapping Data Flows – https://docs.microsoft.com/en-us/azure/data-factory/data-flow-flatten.

Here is the setting:

Flatten transformation is powerful but very simple to use. What we need to find out which array you want to unroll by selecting it in the drop down lists as well as find out the Unroll root. By default, the flatten transformation unrolls an array to the top of the hierarchy it exists in.

Step 3: Updating table inside Azure Synapse Analytics

Since we only load last two days’ data, so we will need to append and update the full stage table. One of the method I was using is a simple stored procedure as follow: 

Finally, this stage table can be used for the downstream ETL. 

Please contact me for any question and welcome any comments and suggestions.

Bo Huang

About

I am a Senior Cloud Data/AI solutions Consultant and Architect with expertise in Business Intelligence, Data Modeling, Advanced Analytics, AI/ML, and Low-code/no-code app development. I have more than 14 years of experience of data collection, analytics, modeling, interpretation, and visualization to support business operations. I am well versed in the implementation of full life cycle of traditional BI as well as Advanced Analytics projects: requirements gathering, design thinking, development, maintenance of data warehouse, ETL, dimensional modeling, AI/ML model training and deployment, and data visualization. I have a master’s degree in computer information system, which gave me the foundation to be a good technology consultant. I was awarded Microsoft Certified Solutions Expert (MCSE): Data Management and Analytics with several certifications on AI/ML from Microsoft as well as from Google Cloud Platform. Throughout of my career as technology consultant, I have been recognized for the work I have done by my employers as well as my clients. I have received several prestigious awards like the Superhero Award from Avanade in 2019 and the Rock Star Award from Accenture in 2018.

More on Bo Huang.

Related Posts

Your email address will not be published. Required fields are marked *