Extract nested JSON data from an API and load into Azure Synapse Analytics using Azure Data Factory

1

Azure Data Factory Use Case Series – 1

In this blog, I will show you step-by-step of extracting nested JSON data from REST API and loading into Azure Synapse Analytics using Azure Data Factory.

Requirement: Recently one of my projects was to extract data from an REST API site every 30 min to get updated data and loaded into Azure Synapse Analytics as a table. I would like to share one of the method I was using.

Overall high-level pipeline setting

This includes 3 steps, 1st step is to load API data into blob (.json files); 2nd step is to load blob file to ADW; last step is to implement incremental logic for updating table inside Azure Synapse Analytics and to incorporate the API data into downstream ETL steps.

If we directly load API data into Azure Synapse Analytics, the nested json file cannot be correctly mapped to Azure Synapse Analytics tables, so we need to utilize Flatten transformation inside Azure Mapping Data Flow (https://docs.microsoft.com/en-us/azure/data-factory/data-flow-flatten) to unroll array values into individual rows. Details for setting up this step will be talked about later in this blog.

Step 1: Loading API data into blob

Because one API call will have more than 1 page, so we need to use Until activity to iterate the pages for API call.

For completing this iteration, we need 3 variables: PageCount, Counter, and Increment.

We use Web activity to get API metadata, and assign the totalpages to PageCount variable. Please note here we only get last 2 days’ data from REST API since we want to set up incremental load.

PageCount was giving dynamic value: @string(activity(‘Get_RESTAPI_Metadata’).output.paginationMetadata.totalPages). Basically, this will set up how many pages/iterations the Until activity will be performed.

In the Until activity settings, we use the expression “@less(variables(‘PageCount’),variables(‘Counter’))” to determine when the iteration will stop.

Then inside the Until activity, we have the follow setting.

First activity is for the actual copy data step from REST API to Blob storage, one page per time. Then we increase Counter value by 1 after each copy using the @string(add(int(variables(‘Increment’)),1))

Step 2: Transform json files from blob and load into Azure Synapse Analytics stage table

After Until activity, Json files from all the pages in the API call have been loaded into Blob storage, next we will transform those files to load them into a table inside Azure Synapse Analytics. For this transformation, we use Flatten activity.

Flatten activity was introduced on 2020-Mar-19 to the existing set of powerful transformations available in the Azure Data Factory (ADF) Mapping Data Flows – https://docs.microsoft.com/en-us/azure/data-factory/data-flow-flatten.

Here is the setting:

Flatten transformation is powerful but very simple to use. What we need to find out which array you want to unroll by selecting it in the drop down lists as well as find out the Unroll root. By default, the flatten transformation unrolls an array to the top of the hierarchy it exists in.

Step 3: Updating table inside Azure Synapse Analytics

Since we only load last two days’ data, so we will need to append and update the full stage table. One of the method I was using is a simple stored procedure as follow: 

Finally, this stage table can be used for the downstream ETL. 

Please contact me for any question and welcome any comments and suggestions.

Bo Huang

About

I am a Senior Cloud Data/AI solutions Consultant and Architect with expertise in Business Intelligence, Data Modeling, Advanced Analytics, AI/ML, and Low-code/no-code app development. I have more than 14 years of experience of data collection, analytics, modeling, interpretation, and visualization to support business operations. I am well versed in the implementation of full life cycle of traditional BI as well as Advanced Analytics projects: requirements gathering, design thinking, development, maintenance of data warehouse, ETL, dimensional modeling, AI/ML model training and deployment, and data visualization. I have a master’s degree in computer information system, which gave me the foundation to be a good technology consultant. I was awarded Microsoft Certified Solutions Expert (MCSE): Data Management and Analytics with several certifications on AI/ML from Microsoft as well as from Google Cloud Platform. Throughout of my career as technology consultant, I have been recognized for the work I have done by my employers as well as my clients. I have received several prestigious awards like the Superhero Award from Avanade in 2019 and the Rock Star Award from Accenture in 2018.

More on Bo Huang.

Related Posts

Your email address will not be published. Required fields are marked *

  1. Leonardo Araujo September 29, 2021 Reply

    Hi Bo Huang, thank you for sharing !

    I have three questions regarding your ETL.

    First, on the step 1, I didn’t find any mention about the delete activity (Delete 1). Please, let me know If I understand, but I think doesn’t any function for delete activity. What does the activity Delete do?

    Second, on the step 2, in the sink Azure Synapse, which table are you sinking? DBO.STG_FULL ou DBO.STG_INC?

    Finally, on the last step, “PL_ETL”. What does the pipeline PL_ETL do?

    Best regards,
    Leonardo

    • Bo Huang October 1, 2021 Reply

      Hi Leonardo,
      Thanks for your comments. To answer your question:
      1) Delete activity is for deleting all files in Blob from pervious API call, like TRUNCATE and LOAD a table.
      2) In the sink, we load data into table DBO.STG_INC (this has just las 2 days’ data), and then using DBO.STG_INC table to update the full stage table DBO.STG_FULL.
      3) PL_ETL is for the downstream ETL steps to incorporate this REST API data with other tables to create, for example, some dimension and/or fact tables.
      Let me know if you have any more comments and suggestions.

  2. Lucian Howard November 20, 2021 Reply

    Hi Bo Huang,

    I also have a few questions regarding this setup

    I am rather new to ADF and I need to understand when you parameterize your url.

    Can you give an example what the get metadata url should look like at the end, I get confused where your base url is blue on the second line {utcnow()}&…………… whats supposed to be here. I would appreciate if you can share an example of what the end url string looks like obviously a dummy url

    Thanks
    L