Skip to Content

Extract nested JSON data from an API and load into Azure Synapse Analytics using Azure Data Factory

Sogeti Labs
October 15, 2021

Azure Data Factory Use Case Series – 1

In this blog, I will show you step-by-step of extracting nested JSON data from REST API and loading into Azure Synapse Analytics using Azure Data Factory.

Requirement: Recently one of my projects was to extract data from an REST API site every 30 min to get updated data and loaded into Azure Synapse Analytics as a table. I would like to share one of the method I was using.

Overall high-level pipeline setting

This includes 3 steps, 1st step is to load API data into blob (.json files); 2nd step is to load blob file to ADW; last step is to implement incremental logic for updating table inside Azure Synapse Analytics and to incorporate the API data into downstream ETL steps.

If we directly load API data into Azure Synapse Analytics, the nested json file cannot be correctly mapped to Azure Synapse Analytics tables, so we need to utilize Flatten transformation inside Azure Mapping Data Flow (https://docs.microsoft.com/en-us/azure/data-factory/data-flow-flatten) to unroll array values into individual rows. Details for setting up this step will be talked about later in this blog.

Step 1: Loading API data into blob

Because one API call will have more than 1 page, so we need to use Until activity to iterate the pages for API call.

For completing this iteration, we need 3 variables: PageCount, Counter, and Increment.

We use Web activity to get API metadata, and assign the totalpages to PageCount variable. Please note here we only get last 2 days’ data from REST API since we want to set up incremental load.

PageCount was giving dynamic value: @string(activity(‘Get_RESTAPI_Metadata’).output.paginationMetadata.totalPages). Basically, this will set up how many pages/iterations the Until activity will be performed.

In the Until activity settings, we use the expression “@less(variables(‘PageCount’),variables(‘Counter’))” to determine when the iteration will stop.

Then inside the Until activity, we have the follow setting.

First activity is for the actual copy data step from REST API to Blob storage, one page per time. Then we increase Counter value by 1 after each copy using the @string(add(int(variables(‘Increment’)),1))

Step 2: Transform json files from blob and load into Azure Synapse Analytics stage table

After Until activity, Json files from all the pages in the API call have been loaded into Blob storage, next we will transform those files to load them into a table inside Azure Synapse Analytics. For this transformation, we use Flatten activity.

Flatten activity was introduced on 2020-Mar-19 to the existing set of powerful transformations available in the Azure Data Factory (ADF) Mapping Data Flows – https://docs.microsoft.com/en-us/azure/data-factory/data-flow-flatten.

Here is the setting:

Flatten transformation is powerful but very simple to use. What we need to find out which array you want to unroll by selecting it in the drop down lists as well as find out the Unroll root. By default, the flatten transformation unrolls an array to the top of the hierarchy it exists in.

Step 3: Updating table inside Azure Synapse Analytics

Since we only load last two days’ data, so we will need to append and update the full stage table. One of the method I was using is a simple stored procedure as follow: 

Finally, this stage table can be used for the downstream ETL. 

Please contact me for any question and welcome any comments and suggestions.

About the author

SogetiLabs gathers distinguished technology leaders from around the Sogeti world. It is an initiative explaining not how IT works, but what IT means for business.

    Comments

    4 thoughts on “Extract nested JSON data from an API and load into Azure Synapse Analytics using Azure Data Factory

    1. Hi Bo Huang, thank you for sharing !
      I have three questions regarding your ETL.
      First, on the step 1, I didn’t find any mention about the delete activity (Delete 1). Please, let me know If I understand, but I think doesn’t any function for delete activity. What does the activity Delete do?
      Second, on the step 2, in the sink Azure Synapse, which table are you sinking? DBO.STG_FULL ou DBO.STG_INC?
      Finally, on the last step, “PL_ETL”. What does the pipeline PL_ETL do?
      Best regards,
      Leonardo

      1. Hi Leonardo,
        Thanks for your comments. To answer your question:
        1) Delete activity is for deleting all files in Blob from pervious API call, like TRUNCATE and LOAD a table.
        2) In the sink, we load data into table DBO.STG_INC (this has just las 2 days’ data), and then using DBO.STG_INC table to update the full stage table DBO.STG_FULL.
        3) PL_ETL is for the downstream ETL steps to incorporate this REST API data with other tables to create, for example, some dimension and/or fact tables.
        Let me know if you have any more comments and suggestions.

    2. Hi Bo Huang,
      I also have a few questions regarding this setup
      I am rather new to ADF and I need to understand when you parameterize your url.
      Can you give an example what the get metadata url should look like at the end, I get confused where your base url is blue on the second line {utcnow()}&…………… whats supposed to be here. I would appreciate if you can share an example of what the end url string looks like obviously a dummy url
      Thanks
      L

      1. Hi Lucian,
        Sorry for late response but I just saw this comment.
        The blue bar covered in the third line of the url is just another param for filtering (you can think it as businessline, geography location, etc.).
        It serves as the same functionality as other params (DateRangeStart, pagesize, etc.).
        If we are running the pipeline today of 2021-12-15, the final url will look like this (if we ignore that blue bar param).
        https://{your api url}/DateRangeStart=2021-12-13&DateRangEnd=2021-12-15&pagesize=500
        Above I simplify the value for DateRangeStart and DateRangEnd, since we are using utcnow() which contains not only date but also time in UTC.

    Leave a Reply

    Your email address will not be published. Required fields are marked *