In Advanced Analytics, the possibility of covering data analysis in real time is more in demand. In the case of the Microsoft cloud, we have several tools that allow us to work in that direction. Today we are going to make a brief summary about it, and I will comment on some additional capabilities that are not normally covered such as the importance of defining the type of Windowing, the possibility of building your own UDFs (User Define Functions), as well as the use of functions by default they exist to handle anomaly detection.
In Azure, the current options for working with data in real time are:
HDInsight with Spark Streaming
HDInsight with Storm
Apache Spark in Azure Databricks
Azure Stream Analytics
Focusing on Stream Analytics, we see that within this umbrella we can work with different components, both as data sources, as storage or visualization parts, in addition to the connection with Azure ML in case we want to progress in the use of predictive models. Comment that as Reference Data there are two possibilities, connect to Blob Storage or Azure SQL Database.
But let’s move on, as I have previously commented, in this article we will talk about Windowing and in subsequent entries we will delve into UDF and anomaly detection. These options allow us to cover aspects that are not usually discussed in Azure Stream Analytics presentations.
What is Windowing?
It is the possibility that Azure offers to cover the requirement of creating subsets of data within the Stream based on the timestamp, with the purpose of performing operations such as COUNT, AVG, etc.
As types of Windowing we have four different possibilities:
Let’s look at each of the cases to better understand the best use case for each.
In this case, a subset is created using the TumblingWindow (‘time unit’, integer) function. Example:
SELECT COUNT (*) FROM Input GROUP BY Tumbling Window (second, 10)
In the diagram we see how the result of the query would look using this first type of Window. We would create a subset of the Stream every 10 seconds, we would count the number of events in the different subsets returning the result. There is no overlap here, so an event only belongs to one window. Single: 3, 1, 2
In this case, an example would be to obtain the temperature averages for a data set that covers five seconds.
In the second case, the subset that is created is based on the two parameters that are passed to the function. These are the size of the window and in the second case, the starting value of each subset, so that the same event can be found in several windows.
SELECT COUNT (*) FROM Input GROUP BY Hopping Window (second, 10, 5)
In the diagram we see how the result of the query would be. Single: 3, 3, 2, 2, 2
In this case, an example would be to obtain the number of events that occurred in the last 10 seconds, when the new counts should appear every 5 seconds. Hence the windows overlap.
NOTE: What would happen if we passed (Second, 10, 10)? Well, we would get the same window as using the previous Tumbling Window function.
In the case of Sliding Window, we pass the value of the window size and the time unit to the function, as in the first option. However, the difference between them is that the size of the window is covered from the event that occurred.
SELECT COUNT (*) FROM Input GROUP BY Sliding Window (second, 10)
That is, it goes backwards, so it has at least one event and these events can belong to more than one window. Result: 2, 2, 2, 1
This would be a clear example in case we wanted to obtain the total result of events that occurred in the last 10 seconds.
In this case, the values that are passed to the function are those of the unit of measure, as well as the wait time and the value of the maximum size of the window.
SELECT COUNT (*) FROM Input GROUP BY Session Window (second, 5, 10)
In other words, in this case, when an event appears, it begins to count the waiting seconds. If a new event arrives before the expiration, the account is started again. In the example we see that after the second event, the 5 seconds wait is exceeded, which closes the window. As we see this closure occurs at the value of 12, this is because the windows are calculated based on the maximum value of time set. That is, in the value 10, 20, 30, etc. That is why when the window of the second set is closed, it does so when it reaches the value of 30. Result: 2, 5
This would be a clear example in case we wanted to obtain the total result of events that occurred with a difference of 5 seconds between them.
In the latter case, the total of events that happen in the same time stamp is obtained. For them the function System. Timestamp () is used
SELECT COUNT (*) FROM Input GROUP BY System. Timestamp ()
This case is the easiest to understand, since it performs a typical grouping of values.
The possibility of configuring exactly the values of your windows, as well as defining the behavior you want to use to obtain the expected result, allows us to better handle the potential of Azure Stream Analytics. It allows us to better understand the tools with which we work in Advanced Analytics and provide the user with the correct approach for their developments.
More information in:
Reza Salehi, “Building Streaming Data Pipelines in Microsoft Azure.” June 2020.
Cover photo: Serpstat at Pexels