There is a great need when it comes to processing real-time events, which is the ability to aggregate them as quickly and simply as possible. The Azure Stream Analytics, through the Windowing Functions, provides five possible ways to aggregate events for better data analysis and, consequently, better decision making in your business. Understanding how to use this tool is essential.
Also, this is a key topic in some Azure Certification Exams, and if you understand it correctly, you will certainly get more questions right. Some tips with analogies will be presented, to help you memorize and quickly remember the definitions of each function.
Tumbling Window – the Train
TumblingWindow( timeunit, windowsize, [offset] )
“How many pizza orders I got every 10 minutes”
This is the best way to explain and understand what a windowing function is (also is the most used). A tumbling window aggregates events with a simple non-overlapping, contiguous and fixed time sized window.
Analogy tip: Imagine a train with all wagons of the same size and equally attached to each other, without any space between the wagons. That is similar to a tumbling window. All the events that happen at the time frame will be inside one, and only one, window.
Hopping Window – the Domino
HoppingWindow( timeunit, windowsize, hopsize, [offset] )
“Every 10 minutes, give me the number of pizza orders I got in the last 30 minutes”
The hopping window is similar to the tumbling window, except that it is not contiguous. It aggregate events with a fixed time sized window, but you can choose to update that information in another time frame.
Another example: “Every 20 minutes, give me the number of pizza orders I got in the last 10 minutes”.
Analogy tip: Have you ever seen those extremely satisfying YouTube videos of domino pieces falling? This is a good analogy to a hopping window! Each window has the same size, like a domino piece, and can be intersected (similar to what happens at the end of the videos, when the pieces are on top of each other) or they can be apart (like at the beginning of the videos, when they are ready to fall). So, if the windows are overlapped, all the events will be captured and they can potentially be in more than one window. If they are separated, some events can potentially be out of the final reported windows.
Note: If windowsize and hopsize are equals, the hopping window will work exactly like a tumbling window.
Sliding Window – the Candidates
SlidingWindow( timeunit, windowsize )
“Tell me whenever I got 3 or more pizza orders in under 10 minutes”
This one is the trickiest one to understand. Here you still have the fixed time sized window, but now you won't decide when it starts or when it ends. Instead, Stream Analytics will consider two windows for each event, the window that starts exactly at the event time and the one that ends exactly at the event time.
But if you are working with millions of events, you will have twice the windows and it can be a large number. That’s why sliding windows are usually used with a filter to exclude the windows that are not relevant to you.
Analogy tip: Whenever an event is found in the time frame, Stream Analytics will consider it as the end of, what I call, a “candidate window”. So it will see if there are two or more events in the past 10 minutes. If the answer is yes, the window is valid, and if the answer is no, the window is discarded. And it will apply the same rule to the “candidate window” that starts at the event time, so it will see if there are two or more events in the next 10 minutes. So each event will be part of at least two “candidate windows”, but they can be discarded and the event will be out of the final reported windows. A “candidate window” is also discarded if it contains the same events as a previous valid window (to avoid duplicate windows).
Session Window – the… Session
SessionWindow(timeunit, timeoutSize, maxDurationSize) [OVER (PARTITION BY partitionKey)]
“Give me the average of pizza orders that occur within 5 minutes to each other, in a maximum window size of 20 minutes”
A session window is the only one where the size of each window can be different from one another. It will start once an event happens or after the last session is ended by max duration (not by timeout! In this case, it will wait until a new event), and it will extend whenever another event occurs. It ends only if no events occur until the timeout or once the max duration is exceeded, whichever comes first.
The timeout concept is intuitive, but the max duration is a little bit tricky. The session does not end once the max duration is reached. Instead, Stream Analytics will check from time to time if the window size is smaller than the max duration. If the answer is yes, the session window continues until the next time check (if new events keep occurring). If the answer is no, it means that the window size exceeds the max duration, so it will end and another session will instantly start.
And what is that time to time frame? It is the max duration time size compared to day time, not to the sessions. So, if the max duration is set to 10 minutes and your data events starts around 12:00 pm, it will check at 12:10 pm, 12:20 pm, 12:30 pm, and so on no matter when the session window start. Meaning, that the real maximum duration of a session window is twice the max duration -1mcs (microsecond).
If this example is presented at a certification exam, you probably should set the max duration as 20 minutes (as the official Microsoft page recommend), but in fact, if your real maximum window size desired is 20 minutes, you should set the max duration as 10 minutes to meet this requirement, otherwise, your windows can have up to 40 minutes.
Snapshot Window – the Twins
System.Timestamp()
“Give me all the pizza orders that happen at the same time”
The snapshot window is easy as the tumbling window to understand. It aggregate events that occur at the exact same time. That is it!
Analogy tip: The snapshot windows report the twin events.
I hope this article help you to better understand windowing functions. But it is just an overview of this subject. To know more about it, specifications and other details, you can visit Microsoft official page here!
by Alessandro Melo
Business Assurance & IT Consultant @ Passio Consulting