Wednesday, June 9, 2021

Deploying Azure Digital Twin at Scale

I am part of the team that is working on Digital transformation of Microsoft campuses across the globe. The idea is to provide experiences like Room Occupancy, People count, Thermal comfort, etc. to all our employees. To power these experiences, we deal with real estate like Buildings, IoT Devices and Sensors installed in the buildings and of course Employees. Our system ingests data from all kinds of IoT Devices and Sensors, processes and stores them in Azure Digital Twin (or ADT in short).

 

Since we are dealing with Microsoft buildings across the globe which results in a lot of data, customers often ask us about using ADT at scale – the challenges we faced and how we overcame them.

 

In this article, we will be sharing how we worked with ADT at scale. We will touch upon some of the areas that we focused on:

  • Our approach to
    • handle Model, Twins and Relationships Limits that impact Schema and Data.
    • handle transaction limits like DT API, DT Query API, etc. that deal with continuous workload
  • Scale Testing ADT under load.
  • Design Optimizations.
  • Monitoring and Alerting for a running ADT instance.

So, let’s begin our journey at scale:smile:.

 

ADT Limits

Before we get into understanding how to deal with scale, let’s understand the limits that ADT puts. ADT has various limits categorized under Functional Limits and Rate Limits.

 

Functional limits refer to the limits on the number of objects that can be created like Models, Twins, and Relationships to name a few. These primarily deal with the schema and amount of data that can be created within ADT.

 

Rate limits refer to the limits that are put on ADT transactions. These are the transactional limitations that every developer should be aware of, as the code we write will directly lead to consumption of these rates.

 

Many of the limits defined by ADT are “Adjustable”, which means ADT team is open to changing these limits based on your requirements via a support ticket.

 

For further details and a full list of limitations, please refer to Service limits - Azure Digital Twins | Microsoft Docs.

 

Functional Limits (Model, Twin and Relationships)

The first step in dealing with scale is to figure out if ADT will be able to hold the entire dataset that we want it to or not. This starts with the schema first followed by data.

 

Schema

This first step, as with any system, is to define the schema, which is the structure needed to hold the data. Schema in case of ADT is defined in terms of Models. Once Models were defined, it was straightforward to see that the number of Models were within the ADT limits.

 

While the initial set of Models may be within the limit, it is also important to ensure that there is enough capacity left for future enhancements like new models or different versions of existing models.

 

One learning that we had during this exercise was to ensure regular cleanup of old Model or old versions after they are decommissioned to free up capacity.

 

Data

Once the schema was taken care of, came the check for amount of data that we wanted to store. For this, we looked at the number of Twins and Relationships needed for each Twin. There are limits for incoming and outgoing relationships, so we needed to assess their impact as well.

 

During the data check, we ran into a challenge where for a twin, we had lot of incoming relationships which was beyond the allowed ADT limits. This meant that we had to go back and modify the original schema. We restructured our schema by removing the incoming relationship and instead created a property to flatten the schema.

 

To elaborate more on our scenario – we were trying to create a relationship to link Employees with a Building where they sit and work. For some bigger buildings, the number of employees in that building, were beyond the supported ADT limits for incoming relationships, hence we added a direct property in Employee model to refer to Building directly instead of relationship.

 

With that out of our way, we moved on to checking the number of Twins. Keep in mind that Twin limit applies to all kind of twins including incoming and outgoing relationships. Looking at number of twins that we will have in our system was easier as we knew the number of buildings and other related data that would flow into the system.

 

As in the case of Models, we also looked at our future growth to ensure we have enough buffer to cater to new buildings for future.  

 

Pro Tip: We wrote a tool to simulate creation of twins and relationship as per our requirements, to test out the limits. The tool was also of great help in benchmarking our needs. Don’t forget to cleanup:smile:.

 

Rate Limits (Twin APIs / Query APIs / Query Units /…)

Now that we know ADT can handle the data we are going to store, the next step was to check the rate limits. Most ADT rate limits are handled per second e.g., Twin API operations RPS limit, Query Units consumed per seconds, etc.

 

Before we get into details about Rate limits, here’s a simplistic view of our main sensor reading processing pipeline, which is where the bulk of processing happens and thus contributes to the load. The main component that does the processing here is implemented as Azure Function that sits between IoT Hub and ADT and works on ingesting the sensor readings to ADT.

 

Processing Pipeline - Original.png

 

 

This is pretty similar to the approach suggested by ADT team for ingesting IoT Hub telemetry. Please refer to Ingest telemetry from IoT Hub - Azure Digital Twins | Microsoft Docs for more information on the approach and few implementation details regarding this.

 

To identify the load which our Azure Function will be processing continuously, we looked at the sensor readings that will be ingested across the buildings. We also looked at the frequency at which a sensor sends a reading thus resulting into a specific load for the hour / day.

 

With this we identified our daily load which resulted into identifying a per second load or RPS as we call it. We were expected to process around 100 sensor readings per second. Our load was mostly consistent throughout as we process data from buildings all over the world.

Load (RPS)

100

Load (per day)

~8.5 million readings

 

Once the load data was available, we needed to convert load into the total number of ADT operations. So, we identified the number of ADT operations we will perform for every sensor reading that is ingested. For each reading, we identified below items:

  • Operation type i.e., Twin API vs Query API operation
  • Number of ADT operations required to process every reading.
  • Query charge (or query unit consumption) for Query operations. This is available as part of the response header in the ADT client method.

 

This is the table we got after doing all the stuff mentioned above.

Twin API operation per reading

2

Query API operation per reading

2

Avg Query Unit Consumed per query

30

 

Multiplying the above numbers by the load gave us the expected ADT operations per second.

 

Apart from the sensor reading processing, we also had few other components like background jobs, APIs for querying data, etc. We also added the ADT usage from these components on top of our regular processing number calculations above, to get final numbers.

 

Armed with these numbers, we put up a calculation to get the actual ADT consumption we are going to hit when we go live. Since these numbers were within the ADT Rate limits, we were good.

 

Again, as with Models and Twins, we must ensure some buffer is there otherwise future growth will be restricted.

 

Additionally, when a service limit is reached, ADT will throttle the requests. For more suggestions on working with limits by ADT team, please refer to Service limits - Azure Digital Twins | Microsoft Docs

 

Scale

For future scale requirements, the first thing we did was to figure out our future load projections. We were expected to grow up to twice the current rate in next 2 years. So, we just doubled up all the numbers that we got above and got the future scale needs as well.

 

ADT team provides a template (sample below) that helps in organizing all this information at one place.

Scale template - Original.png

 

 

 

Once the load numbers and future projections were available, we worked with ADT team on the adjustable limits / getting the green light for the Scale numbers.

 

Performance Test

Based on the above numbers and after getting the green light from ADT team – to validate & verify the scale and ADT limits, we did performance test for our system. We used AKS based clusters to simulate ingestion of sensor readings, while also running all our other background jobs at the same time.

 

We ran multiple rounds of perf runs for different loads like X and 2X and gathered metrics around ADT performance. We also ran some endurance tests, where we ran varying loads continuously for a day or two to measure the performance.

 

Design Optimization (Cache / Edge Processing)

Typically, sensors send lot of unchanged readings. For example, a motion will come as false for a long duration and once it changes to true it will most likely stick to being true for some time before going back to false. As such, we don’t need to process each reading and such “no change” readings can be filtered out.

 

With this principle in mind, we added a Cache component in our processing pipeline which helped in reducing the load on ADT. Using Cache, we were able to reduce our ADT operations by around 50%. This helped us achieve a support for higher load with added advantage of faster processing.

 

Another change we did to optimize our sensor traffic was to add edge processing. We introduced an Edge module which acted as the Gateway between Device where the readings are generated and IoT Hub which acts as a storage.

 

The Gateway module processes the data closer to the actual physical devices and helped us in filtering out certain readings based on the rules defined e.g., filtering out health readings from telemetry readings. We also used this module to enrich our sensor readings being sent to IoT Hub which helped in reducing overall processing time.

 

Pro-active Monitoring and Alerts

All said and done, we have tested everything at scale and things look good. But does that mean we will never run into a problem or never reach a limit? Answer is “No”. Since there is no guarantee, we need to prepare ourselves for such eventuality.

 

ADT provides various Out of the Box (OOB) metrics that helps in tracking the Twin Count, RPS for Twin API or Query API operation. We can always write our own code to track more metrics if required, but in our case, we didn’t need to, as the OOB metrics were fulfilling our requirements.  

 

To proactively monitor the system behavior, we created a dashboard in our Application Insights for monitoring ADT where we added widgets to track the consumptions for each of the important ADT limits. Here's how the widget we have look like:

Query API Operations RPS

 

Twin API Operations RPS

Min

Avg

Max

 

Min

Avg

Max

50

80

400

 

100

200

500

 

Twin and Model Count

Model

Model Consumption %

Twin Count

Twin Consumption %

100

10%

80 K

40%

 

For being notified of any discrepancy in system – we have alerts configured to raise a flag in case we consistently hit the ADT limits over a period.  As an example, we have alerts defined at various levels for Twin count say raise warning at 80%, critical error at 95% capacity, for folks to act.

 

An example of how it helped us – once due to some bug (or was it a feature J), a piece of code kept on adding unwanted Twins overnight. In morning we started getting alerts about Twin capacity crossing 80% limit and thus helped us getting notified of the issue and eventually fixing and cleaning up.

 

Summary

I would leave you with a simple summary – while dealing with ADT at Scale, work within the various limits by ADT and design your system keeping them in mind. Plan to test / performance test your system to catch issues early and incorporate changes as required. Setup regular monitoring and alerting, so that you can track system behavior regularly.

 

Last but not the least, keep in mind dealing with Scale is not a one-time thing, but rather a continuous work where you need to be constantly evolving, testing, and optimizing your system as the system itself evolves and grows and you add more features to it.

 

Posted at https://sl.advdat.com/3zhlnCE