Thursday, March 17, 2022

Anomalies, alarms, and faults in Smart buildings and smart manufacturing

Much confusion, or at least hyperbole, abounds in the marketing literature about detecting problems in building or manufacturing environments using IoT technologies and advanced analytics. It might be helpful to explain some of the terms and issues, so let’s start with some of the basic ones.




Problems lead to various issues: wasting energy, wasting money, randomizing resources, unscheduled downtime, employee discomfort, and so on. Very broadly speaking there are two classes of problems:

  • Things that are broken or failing. This includes things like air-conditioners that are not working at all, or not working properly, conveyor belts that are unable to start, and water faucets that are leaking.
  • Things that are misconfigured. This includes things like thermostats that were adjusted manually by an employee, fans that were set to manual override of the proper settings, and heating and cooling systems operating independently and against each other.




Using IoT we can often detect anomalous patterns in data, which may be leading indicators of problems, symptoms of existing problems, or just due to environmental or other factors – such as hotter than usual weather. Anomaly detection can be as simple as an Azure Stream Analytics job with a query


SELECT device, temp, time FROM Input WHERE  temp >= 90


or as sophisticated as using a machine learning model such as that published on the Azure website at . Note that an anomaly per se is not a problem, it is something that might be indicative of a problem.




Alarms are things that people are supposed to act upon. There is a long-established set of alarm definitions established by the OPC Foundation, which you can read at . Another well-established set of alarms can be found at Plenty of other definitions exist, and System integrators define which data patterns are worth raising as alarms. At a base level, an SI creates an algorithm that evaluates data from sensors and issues a notification if a condition is met. For example, an SI might create an algorithm that says if the temperature of a computer room goes above 90 degrees Fahrenheit send an alarm to the building engineers. Note again, though, that alarms per se are not problems, they are notices to people that there is something that an engineer thought worth investigating.


An example of alarm management software is the Hyper Alarm Server in GENESIS64, a commercial product released by ICONICS, the company for which I work. The following screenshot shows creating an alarm if the noise in a room exceeds 30 decibels:




In this case, the user is using a very simple form to create a ‘limit’ alarm, where an alarm is issued based upon the sound volume in the IoT data point received from a remote sensor.  In Hyper Alarm Server the limit alarm is one that implements the following expression:


IFQ {{DataSource}} >= {{HiValue}}
THEN 1 /* Hi Alarm condition code */
ELSE 0 /* normal condition code */


As you can see, this simply compares a data value (DataSource) to a constant (HiValue), that are defined in the form above. The end user dashboard might look something like this, with various alarms:






Faults are not as well codified by the various consortia, and some engineers use the terms alarms and faults interchangeably. If we try to pry a difference between alarms and faults, it is typically that alarms are perceived as calls to action, whereas faults are indications that there are problems which may or may not rise to the level of requiring action. However, that is a difference in how they are acted upon, not a difference in how they are identified. In most IoT systems, the difference between modules for identifying alarms and those for identifying faults is simply that the fault detection systems allow for more complex algorithms. For example, if you want to detect that the performance of a conveyor belt is degrading over time, you may need to create an algorithm that compares multiple data streams (belt speed, gear speed, energy consumed) over time and raise a fault when there is a change of more than a certain percent.


Continuing with the noise problem, an example of fault detection using the FDDWorX module GENESIS64 would be adding a time dimension. The following screenshot shows creating fault definition if the noise in a room exceeds 30 decibels for more than a minute:




Compare the expression in the alarm to that in the fault rule:


IF (TRUEFORDURATION((<<DAT>> >= 30), 60000))


Here we have enhanced the alarm by saying that the noise has got to be over 30 decibels for a minute before issuing a notice. This is a very simple example which took advantage of the TRUEFORDURATION function in FDDWorX which is not available in Hyper Alarm Server.

For users of ICONICS, the value of FDDWorX lies in being able to create expressions that would detect issues that are hard to define simply. Consider the case of monitoring the fans in a parking garage, and trying to distinguish between a broken fan and one that has been put into manual override by a maintenance technician:




To detect if the fan is broken, you simply need to compare whether it is turning to whether it is supposed to be turning. To detect whether it has been put into manual override, you might use an expression in a fault rule like the following, from a training session at Microsoft:


IF(TRUEFORDURATION(<<MODE>> == 1 && <<SF CMD>> == 0 &&
(<<SF STS>> == 1 || <<KW>> >= 1 || <<SSP 1>> > 25 || <<SSP 4>> > 25), 3600000) THEN 1 ELSE 0


or better yet,


MIN (MAX(0,8760 * <<KW>> * <<DUTY CYCLE>>),
8760 * 0.85 * <<DESIGN KW>>) * <<UTILITY RATE>>


I don’t pretend to understand the engineering involved in defining or optimizing the fault rules but suffice it to point out that alarms typically involve simple algorithms, faults can involve anything from simple to pretty complex. Understand, however, that a complex solution is not necessarily better than a simple one and a company needs qualified subject matter experts to identify the best way to detect problems.  


Finally, much is written about the effort required to implement IoT systems to collect and identify data from buildings and equipment, but this IT problem can be relatively small compared to the effort to develop and tune a set of alarms and fault rules that optimize a company’s maintenance operations. In a small project with which I was involved for a factory, it took about a week to implement the data collection, tagging, and object identification pieces, and about five weeks for the building engineers to load and tune the fault rules so they identified the important faults (not too few, not too many) and did not produce large quantities of false positives. The moral of this is that customers would do well to use good commercial software that allows the subject matter experts to focus on the operational tasks and not get bogged down in debugging home-grown IT systems!

Posted at