DAQ Room Temperature sources

DAQ Room Environmental Monitoring Notes

I. Existing Sources of Information

There is a temperature and humidity probe attached to the PDU used with the Online Linux Pool.  The probe is located at the bottom of the rack with onl15-onl30 (and others), near the front of the rack.  This is a likely to be a relatively cool spot in the room without much air flow.

Readings from this probe are recorded in Ganglia (gathered by a cron job on onl12 using snmp):  Temperature & Humidity

As I look at this on June 20, 2018, the temperatures are clearly moving with a daily cycle, presumably tracking the outside air temperatures:


Note that I have tried approximately setting the horizontal positioning and scaling in the graphs below to match these x-axis times while scrolling vertically, but this may not appear well in all browsers.

Slow Controls has a hygrometer and temperature probe in the DAQ room.  This probe is in the DC rack row, near the top of one of the racks, where it is in the exhaust plume of several DAQ and trigger computers.   The data from the Slow Controls device is available in multiple places.  The basic Slow Controls monitoring page displays the most recently recorded values for temperature and humidty -- see "Weather DAQ" - using Fahrenheit.  If one has access to caget on the starp network, the relevant Process Variables for the DAQ Room are "TemperatureF", "DewPointF", and RHumidity".  The Online Status Viewer has historical records for these PVs under Conditions_sc / Environment - db  and a 24-hour summary under Environment / DAQ Room

As can be seen in the following plot over the same time period as above, the temperatures at this location are typically higher and swing more than those above, but follow the same trends.  It is worth noting that neither of these probes have been calibrated since their purchase nor used simultaneously in the same air to determine any offsets between them.  (There is some deviation in the shapes during the afternoon of June 19 with a large drop into June 20 in the Slow Controls sensor.  Perhaps the end of the run (on the 18th) led to some computers being turned off and or other changes in the heat dynamics of the room.)  (TODO:  The Celsius temperature scale below is wrong!  Have notified Dmitry A. about the bug.)

BNL has a fairly comprehensive unified monitoring (and control?) system for a lot of infrastructure components throughout the lab, including the Bldg. 1006 air handling units.  See http://emcs-main.b459.bnl.gov.  I think few people in STAR have access to this - I do but I vaguely recall being asked not to share the credentials I was given, nor do I know who to ask for others to get access.  Below for instance is a temperature chart mostly overlapping the time window shown above - the temperature here is measured by a sensor on the east wall of the DAQ Room behind the work table near the door to the Assembly Building.  It is unable to keep the temperature at the cooling set point (67 deg. F) throughout this period, only coming close in the early morning of June 16, which was, not surprisingly, the coolest outside air temperature during this time period.

And for comparison, here is a record of the outside air temperature, taken from the Brookhaven Meteorology Services web site:

Meanwhile, lots of devices these days have temperature readings of some sort - most computers have the CPU temperatures available one way or another and sometimes other temperature measurements (eg, power supplies, Raid Controller batteries).  High temperature readings may even trigger CPUs or other equipment to slow down or halt outright trying to prevent permanent damage to the hardware.   Individual disk drive temperatures are usually available in their SMART data.  Network switches such as our two Dell PowerConnect 6248 have temperature sensors. In the case of those two network switches, it is available through their web interfaces, their command line interfaces and presumably SNMP. (TODO: confirm SNMP access.)

To gather some of this available data from individual computers, starting with the Online Linux Pool, I have added two data collectors.  First is adding a Ganglia metric for the difference between the average CPU core temperature and the "high" temperature alarm level (so a lower numerical value means a *higher* cpu temperature - close to zero or negative is worrisome).  Click here for an example.  That is only a single metric, based on the average core temperature, but for all of the core temperatures and others if available to lm_sensors, I have set up sensord to keep recent values.  See here for instance.

One particular set of hardware temperatures is already being recorded in Ganglia - the HLT Xeon Phi processors.  For example: the l416 machine Xeon Phi 1 Temperature graphs

Though not being recorded at this time, the Online Linux Pool machines (onl01-onl30, Dell PowerEdge 2950s) have ambient temperature values available, using for instance   ipmitool sdr | grep "Ambient Temp" (TODO: Add these to Ganglia?)

And of course there is the thermometer hanging in the DA rack row, visible from the STAR Control Room.

II. Existing Alarm Capabilities

Currently no alerts are triggered by the OLP PDU probe (should be easy to add emails though, mentioned below).

The Slow Controls Alarm Handler has alarms for the DAQ Room hygrometer readings. 
The alarm values for the DAQ Room hygrometer in Slow Controls at this time are:

DAQ Room hygrometer alarm settings (as of June 26, 2018)
TemperatureF 60 68 95 100
DewPointF none none 65 66
RHumidity (%) none none 80 85

With the Control Room occupied only about half of the year (and of course most likely to be empty during the warmest months), relying on the audible AH sounds to draw attention is dubious.  Jarda has recently added an email alert feature to the Alarm Handler system.  starsupport @ bnl.gov is one of several subscribers to alarm alerts.  The Slow Controls expert(s) can add additional email subscribers if there is interest.  Note however that there are no separate subscriber lists for individual PVs or subsystems - during the shutdown period (eg, summer and fall) only a few alarms are active, but during STAR operations, alarms for many subsystems are active!

The script that is adding the CPU core temperature safety margin to Ganglia is also set to send out emails if any core temperature approaches the high alert value.

The BNL monitoring certainly has alarm capabilities, but I don't know what the alarm levels are or who gets notified (or how).  There are thousands of alarms in their system; I was told that they have a "critical" alarm list which gets attention - at least some of the DAQ Room Air Handler alarms should be designated as "critical".  In the ideal case, such an alarm would trigger a nearly immediate response from F&O without any need for STAR personnel to be involved, but in practice to date, someone from STAR has always had to "raise the alarm" to get anyone's attention and get repairs started (which are often quick once the proper personnel are available).

Most of the disk drives in the S&C managed computers have SMART monitoring enabled, which will trigger emails in extreme temperatures (with the thresholds set by the manufacturers and potentially varying with each model (or even firmware version)).

III.  Possible Actions/Improvements

TODO: The OLP-PDU collection script could easily be modified to send emails for temperature readings above a given threshhold.  Most cell service providers have email-to-SMS, so this could also trigger SMS text messages as another way to get somebody's attention.

Any temperature or humidity measurements that can be extracted to a bare number one way or another (parsing snmp output, scraping web pages, caget, or whatever) can be dropped into Ganglia or the online databases, or trigger an email or other scriptable actions.  It is just a matter of deciding how often and what massaging / consolidation to do, if any (eg. averaging all disk temperatures in a single computer for instance).  Adding some hardware temperature measurements to Slow Controls (with AH configuration) and/or icinga are also possibilities for additional monitoring records and alarm paths. 

Future UPS purchases can include network cards and temperature probes (though they aren't cheap) - we have such a unit in the Control Room already (ups7.starp.bnl.gov reachable only from starp - temperature data is under Status >> Universal I/O in its web interface.)

Can buy individual temperature and humidity units with network interfaces (just one example of many possibilities: ITHX-W3) to measure temperatures in additional locations (and presumably generate alerts).

Can buy full featured systems with multiple integrated probes and fancy software for monitoring (as is done in the RACF and many large computing facilities).  In late 2015 with some follow up in early 2016, we investigated a monitoring system from SynapSense (which is used by the RACF) for use in the DAQ Room.  It would have added sensor units in 11 racks with wireless connectivity to a central server and the SynapSense software.  The cost was over $12,000, about half of which was for services - travel, engineering and installation labor costs plus training us.  The hardware was only $4700 - we may have been able to install it on our own and merge into the existing RACF system, saving thousands of dollars (at the cost of some of our time for the labor), but this was never nailed down and the matter was dropped.