Thursday, 18 August 2016

WARNING: hpasmd: System Overheating (Zone 4, Location CPU, Temperature 85C) - SOLVED and EXPLAINED

In case you find this message in you server log then you are probably looking for reason why your HP server sometimes reboot it self with no particular reason.

WARNING: hpasmd: System Overheating (Zone 4, Location CPU, Temperature 85C)

So reason is overheating! As you know every machine have something called working temperature. Manufacturer of hardware always point's  that in that boring pdf that you never read. In case that working temperature is not in that temperature range, machine can go malfunctioning. To avoid that some manufactures develop certain mechanisms to prevent that. In HP servers case, that mechanism is server reboot. 

HP servers System Overheating reason

 

Reasons for system overheating can be various, from very high CPU load on certain CPU to too much dust covering air flow ventilators. System overheating can be one time thing(every few months) or every day. It all depends. 

You have to rule out reasons for system overheating. Monitor temperatures and try to connect them to CPU load. HP have tool called hpasm and with that you can monitor various things. For temperatures you can enter hpasm cli and user show temp:

hpasmcli> show temp
Sensor   Location              Temp       Threshold
------          --------              ----       ---------
#1        CPU#1                46C/114F   85C/185F
#2        CPU#2                43C/109F   85C/185F
#3        CPU#3                48C/118F   85C/185F
#4        CPU#4                50C/122F   85C/185F
#5        I/O_ZONE           37C/98F    60C/140F
#6        AMBIENT            22C/71F    40C/104F
#7        SYSTEM_BD       39C/102F   60C/140F

In case you have ruled out high load for system overheating condition, then it's probably air flow ventilator issue. How to fix this? Shutdown the server, open it and start vacuum cleaner. :) No, I am not joking.  You may find something like this
WARNING: hpasmd: System Overheating

 On right side is CPU heat sink before vacuum cleaning and on the right is CPU heat sink after vacuum cleaning. As you can see, there was wall of dust stopping normal air flow. This was reason for system overheating and reboot of server.

In case you still think that I am joking take a look at this time graph with CPU temperatures.

  As you can see, there are significant drop in CPU temperatures after dust removal! 

Point is keep you servers clean :)

No comments: