In case you find this message in you server log then you are probably looking for reason why your HP server sometimes reboot it self with no particular reason.
WARNING: hpasmd: System Overheating (Zone 4, Location CPU, Temperature 85C)
So reason is overheating! As you know every machine have something called working temperature. Manufacturer of hardware always point's  that in that boring pdf that you never read. In case that working temperature is not in that temperature range, machine can go malfunctioning. To avoid that some manufactures develop certain mechanisms to prevent that. In HP servers case, that mechanism is server reboot. 
HP servers System Overheating reason
Reasons for system overheating can be various, from very high CPU load on certain CPU to too much dust covering air flow ventilators. System overheating can be one time thing(every few months) or every day. It all depends. 
You have to rule out reasons for system overheating. Monitor temperatures and try to connect them to CPU load. HP have tool called hpasm and with that you can monitor various things. For temperatures you can enter hpasm cli and user show temp:
hpasmcli> show temp
Sensor Location Temp Threshold
------ -------- ---- ---------
#1 CPU#1 46C/114F 85C/185F
#2 CPU#2 43C/109F 85C/185F
#3 CPU#3 48C/118F 85C/185F
#4 CPU#4 50C/122F 85C/185F
#5 I/O_ZONE 37C/98F 60C/140F
#6 AMBIENT 22C/71F 40C/104F
#7 SYSTEM_BD 39C/102F 60C/140F
Sensor Location Temp Threshold
------ -------- ---- ---------
#1 CPU#1 46C/114F 85C/185F
#2 CPU#2 43C/109F 85C/185F
#3 CPU#3 48C/118F 85C/185F
#4 CPU#4 50C/122F 85C/185F
#5 I/O_ZONE 37C/98F 60C/140F
#6 AMBIENT 22C/71F 40C/104F
#7 SYSTEM_BD 39C/102F 60C/140F
In case you have ruled out high load for system overheating condition, then it's probably air flow ventilator issue. How to fix this? Shutdown the server, open it and start vacuum cleaner. :) No, I am not joking.  You may find something like this
 On right side is CPU heat sink before vacuum cleaning and on the right is CPU heat sink after vacuum cleaning. As you can see, there was wall of dust stopping normal air flow. This was reason for system overheating and reboot of server.
In case you still think that I am joking take a look at this time graph with CPU temperatures.
  As you can see, there are significant drop in CPU temperatures after dust removal! 
Point is keep you servers clean :)


No comments:
Post a Comment