Friday 7 October 2016

Reboot LPAR from HMC CLI EXPLAINED and SOLVED

If you are AIX admin, then probably you have experience with Human Managment Console or short HMC. You can do all HMC tasks via web application or via command line.

Web application is java application. In case you recently update your web browser or java on your  PC you will find that no can no longer acces HMC web application because of security or java security issues. What then?

Reboot LPAR by using HMC CLI

Then you connect to HMC via putty. Regular ssh connection! Your user must have admin privileges!

Today I have situation not be able to connect to HMC via web application and LPAR pagging memory was very, very low so LPAR stop responding. I needed to reboot it! How to do that in HMC cli?

There are two things that you need to know. You need to know machine name where LPAR is located and LPAR name.

#:lssyscfg -r sys -F name:state
 MACHINE1:Running
 MACHINE2:Running

 Ok, so we have two machines.


#:lssyscfg -m MACHINE_NAME1 -r lpar -F name:state
 LPAR1:Running 
 LPAR2:Running

 Ok, now we have LPAR names on machine

In case you need to restart imediateally LPAR1 on MACHINE1 here is commad
  
chsysstate -r lpar -m MACHINE1 -o shutdown --immed --restart -n LPAR1
 

 

Thursday 18 August 2016

WARNING: hpasmd: System Overheating (Zone 4, Location CPU, Temperature 85C) - SOLVED and EXPLAINED

In case you find this message in you server log then you are probably looking for reason why your HP server sometimes reboot it self with no particular reason.

WARNING: hpasmd: System Overheating (Zone 4, Location CPU, Temperature 85C)

So reason is overheating! As you know every machine have something called working temperature. Manufacturer of hardware always point's  that in that boring pdf that you never read. In case that working temperature is not in that temperature range, machine can go malfunctioning. To avoid that some manufactures develop certain mechanisms to prevent that. In HP servers case, that mechanism is server reboot. 

HP servers System Overheating reason

 

Reasons for system overheating can be various, from very high CPU load on certain CPU to too much dust covering air flow ventilators. System overheating can be one time thing(every few months) or every day. It all depends. 

You have to rule out reasons for system overheating. Monitor temperatures and try to connect them to CPU load. HP have tool called hpasm and with that you can monitor various things. For temperatures you can enter hpasm cli and user show temp:

hpasmcli> show temp
Sensor   Location              Temp       Threshold
------          --------              ----       ---------
#1        CPU#1                46C/114F   85C/185F
#2        CPU#2                43C/109F   85C/185F
#3        CPU#3                48C/118F   85C/185F
#4        CPU#4                50C/122F   85C/185F
#5        I/O_ZONE           37C/98F    60C/140F
#6        AMBIENT            22C/71F    40C/104F
#7        SYSTEM_BD       39C/102F   60C/140F

In case you have ruled out high load for system overheating condition, then it's probably air flow ventilator issue. How to fix this? Shutdown the server, open it and start vacuum cleaner. :) No, I am not joking.  You may find something like this
WARNING: hpasmd: System Overheating

 On right side is CPU heat sink before vacuum cleaning and on the right is CPU heat sink after vacuum cleaning. As you can see, there was wall of dust stopping normal air flow. This was reason for system overheating and reboot of server.

In case you still think that I am joking take a look at this time graph with CPU temperatures.

  As you can see, there are significant drop in CPU temperatures after dust removal! 

Point is keep you servers clean :)

BIOS update on HP Proliant DL 580 G4 - how to find right(very old) BIOS ROM package and few other things and tricks

Ok, ok, I know.. It's 2016... And this is server from 2007... 9 years... It's more then 7 years in Tibet :) Just kidding!

OK, so you have old server that you need up and running but you need to update BIOS ROM. BIOS ROM update procedure is pretty simple. You go to manufacture web site or portal and find necessary package, follow the instructions and that it. Basically it is but... there are few thing you need to know before you go like "Ok, it's 5 minute job". Good preparation can take days for something to be 5 minute job. Keep that in mind always.

BIOS update on HP Proliant DL 580 G4

For those who don't know, HP make servers. And a good one! There are few server classes like rack servers, blade servers, etc. Every or every second year HP put's on market new servers!  I am working with HP servers 8+ years now. In case you are now familiar with HP annotation, PROLIANT DL is server line, 580 is server type and G4 is means Generation 4. Every few years, HP put's out new generation of servers. From 2015. there is Generation 9. I hope you get the picture. G4 is very old! 

So, to update BIOS ROM you need to follow these steps:

Installation:
The SCEXE components are self-extracting executable files. The SCEXE file unpacks itself, flashes the ROM, and cleans up.

To flash a ROM:
Download the file to the target server.

Execute sh CPxxxxxx.scexe where CPxxxxxx.scexe represents the filename of the component.


BIOS update on HP Proliant DL 580 G4

By the way, there are two types of update files : .exe and .scexe. Exe is for Windows environment and scexe is for Linux environment.
As you can see, there are small padlock near Obtain software. This means that package is not "free" to download. When you press Obtain software you will get this pop-up windows 
BIOS update on HP Proliant DL 580 G4

So lets suppose that you don't have this. So how to get this file? Or any other update firmware file for that particular old server? Answer in Firmware update DVD. You can find this for every HP server. Just download it. It's iso file. You can burn that .iso file to DVD, boot from it and update of BIOS ROM or some other firmware. Or you can mount that iso file and file that one package that you need. Easy? It sound's easy but ... as I mention in first part of this post, good preparation is what makes BIOS ROM update 5 minute job. Why? Because, on HP website for HP DL 580 G4 you can download Firmware update DVD 10.10. It would be perfectly logical to find there BIOS ROM package CP009618.scexe. But .... there are no such file on that .iso.

[root@]# mount -t iso9660 FW1010.2012_0530.49.iso /mnt/ -o loop
[root@]# cd /mnt/
[root@]# find . -name CP009618.scexe
[root@]#


Why? My guess it that you have to keep up with the new hardware. One of things to do that is to reduce support for old hardware.

So... now what?

So what to do now? In case you have noticed, there are date of in version. So file CP009618.scexe referrers as 2008.06.10(2 Sep 2008). My idea was to find Firmware update DVD from 2008 and try to find file that I need. And....

[root]# mount -t iso9660 FW840.2009_0209.17.iso /mnt/ -o loop
[root@]# cd /mnt/
[root@]# find . -name CP009618.scexe
./compaq/swpackages/CP009618.scexe
[root@]#


This is on Firmware update DVD 8.4.


There it was... Package for BIOS ROM update for HP Proliant DL 580 G4!
From now it's 5 minute work.
Copy file to desired server and follow instructions.

[root]# sh CP009618.scexe
Online ROM Flash Engine Version: Linux-3.5.0-0
Name: HP ProLiant DL580 G4 (P59)
Software Version: 06/10/2008

The software is installed but is not up to date.


Current Version: 08/10/2007



Do you want to upgrade the software to a newer version (y/n) ?y

Flash in progress do not interrupt or your system may become unusable.
Working.............................................
The installation procedure completed successfully.

A reboot is required to finish the installation completely.
Do you want to reboot your system now? yes
Do you want to reboot your system now? yes


For BIOS ROM update to take effect you have to reboot the server.

Wednesday 27 July 2016

Linux TAR restore files from tape unknown number of files, file size, block size and block number SOLVED

Recently I had one simple but time consuming request with lack of informatio from people that request came from! Do you have a feeling that it's always too little information when you need to do something for someone else?

Linux TAR restore files from tape unknown number of files, file size, block size and block number


And request was.....(wait for it...) to restore files from old LTO2 tapes. There were no information what so ever of how many files are there on tape or any information that could help me to quickly restore those archive to local file system. Only thing I knew it that archiving was made by using old backup software that we don't have anymore and not tar command. Ok, I know what you can say "Why don't you just restore one by one file?" Well, I start restore by doing exactly that - restoring one by one file but ... it start it turns out that there are many files from 5 to 10MB in size on 200GB LTO2 size tape.

How to restore archive from tape that you don't know anything? Well, I used dd command. So, it's easy. dd command require(or doesn't) two parameters: block size and block count. 

dd if=/dev/tape_device of=restored_file bs=block_size count=block_count

Because dd command is stupid and she only does what you tell her to do. She will copy block by block from input file(if) to output file(of) from memory pointer where he is now with block size(bs) and will copy that many block that you have defined in count. 

How to get bs and count from unknown tape? Use mt command and math!  By using mt command with status option you will get block number. Rewinding tape back and forward at begin and end of file, you can get block count of that file.
Here is example for first file on tape:

rewind tape on the beginning 
#mt -f /dev/IBMtape0n rewind
#mt -f /dev/IBMtape0n status
SCSI 2 tape drive:
File number=-1, block number=1, partition=0.
Tape block size 0 bytes. Density code 0x44 (no translation).
Soft error count since last status=0
General status bits on (1000000):
 ONLINE
 
#:mt -f /dev/IBMtape0n asf 1
Rewind at the beginning of second file. Remember that first file have position 0, second position 1, etc.
#mt -f /dev/IBMtape0n status
SCSI 2 tape drive:
File number=-1, block number=23442, partition=0.
Tape block size 0 bytes. Density code 0x44 (no translation).
Soft error count since last status=0
General status bits on (1000000):
 ONLINE

By subtracting block number from second beginning of second file and beginning of first file, you can get block count of first file. 
Ok, we have block count! What about block size? As Harry Callahan would say "Do you feel lucky?". I used 64k block size and it worked for me.

Ok, so you can now return one file at the time! But an this go faster? Hell, yea! Ok, I will explain this line by line


#!/bin/bash
for i in {0..100};
#for loop for 100 files on tape. There can be more or can be less,it's up to your tape
do 
echo
echo $i
echo xxxxxxxxxxxxxxxxxxx 
mt -f /dev/IBMtape0n asf $i
#rewind tape at beginning of $i 
a=`mt -f /dev/IBMtape0n status |grep number|awk '{print $4}'|awk -F',' '{print $1}'|awk -F'=' '{print $2}'`
# variable a is block number from begining of $i file
date
echo $a
mt -f /dev/IBMtape0n asf $((i+1)) 
#rewind tape at the beginning of next file($i+1)
b=`mt -f /dev/IBMtape0n status |grep number|awk '{print $4}'|awk -F',' '{print $1}'|awk -F'=' '{print $2}'`
# variable b is block number from begining of $i+1 file
echo $b
date
c=`echo "$b-$a"|bc`
#variable c is block count of file $i
echo $c
mt -f /dev/IBMtape0n asf $i
#rewind tape on beginning of file $i
echo "Start restore. File size is " 
echo "$c*64/1024"|bc
dd if=/dev/IBMtape0n of=backup_$i bs=64k count=$c
#start restore
echo Restore is
echo xxxxxxxxxxxxxxxxxxx
echo
done

I hope that this will save you time!

Wednesday 23 March 2016

IBM HMC Filesystems to run fsck mannualy - SOLVED and EXPLAINED

Ok, we have IBM pSeries 570 servers managed by HMC version V7R3.5.0.4. After reboot of HMC, server failed to respond. So when we finally get to the location, problem in booting HMC was file system check. Starting server stopped because file system was corrupted and it needed fsck to run manually for problematic system partition.

Is HMC running on Linux or AIX?

HMC is running on Linux. This version of HMC is running on some (it think) Red Hat like Linux distro. So when you are thinking about HMC as server, it's just Linux server with HMC application. In are case this is  

localhost:~> uname -a
Linux localhost 2.6.16.46-211-smp #1 SMP Mon May 10 23:14:53 UTC 2010 i686 i686 i386 GNU/Linux



HMC after reboot filesystem check manually

 

This was print screen that we have:
hmc filesystem check failed reboot boot fsck

As we can see from picture above, problem was corrupted file system on /dev/sda3 device that was mounted on /var. As it was suggested you need to RUN fsck MANUALLY. To do this first you need to have root password. In case you don't have root password, just retype it. I think I wrote an post about that. Ok, so you logged in and you need to do file system check on your HMC server.
Run this command:

localhost:~>fsck.ext3 -y /dev/sda3

-y option is for automatic yes in case that fsck finds inode that need correction.

This look's something like this

In case everything is good and your HMC file system check don't return you error Segmentation fault or error 11 just reboot server and everything should be fine. Your HMC server will boot normally.
But what in case that you have Segmentation fault or error 11 after fsck command executed?
So, what now? We try to reboot a few times, but file system error persisted. 
On IBM page here is one answer from this link:

Question


When prompted to manually run fsck on HMC, what filesystems should be specified?

Cause


HMC filesystem corruption.

Answer


When prompted to run a manual fsck on the HMC, fsck should be run against
the following ext3 filesystems:
/dev/sda2 / ext3
/dev/sda3 /var ext3
/dev/sda7 /dump ext3
/dev/sda8 /extra ext3
/dev/sda6 /hmcdump ext3
/dev/sda5 /mnt/upgrade ext3

If fsck fails to run cleanly for any of the filesystems,
it will be necessary to reinstall the HMC for full recovery.

So, basically this means reinstallation. Do you want to do this? I don't.

HMC Filesystems to run fsck manually - SOLVED and EXPLAINED

So you can call your IBM support or you can think little different. First, this is Linux fsck issue so we start to search on Google possible solution for that. I found here (https://bugzilla.redhat.com/show_bug.cgi?id=607843) that this is e2fsprogs bug/error. e2fsprogs is package that provides file system checking tools like fsck, e2fsck, etc. HMC server have e2fsprogs 1.38 and this bug is in it. So you need more recent version of e2fsprogs to run fsck of /dev/sda3 and normally boot HMC server. To do this, you have two options:

  1. update e2fsprogs package
  2. boot rescue cd and run fsck
We choose option 2.  For rescue purposes I use SystemRescue CD. It's really easy to use, fast and I just like it. 
Depending on hardware, you need to boot from CD/DVD. This HMC is on IBM x3550 system. During startup process, you are prompted to choose boot order, press F12. You will see possible boot devices. Choose CD. Press enter. 
(READ THIS SECTION AND SAVE 20 MINUTES OF YOUR LIFE)
Wait. After 10-15 seconds, server return to boot device choosing menus. Repeat. Same thing happen. First guess, faulty CD media. I insert another SystemRescue CD, same thing happen. Ok, two faulty CD, it's possible, right? We found Ubuntu 10.x  server installation CD. Insert that CD. Repeat boot from CD procces, nothing or same thing happen. Ok, tree faulty CD. Hm... Maybe there is something else problem her? So I insert CentOS 5.4 installation DVD. And it start to boot from it. Possible reason? We think that from some reason, server(even on this no OS level basic level)  is unable to read Debian like Linux CD/DVD distros like Ubuntu, SystemRescue, etc. 
We boot CentOS 5.4 in rescue mode. How to enter rescue mode? Find on Internet. During booting rescue mode, file system partition on server are mounted under /mnt/sysimage/mount_point. 
We have file system check issue on /dev/sda3 and mount point is /var. Because /dev/sda3 is already mounted under /mnt/sysimage/var before you run fsck manually you need to unmount it.
Important thing to know is that CentOS 5.4 was a bad choice because it have e2fsprogs-1.38 version and same bug/error/issue happen during fsck process. 
So you need bigger version of rescue system. So we insert Red Hat Enterprise Linux 6.0. Repeat boot rescue process. Type this commands:

1.Umount partition with problems. In are case this is /dev/sda3 with /var mount point on HMC server and /mnt/sysimage/var on running rescue OS
rescue#:umount /mnt/sysimage/var
2. Run file system check.
rescue#fsck.ext3 -y /dev/sda3
When you execute this command you will see version of e2fsprogs. On RHEL 6.0 e2fsprogs version is 1.41.
During  execution of this command, fsck will fix all inode errors and last output should be something like Errors fixed.
Reboot server. Eject CD/DVD media. Watch how HMC server is booting normally.
I really hope that this will save you some time and money!


28.9.2017. Apendix

Today I had a situation that need fscheck on / partiotion  that was mounted on /dev/sda2. Small tip that will save you some time: when entering rescue mode in RHEL 6.0(I used this one, you can try some other higher than 6.0), you will be asked if you want to mount partition from hard disk. It is important to press SKIP in this step. Why? When choosing to mount partitions from hdd to rescue system, /( dev/sda2) will be mounted on /mnt/sysimage/, /var(/dev/sda3) be mounted on /mnt/sysimage/var, etc. Problem is that on /mnt/sysimage/ is also mounted /sys, /proc and /dev from rescue system. To do fs check you need to umount problematic partition. You can not umount /mnt/sysimage because /proc,/sys and /dev are there and you can't umount them! This is why it's important to choose SKIP when this question is prompted!