Wednesday, 23 March 2016

IBM HMC Filesystems to run fsck mannualy - SOLVED and EXPLAINED

Ok, we have IBM pSeries 570 servers managed by HMC version V7R3.5.0.4. After reboot of HMC, server failed to respond. So when we finally get to the location, problem in booting HMC was file system check. Starting server stopped because file system was corrupted and it needed fsck to run manually for problematic system partition.

Is HMC running on Linux or AIX?

HMC is running on Linux. This version of HMC is running on some (it think) Red Hat like Linux distro. So when you are thinking about HMC as server, it's just Linux server with HMC application. In are case this is  

localhost:~> uname -a
Linux localhost 2.6.16.46-211-smp #1 SMP Mon May 10 23:14:53 UTC 2010 i686 i686 i386 GNU/Linux



HMC after reboot filesystem check manually

 

This was print screen that we have:
hmc filesystem check failed reboot boot fsck

As we can see from picture above, problem was corrupted file system on /dev/sda3 device that was mounted on /var. As it was suggested you need to RUN fsck MANUALLY. To do this first you need to have root password. In case you don't have root password, just retype it. I think I wrote an post about that. Ok, so you logged in and you need to do file system check on your HMC server.
Run this command:

localhost:~>fsck.ext3 -y /dev/sda3

-y option is for automatic yes in case that fsck finds inode that need correction.

This look's something like this

In case everything is good and your HMC file system check don't return you error Segmentation fault or error 11 just reboot server and everything should be fine. Your HMC server will boot normally.
But what in case that you have Segmentation fault or error 11 after fsck command executed?
So, what now? We try to reboot a few times, but file system error persisted. 
On IBM page here is one answer from this link:

Question


When prompted to manually run fsck on HMC, what filesystems should be specified?

Cause


HMC filesystem corruption.

Answer


When prompted to run a manual fsck on the HMC, fsck should be run against
the following ext3 filesystems:
/dev/sda2 / ext3
/dev/sda3 /var ext3
/dev/sda7 /dump ext3
/dev/sda8 /extra ext3
/dev/sda6 /hmcdump ext3
/dev/sda5 /mnt/upgrade ext3

If fsck fails to run cleanly for any of the filesystems,
it will be necessary to reinstall the HMC for full recovery.

So, basically this means reinstallation. Do you want to do this? I don't.

HMC Filesystems to run fsck manually - SOLVED and EXPLAINED

So you can call your IBM support or you can think little different. First, this is Linux fsck issue so we start to search on Google possible solution for that. I found here (https://bugzilla.redhat.com/show_bug.cgi?id=607843) that this is e2fsprogs bug/error. e2fsprogs is package that provides file system checking tools like fsck, e2fsck, etc. HMC server have e2fsprogs 1.38 and this bug is in it. So you need more recent version of e2fsprogs to run fsck of /dev/sda3 and normally boot HMC server. To do this, you have two options:

  1. update e2fsprogs package
  2. boot rescue cd and run fsck
We choose option 2.  For rescue purposes I use SystemRescue CD. It's really easy to use, fast and I just like it. 
Depending on hardware, you need to boot from CD/DVD. This HMC is on IBM x3550 system. During startup process, you are prompted to choose boot order, press F12. You will see possible boot devices. Choose CD. Press enter. 
(READ THIS SECTION AND SAVE 20 MINUTES OF YOUR LIFE)
Wait. After 10-15 seconds, server return to boot device choosing menus. Repeat. Same thing happen. First guess, faulty CD media. I insert another SystemRescue CD, same thing happen. Ok, two faulty CD, it's possible, right? We found Ubuntu 10.x  server installation CD. Insert that CD. Repeat boot from CD procces, nothing or same thing happen. Ok, tree faulty CD. Hm... Maybe there is something else problem her? So I insert CentOS 5.4 installation DVD. And it start to boot from it. Possible reason? We think that from some reason, server(even on this no OS level basic level)  is unable to read Debian like Linux CD/DVD distros like Ubuntu, SystemRescue, etc. 
We boot CentOS 5.4 in rescue mode. How to enter rescue mode? Find on Internet. During booting rescue mode, file system partition on server are mounted under /mnt/sysimage/mount_point. 
We have file system check issue on /dev/sda3 and mount point is /var. Because /dev/sda3 is already mounted under /mnt/sysimage/var before you run fsck manually you need to unmount it.
Important thing to know is that CentOS 5.4 was a bad choice because it have e2fsprogs-1.38 version and same bug/error/issue happen during fsck process. 
So you need bigger version of rescue system. So we insert Red Hat Enterprise Linux 6.0. Repeat boot rescue process. Type this commands:

1.Umount partition with problems. In are case this is /dev/sda3 with /var mount point on HMC server and /mnt/sysimage/var on running rescue OS
rescue#:umount /mnt/sysimage/var
2. Run file system check.
rescue#fsck.ext3 -y /dev/sda3
When you execute this command you will see version of e2fsprogs. On RHEL 6.0 e2fsprogs version is 1.41.
During  execution of this command, fsck will fix all inode errors and last output should be something like Errors fixed.
Reboot server. Eject CD/DVD media. Watch how HMC server is booting normally.
I really hope that this will save you some time and money!


28.9.2017. Apendix

Today I had a situation that need fscheck on / partiotion  that was mounted on /dev/sda2. Small tip that will save you some time: when entering rescue mode in RHEL 6.0(I used this one, you can try some other higher than 6.0), you will be asked if you want to mount partition from hard disk. It is important to press SKIP in this step. Why? When choosing to mount partitions from hdd to rescue system, /( dev/sda2) will be mounted on /mnt/sysimage/, /var(/dev/sda3) be mounted on /mnt/sysimage/var, etc. Problem is that on /mnt/sysimage/ is also mounted /sys, /proc and /dev from rescue system. To do fs check you need to umount problematic partition. You can not umount /mnt/sysimage because /proc,/sys and /dev are there and you can't umount them! This is why it's important to choose SKIP when this question is prompted!