08 April 2012

Problem (and workaround) - Oracle Enterprise Linux 6.2 and upgrading to UEK 2 with multipath devices

I went ahead and upgraded the Unbreakable Enterprise Kernel from version 1 to version 2 today. But afterwards the system refused to boot. Shortly after the "Welcome to Oracle Enterprise Linux" banner the boot halted - first errors were: "wwids file is read-only. Can't write wwid" and then the boot process bummed out - it didn't (re-)mount the root filesystem as read-write.

I was left with the infamous enter root password or Control-D to continue prompt (some time ago since I've seen that last time - Yikes!).

So I logged in as root and then ran mount -o remount,rw /, which mounted the root filesystem in read-write mode.
The wwid errors previously logged on the screen suggested that the issue was to do with multipath.

I therefore stopped the multipathd daemon with "service multipathd stop", which failed on the face of it as it complained that a filesystem (root) was already mounted on a multipath device - that error message was also the hint to what was wrong, as it turns out. However, a ps at this stage revealed that the multipathd daemon wasn't running.

Checking /etc/multipath/wwids I noticed my expected 3 (non-boot/root) disks.

Then I attempted to start the multipathd daemon again with "service multipathd start" again - and lo-and-behold, the wwids and bindings files were updated with new WWID's - those for the disk containing root/boot.

So, something in the kernel from UEK1 to UEK2 has changed such that certain (SATA-based in my case) disks are now considered for multipath and the multipath library is trying to write out the new WWID's that have been discovered to the wwids file. However, because it's early in the boot process the root file system is not yet mounted read-write, causing this to fail.

After Googling for a few seconds I found this: http://www.redhat.com/archives/dm-devel/2011-July/msg00102.html (option 3), which exactly describes the behaviour (but too early in the boot process to succeed given the read-only root filesystem) - and highlights a chicken-and-egg problem with multipath when new multipath devices (for root/boot) are suddenly discovered post-upgrade of the kernel.

On a final note I could have probably blacklisted the root/boot device in /etc/multipath.conf but I didn't bother looking into this after resolving the issue - of letting multipathd write the new WWID's. However in the situation of a failed drive that needs to be replaced, and hence it will have a different WWID, this could presumably happen again.

1 comment:

Gino said...

Replying here since it was the first result in google:

You need to rebuild with the initramfs file with dracut (with the multipath directories present).

Once your server is booted and all paths discovered correctly do:

dracut --force --include multipath --include /etc/multipath /etc/multipath