sdb failure - mdadm: no devices found for /dev/md0

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* sdb failure - mdadm: no devices found for /dev/md0
@ 2009-07-27 16:19 Andy Bailey
  2009-07-28 13:05 ` Andy Bailey
  0 siblings, 1 reply; 3+ messages in thread
From: Andy Bailey @ 2009-07-27 16:19 UTC (permalink / raw)
  To: linux-raid

Recently we had a disk failure on a sata disk /dev/sdb
it was in a mirror with /dev/sda, md0=boot and root combined, swap
mirrored as well. The system has Fedora Core 10 installed with recent
updates to kernel and mdadm tools.

My plan for the disk swap is below, we got as far as 4 rebooting.

On reboot grub displayed the grub prompt on a black screen. Ie no grub
boot menu. 

We changed back to the failing sdb disk.

the grub menu appeared however upon booting
we got
mdadm: no devices found for /dev/md0
mdadm: /dev/md2 has been started with 1 drive (out of 2)

and other messages that i didnt note down textually
bad superblock on /dev/md0
/dev/root device does not exist

I could boot from the rescue disk and it detected the linux instalation
and mounted it fine, mdstat looked fine, (the 2nd mdstat below)

The only "weird" thing that happened to md0 that didnt happen to the
other devices is that when the sdb disk started to fail, I did a 
mdadm /dev/md0 --grow -n 3
and added another partition from sdb that I had failed and removed from
another raid partition to it. I didnt zero the superblock of the 3rd
partition before adding it, I didnt think it was necessary- could that
be the problem.

All the partitions in mdadm.conf were specified by UUID.

Is it possible that the UUID changed somehow from the value that was
expected by the initrd mdadm.conf.?

I tried adding the kernel argument md=0,/dev/sda1,/dev/sdb1
no change.

Does this override the initrd's mdadm.conf? If not why not?

I tried remaking the initrd from the rescue disk,
chroot /mnt/sysimage
cd /boot
mkinitrd initrdraid {kernel version}

the mkinitrd script. didnt create anything, and no error message, so I
never got to test this out. 

When I didnt get the rescue disk to look for the root partitions:

I could create the mdadm.conf on the rescue root using
mdadm --examine --scan --config=partitions > /etc/mdadm.conf
mdadm -Av /dev/md0

The md0 partition appeared with the sda1 partion and could be fscked and
mounted.

Workaround: (after 2 days work) recover from backup!

Now the system has the boot+root partition not on raid and swap not on
raid, until I can figure out what went wrong.

Can anyone shed some light on what could have happened?

Specifically how is it that swapping the failing sdb for a new sdb and
then putting the failing sdb back again can cause a problem?

I took photos of the screen if anyone needs more info. And have backups
of root.

Thanks in advance,

Andy Bailey

--------------------------------------------------------------------
Plan

mdadm --set-faulty /dev/md0 /dev/sdb1
mdadm --set-faulty /dev/md0 /dev/sdb11

mdadm --set-faulty /dev/md1 /dev/sdb2
mdadm --set-faulty /dev/md2 /dev/sdb3
mdadm --set-faulty /dev/md3 /dev/sdb6
mdadm --set-faulty /dev/md4 /dev/sdb5
mdadm --set-faulty /dev/md5 /dev/sdb10

mdadm --set-faulty /dev/md6 /dev/sdb9
mdadm --set-faulty /dev/md7 /dev/sdb8


mdadm --remove /dev/md0 /dev/sdb1
mdadm --remove /dev/md0 /dev/sdb11

mdadm --remove /dev/md1 /dev/sdb2
mdadm --remove /dev/md2 /dev/sdb3
mdadm --remove /dev/md3 /dev/sdb6
mdadm --remove /dev/md4 /dev/sdb5
mdadm --remove /dev/md5 /dev/sdb10

mdadm --remove /dev/md6 /dev/sdb9
mdadm --remove /dev/md7 /dev/sdb8

grep sdb /proc/mdstat
check nothing appears

poweroff

3 replace the new disk for the old sata slot 1

4 check that the bios detects the disk

5 boot to multiuser

6 as root

sfdisk /dev/sdb < /root/sfdisk.sdb

fdisk /dev/sdb
order: p
check partition type "fd"
(option t, partition #, hexadecimal code fd)


mdadm --add /dev/md0 /dev/sdb1

mdadm --add /dev/md1 /dev/sdb2
mdadm --add /dev/md2 /dev/sdb3
mdadm --add /dev/md3 /dev/sdb6
mdadm --add /dev/md4 /dev/sdb5
mdadm --add /dev/md5 /dev/sdb10


monitor with
watch "cat /proc/mdstat"

when finished 5

mdadm --add /dev/md6 /dev/sdb9
mdadm --add /dev/md7 /dev/sdb8
mdadm --add /dev/md8 /dev/sdb7
mdadm --add /dev/md9 /dev/sdb11

---------------------------------------------------------------------------------------
This is the first mail message from mdadmmonitor after we failed sdb1
P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] [raid6] [raid5] [raid4] 
md3 : active raid1 sdb6[1] sda6[0]
      102398208 blocks [2/2] [UU]
      
md4 : active raid1 sda5[0] sdb5[1]
      102398208 blocks [2/2] [UU]
      
md5 : active raid1 sda10[0] sdb10[1]
      20482752 blocks [2/2] [UU]
      
md6 : active raid1 sda9[0] sdb9[1]
      51199040 blocks [2/2] [UU]
      
md7 : active raid1 sda8[0] sdb8[1]
      51199040 blocks [2/2] [UU]
      bitmap: 0/196 pages [0KB], 128KB chunk

md8 : active raid1 sda7[0]
      51199040 blocks [2/1] [U_]
      bitmap: 0/196 pages [0KB], 128KB chunk

md9 : active raid1 sda11[0]
      35784640 blocks [2/1] [U_]
      bitmap: 2/137 pages [8KB], 128KB chunk

md1 : active raid1 sda2[0] sdb2[1]
      30716160 blocks [2/2] [UU]
      
md2 : active raid1 sda3[0] sdb3[1]
      12289600 blocks [2/2] [UU]
      
md0 : active raid1 sdb11[2] sda1[0] sdb1[3](F)
      30716160 blocks [3/2] [U_U]
-----------------------------------------------
This is the last message after failing all sdb partitions

Personalities : [raid1] [raid6] [raid5] [raid4] 
md3 : active raid1 sdb6[2](F) sda6[0]
      102398208 blocks [2/1] [U_]
      
md4 : active raid1 sda5[0] sdb5[2](F)
      102398208 blocks [2/1] [U_]
      
md5 : active raid1 sda10[0] sdb10[2](F)
      20482752 blocks [2/1] [U_]
      
md6 : active raid1 sda9[0] sdb9[2](F)
      51199040 blocks [2/1] [U_]
      
md7 : active raid1 sda8[0] sdb8[1](F)
      51199040 blocks [2/1] [U_]
      bitmap: 0/196 pages [0KB], 128KB chunk

md8 : active raid1 sda7[0]
      51199040 blocks [2/1] [U_]
      bitmap: 0/196 pages [0KB], 128KB chunk

md9 : active raid1 sda11[0]
      35784640 blocks [2/1] [U_]
      bitmap: 2/137 pages [8KB], 128KB chunk

md1 : active raid1 sda2[0] sdb2[2](F)
      30716160 blocks [2/1] [U_]
      
md2 : active raid1 sda3[0] sdb3[2](F)
      12289600 blocks [2/1] [U_]
      
md0 : active raid1 sdb11[3](F) sda1[0] sdb1[4](F)
      30716160 blocks [3/1] [U__]
      
unused devices: <none>




^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: sdb failure - mdadm: no devices found for /dev/md0
  2009-07-27 16:19 sdb failure - mdadm: no devices found for /dev/md0 Andy Bailey
@ 2009-07-28 13:05 ` Andy Bailey
  2009-07-30  3:21   ` Andy Bailey
  0 siblings, 1 reply; 3+ messages in thread
From: Andy Bailey @ 2009-07-28 13:05 UTC (permalink / raw)
  To: linux-raid

More info:

In the initrd the mdadm.conf has
ARRAY /dev/md0 level=raid1 num-devices=2 metadata=0.90
UUID=392f9510:37d5d89a:d9143456:0dceb00d

In the /etc/mdadm.conf (for that root partition)
ARRAY /dev/md0 level=raid1 num-devices=2
UUID=2d10ee45:1a407729:ec85deb0:7e9ea950

The rest of the UUID entries are identical in the 2 mdadm.confs.

So its now clear what the problem was.

However, I still dont know how the problem happened.

What could have caused he UUID for the root filesystem to change.
Anaconda installed it to md0 and the only thing that we had done was
grow the filesystem to 3 partitions and add a third partition, as I
mentioned in my previous email.

Could yum have installed a kernel upgrade that changed the UUID???

Also why didnt the kernel options md=0,/dev/sda1,/dev/sdb1 override the
settings in mdadm.conf?

Is the syntax correct?
If its not correct doesnt the kernel print an error message?

The kernel we were running was 2.6.27.24-170.2.68.fc10.x86_64

It also doesnt explain why mkinitrd didnt do anything using the rescue
disk, no delay, no message, no file.

On the new system it works as expected
mkinitrd test 2.6.27.24-170.2.68.fc10.x86_64

creates the initrd "test" in the current directory
and if theres a typo it reports
No modules available for kernel ....

Any ideas?

Thanks in advance

Andy Bailey

--------------------------------------------------------------------
[root@servidor cron.daily]# cd /mnt/boot/

[root@servidor boot]# zcat initrd-2.6.27.24-170.2.68.fc10.x86_64.img
> /tmp/initrd
[root@servidor boot]# cd /tmp/
[root@servidor tmp]# mkdir init
[root@servidor tmp]# file initrd 
initrd: ASCII cpio archive (SVR4 with no CRC)
[root@servidor tmp]# man cpio
[root@servidor tmp]# cd init
[root@servidor init]# cpio -i --make-directories < ../initrd 
15858 blocks
[root@servidor init]# cd etc

[root@servidor etc]# cat mdadm.conf 

# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR root

ARRAY /dev/md1 level=raid1 num-devices=2 metadata=0.90
UUID=ed127569:52ff37b7:da473d79:03ebe556
ARRAY /dev/md0 level=raid1 num-devices=2 metadata=0.90
UUID=392f9510:37d5d89a:d9143456:0dceb00d
ARRAY /dev/md2 level=raid1 num-devices=2 metadata=0.90
UUID=c964f385:5b6e5d18:2d631c3c:e70b1b22
ARRAY /dev/md9 level=raid1 num-devices=2 metadata=0.90
UUID=1ab3d6b5:575a8a2b:dfa789fa:805a401f
ARRAY /dev/md8 level=raid1 num-devices=2 metadata=0.90
UUID=318d1ff4:dd2b9ccc:7effe40e:a3e412fc
ARRAY /dev/md7 level=raid1 num-devices=2 metadata=0.90
UUID=0114a14e:b6a45be8:3a719518:7cb31784
ARRAY /dev/md6 level=raid1 num-devices=2 metadata=0.90
UUID=03254e0d:86de6727:2ff70881:773bec64
ARRAY /dev/md5 level=raid1 num-devices=2 metadata=0.90
UUID=644a897a:15ef0e34:833f147a:71729df0
ARRAY /dev/md4 level=raid1 num-devices=2 metadata=0.90
UUID=988ecde9:cc998c1b:c50949cb:adc77570
ARRAY /dev/md3 level=raid1 num-devices=2 metadata=0.90
UUID=e7b76f39:7c6d5185:3b70f7bd:4b6284f5

[root@servidor etc]# cat /mnt/etc/mdadm.conf 

# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR root

ARRAY /dev/md2 level=raid1 num-devices=2
UUID=c964f385:5b6e5d18:2d631c3c:e70b1b22
ARRAY /dev/md4 level=raid1 num-devices=2
UUID=988ecde9:cc998c1b:c50949cb:adc77570
ARRAY /dev/md3 level=raid1 num-devices=2
UUID=e7b76f39:7c6d5185:3b70f7bd:4b6284f5
ARRAY /dev/md8 level=raid1 num-devices=2
UUID=318d1ff4:dd2b9ccc:7effe40e:a3e412fc
ARRAY /dev/md7 level=raid1 num-devices=2
UUID=0114a14e:b6a45be8:3a719518:7cb31784
ARRAY /dev/md6 level=raid1 num-devices=2
UUID=03254e0d:86de6727:2ff70881:773bec64
ARRAY /dev/md5 level=raid1 num-devices=2
UUID=644a897a:15ef0e34:833f147a:71729df0
ARRAY /dev/md0 level=raid1 num-devices=2
UUID=2d10ee45:1a407729:ec85deb0:7e9ea950
ARRAY /dev/md1 level=raid1 num-devices=2
UUID=ed127569:52ff37b7:da473d79:03ebe556
ARRAY /dev/md9 level=raid1 num-devices=2
UUID=1ab3d6b5:575a8a2b:dfa789fa:805a401f
[root@servidor etc]# cat /mnt/etc/mdadm.conf /tmp/init/etc/mdadm.conf |
sort

ARRAY /dev/md0 level=raid1 num-devices=2 metadata=0.90
UUID=392f9510:37d5d89a:d9143456:0dceb00d
ARRAY /dev/md0 level=raid1 num-devices=2
UUID=2d10ee45:1a407729:ec85deb0:7e9ea950

ARRAY /dev/md1 level=raid1 num-devices=2 metadata=0.90
UUID=ed127569:52ff37b7:da473d79:03ebe556
ARRAY /dev/md1 level=raid1 num-devices=2
UUID=ed127569:52ff37b7:da473d79:03ebe556

ARRAY /dev/md2 level=raid1 num-devices=2 metadata=0.90
UUID=c964f385:5b6e5d18:2d631c3c:e70b1b22
ARRAY /dev/md2 level=raid1 num-devices=2
UUID=c964f385:5b6e5d18:2d631c3c:e70b1b22

ARRAY /dev/md3 level=raid1 num-devices=2 metadata=0.90
UUID=e7b76f39:7c6d5185:3b70f7bd:4b6284f5
ARRAY /dev/md3 level=raid1 num-devices=2
UUID=e7b76f39:7c6d5185:3b70f7bd:4b6284f5

ARRAY /dev/md4 level=raid1 num-devices=2 metadata=0.90
UUID=988ecde9:cc998c1b:c50949cb:adc77570
ARRAY /dev/md4 level=raid1 num-devices=2
UUID=988ecde9:cc998c1b:c50949cb:adc77570

ARRAY /dev/md5 level=raid1 num-devices=2 metadata=0.90
UUID=644a897a:15ef0e34:833f147a:71729df0
ARRAY /dev/md5 level=raid1 num-devices=2
UUID=644a897a:15ef0e34:833f147a:71729df0

ARRAY /dev/md6 level=raid1 num-devices=2 metadata=0.90
UUID=03254e0d:86de6727:2ff70881:773bec64
ARRAY /dev/md6 level=raid1 num-devices=2
UUID=03254e0d:86de6727:2ff70881:773bec64

ARRAY /dev/md7 level=raid1 num-devices=2 metadata=0.90
UUID=0114a14e:b6a45be8:3a719518:7cb31784
ARRAY /dev/md7 level=raid1 num-devices=2
UUID=0114a14e:b6a45be8:3a719518:7cb31784

ARRAY /dev/md8 level=raid1 num-devices=2 metadata=0.90
UUID=318d1ff4:dd2b9ccc:7effe40e:a3e412fc
ARRAY /dev/md8 level=raid1 num-devices=2
UUID=318d1ff4:dd2b9ccc:7effe40e:a3e412fc

ARRAY /dev/md9 level=raid1 num-devices=2 metadata=0.90
UUID=1ab3d6b5:575a8a2b:dfa789fa:805a401f
ARRAY /dev/md9 level=raid1 num-devices=2
UUID=1ab3d6b5:575a8a2b:dfa789fa:805a401f



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: sdb failure - mdadm: no devices found for /dev/md0
  2009-07-28 13:05 ` Andy Bailey
@ 2009-07-30  3:21   ` Andy Bailey
  0 siblings, 0 replies; 3+ messages in thread
From: Andy Bailey @ 2009-07-30  3:21 UTC (permalink / raw)
  To: linux-raid

Please, can anyone help out with this problem.

Can anyone think of a possible reason why the UUID changed?

No command was run that I can imagine would change the UUID.
The only thing that I could see that possibly could change the UUID is
adding a partition that had previously been part of another partition,
without zeroing the superblock first. 

Also can anyone see why the system wouldnt boot without the
original /dev/sdb?

The /dev/sdb that replaced it had no partitions created on it, it was
brand new. Could that cause a problem?

And then why it then wouldnt boot with the original /dev/sdb?

Any guess, hypothesis will do, I need some ideas to investigate and find
out what happened and I have drawn a blank.

Or am I asking in the wrong mailing list, if so can somebody tell me the
correct mailing list to use.

Thanks

Andy Bailey 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2009-07-30  3:21 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-27 16:19 sdb failure - mdadm: no devices found for /dev/md0 Andy Bailey
2009-07-28 13:05 ` Andy Bailey
2009-07-30  3:21   ` Andy Bailey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).