Re: Checksums wrong on one disk of mirror

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Checksums wrong on one disk of mirror
@ 2006-11-07 14:10 David
  2006-11-13  8:49 ` Henrik Holst
  0 siblings, 1 reply; 5+ messages in thread
From: David @ 2006-11-07 14:10 UTC (permalink / raw)
  To: linux-raid

Quoting Neil Brown <neilb@suse.de>:
> On Tuesday November 7, lists@edeca.net wrote:
>> Booting into a live CD, mdadm -E /dev/sdaX shows that the checksum is
>> not what would be expected for sda1,2,3 but is fine for sda6. All of
>> the checksums on drive sdb are correct.
>
> I'm surprised it doesn't boot then.  How are the arrays being
> assembled? A more complete kernel log would help.

Neil,

Thanks for such a quick reply.  I will post the kernel logs if the  
below is not enough information.  The old dmesg should also still be  
on the partition.

>> The state is "clean" for all partitions, working 2, active 2 and
>> failed 0. The table for sdb1,2,3 shows that the first device has been
>> removed and is no longer an active mirror.
>>
>> What is the best way to proceed here? Can I somehow sync from the
>> second disk, which appears to have the correct checksums? Is there an
>> easy way to fix this that wont involve loosing the data?
>
> While booted from the live CD you should be able to
>   mdadm -AR /dev/md0 /dev/sdb1
>   mdadm /dev/md0 --add /dev/sda1

Fantastic, this works well for two of the partitions.  However the  
third has a bad sector (as reported by smartmontools) on the disk with  
the "good" superblock.  The disk cannot read the sector, so the  
syncing fails and starts over at 15.7% each time.

Is it safe to mount that partition outside of the md, find the file,  
remove it so that the disk can remap that sector (it is shown as  
Currently_Pending in SMART right now) then resync the array?  I guess  
this will cause problems and break the mirror.  Or is the correct way  
to remove the "bad" superblock drive from the array, mount the md,  
remove the file then resync the array?

If it is possible to do either of the above, how do I stop the  
recovery?  It now starts automatically at live CD boot, repeating from  
15.7% over and over.  My knowledge of the tools is bad but I tried the  
following:

# mdadm /dev/md0 --remove /dev/sda1
and
# mdadm -f /dev/md0 --remove /dev/sda1 (no idea if the -f even makes  
sense there)

> It is very odd that the checksums are all wrong though.  Kernel
> version? mdadm version? hardware architecture?

Kernel installed from Ubuntu 6.06 sources, 2.6.15.  Machine is a x86  
Dell with two identical Maxtor DiamondMax drives on an Intel 82801  
SATA controller.

mdadm is version 1.12.  Looking at the most recently available version  
this seems incredibly out of date, but seems to be the default  
installed in Ubuntu.  Even Debian stable seems to have 1.9.  I can bug  
this with them for an update if necessary.

Is it possible that a broken init script has tried to fsck an  
individual drive instead of the md?  /etc/fstab only uses /dev/md*  
references but I'll check other scripts when (if? :) I get the system  
back up and running.

Whilst the machine is not critical and is only a new install, I'd like  
to keep fighting rather than give in if possible.

Thanks,

David

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Checksums wrong on one disk of mirror
  2006-11-07 14:10 Checksums wrong on one disk of mirror David
@ 2006-11-13  8:49 ` Henrik Holst
  0 siblings, 0 replies; 5+ messages in thread
From: Henrik Holst @ 2006-11-13  8:49 UTC (permalink / raw)
  To: David; +Cc: linux-raid

David wrote:

<snip>

> mdadm is version 1.12.  Looking at the most recently available version
> this seems incredibly out of date, but seems to be the default installed
> in Ubuntu.  Even Debian stable seems to have 1.9.  I can bug this with
> them for an update if necessary.

It's already on it's way. Update to the comming Debian release "Etch"
(due to be "Stable" in December 2006; if I remember correctly). In Etch
the mdadm version is v2.5.3 (7 August 2006).

Henrik Holst


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Checksums wrong on one disk of mirror
@ 2006-11-08 15:55 David
  0 siblings, 0 replies; 5+ messages in thread
From: David @ 2006-11-08 15:55 UTC (permalink / raw)
  To: linux-raid

Quoting David <lists@edeca.net>:
> Or is the correct way
> to remove the "bad" superblock drive from the array, mount the md,
> remove the file then resync the array?

Common sense says this is correct.

> If it is possible to do either of the above, how do I stop the
> recovery?  It now starts automatically at live CD boot, repeating from
> 15.7% over and over.  My knowledge of the tools is bad but I tried the
> following:
>
> # mdadm /dev/md0 --remove /dev/sda1
> and
> # mdadm -f /dev/md0 --remove /dev/sda1 (no idea if the -f even makes
> sense there)

Looking at http://smartmontools.sourceforge.net/BadBlockHowTo.txt I  
tried to figure out what file was in the bad blocks but it turned out  
there wasn't one, it was just unused space.

My fix, for completeness, was this:

Force failure of the corrupt half of the mirror, using
# mdadm --manage /dev/md0 --fail /dev/sda

Mount the other one and fill free space with zeros
# mount /dev/md0 /mnt/test
# dd if=/dev/zero of=/mnt/test/bigfile
# sync

smartctl now showed that the pending sector had been reallocated, so I  
removed the bigfile and hot added the other drive
# mdadm --manage /dev/md0 --add /dev/sda

The recovery went fine this time and both partitions were shown as  
correct and active.  I had to fsck another md before it would boot  
correctly but the machine is now back up and working correctly.

Thanks for your help previously, it helped me along the right lines to  
start fixing this one.

David

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Checksums wrong on one disk of mirror
@ 2006-11-07 10:04 David
  2006-11-07 10:26 ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: David @ 2006-11-07 10:04 UTC (permalink / raw)
  To: linux-raid

I recently installed a server with mirrored disks using software RAID.  
Everything was working fine for a few days until a normal reboot (not  
the first).  Now the machine will not boot because it appears the  
superblock is wrong on some of the RAID devices on the first disk.

The rough layout of the disks (sda and sdb):

  sdx1 (md0) - /
  sdx2 (md1) - /var
  sdx3 (md2) - /usr
  extended partition with swap
  sdx6 (md3) - /opt

The exact error is:

  "invalid superblock checksum on sda3
  sda3 has invalid sb, not importing!"

Booting into a live CD, mdadm -E /dev/sdaX shows that the checksum is  
not what would be expected for sda1,2,3 but is fine for sda6. All of  
the checksums on drive sdb are correct.

The state is "clean" for all partitions, working 2, active 2 and  
failed 0. The table for sdb1,2,3 shows that the first device has been  
removed and is no longer an active mirror.

What is the best way to proceed here? Can I somehow sync from the  
second disk, which appears to have the correct checksums? Is there an  
easy way to fix this that wont involve loosing the data?

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Checksums wrong on one disk of mirror
  2006-11-07 10:04 David
@ 2006-11-07 10:26 ` Neil Brown
  0 siblings, 0 replies; 5+ messages in thread
From: Neil Brown @ 2006-11-07 10:26 UTC (permalink / raw)
  To: David; +Cc: linux-raid

On Tuesday November 7, lists@edeca.net wrote:
> I recently installed a server with mirrored disks using software RAID.  
> Everything was working fine for a few days until a normal reboot (not  
> the first).  Now the machine will not boot because it appears the  
> superblock is wrong on some of the RAID devices on the first disk.
> 
> The rough layout of the disks (sda and sdb):
> 
>   sdx1 (md0) - /
>   sdx2 (md1) - /var
>   sdx3 (md2) - /usr
>   extended partition with swap
>   sdx6 (md3) - /opt
> 
> The exact error is:
> 
>   "invalid superblock checksum on sda3
>   sda3 has invalid sb, not importing!"
> 
> Booting into a live CD, mdadm -E /dev/sdaX shows that the checksum is  
> not what would be expected for sda1,2,3 but is fine for sda6. All of  
> the checksums on drive sdb are correct.

I'm surprised it doesn't boot then.  How are the arrays being
assembled? A more complete kernel log would help.


> 
> The state is "clean" for all partitions, working 2, active 2 and  
> failed 0. The table for sdb1,2,3 shows that the first device has been  
> removed and is no longer an active mirror.
> 
> What is the best way to proceed here? Can I somehow sync from the  
> second disk, which appears to have the correct checksums? Is there an  
> easy way to fix this that wont involve loosing the data?

While booted from the live CD you should be able to
  mdadm -AR /dev/md0 /dev/sdb1
  mdadm /dev/md0 --add /dev/sda1

and repeat for 2 and 3.
That will cause a recovery of all arrays but you won't lose any data.

It is very odd that the checksums are all wrong though.  Kernel
version? mdadm version? hardware architecture?

NeilBrown

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-11-13  8:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-07 14:10 Checksums wrong on one disk of mirror David
2006-11-13  8:49 ` Henrik Holst
  -- strict thread matches above, loose matches on Subject: below --
2006-11-08 15:55 David
2006-11-07 10:04 David
2006-11-07 10:26 ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).