* Checksums wrong on one disk of mirror
@ 2006-11-07 10:04 David
2006-11-07 10:26 ` Neil Brown
0 siblings, 1 reply; 5+ messages in thread
From: David @ 2006-11-07 10:04 UTC (permalink / raw)
To: linux-raid
I recently installed a server with mirrored disks using software RAID.
Everything was working fine for a few days until a normal reboot (not
the first). Now the machine will not boot because it appears the
superblock is wrong on some of the RAID devices on the first disk.
The rough layout of the disks (sda and sdb):
sdx1 (md0) - /
sdx2 (md1) - /var
sdx3 (md2) - /usr
extended partition with swap
sdx6 (md3) - /opt
The exact error is:
"invalid superblock checksum on sda3
sda3 has invalid sb, not importing!"
Booting into a live CD, mdadm -E /dev/sdaX shows that the checksum is
not what would be expected for sda1,2,3 but is fine for sda6. All of
the checksums on drive sdb are correct.
The state is "clean" for all partitions, working 2, active 2 and
failed 0. The table for sdb1,2,3 shows that the first device has been
removed and is no longer an active mirror.
What is the best way to proceed here? Can I somehow sync from the
second disk, which appears to have the correct checksums? Is there an
easy way to fix this that wont involve loosing the data?
Thanks.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Checksums wrong on one disk of mirror
2006-11-07 10:04 David
@ 2006-11-07 10:26 ` Neil Brown
0 siblings, 0 replies; 5+ messages in thread
From: Neil Brown @ 2006-11-07 10:26 UTC (permalink / raw)
To: David; +Cc: linux-raid
On Tuesday November 7, lists@edeca.net wrote:
> I recently installed a server with mirrored disks using software RAID.
> Everything was working fine for a few days until a normal reboot (not
> the first). Now the machine will not boot because it appears the
> superblock is wrong on some of the RAID devices on the first disk.
>
> The rough layout of the disks (sda and sdb):
>
> sdx1 (md0) - /
> sdx2 (md1) - /var
> sdx3 (md2) - /usr
> extended partition with swap
> sdx6 (md3) - /opt
>
> The exact error is:
>
> "invalid superblock checksum on sda3
> sda3 has invalid sb, not importing!"
>
> Booting into a live CD, mdadm -E /dev/sdaX shows that the checksum is
> not what would be expected for sda1,2,3 but is fine for sda6. All of
> the checksums on drive sdb are correct.
I'm surprised it doesn't boot then. How are the arrays being
assembled? A more complete kernel log would help.
>
> The state is "clean" for all partitions, working 2, active 2 and
> failed 0. The table for sdb1,2,3 shows that the first device has been
> removed and is no longer an active mirror.
>
> What is the best way to proceed here? Can I somehow sync from the
> second disk, which appears to have the correct checksums? Is there an
> easy way to fix this that wont involve loosing the data?
While booted from the live CD you should be able to
mdadm -AR /dev/md0 /dev/sdb1
mdadm /dev/md0 --add /dev/sda1
and repeat for 2 and 3.
That will cause a recovery of all arrays but you won't lose any data.
It is very odd that the checksums are all wrong though. Kernel
version? mdadm version? hardware architecture?
NeilBrown
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Checksums wrong on one disk of mirror
@ 2006-11-07 14:10 David
2006-11-13 8:49 ` Henrik Holst
0 siblings, 1 reply; 5+ messages in thread
From: David @ 2006-11-07 14:10 UTC (permalink / raw)
To: linux-raid
Quoting Neil Brown <neilb@suse.de>:
> On Tuesday November 7, lists@edeca.net wrote:
>> Booting into a live CD, mdadm -E /dev/sdaX shows that the checksum is
>> not what would be expected for sda1,2,3 but is fine for sda6. All of
>> the checksums on drive sdb are correct.
>
> I'm surprised it doesn't boot then. How are the arrays being
> assembled? A more complete kernel log would help.
Neil,
Thanks for such a quick reply. I will post the kernel logs if the
below is not enough information. The old dmesg should also still be
on the partition.
>> The state is "clean" for all partitions, working 2, active 2 and
>> failed 0. The table for sdb1,2,3 shows that the first device has been
>> removed and is no longer an active mirror.
>>
>> What is the best way to proceed here? Can I somehow sync from the
>> second disk, which appears to have the correct checksums? Is there an
>> easy way to fix this that wont involve loosing the data?
>
> While booted from the live CD you should be able to
> mdadm -AR /dev/md0 /dev/sdb1
> mdadm /dev/md0 --add /dev/sda1
Fantastic, this works well for two of the partitions. However the
third has a bad sector (as reported by smartmontools) on the disk with
the "good" superblock. The disk cannot read the sector, so the
syncing fails and starts over at 15.7% each time.
Is it safe to mount that partition outside of the md, find the file,
remove it so that the disk can remap that sector (it is shown as
Currently_Pending in SMART right now) then resync the array? I guess
this will cause problems and break the mirror. Or is the correct way
to remove the "bad" superblock drive from the array, mount the md,
remove the file then resync the array?
If it is possible to do either of the above, how do I stop the
recovery? It now starts automatically at live CD boot, repeating from
15.7% over and over. My knowledge of the tools is bad but I tried the
following:
# mdadm /dev/md0 --remove /dev/sda1
and
# mdadm -f /dev/md0 --remove /dev/sda1 (no idea if the -f even makes
sense there)
> It is very odd that the checksums are all wrong though. Kernel
> version? mdadm version? hardware architecture?
Kernel installed from Ubuntu 6.06 sources, 2.6.15. Machine is a x86
Dell with two identical Maxtor DiamondMax drives on an Intel 82801
SATA controller.
mdadm is version 1.12. Looking at the most recently available version
this seems incredibly out of date, but seems to be the default
installed in Ubuntu. Even Debian stable seems to have 1.9. I can bug
this with them for an update if necessary.
Is it possible that a broken init script has tried to fsck an
individual drive instead of the md? /etc/fstab only uses /dev/md*
references but I'll check other scripts when (if? :) I get the system
back up and running.
Whilst the machine is not critical and is only a new install, I'd like
to keep fighting rather than give in if possible.
Thanks,
David
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Checksums wrong on one disk of mirror
@ 2006-11-08 15:55 David
0 siblings, 0 replies; 5+ messages in thread
From: David @ 2006-11-08 15:55 UTC (permalink / raw)
To: linux-raid
Quoting David <lists@edeca.net>:
> Or is the correct way
> to remove the "bad" superblock drive from the array, mount the md,
> remove the file then resync the array?
Common sense says this is correct.
> If it is possible to do either of the above, how do I stop the
> recovery? It now starts automatically at live CD boot, repeating from
> 15.7% over and over. My knowledge of the tools is bad but I tried the
> following:
>
> # mdadm /dev/md0 --remove /dev/sda1
> and
> # mdadm -f /dev/md0 --remove /dev/sda1 (no idea if the -f even makes
> sense there)
Looking at http://smartmontools.sourceforge.net/BadBlockHowTo.txt I
tried to figure out what file was in the bad blocks but it turned out
there wasn't one, it was just unused space.
My fix, for completeness, was this:
Force failure of the corrupt half of the mirror, using
# mdadm --manage /dev/md0 --fail /dev/sda
Mount the other one and fill free space with zeros
# mount /dev/md0 /mnt/test
# dd if=/dev/zero of=/mnt/test/bigfile
# sync
smartctl now showed that the pending sector had been reallocated, so I
removed the bigfile and hot added the other drive
# mdadm --manage /dev/md0 --add /dev/sda
The recovery went fine this time and both partitions were shown as
correct and active. I had to fsck another md before it would boot
correctly but the machine is now back up and working correctly.
Thanks for your help previously, it helped me along the right lines to
start fixing this one.
David
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Checksums wrong on one disk of mirror
2006-11-07 14:10 Checksums wrong on one disk of mirror David
@ 2006-11-13 8:49 ` Henrik Holst
0 siblings, 0 replies; 5+ messages in thread
From: Henrik Holst @ 2006-11-13 8:49 UTC (permalink / raw)
To: David; +Cc: linux-raid
David wrote:
<snip>
> mdadm is version 1.12. Looking at the most recently available version
> this seems incredibly out of date, but seems to be the default installed
> in Ubuntu. Even Debian stable seems to have 1.9. I can bug this with
> them for an update if necessary.
It's already on it's way. Update to the comming Debian release "Etch"
(due to be "Stable" in December 2006; if I remember correctly). In Etch
the mdadm version is v2.5.3 (7 August 2006).
Henrik Holst
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2006-11-13 8:49 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-07 14:10 Checksums wrong on one disk of mirror David
2006-11-13 8:49 ` Henrik Holst
-- strict thread matches above, loose matches on Subject: below --
2006-11-08 15:55 David
2006-11-07 10:04 David
2006-11-07 10:26 ` Neil Brown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).