RAID-5 data corruption

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID-5 data corruption
@ 2008-03-05 23:23 Oliver Martin
  2008-03-06  1:11 ` Dan Williams
  0 siblings, 1 reply; 3+ messages in thread
From: Oliver Martin @ 2008-03-05 23:23 UTC (permalink / raw)
  To: Linux RAID

Hello,

it seems my RAID-5 exploded last Sunday. :-(
Ext3 errors started appearing during the monthly data-check, and when I 
noticed later that day, mismatch_cnt was huge, about 200.000.000. After 
a reboot (or did I just restart the array? can't remember) and another 
check, it was down to 176, but the file system remained badly broken.

I suspected one of the disks was dying and reading/writing bad data, but 
it seems that's not the case: I took them out of their enclosures (I'm 
using external drives) and plugged them into my desktop to read the 
SMART values, and they look okay. Reallocated sector count was 0 on all 
three, there were no errors logged, and all passed both a SMART long 
selftest and badblocks -n. So I guess the disks are fine.

I also ran the latter (badblocks -n) with the disks back in the 
enclosures and using the same USB/Firewire ports, cables and hubs, and 
they passed again, so I guess that part is okay too.

The configuration is an LVM volume on an md array with two USB drives 
and one Firewire drive. I'm not sure what caused the problem, it could 
be an ext3 bug, an LVM bug, an md bug, or something in the USB or 
Firewire drivers, but the huge mismatch_cnt makes me suspect it's a 
rather low-level issue (md or lower). BTW, I'm using 2.6.24.3 with this 
config: http://murli.34sp.com/o/raid/config-2.6.24.3

Anyway, running "e2fsck -n" with all drives in the array aborts with 
"Error while iterating over blocks in inode 28327968: Illegal triply 
indirect block found". When I remove one drive at a time, it's the same 
for two 2/3 configurations, but different for the third: this time, 
e2fsck at least completes, but still finds lots of errors.

I've uploaded e2fsck and kernel logs to http://murli.34sp.com/o/raid/

My current plan is to buy some drives tomorrow to mirror the current 
state, and then see what e2fsck can recover; I also found e2salvage and 
e2extract. Are there any other tools I should look into?

I'll see if I can recover my data, but do you have any ideas what caused 
the problem in the first place?

-- 
Oliver

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: RAID-5 data corruption
  2008-03-05 23:23 RAID-5 data corruption Oliver Martin
@ 2008-03-06  1:11 ` Dan Williams
  2008-03-06 12:01   ` Oliver Martin
  0 siblings, 1 reply; 3+ messages in thread
From: Dan Williams @ 2008-03-06  1:11 UTC (permalink / raw)
  To: Oliver Martin; +Cc: Linux RAID

On Wed, Mar 5, 2008 at 4:23 PM, Oliver Martin
<oliver.martin@student.tuwien.ac.at> wrote:
>  I'll see if I can recover my data, but do you have any ideas what caused
>  the problem in the first place?
>

Did one of the disks cause the USB reset message?  It appears to
coincide with the start of the trouble, but it may be just that,
coincidence.

Mar  2 01:06:02 quassel kernel: md: using 128k window, over a total of
488383936 blocks.
Mar  2 09:03:43 quassel kernel: usb 4-5.3.2: reset high speed USB
device using ehci_hcd and address 8
Mar  2 09:25:15 quassel kernel: EXT3-fs error (device dm-0):
htree_dirblock_to_tree: bad entry in directory #57262081: directory
entry across blocks - offset=0, inode=2917738116, rec_len=11736,
name_len=107
Mar  2 10:30:46 quassel kernel: md: md0: data-check done.

The output of something like 'ls -l /sys/block/sdb/device' should tell.

Have you run this operation on a kernel version prior to 2.6.24.3? If
so, which version?

--
Dan

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: RAID-5 data corruption
  2008-03-06  1:11 ` Dan Williams
@ 2008-03-06 12:01   ` Oliver Martin
  0 siblings, 0 replies; 3+ messages in thread
From: Oliver Martin @ 2008-03-06 12:01 UTC (permalink / raw)
  To: Dan Williams; +Cc: Linux RAID

Dan Williams schrieb:
> 
> Did one of the disks cause the USB reset message?  It appears to
> coincide with the start of the trouble, but it may be just that,
> coincidence.

That's interesting. 5.3.2 is sde, the disk I have to remove to make 
e2fsck run through. I got these messages before, and they never seemed 
to hurt, so I ignored them. mismatch_cnt was always 0, and there's 
another USB disk, which isn't in the RAID, which also shows these resets 
sometimes, and there are no problems with this one.
I do have other USB problems on this machine, though. It has 3 ports, 
and when I set the whole thing up with 2.6.23, two of them weren't 
working correctly (they sometimes disconnected randomly). I thought they 
had just died and since the third worked, I just threw a hub in and went on.
It's of course still possible that it's a hardware issue, but back when 
I used this laptop as my main machine, I never had any USB problems. It 
definitely worked in Windows, I don't remember if I ever used a USB disk 
in Linux on this machine. I'll re-check Windows to see if I have the 
same problem there.
What's interesting as well is that I just re-checked the supposedly bad 
USB ports with 2.6.24 by connecting a disk to each one and reading some 
large files, and I only saw the disconnect issue once today. I couldn't 
reproduce it up to now. That's strange, I remember it being very 
consistent in 2.6.23. I'll check that again too.

> Have you run this operation on a kernel version prior to 2.6.24.3? If
> so, which version?

Which operation do you mean? The array? I started out using 2.6.23.1, 
later switched to 2.6.23.14, then 2.6.24, and recently 2.6.24.3.

-- 
Oliver

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-03-06 12:01 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-05 23:23 RAID-5 data corruption Oliver Martin
2008-03-06  1:11 ` Dan Williams
2008-03-06 12:01   ` Oliver Martin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).