* filesystem corruption with RAID6.
@ 2004-09-03 12:20 Terje Kvernes
2004-09-03 17:06 ` Jim Paris
2004-09-03 17:15 ` Guy
0 siblings, 2 replies; 9+ messages in thread
From: Terje Kvernes @ 2004-09-03 12:20 UTC (permalink / raw)
To: linux-raid
howdy.
I've recently started testing RAID6 on a Promise SATA150 TX4, using
the controller as a pure SATA-controller and running software RAID
over the four drives connected to it. the kernel is 2.6.8.1-mm4.
the drives are all identical, WD2500JD-00H.
I've fiddled a bit with testing the array, setting drives as faulty
and removing them, only to reinsert them afterwards. there were no
complaints from the system while doing these trials, and everything
looked good. my md was then turned into a PV and added to a VG.
all was seemingly well. I probably created the PV while the system
was doing the initial sync of the RAIDset, I am however unsure if
that should cause any problems as the pvcreate didn't report any
errors from the block device.
I then created two LVs and copied data from the network onto one of
the LVs while a recovery was in process (the re-adding of
/dev/sdc1), which didn't report any errors. upon copying from the
recently populated LV to the blank LV however, I get a lot of I/O
errors while reading from the recently populated filesystem. I've
removed the LVs, tested different filesystems (ext3, reiserfs) but
the errors always show in the same way.
now, this isn't exactly a good thing. especially since the only
thing I see are I/O errors upon reading the data. I'm not quite
sure what I can provide to help anyone debug this, but I'm more than
willing to help with testing.
thanks for all the great md-work, and please CC me, I'm not on the
list.
gayomart:/# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [multipath] [raid6]
md0 : active raid6 sdc1[2] sdd1[3] sdb1[1] sda1[0]
488391808 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
unused devices: <none>
gayomart:/# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.01
Creation Time : Thu Sep 2 22:00:49 2004
Raid Level : raid6
Array Size : 488391808 (465.77 GiB 500.11 GB)
Device Size : 244195904 (232.88 GiB 250.06 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Fri Sep 3 14:07:43 2004
State : clean, no-errors
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
2 8 33 2 active sync /dev/sdc1
3 8 49 3 active sync /dev/sdd1
UUID : a9b70f65:e3d7bda8:a0a37b4d:4ae0aab1
Events : 0.1835
--
Terje
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: filesystem corruption with RAID6.
2004-09-03 12:20 filesystem corruption with RAID6 Terje Kvernes
@ 2004-09-03 17:06 ` Jim Paris
2004-09-03 18:19 ` H. Peter Anvin
2004-09-03 17:15 ` Guy
1 sibling, 1 reply; 9+ messages in thread
From: Jim Paris @ 2004-09-03 17:06 UTC (permalink / raw)
To: Terje Kvernes; +Cc: linux-raid, hpa
> I've recently started testing RAID6
..
> I get a lot of I/O errors while reading from the recently
> populated filesystem. I've removed the LVs, tested different
> filesystems (ext3, reiserfs) but the errors always show in the
> same way.
The RAID6 code has problems. The author, H. Peter Anvin, has verified
this corruption, but there's no fix available yet.
-jim
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: filesystem corruption with RAID6.
2004-09-03 12:20 filesystem corruption with RAID6 Terje Kvernes
2004-09-03 17:06 ` Jim Paris
@ 2004-09-03 17:15 ` Guy
1 sibling, 0 replies; 9+ messages in thread
From: Guy @ 2004-09-03 17:15 UTC (permalink / raw)
To: 'Terje Kvernes', linux-raid
I can't help debug it. But you may be able to determine if it is RAID6
related. Can you re-do your test, this time use RAID5? The problems should
go away if it is RAID6 related.
Guy
-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Terje Kvernes
Sent: Friday, September 03, 2004 8:20 AM
To: linux-raid@vger.kernel.org
Subject: filesystem corruption with RAID6.
howdy.
I've recently started testing RAID6 on a Promise SATA150 TX4, using
the controller as a pure SATA-controller and running software RAID
over the four drives connected to it. the kernel is 2.6.8.1-mm4.
the drives are all identical, WD2500JD-00H.
I've fiddled a bit with testing the array, setting drives as faulty
and removing them, only to reinsert them afterwards. there were no
complaints from the system while doing these trials, and everything
looked good. my md was then turned into a PV and added to a VG.
all was seemingly well. I probably created the PV while the system
was doing the initial sync of the RAIDset, I am however unsure if
that should cause any problems as the pvcreate didn't report any
errors from the block device.
I then created two LVs and copied data from the network onto one of
the LVs while a recovery was in process (the re-adding of
/dev/sdc1), which didn't report any errors. upon copying from the
recently populated LV to the blank LV however, I get a lot of I/O
errors while reading from the recently populated filesystem. I've
removed the LVs, tested different filesystems (ext3, reiserfs) but
the errors always show in the same way.
now, this isn't exactly a good thing. especially since the only
thing I see are I/O errors upon reading the data. I'm not quite
sure what I can provide to help anyone debug this, but I'm more than
willing to help with testing.
thanks for all the great md-work, and please CC me, I'm not on the
list.
gayomart:/# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [multipath] [raid6]
md0 : active raid6 sdc1[2] sdd1[3] sdb1[1] sda1[0]
488391808 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
unused devices: <none>
gayomart:/# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.01
Creation Time : Thu Sep 2 22:00:49 2004
Raid Level : raid6
Array Size : 488391808 (465.77 GiB 500.11 GB)
Device Size : 244195904 (232.88 GiB 250.06 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Fri Sep 3 14:07:43 2004
State : clean, no-errors
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
2 8 33 2 active sync /dev/sdc1
3 8 49 3 active sync /dev/sdd1
UUID : a9b70f65:e3d7bda8:a0a37b4d:4ae0aab1
Events : 0.1835
--
Terje
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: filesystem corruption with RAID6.
2004-09-03 17:06 ` Jim Paris
@ 2004-09-03 18:19 ` H. Peter Anvin
2004-09-03 20:27 ` Terje Kvernes
0 siblings, 1 reply; 9+ messages in thread
From: H. Peter Anvin @ 2004-09-03 18:19 UTC (permalink / raw)
To: Jim Paris; +Cc: Terje Kvernes, linux-raid
Jim Paris wrote:
>> I've recently started testing RAID6
>
> ..
>
>> I get a lot of I/O errors while reading from the recently
>> populated filesystem. I've removed the LVs, tested different
>> filesystems (ext3, reiserfs) but the errors always show in the
>> same way.
>
>
> The RAID6 code has problems. The author, H. Peter Anvin, has verified
> this corruption, but there's no fix available yet.
>
> -jim
Note: the corruption Jim found and which I have confirmed are only when doing
large number of writes to a filesystem in degraded mode. If you see something
different, please let me know. It would definitely help tracking this down.
The bug is elusive enough that it's difficult :-/
-hpa
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: filesystem corruption with RAID6.
2004-09-03 18:19 ` H. Peter Anvin
@ 2004-09-03 20:27 ` Terje Kvernes
2004-09-03 21:16 ` H. Peter Anvin
0 siblings, 1 reply; 9+ messages in thread
From: Terje Kvernes @ 2004-09-03 20:27 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: Jim Paris, linux-raid
"H. Peter Anvin" <hpa@zytor.com> writes:
[ ... ]
> Note: the corruption Jim found and which I have confirmed are only
> when doing large number of writes to a filesystem in degraded mode.
> If you see something different, please let me know. It would
> definitely help tracking this down.
ah! that explains it then. I was indeed copying over data in
degraded mode.
> The bug is elusive enough that it's difficult :-/
ouch. and RAID that can't be written to in degraded mode is
somewhat impractical for production use. ;-)
is there anything I can do to help? otherwise, I'll fall back to
RAID5 and prepare the box for production.
--
Terje
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: filesystem corruption with RAID6.
2004-09-03 20:27 ` Terje Kvernes
@ 2004-09-03 21:16 ` H. Peter Anvin
2004-09-04 22:43 ` Terje Kvernes
0 siblings, 1 reply; 9+ messages in thread
From: H. Peter Anvin @ 2004-09-03 21:16 UTC (permalink / raw)
To: Terje Kvernes; +Cc: Jim Paris, linux-raid
Terje Kvernes wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
>
> [ ... ]
>
>
>>Note: the corruption Jim found and which I have confirmed are only
>>when doing large number of writes to a filesystem in degraded mode.
>>If you see something different, please let me know. It would
>>definitely help tracking this down.
>
>
> ah! that explains it then. I was indeed copying over data in
> degraded mode.
>
>
>>The bug is elusive enough that it's difficult :-/
>
> ouch. and RAID that can't be written to in degraded mode is
> somewhat impractical for production use. ;-)
>
> is there anything I can do to help? otherwise, I'll fall back to
> RAID5 and prepare the box for production.
>
Unless you want to try to help hunt down the bug, not much.
-hpa
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: filesystem corruption with RAID6.
2004-09-03 21:16 ` H. Peter Anvin
@ 2004-09-04 22:43 ` Terje Kvernes
2004-09-05 3:01 ` H. Peter Anvin
0 siblings, 1 reply; 9+ messages in thread
From: Terje Kvernes @ 2004-09-04 22:43 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: Jim Paris, linux-raid
"H. Peter Anvin" <hpa@zytor.com> writes:
> Terje Kvernes wrote:
>
> > ouch. and RAID that can't be written to in degraded mode is
> > somewhat impractical for production use. ;-) is there anything I
> > can do to help? otherwise, I'll fall back to RAID5 and prepare
> > the box for production.
>
> Unless you want to try to help hunt down the bug, not much.
hm, so how do I help? :-)
--
Terje
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: filesystem corruption with RAID6.
2004-09-04 22:43 ` Terje Kvernes
@ 2004-09-05 3:01 ` H. Peter Anvin
2004-09-05 12:32 ` Terje Kvernes
0 siblings, 1 reply; 9+ messages in thread
From: H. Peter Anvin @ 2004-09-05 3:01 UTC (permalink / raw)
To: Terje Kvernes; +Cc: Jim Paris, linux-raid
Terje Kvernes wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
>
>
>>Terje Kvernes wrote:
>>
>>
>>>ouch. and RAID that can't be written to in degraded mode is
>>>somewhat impractical for production use. ;-) is there anything I
>>>can do to help? otherwise, I'll fall back to RAID5 and prepare
>>>the box for production.
>>
>>Unless you want to try to help hunt down the bug, not much.
>
>
> hm, so how do I help? :-)
>
Dig into the code and try to figure out what is happening. My best
guess at this point is that a block which is dirty isn't getting marked
as such, and therefore isn't getting correctly written back, but it
could also be that it tries to reconstruct a block before it actually
has all the blocks that it needs to do the reconstruction correctly.
-hpa
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: filesystem corruption with RAID6.
2004-09-05 3:01 ` H. Peter Anvin
@ 2004-09-05 12:32 ` Terje Kvernes
0 siblings, 0 replies; 9+ messages in thread
From: Terje Kvernes @ 2004-09-05 12:32 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: Jim Paris, linux-raid
"H. Peter Anvin" <hpa@zytor.com> writes:
[ ... ]
> Dig into the code and try to figure out what is happening. My best
> guess at this point is that a block which is dirty isn't getting
> marked as such, and therefore isn't getting correctly written back,
> but it could also be that it tries to reconstruct a block before it
> actually has all the blocks that it needs to do the reconstruction
> correctly.
ho-hum. good idea, but with my current knowledge of the RAID-code,
I'm somewhat doubtful of what I can accomplish. I'll try to have a
look though. again, thanks for all the work on the MD layer. ;-)
--
Terje
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2004-09-05 12:32 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-03 12:20 filesystem corruption with RAID6 Terje Kvernes
2004-09-03 17:06 ` Jim Paris
2004-09-03 18:19 ` H. Peter Anvin
2004-09-03 20:27 ` Terje Kvernes
2004-09-03 21:16 ` H. Peter Anvin
2004-09-04 22:43 ` Terje Kvernes
2004-09-05 3:01 ` H. Peter Anvin
2004-09-05 12:32 ` Terje Kvernes
2004-09-03 17:15 ` Guy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).