* Suggestion needed for fixing RAID6
@ 2010-04-22 10:09 Janos Haar
2010-04-22 15:00 ` Mikael Abrahamsson
` (2 more replies)
0 siblings, 3 replies; 48+ messages in thread
From: Janos Haar @ 2010-04-22 10:09 UTC (permalink / raw)
To: linux-raid
Hello Neil, list,
I am trying to fix one RAID6 array wich have 12x1.5TB (samsung) drives.
Actually the array have 1 missing drive, and 3 wich have some bad sectors!
Genearlly because it is RAID6 there is no data lost, because the bad sectors
are not in one address line, but i can't rebuild the missing drive, because
the kernel drops out the bad sector-drives one by one during the rebuild
process.
My question is, there is any way, to force the array to keep the members in
even if have some reading errors?
Or is there a way to re-add the bad sector drives after the kernel dropped
out without stopping the rebuild process?
In normal way after 18 hour sync, @ 97.9% the 3rd drive is always dropped
out and the rebuild stops.
Thanks,
Janos Haar
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: Suggestion needed for fixing RAID6 2010-04-22 10:09 Suggestion needed for fixing RAID6 Janos Haar @ 2010-04-22 15:00 ` Mikael Abrahamsson 2010-04-22 15:12 ` Janos Haar [not found] ` <4BD0AF2D.90207@stud.tu-ilmenau.de> 2010-04-23 6:51 ` Luca Berra 2 siblings, 1 reply; 48+ messages in thread From: Mikael Abrahamsson @ 2010-04-22 15:00 UTC (permalink / raw) To: Janos Haar; +Cc: linux-raid On Thu, 22 Apr 2010, Janos Haar wrote: > My question is, there is any way, to force the array to keep the members in > even if have some reading errors? What version of the kernel are you running? If it's running anywhere recent kernel it shouldn't kick drives upon read error but instead recreate from parity. You should probably send "repair" to the md device (echo repair > /sys/block/mdX/md/sync_action) and see if that fixes the bad blocks. I believe this came in 2.6.15 or something like that (google if you're in that neighbourhood, if you're in 2.6.26 or alike then you should be fine). -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-22 15:00 ` Mikael Abrahamsson @ 2010-04-22 15:12 ` Janos Haar 2010-04-22 15:18 ` Mikael Abrahamsson 0 siblings, 1 reply; 48+ messages in thread From: Janos Haar @ 2010-04-22 15:12 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: linux-raid Hi, ----- Original Message ----- From: "Mikael Abrahamsson" <swmike@swm.pp.se> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <linux-raid@vger.kernel.org> Sent: Thursday, April 22, 2010 5:00 PM Subject: Re: Suggestion needed for fixing RAID6 > On Thu, 22 Apr 2010, Janos Haar wrote: > >> My question is, there is any way, to force the array to keep the members >> in even if have some reading errors? > > What version of the kernel are you running? If it's running anywhere > recent kernel it shouldn't kick drives upon read error but instead > recreate from parity. You should probably send "repair" to the md device > (echo repair > /sys/block/mdX/md/sync_action) and see if that fixes the > bad blocks. I believe this came in 2.6.15 or something like that (google > if you're in that neighbourhood, if you're in 2.6.26 or alike then you > should be fine). The kernel is. 2.6.28.10. I am just tested one of the badblock-hdds, and the bad blocks comes periodicaly, like a little and short scratch, and the drive can't correct these by write. Maybe this is why the kernel kicsk it out... But anyway, the problem is still here, i want to rebuild the missing disk (prior to replace the badblocked drives one by one), but the kernel kicks out more 2 drive during the rebuild. Thanks for the idea, Janos > > -- > Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-22 15:12 ` Janos Haar @ 2010-04-22 15:18 ` Mikael Abrahamsson 2010-04-22 16:25 ` Janos Haar 2010-04-22 16:32 ` Peter Rabbitson 0 siblings, 2 replies; 48+ messages in thread From: Mikael Abrahamsson @ 2010-04-22 15:18 UTC (permalink / raw) To: Janos Haar; +Cc: linux-raid On Thu, 22 Apr 2010, Janos Haar wrote: > I am just tested one of the badblock-hdds, and the bad blocks comes > periodicaly, like a little and short scratch, and the drive can't correct > these by write. Oh, if you get write errors on the drive then you're in bigger trouble. > Maybe this is why the kernel kicsk it out... Yes, a write error to the drive is a kick:able offence. What does smartctl say about the drives? > But anyway, the problem is still here, i want to rebuild the missing disk > (prior to replace the badblocked drives one by one), but the kernel kicks out > more 2 drive during the rebuild. I don't have a good idea that assures your data, unfortunately. One way would be to dd the defective drives to working ones, but that will most likely cause you to have data loss on the defective sectors (since md has no idea that these sectors should be re-created from parity). -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-22 15:18 ` Mikael Abrahamsson @ 2010-04-22 16:25 ` Janos Haar 2010-04-22 16:32 ` Peter Rabbitson 1 sibling, 0 replies; 48+ messages in thread From: Janos Haar @ 2010-04-22 16:25 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: linux-raid ----- Original Message ----- From: "Mikael Abrahamsson" <swmike@swm.pp.se> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <linux-raid@vger.kernel.org> Sent: Thursday, April 22, 2010 5:18 PM Subject: Re: Suggestion needed for fixing RAID6 > On Thu, 22 Apr 2010, Janos Haar wrote: > >> I am just tested one of the badblock-hdds, and the bad blocks comes >> periodicaly, like a little and short scratch, and the drive can't correct >> these by write. > > Oh, if you get write errors on the drive then you're in bigger trouble. I am planning to replace all the defective drives, but first i need to rebuild the missing part. I don't care about wich is the problem, the first drive have 123 unredable sectors, and i have tried to rewrite one but not works. This will goes to RMA, but first i need to solve the problem. > >> Maybe this is why the kernel kicsk it out... > > Yes, a write error to the drive is a kick:able offence. What does smartctl > say about the drives? The smart healt is good. (not wondering...) But the drive have some offline unc sectors and some pendings. > >> But anyway, the problem is still here, i want to rebuild the missing disk >> (prior to replace the badblocked drives one by one), but the kernel kicks >> out more 2 drive during the rebuild. > > I don't have a good idea that assures your data, unfortunately. One way > would be to dd the defective drives to working ones, but that will most > likely cause you to have data loss on the defective sectors (since md has > no idea that these sectors should be re-created from parity). Exactly. This is why i ask here. :-) Because i don't want to make some KB errors on the array wich have all the needed information. Any good idea? Thanks a lot, Janos > > -- > Mikael Abrahamsson email: swmike@swm.pp.se > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-22 15:18 ` Mikael Abrahamsson 2010-04-22 16:25 ` Janos Haar @ 2010-04-22 16:32 ` Peter Rabbitson 1 sibling, 0 replies; 48+ messages in thread From: Peter Rabbitson @ 2010-04-22 16:32 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: Janos Haar, linux-raid Mikael Abrahamsson wrote: > On Thu, 22 Apr 2010, Janos Haar wrote: > > I don't have a good idea that assures your data, unfortunately. One way > would be to dd the defective drives to working ones, but that will most > likely cause you to have data loss on the defective sectors (since md > has no idea that these sectors should be re-created from parity). > There was a thread[1] some time ago, where HPA confirmed that the RAID6 data is sufficient to write an algorithm which will be able to determine which sector is in fact the offending one. There wasn't any interest to incorporate this into the sync_action/repair function though :( [1] http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07327.html ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <4BD0AF2D.90207@stud.tu-ilmenau.de>]
* Re: Suggestion needed for fixing RAID6 [not found] ` <4BD0AF2D.90207@stud.tu-ilmenau.de> @ 2010-04-22 20:48 ` Janos Haar 0 siblings, 0 replies; 48+ messages in thread From: Janos Haar @ 2010-04-22 20:48 UTC (permalink / raw) To: st0ff; +Cc: linux-raid Hi, ----- Original Message ----- From: "Stefan /*St0fF*/ Hübner" <stefan.huebner@stud.tu-ilmenau.de> To: "Janos Haar" <janos.haar@netcenter.hu> Sent: Thursday, April 22, 2010 10:18 PM Subject: Re: Suggestion needed for fixing RAID6 > Hi Janos, > > I'd ddrescue the failing drives one by one to replacement drives. Set a > very high retry-count for this action. I know what am i doing, trust me. ;-) I have much more professional tools for this than the ddrescue, and i have the list of defective sectors as well. Now i am imaging the second of the failing drives, and this one have >1800 failing sectors. > > The logfile ddrescue creates shows the unreadable sectors afterwards. > The hard part would now be to incorporate the raid-algorithm into some > tool to just restore the missing sectors... I can do that, but it is not a good game for 15TB array or even some hundred of sectors to fix by hand.... The linux md knows how to recalculate these errors, i want to find this way....somehow... I am thinking of making RAID1 from the defective drives, and if the kernel will re-write the sectors, the copy will get it. But i don't know how to prevent the copy to read it. :-/ Thanks for your suggestions, Janos > > I hope this helps a bit. > Stefan > > Am 22.04.2010 12:09, schrieb Janos Haar: >> Hello Neil, list, >> >> I am trying to fix one RAID6 array wich have 12x1.5TB (samsung) drives. >> Actually the array have 1 missing drive, and 3 wich have some bad >> sectors! >> Genearlly because it is RAID6 there is no data lost, because the bad >> sectors are not in one address line, but i can't rebuild the missing >> drive, because the kernel drops out the bad sector-drives one by one >> during the rebuild process. >> >> My question is, there is any way, to force the array to keep the members >> in even if have some reading errors? >> Or is there a way to re-add the bad sector drives after the kernel >> dropped out without stopping the rebuild process? >> In normal way after 18 hour sync, @ 97.9% the 3rd drive is always >> dropped out and the rebuild stops. >> >> Thanks, >> Janos Haar >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-22 10:09 Suggestion needed for fixing RAID6 Janos Haar 2010-04-22 15:00 ` Mikael Abrahamsson [not found] ` <4BD0AF2D.90207@stud.tu-ilmenau.de> @ 2010-04-23 6:51 ` Luca Berra 2010-04-23 8:47 ` Janos Haar 2 siblings, 1 reply; 48+ messages in thread From: Luca Berra @ 2010-04-23 6:51 UTC (permalink / raw) To: linux-raid On Thu, Apr 22, 2010 at 12:09:08PM +0200, Janos Haar wrote: > Hello Neil, list, > > I am trying to fix one RAID6 array wich have 12x1.5TB (samsung) drives. > Actually the array have 1 missing drive, and 3 wich have some bad sectors! > Genearlly because it is RAID6 there is no data lost, because the bad > sectors are not in one address line, but i can't rebuild the missing drive, > because the kernel drops out the bad sector-drives one by one during the > rebuild process. I would seriously consider moving the data out of that array and dumping all drives from that batch, and this is gonna be painful because you must watch drives being dropped and add them back, and yes you need the resources to store the data. ddrescue won't obviously work, since it will mask read errors and turn those into data corruption the raid 1 trick wont work, as you noted another option could be using the device mapper snapshot-merge target (writable snapshot), which iirc is a 2.6.33+ feature look at http://smorgasbord.gavagai.nl/2010/03/online-merging-of-cow-volumes-with-dm-snapshot/ for hints. btw i have no clue how the scsi error will travel thru the dm layer. L. -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-23 6:51 ` Luca Berra @ 2010-04-23 8:47 ` Janos Haar 2010-04-23 12:34 ` MRK 0 siblings, 1 reply; 48+ messages in thread From: Janos Haar @ 2010-04-23 8:47 UTC (permalink / raw) To: Luca Berra; +Cc: linux-raid ----- Original Message ----- From: "Luca Berra" <bluca@comedia.it> To: <linux-raid@vger.kernel.org> Sent: Friday, April 23, 2010 8:51 AM Subject: Re: Suggestion needed for fixing RAID6 > On Thu, Apr 22, 2010 at 12:09:08PM +0200, Janos Haar wrote: >> Hello Neil, list, >> >> I am trying to fix one RAID6 array wich have 12x1.5TB (samsung) drives. >> Actually the array have 1 missing drive, and 3 wich have some bad >> sectors! >> Genearlly because it is RAID6 there is no data lost, because the bad >> sectors are not in one address line, but i can't rebuild the missing >> drive, because the kernel drops out the bad sector-drives one by one >> during the rebuild process. > > I would seriously consider moving the data out of that array and dumping > all drives from that batch, and this is gonna be painful because you > must watch drives being dropped and add them back, and yes you need the > resources to store the data. > > ddrescue won't obviously work, since it will mask read errors and turn > those into data corruption > > the raid 1 trick wont work, as you noted > > another option could be using the device mapper snapshot-merge target > (writable snapshot), which iirc is a 2.6.33+ feature > look at > http://smorgasbord.gavagai.nl/2010/03/online-merging-of-cow-volumes-with-dm-snapshot/ > for hints. > btw i have no clue how the scsi error will travel thru the dm layer. > L. ...or cowloop! :-) This is a good idea! :-) Thank you. I have another one: re-create the array (--assume-clean) with external bitmap, than drop the missing drive. Than manually manipulate the bitmap file to re-sync only the last 10% wich is good enough for me... Thanks again, Janos > > -- > Luca Berra -- bluca@comedia.it > Communication Media & Services S.r.l. > /"\ > \ / ASCII RIBBON CAMPAIGN > X AGAINST HTML MAIL > / \ > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-23 8:47 ` Janos Haar @ 2010-04-23 12:34 ` MRK 2010-04-24 19:36 ` Janos Haar 0 siblings, 1 reply; 48+ messages in thread From: MRK @ 2010-04-23 12:34 UTC (permalink / raw) To: Janos Haar; +Cc: linux-raid On 04/23/2010 10:47 AM, Janos Haar wrote: > > ----- Original Message ----- From: "Luca Berra" <bluca@comedia.it> > To: <linux-raid@vger.kernel.org> > Sent: Friday, April 23, 2010 8:51 AM > Subject: Re: Suggestion needed for fixing RAID6 > > >> another option could be using the device mapper snapshot-merge target >> (writable snapshot), which iirc is a 2.6.33+ feature >> look at >> http://smorgasbord.gavagai.nl/2010/03/online-merging-of-cow-volumes-with-dm-snapshot/ >> >> for hints. >> btw i have no clue how the scsi error will travel thru the dm layer. >> L. > > ...or cowloop! :-) > This is a good idea! :-) > Thank you. > > I have another one: > re-create the array (--assume-clean) with external bitmap, than drop > the missing drive. > Than manually manipulate the bitmap file to re-sync only the last 10% > wich is good enough for me... Cowloop is kinda deprecated in favour of DM, says wikipedia, and messing with the bitmap looks complicated to me. I think Luca's is a great suggestion. You can use 3 files with loop-device so to store the COW devices for the 3 disks which are faulty. So that writes go there and you can complete the resync. Then you would fail the cow devices one by one from mdadm and replicate to spares. But this will work ONLY if read errors are still be reported across the DM-snapshot thingo. Otherwise (if it e.g. returns a block of zeroes without error) you are eventually going to get data corruption when replacing drives. You can check if read errors are reported, by looking at the dmesg during the resync. If you see many "read error corrected..." it works, while if it's silent it means it hasn't received read errors which means that it doesn't work. If it doesn't work DO NOT go ahead replacing drives, or you will get data corruption. So you need an initial test which just performs a resync but *without* replicating to a spare. So I suggest you first remove all the spares from the array, then create the COW snapshots, then assemble the array, perform a resync, look at the dmesg. If it works: add the spares back, fail one drive, etc. If this technique works this would be useful for everybody, so pls keep us informed!! Thank you ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-23 12:34 ` MRK @ 2010-04-24 19:36 ` Janos Haar 2010-04-24 22:47 ` MRK 0 siblings, 1 reply; 48+ messages in thread From: Janos Haar @ 2010-04-24 19:36 UTC (permalink / raw) To: MRK; +Cc: linux-raid ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: "linux-raid" <linux-raid@vger.kernel.org> Sent: Friday, April 23, 2010 2:34 PM Subject: Re: Suggestion needed for fixing RAID6 > On 04/23/2010 10:47 AM, Janos Haar wrote: >> >> ----- Original Message ----- From: "Luca Berra" <bluca@comedia.it> >> To: <linux-raid@vger.kernel.org> >> Sent: Friday, April 23, 2010 8:51 AM >> Subject: Re: Suggestion needed for fixing RAID6 >> >> >>> another option could be using the device mapper snapshot-merge target >>> (writable snapshot), which iirc is a 2.6.33+ feature >>> look at >>> http://smorgasbord.gavagai.nl/2010/03/online-merging-of-cow-volumes-with-dm-snapshot/ >>> for hints. >>> btw i have no clue how the scsi error will travel thru the dm layer. >>> L. >> >> ...or cowloop! :-) >> This is a good idea! :-) >> Thank you. >> >> I have another one: >> re-create the array (--assume-clean) with external bitmap, than drop the >> missing drive. >> Than manually manipulate the bitmap file to re-sync only the last 10% >> wich is good enough for me... > > > Cowloop is kinda deprecated in favour of DM, says wikipedia, and messing > with the bitmap looks complicated to me. Hi, I think i will like again this idea... :-D > I think Luca's is a great suggestion. You can use 3 files with loop-device > so to store the COW devices for the 3 disks which are faulty. So that > writes go there and you can complete the resync. > Then you would fail the cow devices one by one from mdadm and replicate to > spares. > > But this will work ONLY if read errors are still be reported across the > DM-snapshot thingo. Otherwise (if it e.g. returns a block of zeroes > without error) you are eventually going to get data corruption when > replacing drives. > > You can check if read errors are reported, by looking at the dmesg during > the resync. If you see many "read error corrected..." it works, while if > it's silent it means it hasn't received read errors which means that it > doesn't work. If it doesn't work DO NOT go ahead replacing drives, or you > will get data corruption. > > So you need an initial test which just performs a resync but *without* > replicating to a spare. So I suggest you first remove all the spares from > the array, then create the COW snapshots, then assemble the array, perform > a resync, look at the dmesg. If it works: add the spares back, fail one > drive, etc. > > If this technique works this would be useful for everybody, so pls keep us > informed!! Ok, i am doing it. I think i have found some interesting, what is unexpected: After 99.9% (and another 1800minute) the array is dropped the dm-snapshot structure! ata5.00: exception Emask 0x0 SAct 0x7fa1 SErr 0x0 action 0x0 ata5.00: irq_stat 0x40000008 ata5.00: cmd 60/d8:38:1d:e7:90/00:00:ae:00:00/40 tag 7 ncq 110592 in res 41/40:7a:7b:e7:90/6c:00:ae:00:00/40 Emask 0x409 (media error) <F> ata5.00: status: { DRDY ERR } ata5.00: error: { UNC } ata5.00: configured for UDMA/133 ata5: EH complete ... sd 4:0:0:0: [sde] 2930277168 512-byte hardware sectors: (1.50 TB/1.36 TiB) sd 4:0:0:0: [sde] Write Protect is off sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00 sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA ata5.00: exception Emask 0x0 SAct 0x3ff SErr 0x0 action 0x0 ata5.00: irq_stat 0x40000008 ata5.00: cmd 60/d8:38:1d:e7:90/00:00:ae:00:00/40 tag 7 ncq 110592 in res 41/40:7a:7b:e7:90/6c:00:ae:00:00/40 Emask 0x409 (media error) <F> ata5.00: status: { DRDY ERR } ata5.00: error: { UNC } ata5.00: configured for UDMA/133 sd 4:0:0:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK sd 4:0:0:0: [sde] Sense Key : Medium Error [current] [descriptor] Descriptor sense data with sense descriptors (in hex): 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 ae 90 e7 7b sd 4:0:0:0: [sde] Add. Sense: Unrecovered read error - auto reallocate failed end_request: I/O error, dev sde, sector 2928732027 __ratelimit: 16 callbacks suppressed raid5:md3: read error not correctable (sector 2923767936 on dm-0). raid5: Disk failure on dm-0, disabling device. raid5: Operation continuing on 9 devices. md: md3: recovery done. raid5:md3: read error not correctable (sector 2923767944 on dm-0). raid5:md3: read error not correctable (sector 2923767952 on dm-0). raid5:md3: read error not correctable (sector 2923767960 on dm-0). raid5:md3: read error not correctable (sector 2923767968 on dm-0). raid5:md3: read error not correctable (sector 2923767976 on dm-0). raid5:md3: read error not correctable (sector 2923767984 on dm-0). raid5:md3: read error not correctable (sector 2923767992 on dm-0). raid5:md3: read error not correctable (sector 2923768000 on dm-0). raid5:md3: read error not correctable (sector 2923768008 on dm-0). ata5: EH complete sd 4:0:0:0: [sde] 2930277168 512-byte hardware sectors: (1.50 TB/1.36 TiB) sd 4:0:0:0: [sde] Write Protect is off sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00 sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA ata5.00: exception Emask 0x0 SAct 0x1e1 SErr 0x0 action 0x0 ata5.00: irq_stat 0x40000008 ata5.00: cmd 60/00:28:f5:e8:90/01:00:ae:00:00/40 tag 5 ncq 131072 in res 41/40:27:ce:e9:90/6c:00:ae:00:00/40 Emask 0x409 (media error) <F> ata5.00: status: { DRDY ERR } ata5.00: error: { UNC } ata5.00: configured for UDMA/133 ata5: EH complete sd 4:0:0:0: [sde] 2930277168 512-byte hardware sectors: (1.50 TB/1.36 TiB) sd 4:0:0:0: [sde] Write Protect is off sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00 sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA RAID5 conf printout: --- rd:12 wd:9 disk 0, o:1, dev:sda4 disk 1, o:1, dev:sdb4 disk 2, o:1, dev:sdc4 disk 3, o:1, dev:sdd4 disk 4, o:0, dev:dm-0 disk 5, o:1, dev:sdf4 disk 6, o:1, dev:sdg4 disk 8, o:1, dev:sdi4 disk 9, o:1, dev:sdj4 disk 10, o:1, dev:sdk4 disk 11, o:1, dev:sdl4 RAID5 conf printout: --- rd:12 wd:9 disk 0, o:1, dev:sda4 disk 1, o:1, dev:sdb4 disk 2, o:1, dev:sdc4 disk 4, o:0, dev:dm-0 disk 5, o:1, dev:sdf4 disk 6, o:1, dev:sdg4 disk 8, o:1, dev:sdi4 disk 9, o:1, dev:sdj4 disk 10, o:1, dev:sdk4 disk 11, o:1, dev:sdl4 RAID5 conf printout: --- rd:12 wd:9 disk 0, o:1, dev:sda4 disk 1, o:1, dev:sdb4 disk 2, o:1, dev:sdc4 disk 4, o:0, dev:dm-0 disk 5, o:1, dev:sdf4 disk 6, o:1, dev:sdg4 disk 8, o:1, dev:sdi4 disk 9, o:1, dev:sdj4 disk 10, o:1, dev:sdk4 disk 11, o:1, dev:sdl4 RAID5 conf printout: --- rd:12 wd:9 disk 0, o:1, dev:sda4 disk 1, o:1, dev:sdb4 disk 2, o:1, dev:sdc4 disk 5, o:1, dev:sdf4 disk 6, o:1, dev:sdg4 disk 8, o:1, dev:sdi4 disk 9, o:1, dev:sdj4 disk 10, o:1, dev:sdk4 disk 11, o:1, dev:sdl4 So, the dm-0 is dropped only for _READ_ error! kernel 2.6.28.10 Now i am trying to do a repair-resync solution before rebuild the missing drive... Cheers, Janos > Thank you > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-24 19:36 ` Janos Haar @ 2010-04-24 22:47 ` MRK 2010-04-25 10:00 ` Janos Haar 0 siblings, 1 reply; 48+ messages in thread From: MRK @ 2010-04-24 22:47 UTC (permalink / raw) To: Janos Haar; +Cc: linux-raid On 04/24/2010 09:36 PM, Janos Haar wrote: > > Ok, i am doing it. > > I think i have found some interesting, what is unexpected: > After 99.9% (and another 1800minute) the array is dropped the > dm-snapshot structure! > > ...[CUT]... > > raid5:md3: read error not correctable (sector 2923767944 on dm-0). > raid5:md3: read error not correctable (sector 2923767952 on dm-0). > raid5:md3: read error not correctable (sector 2923767960 on dm-0). > raid5:md3: read error not correctable (sector 2923767968 on dm-0). > raid5:md3: read error not correctable (sector 2923767976 on dm-0). > raid5:md3: read error not correctable (sector 2923767984 on dm-0). > raid5:md3: read error not correctable (sector 2923767992 on dm-0). > raid5:md3: read error not correctable (sector 2923768000 on dm-0). > > ...[CUT]... > > So, the dm-0 is dropped only for _READ_ error! Actually no, it is being dropped for "uncorrectable read error" which means, AFAIK, that the read error was received, then the block was recomputed from the other disks, then a rewrite of the damaged block was attempted, and such *write* failed. So it is being dropped for a *write* error. People correct me if I'm wrong. This is strange because the write should have gone to the cow device. Are you sure you did everything correctly with DM? Could you post here how you created the dm-0 device? We might ask to the DM people why it's not working maybe. Anyway there is one good news, and it's that the read error apparently does travel through the DM stack. Thanks for your work ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-24 22:47 ` MRK @ 2010-04-25 10:00 ` Janos Haar 2010-04-26 10:24 ` MRK 0 siblings, 1 reply; 48+ messages in thread From: Janos Haar @ 2010-04-25 10:00 UTC (permalink / raw) To: MRK; +Cc: linux-raid ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <linux-raid@vger.kernel.org> Sent: Sunday, April 25, 2010 12:47 AM Subject: Re: Suggestion needed for fixing RAID6 Just a little note: The repair-sync action failed similar way too. :-( > On 04/24/2010 09:36 PM, Janos Haar wrote: >> >> Ok, i am doing it. >> >> I think i have found some interesting, what is unexpected: >> After 99.9% (and another 1800minute) the array is dropped the dm-snapshot >> structure! >> >> ...[CUT]... >> >> raid5:md3: read error not correctable (sector 2923767944 on dm-0). >> raid5:md3: read error not correctable (sector 2923767952 on dm-0). >> raid5:md3: read error not correctable (sector 2923767960 on dm-0). >> raid5:md3: read error not correctable (sector 2923767968 on dm-0). >> raid5:md3: read error not correctable (sector 2923767976 on dm-0). >> raid5:md3: read error not correctable (sector 2923767984 on dm-0). >> raid5:md3: read error not correctable (sector 2923767992 on dm-0). >> raid5:md3: read error not correctable (sector 2923768000 on dm-0). >> >> ...[CUT]... >> >> So, the dm-0 is dropped only for _READ_ error! > > Actually no, it is being dropped for "uncorrectable read error" which > means, AFAIK, that the read error was received, then the block was > recomputed from the other disks, then a rewrite of the damaged block was > attempted, and such *write* failed. So it is being dropped for a *write* > error. People correct me if I'm wrong. I think i can try: # dd_rescue -v /dev/zero -S $((2923767944 / 2))k /dev/mapper/cow -m 4k dd_rescue: (info): about to transfer 4.0 kBytes from /dev/zero to /dev/mapper/cow dd_rescue: (info): blocksizes: soft 65536, hard 512 dd_rescue: (info): starting positions: in 0.0k, out 1461883972.0k dd_rescue: (info): Logfile: (none), Maxerr: 0 dd_rescue: (info): Reverse: no , Trunc: no , interactive: no dd_rescue: (info): abort on Write errs: no , spArse write: if err dd_rescue: (info): ipos: 0.0k, opos:1461883972.0k, xferd: 0.0k errs: 0, errxfer: 0.0k, succxfer: 0.0k +curr.rate: 0kB/s, avg.rate: 0kB/s, avg.load: 0.0% Summary for /dev/zero -> /dev/mapper/cow: dd_rescue: (info): ipos: 4.0k, opos:1461883976.0k, xferd: 4.0k errs: 0, errxfer: 0.0k, succxfer: 4.0k +curr.rate: 203kB/s, avg.rate: 203kB/s, avg.load: 0.0% > > This is strange because the write should have gone to the cow device. Are > you sure you did everything correctly with DM? Could you post here how you > created the dm-0 device? echo 0 $(blockdev --getsize /dev/sde4) \ snapshot /dev/sde4 /dev/loop3 p 8 | \ dmsetup create cow ]# losetup /dev/loop3 /dev/loop3: [0901]:55091517 (/snapshot.bin) /snapshot.bin is a sparse file with 2000G seeked size. I have 3.6GB free space in / so the out of space is not an option. :-) I think this is correct. :-) But anyway, i have pre-tested it with fdisk and works. > > We might ask to the DM people why it's not working maybe. Anyway there is > one good news, and it's that the read error apparently does travel through > the DM stack. For me, this looks like md's bug not dm's problem. The "uncorrectable read error" means exactly the drive can't correct the damaged sector with ECC, and this is an unreadable sector. (pending in smart table) The auto read reallocation failed not meas the sector is not re-allocatable by rewriting it! The most of the drives doesn't do read-reallocation only write-reallocation. These drives wich does read reallocation, does it because the sector was hard to re-calculate (maybe needed more rotation, more repositioning, too much time) and moved automatically, BUT those sectors ARE NOT reported to the pc as read-error (UNC), so must NOT appear in the log... I am glad if i can help to fix this but, but please keep this in mind, this raid array is a productive system, and my customer gets more and more nervous day by day... I need a good solution for fixing this array to safely replace the bad drives without any data lost! Somebody have any good idea wich is not copy the entire (15TB) array? Thanks a lot, Janos Haat > > Thanks for your work > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-25 10:00 ` Janos Haar @ 2010-04-26 10:24 ` MRK 2010-04-26 12:52 ` Janos Haar 0 siblings, 1 reply; 48+ messages in thread From: MRK @ 2010-04-26 10:24 UTC (permalink / raw) To: Janos Haar; +Cc: linux-raid On 04/25/2010 12:00 PM, Janos Haar wrote: > > ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> > To: "Janos Haar" <janos.haar@netcenter.hu> > Cc: <linux-raid@vger.kernel.org> > Sent: Sunday, April 25, 2010 12:47 AM > Subject: Re: Suggestion needed for fixing RAID6 > > Just a little note: > > The repair-sync action failed similar way too. :-( > > >> On 04/24/2010 09:36 PM, Janos Haar wrote: >>> >>> Ok, i am doing it. >>> >>> I think i have found some interesting, what is unexpected: >>> After 99.9% (and another 1800minute) the array is dropped the >>> dm-snapshot structure! >>> >>> ...[CUT]... >>> >>> raid5:md3: read error not correctable (sector 2923767944 on dm-0). >>> raid5:md3: read error not correctable (sector 2923767952 on dm-0). >>> raid5:md3: read error not correctable (sector 2923767960 on dm-0). >>> raid5:md3: read error not correctable (sector 2923767968 on dm-0). >>> raid5:md3: read error not correctable (sector 2923767976 on dm-0). >>> raid5:md3: read error not correctable (sector 2923767984 on dm-0). >>> raid5:md3: read error not correctable (sector 2923767992 on dm-0). >>> raid5:md3: read error not correctable (sector 2923768000 on dm-0). >>> >>> ...[CUT]... >>> > Remember this exact error message: "read error not correctable" > >> >> This is strange because the write should have gone to the cow device. >> Are you sure you did everything correctly with DM? Could you post >> here how you created the dm-0 device? > > echo 0 $(blockdev --getsize /dev/sde4) \ > snapshot /dev/sde4 /dev/loop3 p 8 | \ > dmsetup create cow > Seems correct to me... > ]# losetup /dev/loop3 > /dev/loop3: [0901]:55091517 (/snapshot.bin) > This line comes BEFORE the other one, right? > /snapshot.bin is a sparse file with 2000G seeked size. > I have 3.6GB free space in / so the out of space is not an option. :-) > > [...] > >> >> We might ask to the DM people why it's not working maybe. Anyway >> there is one good news, and it's that the read error apparently does >> travel through the DM stack. > > For me, this looks like md's bug not dm's problem. > The "uncorrectable read error" means exactly the drive can't correct > the damaged sector with ECC, and this is an unreadable sector. > (pending in smart table) > The auto read reallocation failed not meas the sector is not > re-allocatable by rewriting it! > The most of the drives doesn't do read-reallocation only > write-reallocation. > > These drives wich does read reallocation, does it because the sector > was hard to re-calculate (maybe needed more rotation, more > repositioning, too much time) and moved automatically, BUT those > sectors ARE NOT reported to the pc as read-error (UNC), so must NOT > appear in the log... > No the error message really comes from MD. Can you read C code? Go into the kernel source and look this file: linux_source_dir/drivers/md/raid5.c (file raid5.c is also for raid6) search for "read error not correctable" What you see there is the reason for failure. You see the line "if (conf->mddev->degraded)" just above? I think your mistake was that you did the DM COW trick only on the last device, or anyway one device only, instead you should have done it on all 3 devices which were failing. It did not work for you because at the moment you got the read error on the last disk, two disks were already dropped from the array, the array was doubly degraded, and it's not possible to correct a read error if the array is degraded because you don't have enough parity information to recover the data for that sector. You should have prevented also the first two disks from dropping. Do the DM trick on all of them simultaneously, or at least on 2 of them (if you are sure only 3 disks have problems), start the array making sure it starts with all devices online i.e. nondegraded, then start the resync, and I think it will work. > I am glad if i can help to fix this but, but please keep this in mind, > this raid array is a productive system, and my customer gets more and > more nervous day by day... > I need a good solution for fixing this array to safely replace the bad > drives without any data lost! > > Somebody have any good idea wich is not copy the entire (15TB) array? I don't think there is another way. You need to make this work. Good luck ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-26 10:24 ` MRK @ 2010-04-26 12:52 ` Janos Haar 2010-04-26 16:53 ` MRK 0 siblings, 1 reply; 48+ messages in thread From: Janos Haar @ 2010-04-26 12:52 UTC (permalink / raw) To: MRK; +Cc: linux-raid ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <linux-raid@vger.kernel.org> Sent: Monday, April 26, 2010 12:24 PM Subject: Re: Suggestion needed for fixing RAID6 > On 04/25/2010 12:00 PM, Janos Haar wrote: >> >> ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> >> To: "Janos Haar" <janos.haar@netcenter.hu> >> Cc: <linux-raid@vger.kernel.org> >> Sent: Sunday, April 25, 2010 12:47 AM >> Subject: Re: Suggestion needed for fixing RAID6 >> >> Just a little note: >> >> The repair-sync action failed similar way too. :-( >> >> >>> On 04/24/2010 09:36 PM, Janos Haar wrote: >>>> >>>> Ok, i am doing it. >>>> >>>> I think i have found some interesting, what is unexpected: >>>> After 99.9% (and another 1800minute) the array is dropped the >>>> dm-snapshot structure! >>>> >>>> ...[CUT]... >>>> >>>> raid5:md3: read error not correctable (sector 2923767944 on dm-0). >>>> raid5:md3: read error not correctable (sector 2923767952 on dm-0). >>>> raid5:md3: read error not correctable (sector 2923767960 on dm-0). >>>> raid5:md3: read error not correctable (sector 2923767968 on dm-0). >>>> raid5:md3: read error not correctable (sector 2923767976 on dm-0). >>>> raid5:md3: read error not correctable (sector 2923767984 on dm-0). >>>> raid5:md3: read error not correctable (sector 2923767992 on dm-0). >>>> raid5:md3: read error not correctable (sector 2923768000 on dm-0). >>>> >>>> ...[CUT]... >>>> >> > > Remember this exact error message: "read error not correctable" > >> >>> >>> This is strange because the write should have gone to the cow device. >>> Are you sure you did everything correctly with DM? Could you post >>> here how you created the dm-0 device? >> >> echo 0 $(blockdev --getsize /dev/sde4) \ >> snapshot /dev/sde4 /dev/loop3 p 8 | \ >> dmsetup create cow >> > > Seems correct to me... > >> ]# losetup /dev/loop3 >> /dev/loop3: [0901]:55091517 (/snapshot.bin) >> > This line comes BEFORE the other one, right? > >> /snapshot.bin is a sparse file with 2000G seeked size. >> I have 3.6GB free space in / so the out of space is not an option. :-) >> >> > [...] >> >>> >>> We might ask to the DM people why it's not working maybe. Anyway >>> there is one good news, and it's that the read error apparently does >>> travel through the DM stack. >> >> For me, this looks like md's bug not dm's problem. >> The "uncorrectable read error" means exactly the drive can't correct >> the damaged sector with ECC, and this is an unreadable sector. >> (pending in smart table) >> The auto read reallocation failed not meas the sector is not >> re-allocatable by rewriting it! >> The most of the drives doesn't do read-reallocation only >> write-reallocation. >> >> These drives wich does read reallocation, does it because the sector >> was hard to re-calculate (maybe needed more rotation, more >> repositioning, too much time) and moved automatically, BUT those >> sectors ARE NOT reported to the pc as read-error (UNC), so must NOT >> appear in the log... >> > > No the error message really comes from MD. Can you read C code? Go into > the kernel source and look this file: > > linux_source_dir/drivers/md/raid5.c > > (file raid5.c is also for raid6) search for "read error not correctable" > > What you see there is the reason for failure. You see the line "if > (conf->mddev->degraded)" just above? I think your mistake was that you > did the DM COW trick only on the last device, or anyway one device only, > instead you should have done it on all 3 devices which were failing. > > It did not work for you because at the moment you got the read error on > the last disk, two disks were already dropped from the array, the array > was doubly degraded, and it's not possible to correct a read error if > the array is degraded because you don't have enough parity information > to recover the data for that sector. Oops, you are right! It was my mistake. Sorry, i will try it again, to support 2 drives with dm-cow. I will try it. Thanks again. Janos ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-26 12:52 ` Janos Haar @ 2010-04-26 16:53 ` MRK 2010-04-26 22:39 ` Janos Haar 2010-04-27 15:50 ` Janos Haar 0 siblings, 2 replies; 48+ messages in thread From: MRK @ 2010-04-26 16:53 UTC (permalink / raw) To: Janos Haar; +Cc: linux-raid On 04/26/2010 02:52 PM, Janos Haar wrote: > > Oops, you are right! > It was my mistake. > Sorry, i will try it again, to support 2 drives with dm-cow. > I will try it. Great! post here the results... the dmesg in particular. The dmesg should contain multiple lines like this "raid5:md3: read error corrected ....." then you know it worked. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-26 16:53 ` MRK @ 2010-04-26 22:39 ` Janos Haar 2010-04-26 23:06 ` Michael Evans 2010-04-27 15:50 ` Janos Haar 1 sibling, 1 reply; 48+ messages in thread From: Janos Haar @ 2010-04-26 22:39 UTC (permalink / raw) To: MRK; +Cc: linux-raid ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <linux-raid@vger.kernel.org> Sent: Monday, April 26, 2010 6:53 PM Subject: Re: Suggestion needed for fixing RAID6 > On 04/26/2010 02:52 PM, Janos Haar wrote: >> >> Oops, you are right! >> It was my mistake. >> Sorry, i will try it again, to support 2 drives with dm-cow. >> I will try it. > > Great! post here the results... the dmesg in particular. > The dmesg should contain multiple lines like this "raid5:md3: read error > corrected ....." > then you know it worked. md3 : active raid6 sdd4[12] sdl4[11] sdk4[10] sdj4[9] sdi4[8] dm-1[13](F) sdg4[6] sdf4[5] dm-0[14](F) sdc4[2] sdb4[1] sda4[0] 14626538880 blocks level 6, 16k chunk, algorithm 2 [12/9] [UUU__UU_UUUU] [>....................] recovery = 1.5% (22903832/1462653888) finish=3188383.4min speed=7K/sec Khm.... :-D It is working on something or stopped with 3 missing drive? : ^ ) (I have found the cause of the 2 dm's failure. Now retry runs...) Cheers, Janos > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-26 22:39 ` Janos Haar @ 2010-04-26 23:06 ` Michael Evans [not found] ` <7cfd01cae598$419e8d20$0400a8c0@dcccs> 0 siblings, 1 reply; 48+ messages in thread From: Michael Evans @ 2010-04-26 23:06 UTC (permalink / raw) To: Janos Haar; +Cc: MRK, linux-raid On Mon, Apr 26, 2010 at 3:39 PM, Janos Haar <janos.haar@netcenter.hu> wrote: > > ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> > To: "Janos Haar" <janos.haar@netcenter.hu> > Cc: <linux-raid@vger.kernel.org> > Sent: Monday, April 26, 2010 6:53 PM > Subject: Re: Suggestion needed for fixing RAID6 > > >> On 04/26/2010 02:52 PM, Janos Haar wrote: >>> >>> Oops, you are right! >>> It was my mistake. >>> Sorry, i will try it again, to support 2 drives with dm-cow. >>> I will try it. >> >> Great! post here the results... the dmesg in particular. >> The dmesg should contain multiple lines like this "raid5:md3: read error >> corrected ....." >> then you know it worked. > > md3 : active raid6 sdd4[12] sdl4[11] sdk4[10] sdj4[9] sdi4[8] dm-1[13](F) > sdg4[6] sdf4[5] dm-0[14](F) sdc4[2] sdb4[1] sda4[0] > 14626538880 blocks level 6, 16k chunk, algorithm 2 [12/9] [UUU__UU_UUUU] > [>....................] recovery = 1.5% (22903832/1462653888) > finish=3188383.4min speed=7K/sec > > Khm.... :-D > It is working on something or stopped with 3 missing drive? : ^ ) > > (I have found the cause of the 2 dm's failure. > Now retry runs...) > > Cheers, > Janos > > > > >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > What is displayed there seems like it can't be correct. Please run mdadm -Evvs mdadm -Dvvs and provide the results for us. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <7cfd01cae598$419e8d20$0400a8c0@dcccs>]
* Re: Suggestion needed for fixing RAID6 [not found] ` <7cfd01cae598$419e8d20$0400a8c0@dcccs> @ 2010-04-27 0:04 ` Michael Evans 0 siblings, 0 replies; 48+ messages in thread From: Michael Evans @ 2010-04-27 0:04 UTC (permalink / raw) To: linux-raid On Mon, Apr 26, 2010 at 4:29 PM, Janos Haar <janos.haar@netcenter.hu> wrote: > > ----- Original Message ----- From: "Michael Evans" <mjevans1983@gmail.com> > To: "Janos Haar" <janos.haar@netcenter.hu> > Cc: "MRK" <mrk@shiftmail.org>; <linux-raid@vger.kernel.org> > Sent: Tuesday, April 27, 2010 1:06 AM > Subject: Re: Suggestion needed for fixing RAID6 > > >> On Mon, Apr 26, 2010 at 3:39 PM, Janos Haar <janos.haar@netcenter.hu> >> wrote: >>> >>> ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> >>> To: "Janos Haar" <janos.haar@netcenter.hu> >>> Cc: <linux-raid@vger.kernel.org> >>> Sent: Monday, April 26, 2010 6:53 PM >>> Subject: Re: Suggestion needed for fixing RAID6 >>> >>> >>>> On 04/26/2010 02:52 PM, Janos Haar wrote: >>>>> >>>>> Oops, you are right! >>>>> It was my mistake. >>>>> Sorry, i will try it again, to support 2 drives with dm-cow. >>>>> I will try it. >>>> >>>> Great! post here the results... the dmesg in particular. >>>> The dmesg should contain multiple lines like this "raid5:md3: read error >>>> corrected ....." >>>> then you know it worked. >>> >>> md3 : active raid6 sdd4[12] sdl4[11] sdk4[10] sdj4[9] sdi4[8] dm-1[13](F) >>> sdg4[6] sdf4[5] dm-0[14](F) sdc4[2] sdb4[1] sda4[0] >>> 14626538880 blocks level 6, 16k chunk, algorithm 2 [12/9] [UUU__UU_UUUU] >>> [>....................] recovery = 1.5% (22903832/1462653888) >>> finish=3188383.4min speed=7K/sec >>> >>> Khm.... :-D >>> It is working on something or stopped with 3 missing drive? : ^ ) >>> >>> (I have found the cause of the 2 dm's failure. >>> Now retry runs...) >>> >>> Cheers, >>> Janos >>> >>> >>> >>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >> What is displayed there seems like it can't be correct. Please run >> >> mdadm -Evvs >> >> mdadm -Dvvs >> >> and provide the results for us. > > I have wrongly assigned the dm devices (cross-linked) and the sync process > is freezed. > The snapshot is grown to the maximum of space, than both failed with write > error at the same time with out of space. > The md_sync process is freezed. > (I have to push the reset.) > > I think this is correct what we can see, because the process is freezed > before exit and can't change the state to failed. > > Cheers, > Janos > >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please reply to all. It sounds like you need a LOT more space. Please carefully try again. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-26 16:53 ` MRK 2010-04-26 22:39 ` Janos Haar @ 2010-04-27 15:50 ` Janos Haar 2010-04-27 23:02 ` MRK 1 sibling, 1 reply; 48+ messages in thread From: Janos Haar @ 2010-04-27 15:50 UTC (permalink / raw) To: MRK; +Cc: linux-raid ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <linux-raid@vger.kernel.org> Sent: Monday, April 26, 2010 6:53 PM Subject: Re: Suggestion needed for fixing RAID6 > On 04/26/2010 02:52 PM, Janos Haar wrote: >> >> Oops, you are right! >> It was my mistake. >> Sorry, i will try it again, to support 2 drives with dm-cow. >> I will try it. > > Great! post here the results... the dmesg in particular. > The dmesg should contain multiple lines like this "raid5:md3: read error > corrected ....." > then you know it worked. I am affraid i am still right about that.... ... end_request: I/O error, dev sdh, sector 1667152256 raid5:md3: read error not correctable (sector 1662188168 on dm-1). raid5: Disk failure on dm-1, disabling device. raid5: Operation continuing on 10 devices. raid5:md3: read error not correctable (sector 1662188176 on dm-1). raid5:md3: read error not correctable (sector 1662188184 on dm-1). raid5:md3: read error not correctable (sector 1662188192 on dm-1). raid5:md3: read error not correctable (sector 1662188200 on dm-1). raid5:md3: read error not correctable (sector 1662188208 on dm-1). raid5:md3: read error not correctable (sector 1662188216 on dm-1). raid5:md3: read error not correctable (sector 1662188224 on dm-1). raid5:md3: read error not correctable (sector 1662188232 on dm-1). raid5:md3: read error not correctable (sector 1662188240 on dm-1). ata8: EH complete sd 7:0:0:0: [sdh] 2930277168 512-byte hardware sectors: (1.50 TB/1.36 TiB) ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata8.00: port_status 0x20200000 ata8.00: cmd 25/00:f8:f5:ba:5e/00:03:63:00:00/e0 tag 0 dma 520192 in res 51/40:00:ef:bb:5e/40:00:63:00:00/e0 Emask 0x9 (media error) ata8.00: status: { DRDY ERR } ata8.00: error: { UNC } ata8.00: configured for UDMA/133 ata8: EH complete .... .... sd 7:0:0:0: [sdh] Add. Sense: Unrecovered read error - auto reallocate failed end_request: I/O error, dev sdh, sector 1667152879 __ratelimit: 36 callbacks suppressed raid5:md3: read error not correctable (sector 1662188792 on dm-1). raid5:md3: read error not correctable (sector 1662188800 on dm-1). md: md3: recovery done. raid5:md3: read error not correctable (sector 1662188808 on dm-1). raid5:md3: read error not correctable (sector 1662188816 on dm-1). raid5:md3: read error not correctable (sector 1662188824 on dm-1). raid5:md3: read error not correctable (sector 1662188832 on dm-1). raid5:md3: read error not correctable (sector 1662188840 on dm-1). raid5:md3: read error not correctable (sector 1662188848 on dm-1). raid5:md3: read error not correctable (sector 1662188856 on dm-1). raid5:md3: read error not correctable (sector 1662188864 on dm-1). ata8: EH complete sd 7:0:0:0: [sdh] Write Protect is off sd 7:0:0:0: [sdh] Mode Sense: 00 3a 00 00 sd 7:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata8.00: port_status 0x20200000 .... .... res 51/40:00:27:c0:5e/40:00:63:00:00/e0 Emask 0x9 (media error) ata8.00: status: { DRDY ERR } ata8.00: error: { UNC } ata8.00: configured for UDMA/133 sd 7:0:0:0: [sdh] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK sd 7:0:0:0: [sdh] Sense Key : Medium Error [current] [descriptor] Descriptor sense data with sense descriptors (in hex): 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 63 5e c0 27 sd 7:0:0:0: [sdh] Add. Sense: Unrecovered read error - auto reallocate failed end_request: I/O error, dev sdh, sector 1667153959 __ratelimit: 86 callbacks suppressed raid5:md3: read error not correctable (sector 1662189872 on dm-1). raid5:md3: read error not correctable (sector 1662189880 on dm-1). raid5:md3: read error not correctable (sector 1662189888 on dm-1). raid5:md3: read error not correctable (sector 1662189896 on dm-1). raid5:md3: read error not correctable (sector 1662189904 on dm-1). raid5:md3: read error not correctable (sector 1662189912 on dm-1). raid5:md3: read error not correctable (sector 1662189920 on dm-1). raid5:md3: read error not correctable (sector 1662189928 on dm-1). raid5:md3: read error not correctable (sector 1662189936 on dm-1). raid5:md3: read error not correctable (sector 1662189944 on dm-1). ata8: EH complete sd 7:0:0:0: [sdh] 2930277168 512-byte hardware sectors: (1.50 TB/1.36 TiB) sd 7:0:0:0: [sdh] Write Protect is off sd 7:0:0:0: [sdh] Mode Sense: 00 3a 00 00 sd 7:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 7:0:0:0: [sdh] 2930277168 512-byte hardware sectors: (1.50 TB/1.36 TiB) sd 7:0:0:0: [sdh] Write Protect is off sd 7:0:0:0: [sdh] Mode Sense: 00 3a 00 00 sd 7:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA RAID5 conf printout: --- rd:12 wd:10 disk 0, o:1, dev:sda4 disk 1, o:1, dev:sdb4 disk 2, o:1, dev:sdc4 disk 3, o:1, dev:sdd4 disk 4, o:1, dev:dm-0 disk 5, o:1, dev:sdf4 disk 6, o:1, dev:sdg4 disk 7, o:0, dev:dm-1 disk 8, o:1, dev:sdi4 disk 9, o:1, dev:sdj4 disk 10, o:1, dev:sdk4 disk 11, o:1, dev:sdl4 RAID5 conf printout: --- rd:12 wd:10 disk 0, o:1, dev:sda4 disk 1, o:1, dev:sdb4 disk 2, o:1, dev:sdc4 disk 3, o:1, dev:sdd4 disk 4, o:1, dev:dm-0 disk 5, o:1, dev:sdf4 disk 6, o:1, dev:sdg4 disk 7, o:0, dev:dm-1 disk 8, o:1, dev:sdi4 disk 9, o:1, dev:sdj4 disk 10, o:1, dev:sdk4 disk 11, o:1, dev:sdl4 RAID5 conf printout: --- rd:12 wd:10 disk 0, o:1, dev:sda4 disk 1, o:1, dev:sdb4 disk 2, o:1, dev:sdc4 disk 3, o:1, dev:sdd4 disk 4, o:1, dev:dm-0 disk 5, o:1, dev:sdf4 disk 6, o:1, dev:sdg4 disk 7, o:0, dev:dm-1 disk 8, o:1, dev:sdi4 disk 9, o:1, dev:sdj4 disk 10, o:1, dev:sdk4 disk 11, o:1, dev:sdl4 RAID5 conf printout: --- rd:12 wd:10 disk 0, o:1, dev:sda4 disk 1, o:1, dev:sdb4 disk 2, o:1, dev:sdc4 disk 3, o:1, dev:sdd4 disk 4, o:1, dev:dm-0 disk 5, o:1, dev:sdf4 disk 6, o:1, dev:sdg4 disk 8, o:1, dev:sdi4 disk 9, o:1, dev:sdj4 disk 10, o:1, dev:sdk4 disk 11, o:1, dev:sdl4 md: recovery of RAID array md3 md: minimum _guaranteed_ speed: 1000 KB/sec/disk. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. md: using 128k window, over a total of 1462653888 blocks. md: resuming recovery of md3 from checkpoint. md3 : active raid6 sdd4[12] sdl4[11] sdk4[10] sdj4[9] sdi4[8] dm-1[13](F) sdg4[6] sdf4[5] dm-0[4] sdc4[2] sdb4[1] sda4[0] 14626538880 blocks level 6, 16k chunk, algorithm 2 [12/10] [UUU_UUU_UUUU] [===============>.....] recovery = 75.3% (1101853312/1462653888) finish=292.3min speed=20565K/sec du -h /sna* 1.1M /snapshot2.bin 1.1M /snapshot.bin df -h Filesystem Size Used Avail Use% Mounted on /dev/md1 19G 16G 3.5G 82% / /dev/md0 99M 34M 60M 36% /boot tmpfs 2.0G 0 2.0G 0% /dev/shm This is the actual state. :-( In this way, the sync will stop again at 97.9%. Another idea? Or how to solve this dm-snapshot thing? I think i know how can this be: If i am right, the sync uses normal block size like usually wich is 4Kbyte in linux. But the bad blocks are 512 bytes. lets see for example one 4K window: [BGBGBBGG] B: bad G: good sector The sync reads up the block, the reported state is UNC because the drive reported UNC for some sector in this area. The md recalculates the first 512byte bad block because the address is the same like the 4K block, than re-write it. Than re-read the 4K block wich is still UNC because the 3rd sector is bad. Can this be the issue? Thanks, Janos > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-27 15:50 ` Janos Haar @ 2010-04-27 23:02 ` MRK 2010-04-28 1:37 ` Neil Brown 0 siblings, 1 reply; 48+ messages in thread From: MRK @ 2010-04-27 23:02 UTC (permalink / raw) To: Janos Haar; +Cc: linux-raid, Neil Brown On 04/27/2010 05:50 PM, Janos Haar wrote: > > ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> > To: "Janos Haar" <janos.haar@netcenter.hu> > Cc: <linux-raid@vger.kernel.org> > Sent: Monday, April 26, 2010 6:53 PM > Subject: Re: Suggestion needed for fixing RAID6 > > >> On 04/26/2010 02:52 PM, Janos Haar wrote: >>> >>> Oops, you are right! >>> It was my mistake. >>> Sorry, i will try it again, to support 2 drives with dm-cow. >>> I will try it. >> >> Great! post here the results... the dmesg in particular. >> The dmesg should contain multiple lines like this "raid5:md3: read >> error corrected ....." >> then you know it worked. > > I am affraid i am still right about that.... > > ... > end_request: I/O error, dev sdh, sector 1667152256 > raid5:md3: read error not correctable (sector 1662188168 on dm-1). > raid5: Disk failure on dm-1, disabling device. > raid5: Operation continuing on 10 devices. I think I can see a problem here: You had 11 active devices over 12 when you received the read error. At 11 devices over 12 your array is singly-degraded and this should be enough for raid6 to recompute the block from parity and perform the rewrite, correcting the read-error, but instead MD declared that it's impossible to correct the error, and dropped one more device (going to doubly-degraded). I think this is an MD bug, and I think I know where it is: --- linux-2.6.33-vanilla/drivers/md/raid5.c 2010-02-24 19:52:17.000000000 +0100 +++ linux-2.6.33/drivers/md/raid5.c 2010-04-27 23:58:31.000000000 +0200 @@ -1526,7 +1526,7 @@ static void raid5_end_read_request(struc clear_bit(R5_UPTODATE, &sh->dev[i].flags); atomic_inc(&rdev->read_errors); - if (conf->mddev->degraded) + if (conf->mddev->degraded == conf->max_degraded) printk_rl(KERN_WARNING "raid5:%s: read error not correctable " "(sector %llu on %s).\n", ------------------------------------------------------ (This is just compile-tested so try at your risk) I'd like to hear what Neil thinks of this... The problem here (apart from the erroneous error message) is that if execution goes inside that "if" clause, it will eventually reach the md_error() statement some 30 lines below there, which will have the effect of dropping one further device further worsening the situation instead of recovering it, and this is not the correct behaviour in this case as far as I understand. At the current state raid6 behaves like if it was a raid5, effectively supporting only one failed disk. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-27 23:02 ` MRK @ 2010-04-28 1:37 ` Neil Brown 2010-04-28 2:02 ` Mikael Abrahamsson 2010-04-28 12:57 ` MRK 0 siblings, 2 replies; 48+ messages in thread From: Neil Brown @ 2010-04-28 1:37 UTC (permalink / raw) To: MRK; +Cc: Janos Haar, linux-raid On Wed, 28 Apr 2010 01:02:14 +0200 MRK <mrk@shiftmail.org> wrote: > On 04/27/2010 05:50 PM, Janos Haar wrote: > > > > ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> > > To: "Janos Haar" <janos.haar@netcenter.hu> > > Cc: <linux-raid@vger.kernel.org> > > Sent: Monday, April 26, 2010 6:53 PM > > Subject: Re: Suggestion needed for fixing RAID6 > > > > > >> On 04/26/2010 02:52 PM, Janos Haar wrote: > >>> > >>> Oops, you are right! > >>> It was my mistake. > >>> Sorry, i will try it again, to support 2 drives with dm-cow. > >>> I will try it. > >> > >> Great! post here the results... the dmesg in particular. > >> The dmesg should contain multiple lines like this "raid5:md3: read > >> error corrected ....." > >> then you know it worked. > > > > I am affraid i am still right about that.... > > > > ... > > end_request: I/O error, dev sdh, sector 1667152256 > > raid5:md3: read error not correctable (sector 1662188168 on dm-1). > > raid5: Disk failure on dm-1, disabling device. > > raid5: Operation continuing on 10 devices. > > I think I can see a problem here: > You had 11 active devices over 12 when you received the read error. > At 11 devices over 12 your array is singly-degraded and this should be > enough for raid6 to recompute the block from parity and perform the > rewrite, correcting the read-error, but instead MD declared that it's > impossible to correct the error, and dropped one more device (going to > doubly-degraded). > > I think this is an MD bug, and I think I know where it is: > > > --- linux-2.6.33-vanilla/drivers/md/raid5.c 2010-02-24 > 19:52:17.000000000 +0100 > +++ linux-2.6.33/drivers/md/raid5.c 2010-04-27 23:58:31.000000000 +0200 > @@ -1526,7 +1526,7 @@ static void raid5_end_read_request(struc > > clear_bit(R5_UPTODATE, &sh->dev[i].flags); > atomic_inc(&rdev->read_errors); > - if (conf->mddev->degraded) > + if (conf->mddev->degraded == conf->max_degraded) > printk_rl(KERN_WARNING > "raid5:%s: read error not correctable " > "(sector %llu on %s).\n", > > ------------------------------------------------------ > (This is just compile-tested so try at your risk) > > I'd like to hear what Neil thinks of this... I think you've found a real bug - thanks. It would make the test '>=' rather than '==' as that is safer, otherwise I agree. > - if (conf->mddev->degraded) > + if (conf->mddev->degraded >= conf->max_degraded) Thanks, NeilBrown > > The problem here (apart from the erroneous error message) is that if > execution goes inside that "if" clause, it will eventually reach the > md_error() statement some 30 lines below there, which will have the > effect of dropping one further device further worsening the situation > instead of recovering it, and this is not the correct behaviour in this > case as far as I understand. > At the current state raid6 behaves like if it was a raid5, effectively > supporting only one failed disk. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-28 1:37 ` Neil Brown @ 2010-04-28 2:02 ` Mikael Abrahamsson 2010-04-28 2:12 ` Neil Brown 2010-04-28 12:57 ` MRK 1 sibling, 1 reply; 48+ messages in thread From: Mikael Abrahamsson @ 2010-04-28 2:02 UTC (permalink / raw) To: Neil Brown; +Cc: MRK, Janos Haar, linux-raid On Wed, 28 Apr 2010, Neil Brown wrote: >> I think I can see a problem here: >> You had 11 active devices over 12 when you received the read error. >> At 11 devices over 12 your array is singly-degraded and this should be >> enough for raid6 to recompute the block from parity and perform the >> rewrite, correcting the read-error, but instead MD declared that it's >> impossible to correct the error, and dropped one more device (going to >> doubly-degraded). >> >> I think this is an MD bug, and I think I know where it is: >> >> >> --- linux-2.6.33-vanilla/drivers/md/raid5.c 2010-02-24 >> 19:52:17.000000000 +0100 >> +++ linux-2.6.33/drivers/md/raid5.c 2010-04-27 23:58:31.000000000 +0200 >> @@ -1526,7 +1526,7 @@ static void raid5_end_read_request(struc >> >> clear_bit(R5_UPTODATE, &sh->dev[i].flags); >> atomic_inc(&rdev->read_errors); >> - if (conf->mddev->degraded) >> + if (conf->mddev->degraded == conf->max_degraded) >> printk_rl(KERN_WARNING >> "raid5:%s: read error not correctable " >> "(sector %llu on %s).\n", >> >> ------------------------------------------------------ >> (This is just compile-tested so try at your risk) >> >> I'd like to hear what Neil thinks of this... > > I think you've found a real bug - thanks. > > It would make the test '>=' rather than '==' as that is safer, otherwise I > agree. > >> - if (conf->mddev->degraded) >> + if (conf->mddev->degraded >= conf->max_degraded) If a raid6 device handling can reach this code path, could I also point out that the message says "raid5" and that this is confusing if it's referring to a degraded raid6? -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-28 2:02 ` Mikael Abrahamsson @ 2010-04-28 2:12 ` Neil Brown 2010-04-28 2:30 ` Mikael Abrahamsson 0 siblings, 1 reply; 48+ messages in thread From: Neil Brown @ 2010-04-28 2:12 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: MRK, Janos Haar, linux-raid On Wed, 28 Apr 2010 04:02:39 +0200 (CEST) Mikael Abrahamsson <swmike@swm.pp.se> wrote: > On Wed, 28 Apr 2010, Neil Brown wrote: > > >> I think I can see a problem here: > >> You had 11 active devices over 12 when you received the read error. > >> At 11 devices over 12 your array is singly-degraded and this should be > >> enough for raid6 to recompute the block from parity and perform the > >> rewrite, correcting the read-error, but instead MD declared that it's > >> impossible to correct the error, and dropped one more device (going to > >> doubly-degraded). > >> > >> I think this is an MD bug, and I think I know where it is: > >> > >> > >> --- linux-2.6.33-vanilla/drivers/md/raid5.c 2010-02-24 > >> 19:52:17.000000000 +0100 > >> +++ linux-2.6.33/drivers/md/raid5.c 2010-04-27 23:58:31.000000000 +0200 > >> @@ -1526,7 +1526,7 @@ static void raid5_end_read_request(struc > >> > >> clear_bit(R5_UPTODATE, &sh->dev[i].flags); > >> atomic_inc(&rdev->read_errors); > >> - if (conf->mddev->degraded) > >> + if (conf->mddev->degraded == conf->max_degraded) > >> printk_rl(KERN_WARNING > >> "raid5:%s: read error not correctable " > >> "(sector %llu on %s).\n", > >> > >> ------------------------------------------------------ > >> (This is just compile-tested so try at your risk) > >> > >> I'd like to hear what Neil thinks of this... > > > > I think you've found a real bug - thanks. > > > > It would make the test '>=' rather than '==' as that is safer, otherwise I > > agree. > > > >> - if (conf->mddev->degraded) > >> + if (conf->mddev->degraded >= conf->max_degraded) > > If a raid6 device handling can reach this code path, could I also point > out that the message says "raid5" and that this is confusing if it's > referring to a degraded raid6? > You could.... There are lots of places that say "raid5" where it could apply to raid4 or raid6 as well. Maybe I should change them all to 'raid456'... NeilBrown ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-28 2:12 ` Neil Brown @ 2010-04-28 2:30 ` Mikael Abrahamsson 2010-05-03 2:29 ` Neil Brown 0 siblings, 1 reply; 48+ messages in thread From: Mikael Abrahamsson @ 2010-04-28 2:30 UTC (permalink / raw) To: Neil Brown; +Cc: MRK, Janos Haar, linux-raid On Wed, 28 Apr 2010, Neil Brown wrote: > There are lots of places that say "raid5" where it could apply to raid4 > or raid6 as well. Maybe I should change them all to 'raid456'... That sounds like a good idea, or just call it "raid:" or "raid4/5/6". Don't know where we are in the stable kernel release cycle, but it would be super if this could make it in by next cycle, this code is handling the fault scenario that made me go from raid5 to raid6 :) -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-28 2:30 ` Mikael Abrahamsson @ 2010-05-03 2:29 ` Neil Brown 0 siblings, 0 replies; 48+ messages in thread From: Neil Brown @ 2010-05-03 2:29 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: MRK, Janos Haar, linux-raid On Wed, 28 Apr 2010 04:30:05 +0200 (CEST) Mikael Abrahamsson <swmike@swm.pp.se> wrote: > On Wed, 28 Apr 2010, Neil Brown wrote: > > > There are lots of places that say "raid5" where it could apply to raid4 > > or raid6 as well. Maybe I should change them all to 'raid456'... > > That sounds like a good idea, or just call it "raid:" or "raid4/5/6". > > Don't know where we are in the stable kernel release cycle, but it would > be super if this could make it in by next cycle, this code is handling the > fault scenario that made me go from raid5 to raid6 :) > We are very close to release of 2.6.34. I won't submit this before 2.6.34 is released as it is not a regression and not technically a data-corruption bug. However it will go into 2.6.35-rc1 and but submitted to -stable for 2.6.34.1 and probably other -stable kernels. NeilBrown ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-28 1:37 ` Neil Brown 2010-04-28 2:02 ` Mikael Abrahamsson @ 2010-04-28 12:57 ` MRK 2010-04-28 13:32 ` Janos Haar 1 sibling, 1 reply; 48+ messages in thread From: MRK @ 2010-04-28 12:57 UTC (permalink / raw) To: Neil Brown; +Cc: Janos Haar, linux-raid On 04/28/2010 03:37 AM, Neil Brown wrote: >> --- linux-2.6.33-vanilla/drivers/md/raid5.c 2010-02-24 >> 19:52:17.000000000 +0100 >> +++ linux-2.6.33/drivers/md/raid5.c 2010-04-27 23:58:31.000000000 +0200 >> @@ -1526,7 +1526,7 @@ static void raid5_end_read_request(struc >> >> clear_bit(R5_UPTODATE,&sh->dev[i].flags); >> atomic_inc(&rdev->read_errors); >> - if (conf->mddev->degraded) >> + if (conf->mddev->degraded == conf->max_degraded) >> printk_rl(KERN_WARNING >> "raid5:%s: read error not correctable " >> "(sector %llu on %s).\n", >> >> ------------------------------------------------------ >> (This is just compile-tested so try at your risk) >> >> I'd like to hear what Neil thinks of this... >> > I think you've found a real bug - thanks. > > It would make the test '>=' rather than '==' as that is safer, otherwise I > agree. > > >> - if (conf->mddev->degraded) >> + if (conf->mddev->degraded>= conf->max_degraded) >> Right, agreed... > Thanks, > NeilBrown > Ok then I'll post a more official patch in a separate email shortly, thanks ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-28 12:57 ` MRK @ 2010-04-28 13:32 ` Janos Haar 2010-04-28 14:19 ` MRK 0 siblings, 1 reply; 48+ messages in thread From: Janos Haar @ 2010-04-28 13:32 UTC (permalink / raw) To: MRK; +Cc: linux-raid, Neil Brown MRK, Neil, Please let me have one wish: Please write down my name to the kernel tree with a note i was who reported and helped to track down this. :-) Thanks. Janos Haar ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> To: "Neil Brown" <neilb@suse.de> Cc: "Janos Haar" <janos.haar@netcenter.hu>; <linux-raid@vger.kernel.org> Sent: Wednesday, April 28, 2010 2:57 PM Subject: Re: Suggestion needed for fixing RAID6 > On 04/28/2010 03:37 AM, Neil Brown wrote: >>> --- linux-2.6.33-vanilla/drivers/md/raid5.c 2010-02-24 >>> 19:52:17.000000000 +0100 >>> +++ linux-2.6.33/drivers/md/raid5.c 2010-04-27 23:58:31.000000000 >>> +0200 >>> @@ -1526,7 +1526,7 @@ static void raid5_end_read_request(struc >>> >>> clear_bit(R5_UPTODATE,&sh->dev[i].flags); >>> atomic_inc(&rdev->read_errors); >>> - if (conf->mddev->degraded) >>> + if (conf->mddev->degraded == conf->max_degraded) >>> printk_rl(KERN_WARNING >>> "raid5:%s: read error not >>> correctable " >>> "(sector %llu on %s).\n", >>> >>> ------------------------------------------------------ >>> (This is just compile-tested so try at your risk) >>> >>> I'd like to hear what Neil thinks of this... >>> >> I think you've found a real bug - thanks. >> >> It would make the test '>=' rather than '==' as that is safer, otherwise >> I >> agree. >> >> >>> - if (conf->mddev->degraded) >>> + if (conf->mddev->degraded>= conf->max_degraded) >>> > > Right, agreed... > >> Thanks, >> NeilBrown >> > > Ok then I'll post a more official patch in a separate email shortly, > thanks ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-28 13:32 ` Janos Haar @ 2010-04-28 14:19 ` MRK 2010-04-28 14:51 ` Janos Haar 2010-04-29 7:55 ` Janos Haar 0 siblings, 2 replies; 48+ messages in thread From: MRK @ 2010-04-28 14:19 UTC (permalink / raw) To: Janos Haar; +Cc: linux-raid, Neil Brown On 04/28/2010 03:32 PM, Janos Haar wrote: > MRK, Neil, > > Please let me have one wish: > Please write down my name to the kernel tree with a note i was who > reported and helped to track down this. :-) > > Thanks. > Janos Haar Ok I did However it would be nice if you can actually test the patch and confirm that it solves your problem, starting with the raid6 array in singly-degraded mode like you did yesterday. Then I think we can add one further line on top: Tested-by: Janos Haar <janos.haar@netcenter.hu> before Neil (hopefully) acks it. Testing is needed anyway before pushing it to mainline, I think... ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-28 14:19 ` MRK @ 2010-04-28 14:51 ` Janos Haar 2010-04-29 7:55 ` Janos Haar 1 sibling, 0 replies; 48+ messages in thread From: Janos Haar @ 2010-04-28 14:51 UTC (permalink / raw) To: MRK; +Cc: linux-raid ----- Original Message ----- From: "MRK" <gabriele.trombetti@gmail.com> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <linux-raid@vger.kernel.org>; "Neil Brown" <neilb@suse.de> Sent: Wednesday, April 28, 2010 4:19 PM Subject: Re: Suggestion needed for fixing RAID6 > On 04/28/2010 03:32 PM, Janos Haar wrote: >> MRK, Neil, >> >> Please let me have one wish: >> Please write down my name to the kernel tree with a note i was who >> reported and helped to track down this. :-) >> >> Thanks. >> Janos Haar > > Ok I did > However it would be nice if you can actually test the patch and confirm > that it solves your problem, starting with the raid6 array in > singly-degraded mode like you did yesterday. Then I think we can add one > further line on top: > > Tested-by: Janos Haar <janos.haar@netcenter.hu> > > before Neil (hopefully) acks it. Testing is needed anyway before pushing > it to mainline, I think... I am allready working on...... Please give me some time.... Janos > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-28 14:19 ` MRK 2010-04-28 14:51 ` Janos Haar @ 2010-04-29 7:55 ` Janos Haar 2010-04-29 15:22 ` MRK 1 sibling, 1 reply; 48+ messages in thread From: Janos Haar @ 2010-04-29 7:55 UTC (permalink / raw) To: MRK; +Cc: linux-raid md3 : active raid6 sdd4[12] sdl4[11] sdk4[10] sdj4[9] sdi4[8] dm-1[13](F) sdg4[6 ] sdf4[5] dm-0[4] sdc4[2] sdb4[1] sda4[0] 14626538880 blocks level 6, 16k chunk, algorithm 2 [12/10] [UUU_UUU_UUUU] [===========>.........] recovery = 56.8% (831095108/1462653888) finish=50 19.8min speed=2096K/sec Drive dropped again with this patch! + the kernel freezed. (I will try to get more info...) Janos ----- Original Message ----- From: "MRK" <**************> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <linux-raid@vger.kernel.org>; "Neil Brown" <neilb@suse.de> Sent: Wednesday, April 28, 2010 4:19 PM Subject: Re: Suggestion needed for fixing RAID6 > On 04/28/2010 03:32 PM, Janos Haar wrote: >> MRK, Neil, >> >> Please let me have one wish: >> Please write down my name to the kernel tree with a note i was who >> reported and helped to track down this. :-) >> >> Thanks. >> Janos Haar > > Ok I did > However it would be nice if you can actually test the patch and confirm > that it solves your problem, starting with the raid6 array in > singly-degraded mode like you did yesterday. Then I think we can add one > further line on top: > > Tested-by: Janos Haar <janos.haar@netcenter.hu> > > before Neil (hopefully) acks it. Testing is needed anyway before pushing > it to mainline, I think... > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-29 7:55 ` Janos Haar @ 2010-04-29 15:22 ` MRK 2010-04-29 21:07 ` Janos Haar 0 siblings, 1 reply; 48+ messages in thread From: MRK @ 2010-04-29 15:22 UTC (permalink / raw) To: Janos Haar; +Cc: linux-raid On 04/29/2010 09:55 AM, Janos Haar wrote: > > md3 : active raid6 sdd4[12] sdl4[11] sdk4[10] sdj4[9] sdi4[8] > dm-1[13](F) sdg4[6 > ] sdf4[5] dm-0[4] sdc4[2] sdb4[1] sda4[0] > 14626538880 blocks level 6, 16k chunk, algorithm 2 [12/10] > [UUU_UUU_UUUU] > [===========>.........] recovery = 56.8% (831095108/1462653888) > finish=50 > 19.8min speed=2096K/sec > > Drive dropped again with this patch! > + the kernel freezed. > (I will try to get more info...) > > Janos Hmm too bad :-( it seems it still doesn't work, sorry for that I suppose the kernel didn't freeze immediately after disabling the drive or you wouldn't have had the chance to cat /proc/mdstat... Hence dmesg messages might have gone to /var/log/messages or something. Can you look there to see if there is any interesting message to post here? Did the COW device fill up at least a bit? Also: you know that if you disable graphics on the server ("/etc/init.d/gdm stop" or something like that) you usually can see the stack trace of the kernel panic on screen when it hangs (unless terminal was blank for powersaving, which you can disable too). You can take a photo of that one (or write it down but it will be long) to so maybe somebody can understand why it hanged. You might be even obtain the stack trace through a serial port but that will take more effort. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-29 15:22 ` MRK @ 2010-04-29 21:07 ` Janos Haar 2010-04-29 23:00 ` MRK 0 siblings, 1 reply; 48+ messages in thread From: Janos Haar @ 2010-04-29 21:07 UTC (permalink / raw) To: MRK; +Cc: linux-raid ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <linux-raid@vger.kernel.org> Sent: Thursday, April 29, 2010 5:22 PM Subject: Re: Suggestion needed for fixing RAID6 > On 04/29/2010 09:55 AM, Janos Haar wrote: >> >> md3 : active raid6 sdd4[12] sdl4[11] sdk4[10] sdj4[9] sdi4[8] dm-1[13](F) >> sdg4[6 >> ] sdf4[5] dm-0[4] sdc4[2] sdb4[1] sda4[0] >> 14626538880 blocks level 6, 16k chunk, algorithm 2 [12/10] >> [UUU_UUU_UUUU] >> [===========>.........] recovery = 56.8% (831095108/1462653888) >> finish=50 >> 19.8min speed=2096K/sec >> >> Drive dropped again with this patch! >> + the kernel freezed. >> (I will try to get more info...) >> >> Janos > > Hmm too bad :-( it seems it still doesn't work, sorry for that > > I suppose the kernel didn't freeze immediately after disabling the drive > or you wouldn't have had the chance to cat /proc/mdstat... this was this command in putty.exe window: watch "cat /proc/mdstat ; du -h /snap*" I think it have crashed soon. I had no time to recognize what happened and exit from the watch. > > Hence dmesg messages might have gone to /var/log/messages or something. > Can you look there to see if there is any interesting message to post > here? Yes, i know that. The crash was not written up unfortunately. But there is some info: (some UNC reported from sdh) .... Apr 29 09:50:29 Clarus-gl2k10-2 kernel: res 51/40:00:27:c0:5e/40:00:63:00:00/e0 Emask 0x9 (media error) Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8.00: status: { DRDY ERR } Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8.00: error: { UNC } Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8.00: configured for UDMA/133 Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Sense Key : Medium Error [current] [descriptor] Apr 29 09:50:29 Clarus-gl2k10-2 kernel: Descriptor sense data with sense descriptors (in hex): Apr 29 09:50:29 Clarus-gl2k10-2 kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Apr 29 09:50:29 Clarus-gl2k10-2 kernel: 63 5e c0 27 Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Add. Sense: Unrecovered read error - auto reallocate failed Apr 29 09:50:29 Clarus-gl2k10-2 kernel: end_request: I/O error, dev sdh, sector 1667153959 Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189872 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189880 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189888 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189896 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189904 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189912 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189920 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189928 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189936 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189944 on dm-1). Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8: EH complete Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write Protect is off Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] 2930277168 512-byte hardware sectors: (1.50 TB/1.36 TiB) Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write Protect is off Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Apr 29 13:07:39 Clarus-gl2k10-2 syslogd 1.4.1: restart. > Did the COW device fill up at least a bit? The initial size is 1.1MB, and what we wants to see is only some kbytes... I don't know exactly. Next time i will try to reduce the initial size to 16KByte. > > Also: you know that if you disable graphics on the server > ("/etc/init.d/gdm stop" or something like that) you usually can see the > stack trace of the kernel panic on screen when it hangs (unless terminal > was blank for powersaving, which you can disable too). You can take a > photo of that one (or write it down but it will be long) to so maybe > somebody can understand why it hanged. You might be even obtain the stack > trace through a serial port but that will take more effort. This pc based server have no graphic card at all. :-) (this is one of my freak ideas) And the terminal is redirected to the com1. If i really want, i can catch this with serial cable, but i think the log should be enough from the messages file. Thanks, Janos ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-29 21:07 ` Janos Haar @ 2010-04-29 23:00 ` MRK 2010-04-30 6:17 ` Janos Haar 0 siblings, 1 reply; 48+ messages in thread From: MRK @ 2010-04-29 23:00 UTC (permalink / raw) To: Janos Haar; +Cc: linux-raid On 04/29/2010 11:07 PM, Janos Haar wrote: > > ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> > To: "Janos Haar" <janos.haar@netcenter.hu> > Cc: <linux-raid@vger.kernel.org> > Sent: Thursday, April 29, 2010 5:22 PM > Subject: Re: Suggestion needed for fixing RAID6 > > >> On 04/29/2010 09:55 AM, Janos Haar wrote: >>> >>> md3 : active raid6 sdd4[12] sdl4[11] sdk4[10] sdj4[9] sdi4[8] >>> dm-1[13](F) sdg4[6 >>> ] sdf4[5] dm-0[4] sdc4[2] sdb4[1] sda4[0] >>> 14626538880 blocks level 6, 16k chunk, algorithm 2 [12/10] >>> [UUU_UUU_UUUU] >>> [===========>.........] recovery = 56.8% >>> (831095108/1462653888) finish=50 >>> 19.8min speed=2096K/sec >>> >>> Drive dropped again with this patch! >>> + the kernel freezed. >>> (I will try to get more info...) >>> >>> Janos >> >> Hmm too bad :-( it seems it still doesn't work, sorry for that >> >> I suppose the kernel didn't freeze immediately after disabling the >> drive or you wouldn't have had the chance to cat /proc/mdstat... > > this was this command in putty.exe window: > watch "cat /proc/mdstat ; du -h /snap*" > good idea... > I think it have crashed soon. > I had no time to recognize what happened and exit from the watch. > >> >> Hence dmesg messages might have gone to /var/log/messages or >> something. Can you look there to see if there is any interesting >> message to post here? > > Yes, i know that. > The crash was not written up unfortunately. > But there is some info: > > (some UNC reported from sdh) > .... > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: res > 51/40:00:27:c0:5e/40:00:63:00:00/e0 Emask 0x9 (media error) > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8.00: status: { DRDY ERR } > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8.00: error: { UNC } > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8.00: configured for UDMA/133 > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Sense Key : > Medium Error [current] [descriptor] > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: Descriptor sense data with > sense descriptors (in hex): > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: 72 03 11 04 00 00 00 > 0c 00 0a 80 00 00 00 00 00 > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: 63 5e c0 27 > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Add. Sense: > Unrecovered read error - auto reallocate failed > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: end_request: I/O error, dev > sdh, sector 1667153959 > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not > correctable (sector 1662189872 on dm-1). > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not > correctable (sector 1662189880 on dm-1). > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not > correctable (sector 1662189888 on dm-1). > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not > correctable (sector 1662189896 on dm-1). > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not > correctable (sector 1662189904 on dm-1). > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not > correctable (sector 1662189912 on dm-1). > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not > correctable (sector 1662189920 on dm-1). > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not > correctable (sector 1662189928 on dm-1). > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not > correctable (sector 1662189936 on dm-1). > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not > correctable (sector 1662189944 on dm-1). > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8: EH complete > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write > Protect is off > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write cache: > enabled, read cache: enabled, doesn't support DPO or FUA > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] 2930277168 > 512-byte hardware sectors: (1.50 TB/1.36 TiB) > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write > Protect is off > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write cache: > enabled, read cache: enabled, doesn't support DPO or FUA > Apr 29 13:07:39 Clarus-gl2k10-2 syslogd 1.4.1: restart. Hmm what strange... I don't see the message "Disk failure on %s, disabling device" \n "Operation continuing on %d devices" in your log. In MD raid456 the ONLY place where a disk is set faulty is this (file raid5.c): ---------------------- set_bit(Faulty, &rdev->flags); printk(KERN_ALERT "raid5: Disk failure on %s, disabling device.\n" "raid5: Operation continuing on %d devices.\n", bdevname(rdev->bdev,b), conf->raid_disks - mddev->degraded); ---------------------- ( which is called by md_error() ) As you can see, just after disabling the device it prints the dmesg message. I don't understand how you could catch a cat /proc/mdstat already reporting the disk as failed, and still not seeing the message in the /var/log/messages . But you do see messages that should come chronologically after that one. The errors like: "Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189872 on dm-1)." can now (after the patch) be generated only after raid-6 is in doubly-degraded state. I don't understand how those errors could become visible before the message telling that MD is disabling the device. To make the thing more strange, if raid-6 is in doubly-degraded state it means dm-1/sdh is disabled, but if dm-1/sdh is disabled MD should not have read anything from there. I mean there shouldn't have been any read error because there shouldn't have been any read. You are sure that a) this dmesg you reported really is from your last run of the resync b) above or below the messages you report there is no "Disk failure on ..., disabling device" string? Last thing, your system might have crashed because of the sd / SATA driver (instead of that being a direct bug of MD). You see, those are the last messages before the reboot, and the message about write cache is repeated. The driver might have tried to reset the drive, maybe quickly more than once. I'm not sure... but that could be a reason. Exactly what kernel version are you running now, after applying my patch? At the moment I don't have more ideas, sorry. I hope somebody else replies. In the meanwhile you might run it through the serial cable if you have some time. Maybe you can get more dmesg stuff that couldn't make it through /var/log/messages. And you would also get the kernel panic. Actually for the dmesg I think you can try with a "watch dmesg -c" via putty. Good luck ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-29 23:00 ` MRK @ 2010-04-30 6:17 ` Janos Haar 2010-04-30 23:54 ` MRK [not found] ` <4BDB6DB6.5020306@sh iftmail.org> 0 siblings, 2 replies; 48+ messages in thread From: Janos Haar @ 2010-04-30 6:17 UTC (permalink / raw) To: MRK; +Cc: Neil Brown, linux-raid Hello, OK, MRK you are right (again). There was some line in the messages wich avoids my attention. The entire log is here: http://download.netcenter.hu/bughunt/20100430/messages The dm founds invalid my cow devices, but i don't know why at this time. My setup script looks like this: "create-cow": rm -f /snapshot.bin rm -f /snapshot2.bin dd_rescue -v /dev/zero /snapshot.bin -m 4k -S 2000G dd_rescue -v /dev/zero /snapshot2.bin -m 4k -S 2000G losetup /dev/loop3 /snapshot.bin losetup /dev/loop4 /snapshot2.bin dd if=/dev/zero of=/dev/loop3 bs=1M count=1 dd if=/dev/zero of=/dev/loop4 bs=1M count=1 echo 0 $(blockdev --getsize /dev/sde4) \ snapshot /dev/sde4 /dev/loop3 p 8 | \ dmsetup create cow echo 0 $(blockdev --getsize /dev/sdh4) \ snapshot /dev/sdh4 /dev/loop4 p 8 | \ dmsetup create cow2 Now i have the last state, and there is more space left on the disk, and the snapshots are smalls: du -h /snapshot* 1.1M /snapshot2.bin 1.1M /snapshot.bin My new kernel is the same like the old one, only diff is the md-patch. Additionally i need to note, my kernel have only one additional patch wich differs from the normal tree, this patch is the pdflush-patch. (I can set the number of pdflushd's number in the proc.) I can try again, if there is any new idea, but it would be really good to do some trick with bitmaps or set the recovery's start point or something similar, because every time i need >16 hour to get the first poit where the raid do something interesting.... Neil, Can you say something useful about this? Thanks again, Janos ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <linux-raid@vger.kernel.org> Sent: Friday, April 30, 2010 1:00 AM Subject: Re: Suggestion needed for fixing RAID6 > On 04/29/2010 11:07 PM, Janos Haar wrote: >> >> ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> >> To: "Janos Haar" <janos.haar@netcenter.hu> >> Cc: <linux-raid@vger.kernel.org> >> Sent: Thursday, April 29, 2010 5:22 PM >> Subject: Re: Suggestion needed for fixing RAID6 >> >> >>> On 04/29/2010 09:55 AM, Janos Haar wrote: >>>> >>>> md3 : active raid6 sdd4[12] sdl4[11] sdk4[10] sdj4[9] sdi4[8] >>>> dm-1[13](F) sdg4[6 >>>> ] sdf4[5] dm-0[4] sdc4[2] sdb4[1] sda4[0] >>>> 14626538880 blocks level 6, 16k chunk, algorithm 2 [12/10] >>>> [UUU_UUU_UUUU] >>>> [===========>.........] recovery = 56.8% (831095108/1462653888) >>>> finish=50 >>>> 19.8min speed=2096K/sec >>>> >>>> Drive dropped again with this patch! >>>> + the kernel freezed. >>>> (I will try to get more info...) >>>> >>>> Janos >>> >>> Hmm too bad :-( it seems it still doesn't work, sorry for that >>> >>> I suppose the kernel didn't freeze immediately after disabling the drive >>> or you wouldn't have had the chance to cat /proc/mdstat... >> >> this was this command in putty.exe window: >> watch "cat /proc/mdstat ; du -h /snap*" >> > > good idea... > >> I think it have crashed soon. >> I had no time to recognize what happened and exit from the watch. >> >>> >>> Hence dmesg messages might have gone to /var/log/messages or something. >>> Can you look there to see if there is any interesting message to post >>> here? >> >> Yes, i know that. >> The crash was not written up unfortunately. >> But there is some info: >> >> (some UNC reported from sdh) >> .... >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: res >> 51/40:00:27:c0:5e/40:00:63:00:00/e0 Emask 0x9 (media error) >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8.00: status: { DRDY ERR } >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8.00: error: { UNC } >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8.00: configured for UDMA/133 >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Result: >> hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Sense Key : >> Medium Error [current] [descriptor] >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: Descriptor sense data with sense >> descriptors (in hex): >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: 72 03 11 04 00 00 00 0c >> 00 0a 80 00 00 00 00 00 >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: 63 5e c0 27 >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Add. Sense: >> Unrecovered read error - auto reallocate failed >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: end_request: I/O error, dev sdh, >> sector 1667153959 >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not >> correctable (sector 1662189872 on dm-1). >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not >> correctable (sector 1662189880 on dm-1). >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not >> correctable (sector 1662189888 on dm-1). >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not >> correctable (sector 1662189896 on dm-1). >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not >> correctable (sector 1662189904 on dm-1). >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not >> correctable (sector 1662189912 on dm-1). >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not >> correctable (sector 1662189920 on dm-1). >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not >> correctable (sector 1662189928 on dm-1). >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not >> correctable (sector 1662189936 on dm-1). >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not >> correctable (sector 1662189944 on dm-1). >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8: EH complete >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write Protect >> is off >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write cache: >> enabled, read cache: enabled, doesn't support DPO or FUA >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] 2930277168 >> 512-byte hardware sectors: (1.50 TB/1.36 TiB) >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write Protect >> is off >> Apr 29 09:50:29 Clarus-gl2k10-2 kernel: sd 7:0:0:0: [sdh] Write cache: >> enabled, read cache: enabled, doesn't support DPO or FUA >> Apr 29 13:07:39 Clarus-gl2k10-2 syslogd 1.4.1: restart. > > Hmm what strange... > I don't see the message "Disk failure on %s, disabling device" \n > "Operation continuing on %d devices" in your log. > > In MD raid456 the ONLY place where a disk is set faulty is this (file > raid5.c): > > ---------------------- > set_bit(Faulty, &rdev->flags); > printk(KERN_ALERT > "raid5: Disk failure on %s, disabling device.\n" > "raid5: Operation continuing on %d devices.\n", > bdevname(rdev->bdev,b), conf->raid_disks - > mddev->degraded); > ---------------------- > ( which is called by md_error() ) > > As you can see, just after disabling the device it prints the dmesg > message. > I don't understand how you could catch a cat /proc/mdstat already > reporting the disk as failed, and still not seeing the message in the > /var/log/messages . > > But you do see messages that should come chronologically after that one. > The errors like: > "Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not > correctable (sector 1662189872 on dm-1)." > can now (after the patch) be generated only after raid-6 is in > doubly-degraded state. I don't understand how those errors could become > visible before the message telling that MD is disabling the device. > > To make the thing more strange, if raid-6 is in doubly-degraded state it > means dm-1/sdh is disabled, but if dm-1/sdh is disabled MD should not have > read anything from there. I mean there shouldn't have been any read error > because there shouldn't have been any read. > > You are sure that > a) this dmesg you reported really is from your last run of the resync > b) above or below the messages you report there is no "Disk failure on > ..., disabling device" string? > > Last thing, your system might have crashed because of the sd / SATA driver > (instead of that being a direct bug of MD). You see, those are the last > messages before the reboot, and the message about write cache is repeated. > The driver might have tried to reset the drive, maybe quickly more than > once. I'm not sure... but that could be a reason. > > Exactly what kernel version are you running now, after applying my patch? > > At the moment I don't have more ideas, sorry. I hope somebody else > replies. > In the meanwhile you might run it through the serial cable if you have > some time. Maybe you can get more dmesg stuff that couldn't make it > through /var/log/messages. And you would also get the kernel panic. > Actually for the dmesg I think you can try with a "watch dmesg -c" via > putty. > > Good luck ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-04-30 6:17 ` Janos Haar @ 2010-04-30 23:54 ` MRK [not found] ` <4BDB6DB6.5020306@sh iftmail.org> 1 sibling, 0 replies; 48+ messages in thread From: MRK @ 2010-04-30 23:54 UTC (permalink / raw) To: Janos Haar; +Cc: linux-raid On 04/30/2010 08:17 AM, Janos Haar wrote: > Hello, > > OK, MRK you are right (again). > There was some line in the messages wich avoids my attention. > The entire log is here: > http://download.netcenter.hu/bughunt/20100430/messages > Ah here we go: Apr 29 09:50:29 Clarus-gl2k10-2 kernel: device-mapper: snapshots: Invalidating snapshot: Error reading/writing. Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8: EH complete Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5: Disk failure on dm-1, disabling device. Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5: Operation continuing on 10 devices. Apr 29 09:50:29 Clarus-gl2k10-2 kernel: md: md3: recovery done. Firstly I'm not totally sure of how DM passed the information of the device failing to MD. There is no error message about this on MD. If it was a read error, MD should have performed the rewrite but this apparently did not happen (the error message for a failed rewrite by MD I think is "read error NOT corrected!!"). But anyway... > The dm founds invalid my cow devices, but i don't know why at this time. > I have just had a brief look ad DM code. I understand like 1% of it right now, however I am thinking that in a not-perfectly-optimized way of doing things, if you specified 8 sectors (8x512b = 4k, which you did) granularity during the creation of your cow and cow2 devices, whenever you write to the COW device, DM might do the thing in 2 steps: 1- copy 8 (or multiple of 8) sectors from the HD to the cow device, enough to cover the area to which you are writing 2- overwrite such 8 sectors with the data coming from MD. Of course this is not optimal in case you are writing exactly 8 sectors with MD, and these are aligned to the ones that DM uses (both things I think are true in your case) because DM could have skipped #1 in this case. However supposing DM is not so smart and it indeed does not skip step #1, then I think I understand why it disables the device: it's because #1 fails with read error and DM does not know how to handle the situation in that case in general. If you had written a smaller amount with MD such as 512 bytes, if step #1 fails, what do you write in the other 7 sectors around it? The right semantics is not obvious so they disable the device. Firstly you could try with 1 sector granularity instead of 8, during the creation of dm cow devices. This MIGHT work around the issue if DM is at least a bit smart. Right now it's not obvious to me where in the is code the logic for the COW copying. Maybe tomorrow I will understand this. If this doesn't work, the best thing is probably if you can write to the DM mailing list asking why it behaves like this and if they can guess a workaround. You can keep me in cc, I'm interested. > [CUT] > > echo 0 $(blockdev --getsize /dev/sde4) \ > snapshot /dev/sde4 /dev/loop3 p 8 | \ > dmsetup create cow > > echo 0 $(blockdev --getsize /dev/sdh4) \ > snapshot /dev/sdh4 /dev/loop4 p 8 | \ > dmsetup create cow2 See, you are creating it with 8 sectors granularity... try with 1. > I can try again, if there is any new idea, but it would be really good > to do some trick with bitmaps or set the recovery's start point or > something similar, because every time i need >16 hour to get the first > poit where the raid do something interesting.... > > Neil, > Can you say something useful about this? > I just looked into this and it seems this feature is already there. See if you have these files: /sys/block/md3/md/sync_min and sync_max Those are the starting and ending sector. But keep in mind you have to enter them in multiples of the chunk size so if your chunk is e.g. 1024k then you need to enter multiples of 2048 (sectors). Enter the value before starting the sync. Or stop the sync by entering "idle" in sync_action, then change the sync_min value, then restart the sync entering "check" in sync_action. It should work, I just tried it on my comp. Good luck ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <4BDB6DB6.5020306@sh iftmail.org>]
* Re: Suggestion needed for fixing RAID6 [not found] ` <4BDB6DB6.5020306@sh iftmail.org> @ 2010-05-01 9:37 ` Janos Haar 2010-05-01 17:17 ` MRK 0 siblings, 1 reply; 48+ messages in thread From: Janos Haar @ 2010-05-01 9:37 UTC (permalink / raw) To: MRK; +Cc: linux-raid Hello, Now i am tried with 1 sector snapshot size. the result was the same first the snapshot have been invalidated, than DM dropped from the raid. The next was this: md3 : active raid6 sdl4[11] sdk4[10] sdj4[9] sdi4[8] dm-1[12](F) sdg4[6] sdf4[5] dm-0[4] sdc4[2] sdb4[1] sda4[0] 14626538880 blocks level 6, 16k chunk, algorithm 2 [12/10] [UUU_UUU_UUUU] [===================>.] resync = 99.9% (1462653628/1462653888) finish=0.0 min speed=2512K/sec The sync progress bar jumped from 58.8% to 99.9% the speed falls, the 1462653628/1462653888 is freezed in this point. I can do dmesg once by hand, than save the dmesg output to file, but the system crashed after this. The entire story was about 1 minute. Whoever, the sync_min option generally solves my problem, becasue i can build up the missing disk from the 90% wich is good enough for me. :-) If somebody is interested about playing more with this system, i still have some days for it, but i am not interested anymore to trace the md-dm behavior in this situation.... Additionally, i don't want to put in risk the data if not really needed.... Thanks a lot, Janos Haar ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <linux-raid@vger.kernel.org> Sent: Saturday, May 01, 2010 1:54 AM Subject: Re: Suggestion needed for fixing RAID6 > On 04/30/2010 08:17 AM, Janos Haar wrote: >> Hello, >> >> OK, MRK you are right (again). >> There was some line in the messages wich avoids my attention. >> The entire log is here: >> http://download.netcenter.hu/bughunt/20100430/messages >> > > Ah here we go: > > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: device-mapper: snapshots: > Invalidating snapshot: Error reading/writing. > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8: EH complete > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5: Disk failure on dm-1, > disabling device. > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5: Operation continuing on 10 > devices. > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: md: md3: recovery done. > > Firstly I'm not totally sure of how DM passed the information of the > device failing to MD. There is no error message about this on MD. If it > was a read error, MD should have performed the rewrite but this apparently > did not happen (the error message for a failed rewrite by MD I think is > "read error NOT corrected!!"). But anyway... > >> The dm founds invalid my cow devices, but i don't know why at this time. >> > > I have just had a brief look ad DM code. I understand like 1% of it right > now, however I am thinking that in a not-perfectly-optimized way of doing > things, if you specified 8 sectors (8x512b = 4k, which you did) > granularity during the creation of your cow and cow2 devices, whenever you > write to the COW device, DM might do the thing in 2 steps: > > 1- copy 8 (or multiple of 8) sectors from the HD to the cow device, enough > to cover the area to which you are writing > 2- overwrite such 8 sectors with the data coming from MD. > > Of course this is not optimal in case you are writing exactly 8 sectors > with MD, and these are aligned to the ones that DM uses (both things I > think are true in your case) because DM could have skipped #1 in this > case. > However supposing DM is not so smart and it indeed does not skip step #1, > then I think I understand why it disables the device: it's because #1 > fails with read error and DM does not know how to handle the situation in > that case in general. If you had written a smaller amount with MD such as > 512 bytes, if step #1 fails, what do you write in the other 7 sectors > around it? The right semantics is not obvious so they disable the device. > > Firstly you could try with 1 sector granularity instead of 8, during the > creation of dm cow devices. This MIGHT work around the issue if DM is at > least a bit smart. Right now it's not obvious to me where in the is code > the logic for the COW copying. Maybe tomorrow I will understand this. > > If this doesn't work, the best thing is probably if you can write to the > DM mailing list asking why it behaves like this and if they can guess a > workaround. You can keep me in cc, I'm interested. > > >> [CUT] >> >> echo 0 $(blockdev --getsize /dev/sde4) \ >> snapshot /dev/sde4 /dev/loop3 p 8 | \ >> dmsetup create cow >> >> echo 0 $(blockdev --getsize /dev/sdh4) \ >> snapshot /dev/sdh4 /dev/loop4 p 8 | \ >> dmsetup create cow2 > > See, you are creating it with 8 sectors granularity... try with 1. > >> I can try again, if there is any new idea, but it would be really good to >> do some trick with bitmaps or set the recovery's start point or something >> similar, because every time i need >16 hour to get the first poit where >> the raid do something interesting.... >> >> Neil, >> Can you say something useful about this? >> > > I just looked into this and it seems this feature is already there. > See if you have these files: > /sys/block/md3/md/sync_min and sync_max > Those are the starting and ending sector. > But keep in mind you have to enter them in multiples of the chunk size so > if your chunk is e.g. 1024k then you need to enter multiples of 2048 > (sectors). > Enter the value before starting the sync. Or stop the sync by entering > "idle" in sync_action, then change the sync_min value, then restart the > sync entering "check" in sync_action. It should work, I just tried it on > my comp. > > Good luck > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-05-01 9:37 ` Janos Haar @ 2010-05-01 17:17 ` MRK 2010-05-01 21:44 ` Janos Haar 0 siblings, 1 reply; 48+ messages in thread From: MRK @ 2010-05-01 17:17 UTC (permalink / raw) To: Janos Haar; +Cc: linux-raid On 05/01/2010 11:37 AM, Janos Haar wrote: > Whoever, the sync_min option generally solves my problem, becasue i > can build up the missing disk from the 90% wich is good enough for me. :-) Are you sure? How do you do that? Resyncing a specific part is easy, replicating to a spare a specific part is not. If the disk you want to replace was 100% made of parity data that would be easy, you do that with a resync after replacing the disk, maybe multiple resyncs region by region, but in your case it is not made of only parity data. Only raid3 and 4 separate parity data from actual data, raid6 instead finely interleaves them. If you are thinking about replacing a disk with a new one (full of zeroes) and then resyncing manually region by region, you will destroy your data. Because in those chunks where the new disk acts as "actual data" the parity will be recomputed based on your newly introduced zeroes, and it will overwrite the parity data you had on the good disks, making recovery impossible from that point on. You really need to do the replication to a spare as a single step, from the beginning to the end. You cannot use sync_min and sync_max for that purpose. I think... unless bitmaps really do some magic in this, flagging the newly introduced disk as more recent than parity data... but do they really do this? people correct me if I'm wrong. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-05-01 17:17 ` MRK @ 2010-05-01 21:44 ` Janos Haar 2010-05-02 23:05 ` MRK 2010-05-03 2:17 ` Neil Brown 0 siblings, 2 replies; 48+ messages in thread From: Janos Haar @ 2010-05-01 21:44 UTC (permalink / raw) To: MRK; +Cc: linux-raid ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <linux-raid@vger.kernel.org> Sent: Saturday, May 01, 2010 7:17 PM Subject: Re: Suggestion needed for fixing RAID6 > On 05/01/2010 11:37 AM, Janos Haar wrote: >> Whoever, the sync_min option generally solves my problem, becasue i can >> build up the missing disk from the 90% wich is good enough for me. :-) > > Are you sure? How do you do that? > Resyncing a specific part is easy, replicating to a spare a specific part > is not. If the disk you want to replace was 100% made of parity data that > would be easy, you do that with a resync after replacing the disk, maybe > multiple resyncs region by region, but in your case it is not made of only > parity data. Only raid3 and 4 separate parity data from actual data, raid6 > instead finely interleaves them. > If you are thinking about replacing a disk with a new one (full of zeroes) > and then resyncing manually region by region, you will destroy your data. > Because in those chunks where the new disk acts as "actual data" the > parity will be recomputed based on your newly introduced zeroes, and it > will overwrite the parity data you had on the good disks, making recovery > impossible from that point on. > You really need to do the replication to a spare as a single step, from > the beginning to the end. You cannot use sync_min and sync_max for that > purpose. You are right again, or at least close. :-) I have the missing sdd4 wich is 98% correctly rebuilt allready. But you are right, because the sync_min option not works for rebuilding disks, only for resyncing. (it is too smart to do the trick for me) > I think... unless bitmaps really do some magic in this, flagging the newly > introduced disk as more recent than parity data... but do they really do > this? people correct me if I'm wrong. Bitmap manipulation should work. I think i know how to do that, but the data is more important than try it on my own. I want to wait until somebody support this. ... or somebody have another good idea? The general problem is, i have one single-degraded RAID6 + 2 badblock disk inside wich have bads in different location. The big question is how to keep the integrity or how to do the rebuild by 2 step instead of one continous? Thanks again Janos > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-05-01 21:44 ` Janos Haar @ 2010-05-02 23:05 ` MRK 2010-05-03 2:17 ` Neil Brown 1 sibling, 0 replies; 48+ messages in thread From: MRK @ 2010-05-02 23:05 UTC (permalink / raw) To: Janos Haar; +Cc: linux-raid On 05/01/2010 11:44 PM, Janos Haar wrote: > > But you are right, because the sync_min option not works for > rebuilding disks, only for resyncing. (it is too smart to do the trick > for me) > >> I think... unless bitmaps really do some magic in this, flagging the >> newly introduced disk as more recent than parity data... but do they >> really do this? people correct me if I'm wrong. > > Bitmap manipulation should work. > I think i know how to do that, but the data is more important than try > it on my own. > I want to wait until somebody support this. > ... or somebody have another good idea? Firstly: do you have any backup of your data? If not, before doing any experiment I suggest that you back up important stuff. This can be done with rsync, and reassembling the array every time it goes down. I suggest to put the array in readonly mode (mdadm --readonly /dev/md3): this should prevent resyncs from starting automatically, and AFAIR even prevent drives being dropped because of read errors (but you can't use it during resyncs or rebuilds). Resyncs are bad because they will eventually bring down your array. Don't use DM when doing this. Now, for the real thing, instead of experimenting with bitmaps, I suggest you try and see if the normal MD resync works now. If that works then you can do the normal rebuild. *Pls note that: DM should not be needed!* - I know that you have tried resyncing with DM COW under MD and that one doesn't work well in this case, but in fact DM should not be needed. We pointed you to DM around Apr 23rd because at that time we thought that your drives were dropping for uncorrectable read error, but we had guessed wrong. The general MD phylosophy is that if there is enough parity informations, drives are not dropped just for a read error. Upon read error MD recomputes the value of the sector from the parity information, and then it attempts rewriting the block in place. During this rewrite the drive performs a reallocation, moving the block to a hidden spare region. If this rewrite fails it means that the drive is out of spare sectors and this is considered to be a major failure for MD, and only at that point the drive is dropped. So we thought this was the reason also in your case, but we were wrong, in your case it was because of an MD bug, which is the one for which I submitted the patch. So it should work now (without DM). And I think this is the safest thing you can try. Having a backup is always better though. So start the resync without DM and see if it goes through to the end without dropping drives. You can use sync_min to cut the dead times. For max safety you could first try resyncing only one chunk from the region of the damaged sectors, so to provoke only a minimum amount of rewrites. Set the sync_min to the location of the errors, and sync_max to just one chunk above. See what happens... If it rewrites correctly and the drive is not dropped, then run "check" again on the same region and see if "cat /sys/block/md3/md/mismatch_cnt" still returns zero (or the value it was before the rewrite). If it is zero (or anyway has not changed value) it means the block was really rewritten with the correct value: recovery of one sector really works for raid6 in singly-degraded state. Then the procedure is safe, as far as I understand, and you can go ahead on the other chunks. When all damaged sectors are reallocated, there are no more read errors, and the mismatch_cnt is still at zero, you can go ahead replacing the defective drive. There are a few reasons that can still make the resync fail if we are really unlucky, but dmesg should point us to the right direction in that case. Also remember that the patch still needs testing... currently it is not really tested because DM drops the drive before MD. We would need to know if raid6 is behaving like a raid6 now or it's still behaving like a raid5... Thank you ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-05-01 21:44 ` Janos Haar 2010-05-02 23:05 ` MRK @ 2010-05-03 2:17 ` Neil Brown 2010-05-03 10:04 ` MRK [not found] ` <4BDE9FB6.80309@shiftmai! l.org> 1 sibling, 2 replies; 48+ messages in thread From: Neil Brown @ 2010-05-03 2:17 UTC (permalink / raw) To: Janos Haar; +Cc: MRK, linux-raid On Sat, 1 May 2010 23:44:04 +0200 "Janos Haar" <janos.haar@netcenter.hu> wrote: > The general problem is, i have one single-degraded RAID6 + 2 badblock disk > inside wich have bads in different location. > The big question is how to keep the integrity or how to do the rebuild by 2 > step instead of one continous? Once you have the fix that has already been discussed in this thread, the only other problem I can see with this situation is if attempts to write good data over the read-errors results in a write-error which causes the device to be evicted from the array. And I think you have reported getting write errors. The following patch should address this issue for you. It is *not* a general-purpose fix, but a specific fix to address an issue you are having. It might be appropriate to make this configurable via sysfs, or possibly even to try to auto-detect the situation and don't bother writing. Longer term I want to add support for storing a bad-block-list per device so that a write error just fails that block, not the whole device. I just need to organise my time so that I make progress on that project. NeilBrown diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index c181438..fd73929 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -3427,6 +3427,12 @@ static void handle_stripe6(struct stripe_head *sh) && !test_bit(R5_LOCKED, &dev->flags) && test_bit(R5_UPTODATE, &dev->flags) ) { +#if 1 + /* We have recovered the data, but don't + * trust the device enough to write back + */ + clear_bit(R5_ReadError, &dev->flags); +#else if (!test_bit(R5_ReWrite, &dev->flags)) { set_bit(R5_Wantwrite, &dev->flags); set_bit(R5_ReWrite, &dev->flags); @@ -3438,6 +3444,7 @@ static void handle_stripe6(struct stripe_head *sh) set_bit(R5_LOCKED, &dev->flags); s.locked++; } +#endif } } ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-05-03 2:17 ` Neil Brown @ 2010-05-03 10:04 ` MRK 2010-05-03 10:21 ` MRK 2010-05-03 21:02 ` Neil Brown [not found] ` <4BDE9FB6.80309@shiftmai! l.org> 1 sibling, 2 replies; 48+ messages in thread From: MRK @ 2010-05-03 10:04 UTC (permalink / raw) To: Neil Brown; +Cc: Janos Haar, linux-raid On 05/03/2010 04:17 AM, Neil Brown wrote: > On Sat, 1 May 2010 23:44:04 +0200 > "Janos Haar"<janos.haar@netcenter.hu> wrote: > > >> The general problem is, i have one single-degraded RAID6 + 2 badblock disk >> inside wich have bads in different location. >> The big question is how to keep the integrity or how to do the rebuild by 2 >> step instead of one continous? >> > Once you have the fix that has already been discussed in this thread, the > only other problem I can see with this situation is if attempts to write good > data over the read-errors results in a write-error which causes the device to > be evicted from the array. > > And I think you have reported getting write > errors. > His dmesg AFAIR has never reported any error of the kind "raid5:%s: read error NOT corrected!! " (the error message you get on failed rewrite AFAIU) Up to now (after my patch) he only tried with MD above DM-COW and DM was dropping the drive on read error so I think MD didn't get any opportunity to rewrite. It is not clear to me what kind of error MD got from DM: Apr 29 09:50:29 Clarus-gl2k10-2 kernel: device-mapper: snapshots: Invalidating snapshot: Error reading/writing. Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8: EH complete Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5: Disk failure on dm-1, disabling device. I don't understand from what place the md_error() is called... but also in this case it doesn't look like a rewrite error... I think without DM COW it should probably work in his case. Your new patch skips the rewriting and keeps the unreadable sectors, right? So that the drive isn't dropped on rewrite... > The following patch should address this issue for you. > It is*not* a general-purpose fix, but a specific fix [CUT] ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-05-03 10:04 ` MRK @ 2010-05-03 10:21 ` MRK 2010-05-03 21:04 ` Neil Brown 2010-05-03 21:02 ` Neil Brown 1 sibling, 1 reply; 48+ messages in thread From: MRK @ 2010-05-03 10:21 UTC (permalink / raw) To: MRK, Neil Brown; +Cc: Janos Haar, linux-raid On 05/03/2010 12:04 PM, MRK wrote: > On 05/03/2010 04:17 AM, Neil Brown wrote: >> On Sat, 1 May 2010 23:44:04 +0200 >> "Janos Haar"<janos.haar@netcenter.hu> wrote: >> >>> The general problem is, i have one single-degraded RAID6 + 2 >>> badblock disk >>> inside wich have bads in different location. >>> The big question is how to keep the integrity or how to do the >>> rebuild by 2 >>> step instead of one continous? >> Once you have the fix that has already been discussed in this thread, >> the >> only other problem I can see with this situation is if attempts to >> write good >> data over the read-errors results in a write-error which causes the >> device to >> be evicted from the array. >> >> And I think you have reported getting write >> errors. > > His dmesg AFAIR has never reported any error of the kind "raid5:%s: > read error NOT corrected!! " (the error message you get on failed > rewrite AFAIU) > Up to now (after my patch) he only tried with MD above DM-COW and DM > was dropping the drive on read error so I think MD didn't get any > opportunity to rewrite. > > It is not clear to me what kind of error MD got from DM: > > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: device-mapper: snapshots: > Invalidating snapshot: Error reading/writing. > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8: EH complete > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5: Disk failure on dm-1, > disabling device. > > I don't understand from what place the md_error() is called... > [CUT] Oh and there is another issue I wanted to expose: His last dmesg: http://download.netcenter.hu/bughunt/20100430/messages Much after the line: Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5: Disk failure on dm-1, disabling device. there are many lines like this: Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not correctable (sector 1662189872 on dm-1). How come MD still wants to read from a device it has disabled? looks like a problem to me... MD also scrubs failed devices during check? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-05-03 10:21 ` MRK @ 2010-05-03 21:04 ` Neil Brown 0 siblings, 0 replies; 48+ messages in thread From: Neil Brown @ 2010-05-03 21:04 UTC (permalink / raw) To: MRK; +Cc: Janos Haar, linux-raid On Mon, 03 May 2010 12:21:08 +0200 MRK <mrk@shiftmail.org> wrote: > On 05/03/2010 12:04 PM, MRK wrote: > > On 05/03/2010 04:17 AM, Neil Brown wrote: > >> On Sat, 1 May 2010 23:44:04 +0200 > >> "Janos Haar"<janos.haar@netcenter.hu> wrote: > >> > >>> The general problem is, i have one single-degraded RAID6 + 2 > >>> badblock disk > >>> inside wich have bads in different location. > >>> The big question is how to keep the integrity or how to do the > >>> rebuild by 2 > >>> step instead of one continous? > >> Once you have the fix that has already been discussed in this thread, > >> the > >> only other problem I can see with this situation is if attempts to > >> write good > >> data over the read-errors results in a write-error which causes the > >> device to > >> be evicted from the array. > >> > >> And I think you have reported getting write > >> errors. > > > > His dmesg AFAIR has never reported any error of the kind "raid5:%s: > > read error NOT corrected!! " (the error message you get on failed > > rewrite AFAIU) > > Up to now (after my patch) he only tried with MD above DM-COW and DM > > was dropping the drive on read error so I think MD didn't get any > > opportunity to rewrite. > > > > It is not clear to me what kind of error MD got from DM: > > > > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: device-mapper: snapshots: > > Invalidating snapshot: Error reading/writing. > > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8: EH complete > > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5: Disk failure on dm-1, > > disabling device. > > > > I don't understand from what place the md_error() is called... > > [CUT] > > Oh and there is another issue I wanted to expose: > > His last dmesg: > http://download.netcenter.hu/bughunt/20100430/messages > > Much after the line: > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5: Disk failure on dm-1, > disabling device. > > there are many lines like this: > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5:md3: read error not > correctable (sector 1662189872 on dm-1). > > How come MD still wants to read from a device it has disabled? > looks like a problem to me... There are often many IO requests in flight at the same time. When one returns with an error we might fail the device but there are still lots more that have not yet completed. As they complete we might write messages about them - even after we have reported the device as 'failed'. But we never initiate an IO after the device has been marked 'faulty'. NeilBrown > MD also scrubs failed devices during check? > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 2010-05-03 10:04 ` MRK 2010-05-03 10:21 ` MRK @ 2010-05-03 21:02 ` Neil Brown 1 sibling, 0 replies; 48+ messages in thread From: Neil Brown @ 2010-05-03 21:02 UTC (permalink / raw) To: MRK; +Cc: Janos Haar, linux-raid On Mon, 03 May 2010 12:04:38 +0200 MRK <mrk@shiftmail.org> wrote: > On 05/03/2010 04:17 AM, Neil Brown wrote: > > On Sat, 1 May 2010 23:44:04 +0200 > > "Janos Haar"<janos.haar@netcenter.hu> wrote: > > > > > >> The general problem is, i have one single-degraded RAID6 + 2 badblock disk > >> inside wich have bads in different location. > >> The big question is how to keep the integrity or how to do the rebuild by 2 > >> step instead of one continous? > >> > > Once you have the fix that has already been discussed in this thread, the > > only other problem I can see with this situation is if attempts to write good > > data over the read-errors results in a write-error which causes the device to > > be evicted from the array. > > > > And I think you have reported getting write > > errors. > > > > His dmesg AFAIR has never reported any error of the kind "raid5:%s: read > error NOT corrected!! " (the error message you get on failed rewrite AFAIU) > Up to now (after my patch) he only tried with MD above DM-COW and DM was > dropping the drive on read error so I think MD didn't get any > opportunity to rewrite. Hmmm... fair enough. > > It is not clear to me what kind of error MD got from DM: > > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: device-mapper: snapshots: Invalidating snapshot: Error reading/writing. > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8: EH complete > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5: Disk failure on dm-1, disabling device. > > I don't understand from what place the md_error() is called... I suspect it is from raid5_end_write_request. It looks like we don't print any message when the re-write fails. Only if the read after the rewrite fails. > but also in this case it doesn't look like a rewrite error... > ... so I suspect it is a rewrite error. Unless I missed something. What message did you expect to see in the case of a re-write error? > I think without DM COW it should probably work in his case. > > Your new patch skips the rewriting and keeps the unreadable sectors, > right? So that the drive isn't dropped on rewrite... Correct. > > > The following patch should address this issue for you. > > It is*not* a general-purpose fix, but a specific fix > [CUT] NeilBrown ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <4BDE9FB6.80309@shiftmai! l.org>]
* Re: Suggestion needed for fixing RAID6 [not found] ` <4BDE9FB6.80309@shiftmai! l.org> @ 2010-05-03 10:20 ` Janos Haar 2010-05-05 15:24 ` Suggestion needed for fixing RAID6 [SOLVED] Janos Haar 1 sibling, 0 replies; 48+ messages in thread From: Janos Haar @ 2010-05-03 10:20 UTC (permalink / raw) To: MRK; +Cc: Neil Brown, linux-raid ----- Original Message ----- From: "MRK" <mrk@shiftmail.org> To: "Neil Brown" <neilb@suse.de> Cc: "Janos Haar" <janos.haar@netcenter.hu>; <linux-raid@vger.kernel.org> Sent: Monday, May 03, 2010 12:04 PM Subject: Re: Suggestion needed for fixing RAID6 > On 05/03/2010 04:17 AM, Neil Brown wrote: >> On Sat, 1 May 2010 23:44:04 +0200 >> "Janos Haar"<janos.haar@netcenter.hu> wrote: >> >> >>> The general problem is, i have one single-degraded RAID6 + 2 badblock >>> disk >>> inside wich have bads in different location. >>> The big question is how to keep the integrity or how to do the rebuild >>> by 2 >>> step instead of one continous? >>> >> Once you have the fix that has already been discussed in this thread, the >> only other problem I can see with this situation is if attempts to write >> good >> data over the read-errors results in a write-error which causes the >> device to >> be evicted from the array. >> >> And I think you have reported getting write >> errors. >> > > His dmesg AFAIR has never reported any error of the kind "raid5:%s: read > error NOT corrected!! " (the error message you get on failed rewrite > AFAIU) > Up to now (after my patch) he only tried with MD above DM-COW and DM was > dropping the drive on read error so I think MD didn't get any opportunity > to rewrite. > > It is not clear to me what kind of error MD got from DM: > > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: device-mapper: snapshots: > Invalidating snapshot: Error reading/writing. > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8: EH complete > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5: Disk failure on dm-1, > disabling device. > > I don't understand from what place the md_error() is called... > but also in this case it doesn't look like a rewrite error... > > I think without DM COW it should probably work in his case. > > Your new patch skips the rewriting and keeps the unreadable sectors, > right? So that the drive isn't dropped on rewrite... > >> The following patch should address this issue for you. >> It is*not* a general-purpose fix, but a specific fix > [CUT] Just a little note: I have 2 bad drives, one wich have bads at 54%, have >2500 UNC sectors, wich is too much for trying it to repair, this drive is really failing.... The other have only 123 bads at 99% wich is a very small scratch on the platter, now i am trying to fix this drive instead. The repair-check sync process now runs, i will reply soon again... Thanks, Janos ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 [SOLVED] [not found] ` <4BDE9FB6.80309@shiftmai! l.org> 2010-05-03 10:20 ` Janos Haar @ 2010-05-05 15:24 ` Janos Haar 2010-05-05 19:27 ` MRK 1 sibling, 1 reply; 48+ messages in thread From: Janos Haar @ 2010-05-05 15:24 UTC (permalink / raw) To: MRK; +Cc: linux-raid, Neil Brown > > It is not clear to me what kind of error MD got from DM: > > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: device-mapper: snapshots: > Invalidating snapshot: Error reading/writing. > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: ata8: EH complete > Apr 29 09:50:29 Clarus-gl2k10-2 kernel: raid5: Disk failure on dm-1, > disabling device. > > I don't understand from what place the md_error() is called... > but also in this case it doesn't look like a rewrite error... > > I think without DM COW it should probably work in his case. First sorry for delay. Without DM, the original behavior-fix patch worked very well. Neil is generally right about the drive should reallocate the bad sectors on rewrite, but this is the ideal scenario wich is far from the real world unfortunately.... I needed to repeat 4 times the "repair" sync methode on the better HDD (wich have only 123 bads) to get readable again. The another hdd have >2500 bads wich looks like have no chance to fix this way. > > Your new patch skips the rewriting and keeps the unreadable sectors, > right? So that the drive isn't dropped on rewrite... > >> The following patch should address this issue for you. >> It is*not* a general-purpose fix, but a specific fix > [CUT] Neil, i think this patch should be in the sysfs or in the proc to be inactive by default, and of course will be good for recover bad cases like mine. There is a lot of hdd problems wich can make really uncorrectable sectors wich can't be good again even on rewrite.... Thanks a lot for all who helped me to solve this.... And MRK, please don't forget to write in my name. :-) Cheers, Janos ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Suggestion needed for fixing RAID6 [SOLVED] 2010-05-05 15:24 ` Suggestion needed for fixing RAID6 [SOLVED] Janos Haar @ 2010-05-05 19:27 ` MRK 0 siblings, 0 replies; 48+ messages in thread From: MRK @ 2010-05-05 19:27 UTC (permalink / raw) To: Janos Haar; +Cc: linux-raid, Neil Brown On 05/05/2010 05:24 PM, Janos Haar wrote: >> I think without DM COW it should probably work in his case. > > First sorry for delay. > Without DM, the original behavior-fix patch worked very well. Great! Ok I have just resubmitted the patch (v2) which includes a "Tested-by: Janos Haar <janos.haar@netcenter.hu>" line and a few fixes on the description. > [CUT] > Thanks a lot for all who helped me to solve this.... > > And MRK, please don't forget to write in my name. :-) I did it. Now it's in Neil's hands, hopefully he acks it and pushes it to mainline. Thanks everybody, GAT ^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2010-05-05 19:27 UTC | newest]
Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-22 10:09 Suggestion needed for fixing RAID6 Janos Haar
2010-04-22 15:00 ` Mikael Abrahamsson
2010-04-22 15:12 ` Janos Haar
2010-04-22 15:18 ` Mikael Abrahamsson
2010-04-22 16:25 ` Janos Haar
2010-04-22 16:32 ` Peter Rabbitson
[not found] ` <4BD0AF2D.90207@stud.tu-ilmenau.de>
2010-04-22 20:48 ` Janos Haar
2010-04-23 6:51 ` Luca Berra
2010-04-23 8:47 ` Janos Haar
2010-04-23 12:34 ` MRK
2010-04-24 19:36 ` Janos Haar
2010-04-24 22:47 ` MRK
2010-04-25 10:00 ` Janos Haar
2010-04-26 10:24 ` MRK
2010-04-26 12:52 ` Janos Haar
2010-04-26 16:53 ` MRK
2010-04-26 22:39 ` Janos Haar
2010-04-26 23:06 ` Michael Evans
[not found] ` <7cfd01cae598$419e8d20$0400a8c0@dcccs>
2010-04-27 0:04 ` Michael Evans
2010-04-27 15:50 ` Janos Haar
2010-04-27 23:02 ` MRK
2010-04-28 1:37 ` Neil Brown
2010-04-28 2:02 ` Mikael Abrahamsson
2010-04-28 2:12 ` Neil Brown
2010-04-28 2:30 ` Mikael Abrahamsson
2010-05-03 2:29 ` Neil Brown
2010-04-28 12:57 ` MRK
2010-04-28 13:32 ` Janos Haar
2010-04-28 14:19 ` MRK
2010-04-28 14:51 ` Janos Haar
2010-04-29 7:55 ` Janos Haar
2010-04-29 15:22 ` MRK
2010-04-29 21:07 ` Janos Haar
2010-04-29 23:00 ` MRK
2010-04-30 6:17 ` Janos Haar
2010-04-30 23:54 ` MRK
[not found] ` <4BDB6DB6.5020306@sh iftmail.org>
2010-05-01 9:37 ` Janos Haar
2010-05-01 17:17 ` MRK
2010-05-01 21:44 ` Janos Haar
2010-05-02 23:05 ` MRK
2010-05-03 2:17 ` Neil Brown
2010-05-03 10:04 ` MRK
2010-05-03 10:21 ` MRK
2010-05-03 21:04 ` Neil Brown
2010-05-03 21:02 ` Neil Brown
[not found] ` <4BDE9FB6.80309@shiftmai! l.org>
2010-05-03 10:20 ` Janos Haar
2010-05-05 15:24 ` Suggestion needed for fixing RAID6 [SOLVED] Janos Haar
2010-05-05 19:27 ` MRK
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).