* Spares and partitioning huge disks
@ 2005-01-06 14:16 maarten
2005-01-06 16:46 ` Guy
` (2 more replies)
0 siblings, 3 replies; 95+ messages in thread
From: maarten @ 2005-01-06 14:16 UTC (permalink / raw)
To: linux-raid
Hi
I just got my 4 new 250GB disks. I have read someone on this list advocating
that it is better to build arrays with smaller volumes, as that decreases the
chance of failure, especially failures of two disks in a raid5 configuration.
The idea behind it was that since a drive gets kicked when a read-error
occurs, the chance is lower that a 40 GB part develops a read error than for
the full size 250 GB. Thus, if you have 24 40GB parts, there is no fatal two
disk failure when part sda6 and part sdc4 develop a bad sector at the same
time. On the other hand, if the (full-size) disk sda1 and sdc1 do fail at the
same time, you're in deep shit.
I thought it was real insightful, so I would like to try that now.
(Thanks to the original poster, I don't recall your name, sorry)
Now my two questions regarding this.
1) What is better, make 6 raid5 arrays consisting of all 40GB partitions and
group them in a LVM set, or group them in a raid-0 set (if the latter is even
possible that is) ?
2) Seen as the 'physical' volumes are now 40 GB, I could add an older 80GB
disk partitioned in two 40GB halves, and use those two as hot-spares.
However, for that to work you'd have to be able to add the spares to _all_
raid sets, not specific ones, if you understand what I mean. So they would
act as 'roaming' spares, and they would get used by the first array that
needs a spare (when a failure occurs of course). But... is this possible ?
Thanks for any insights!
Maarten
--
^ permalink raw reply [flat|nested] 95+ messages in thread* RE: Spares and partitioning huge disks 2005-01-06 14:16 Spares and partitioning huge disks maarten @ 2005-01-06 16:46 ` Guy 2005-01-06 17:08 ` maarten 2005-01-06 17:31 ` Guy 2005-01-07 20:59 ` Mario Holbe 2 siblings, 1 reply; 95+ messages in thread From: Guy @ 2005-01-06 16:46 UTC (permalink / raw) To: 'maarten', linux-raid This is from "man mdadm": As well as reporting events, mdadm may move a spare drive from one array to another if they are in the same spare-group and if the desti- nation array has a failed drive but not spares. You can do what you want. I have never tried. My arrays are too different. I don't want to waste an 18Gig spare on a 256M array. Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten Sent: Thursday, January 06, 2005 9:17 AM To: linux-raid@vger.kernel.org Subject: Spares and partitioning huge disks Hi I just got my 4 new 250GB disks. I have read someone on this list advocating that it is better to build arrays with smaller volumes, as that decreases the chance of failure, especially failures of two disks in a raid5 configuration. The idea behind it was that since a drive gets kicked when a read-error occurs, the chance is lower that a 40 GB part develops a read error than for the full size 250 GB. Thus, if you have 24 40GB parts, there is no fatal two disk failure when part sda6 and part sdc4 develop a bad sector at the same time. On the other hand, if the (full-size) disk sda1 and sdc1 do fail at the same time, you're in deep shit. I thought it was real insightful, so I would like to try that now. (Thanks to the original poster, I don't recall your name, sorry) Now my two questions regarding this. 1) What is better, make 6 raid5 arrays consisting of all 40GB partitions and group them in a LVM set, or group them in a raid-0 set (if the latter is even possible that is) ? 2) Seen as the 'physical' volumes are now 40 GB, I could add an older 80GB disk partitioned in two 40GB halves, and use those two as hot-spares. However, for that to work you'd have to be able to add the spares to _all_ raid sets, not specific ones, if you understand what I mean. So they would act as 'roaming' spares, and they would get used by the first array that needs a spare (when a failure occurs of course). But... is this possible ? Thanks for any insights! Maarten -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-06 16:46 ` Guy @ 2005-01-06 17:08 ` maarten 0 siblings, 0 replies; 95+ messages in thread From: maarten @ 2005-01-06 17:08 UTC (permalink / raw) To: linux-raid On Thursday 06 January 2005 17:46, Guy wrote: > This is from "man mdadm": Oops. really ? Sorry, I should have checked that myself. Mea culpa. > As well as reporting events, mdadm may move a spare drive from one > array to another if they are in the same spare-group and if the desti- > nation array has a failed drive but not spares. > > You can do what you want. I have never tried. My arrays are too > different. I don't want to waste an 18Gig spare on a 256M array. Same (but different) idea here, I'd hate to waste a 250GB for a spare. :-) I still have one LVM issue to figure out, they say that you should set the partition type to 0x8e, but I can't find if that applies to md devices too. And if it does how do you set that. Running 'fdisk /dev/mdx' ? Well, why not I suppose, but I've never ran fdisk on an md device, only mkfs.* Thanks Guy, Maarten -- ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-06 14:16 Spares and partitioning huge disks maarten 2005-01-06 16:46 ` Guy @ 2005-01-06 17:31 ` Guy 2005-01-06 18:18 ` maarten 2005-01-07 20:59 ` Mario Holbe 2 siblings, 1 reply; 95+ messages in thread From: Guy @ 2005-01-06 17:31 UTC (permalink / raw) To: 'maarten', linux-raid This idea of splitting larger disks into smaller partitions, then re-assembling them seems odd. But it should help with the "bad block kicks out a disk" problem. I have never used RAID0. I have never used more than 1 pv with LVM on Linux. However, if you are going to use LVM anyway, why not allow LVM to assemble the disks? I do that sort of thing all the time with HP-UX. I create stripped mirrors using 4 or more disks. With HP-UX, use the -D option with lvcreate. No idea if Linux and LVM can strip. You are making me think! I hate that! :) Since your 6 RAID5 arrays are on the same 4 disks, striping them will kill performance. The poor heads will be going from 1 end to the other, all the time. You should use LINEAR is you combine them with md. If you use LVM, make sure it does not stripe them. With LVM on HP-UX, the default behavior is to not stripe. Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten Sent: Thursday, January 06, 2005 9:17 AM To: linux-raid@vger.kernel.org Subject: Spares and partitioning huge disks Hi I just got my 4 new 250GB disks. I have read someone on this list advocating that it is better to build arrays with smaller volumes, as that decreases the chance of failure, especially failures of two disks in a raid5 configuration. The idea behind it was that since a drive gets kicked when a read-error occurs, the chance is lower that a 40 GB part develops a read error than for the full size 250 GB. Thus, if you have 24 40GB parts, there is no fatal two disk failure when part sda6 and part sdc4 develop a bad sector at the same time. On the other hand, if the (full-size) disk sda1 and sdc1 do fail at the same time, you're in deep shit. I thought it was real insightful, so I would like to try that now. (Thanks to the original poster, I don't recall your name, sorry) Now my two questions regarding this. 1) What is better, make 6 raid5 arrays consisting of all 40GB partitions and group them in a LVM set, or group them in a raid-0 set (if the latter is even possible that is) ? 2) Seen as the 'physical' volumes are now 40 GB, I could add an older 80GB disk partitioned in two 40GB halves, and use those two as hot-spares. However, for that to work you'd have to be able to add the spares to _all_ raid sets, not specific ones, if you understand what I mean. So they would act as 'roaming' spares, and they would get used by the first array that needs a spare (when a failure occurs of course). But... is this possible ? Thanks for any insights! Maarten -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-06 17:31 ` Guy @ 2005-01-06 18:18 ` maarten [not found] ` <41DD83DA.9040609@h3c.com> 0 siblings, 1 reply; 95+ messages in thread From: maarten @ 2005-01-06 18:18 UTC (permalink / raw) To: linux-raid On Thursday 06 January 2005 18:31, Guy wrote: > This idea of splitting larger disks into smaller partitions, then > re-assembling them seems odd. But it should help with the "bad block kicks > out a disk" problem. Yes. And I'm absolutely sure I read it on linux-raid. Couple of months back. > However, if you are going to use LVM anyway, why not allow LVM to assemble > the disks? I do that sort of thing all the time with HP-UX. I create > stripped mirrors using 4 or more disks. With HP-UX, use the -D option with > lvcreate. No idea if Linux and LVM can strip. I think so. But I am more familiar with md, so I'll still use that. In any case LVM's striping is akin raid-0, whereas I will definitely use raid-5. > You are making me think! I hate that! :) Since your 6 RAID5 arrays are ;-) Terrible, isn't it. > on the same 4 disks, striping them will kill performance. The poor heads > will be going from 1 end to the other, all the time. You should use LINEAR > is you combine them with md. If you use LVM, make sure it does not stripe > them. With LVM on HP-UX, the default behavior is to not stripe. Exactly what I thought. That they are on the same disks should not matter; only when one full md set (6-1)*40GB=200GB is full (or used, or whatever) will the "access" move on to the next set of drives. It is indeed imperative to NOT have LVM striping (nor use raid-0, thanks for observing that!), as that would be totally counterproductive and may thus kill performance. (r/w head thrashing) For all clarity, this is how it would look: md0 : active raid5 sda1[0] sdb1[1] sdc1[2] sdd1[3] 40000000 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] ... ... ... md5 : active raid5 sda6[0] sdb6[1] sdc6[2] sdd6[3] 40000000 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] The lvm part is still new to me, but the goal is simply to add all PVs /dev/ md0 through /dev/md5 to one LV yielding... well, a very large volume. :-) I was planning to do this quickly tonight, but I've overlooked one essential thing ;-| The old server has already 220 GB of data built of 4 80GB disks in raid-5. But I cannot connect all 8 disks at the same time, so I'll have to 'free up' another system to define the arrays and copy the data over Gbit LAN. I definitely don't want to lose the data! What complicates this a bit is that I wanted to copy the OS verbatim (it is not part of that raid-5 set, just raid-1). But I suppose booting a rescue CD would enable me to somehow netcat the OS over to the new disks... We'll see. But for now I'm searching my home for a spare system with SATA onboard... :-) Maarten -- ^ permalink raw reply [flat|nested] 95+ messages in thread
[parent not found: <41DD83DA.9040609@h3c.com>]
* Re: Spares and partitioning huge disks [not found] ` <41DD83DA.9040609@h3c.com> @ 2005-01-06 19:42 ` maarten 0 siblings, 0 replies; 95+ messages in thread From: maarten @ 2005-01-06 19:42 UTC (permalink / raw) To: Mike Hardy, linux-raid On Thursday 06 January 2005 19:30, Mike Hardy wrote: > maarten wrote: > > I was planning to do this quickly tonight, but I've overlooked one > > essential thing ;-| The old server has already 220 GB of data built of 4 > > 80GB disks in raid-5. But I cannot connect all 8 disks at the same time, > > so I'll have to 'free up' another system to define the arrays and copy > > the data over Gbit LAN. I definitely don't want to lose the data! > > What complicates this a bit is that I wanted to copy the OS verbatim (it > > is not part of that raid-5 set, just raid-1). But I suppose booting a > > rescue CD would enable me to somehow netcat the OS over to the new > > disks... We'll see. > > You could degrade the current raid5 set by plugging one of the new > drives in and copying the 220GB to it directly, then you could build the > new raid5 sets with one drive "missing" and then finally dump the data > on the new raid5's and then hotadd the missing drive Hey. I knew that trick of course, but only now that you mention it I realize that indeed one single new disk is big enough to hold all of the old data. Stupid :) I never though of that. Go figure. Those disks get BIG indeed...! :-)) Right now I took my main(*) fileserver offline, unplugged all the disks from it and connected the new disks. Using a RIP CDrom I partitioned them and 'mkraid'ed (mdadm not yet being on RIP) the first md device which will hold the OS. As we speak the netcat session is running, so if I didn't make any typos or thinkos soon I will hopefully reboot to a full-fledged system. (*) These new disks are not for my _fileserver_, but for my MythTV PVR, the machine they will eventually end up in, which holds the 220 GB TV & movies. So far, so good... Maarten -- ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-06 14:16 Spares and partitioning huge disks maarten 2005-01-06 16:46 ` Guy 2005-01-06 17:31 ` Guy @ 2005-01-07 20:59 ` Mario Holbe 2005-01-07 21:57 ` Guy 2 siblings, 1 reply; 95+ messages in thread From: Mario Holbe @ 2005-01-07 20:59 UTC (permalink / raw) To: linux-raid maarten <maarten@ultratux.net> wrote: > I just got my 4 new 250GB disks. I have read someone on this list advocating > that it is better to build arrays with smaller volumes, as that decreases the > chance of failure, especially failures of two disks in a raid5 configuration. This might be true for read-errors. However, if a whole disk dies (perhaps because the IDE controller fails, I assume you're having IDE disks or because of a temperature failure or something like that) with a couple of partitions on it, you get a lot of simultaneously 'disks' (partitions), which would completely kill your RAID5, because RAID5 can IMHO only recover one failing device. I'd assume, such a setup would kill you in this case, while with only 4 devices (whole 250G disks) you'd survive it. I'm quite sure one could get it managed back together with more or less expert knowledge, but I belive the complete RAID would stop processing first. Just to make this clear - all this are spontaneous assumptions, I did never play with RAID5. regards, Mario -- I heard, if you play a NT-CD backwards, you get satanic messages... That's nothing. If you play it forwards, it installs NT. ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-07 20:59 ` Mario Holbe @ 2005-01-07 21:57 ` Guy 2005-01-08 10:22 ` Mario Holbe 2005-01-08 14:52 ` Frank van Maarseveen 0 siblings, 2 replies; 95+ messages in thread From: Guy @ 2005-01-07 21:57 UTC (permalink / raw) To: 'Mario Holbe', linux-raid His plan is to split the disks into 6 partitions. Each of his six RAID5 arrays will only use 1 partition of each physical disk. If he were to lose a disk, all 6 RAID5 arrays would only see 1 failed disk. If he gets 2 read errors, on different disks, at the same time, he has a 1/6 chance they would be in the same array (which would be bad). Everything SHOULD work just fine. :) His plan is to combine the 6 arrays with LVM or a linear array. Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Mario Holbe Sent: Friday, January 07, 2005 3:59 PM To: linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks maarten <maarten@ultratux.net> wrote: > I just got my 4 new 250GB disks. I have read someone on this list advocating > that it is better to build arrays with smaller volumes, as that decreases the > chance of failure, especially failures of two disks in a raid5 configuration. This might be true for read-errors. However, if a whole disk dies (perhaps because the IDE controller fails, I assume you're having IDE disks or because of a temperature failure or something like that) with a couple of partitions on it, you get a lot of simultaneously 'disks' (partitions), which would completely kill your RAID5, because RAID5 can IMHO only recover one failing device. I'd assume, such a setup would kill you in this case, while with only 4 devices (whole 250G disks) you'd survive it. I'm quite sure one could get it managed back together with more or less expert knowledge, but I belive the complete RAID would stop processing first. Just to make this clear - all this are spontaneous assumptions, I did never play with RAID5. regards, Mario -- I heard, if you play a NT-CD backwards, you get satanic messages... That's nothing. If you play it forwards, it installs NT. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-07 21:57 ` Guy @ 2005-01-08 10:22 ` Mario Holbe 2005-01-08 12:19 ` maarten 2005-01-08 14:52 ` Frank van Maarseveen 1 sibling, 1 reply; 95+ messages in thread From: Mario Holbe @ 2005-01-08 10:22 UTC (permalink / raw) To: linux-raid Guy <bugzilla@watkins-home.com> wrote: > Each of his six RAID5 arrays will only use 1 partition of each physical > disk. > His plan is to combine the 6 arrays with LVM or a linear array. Ah, I just missed that part, sorry & thanks :) I agree with you then - it's something like a RAID5+0 (analogue to RAID1+0) then and it *should* work just fine :) regards, Mario -- Independence Day: Fortunately, the alien computer operating system works just fine with the laptop. This proves an important point which Apple enthusiasts have known for years. While the evil empire of Microsoft may dominate the computers of Earth people, more advanced life forms clearly prefer Mac's. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-08 10:22 ` Mario Holbe @ 2005-01-08 12:19 ` maarten 2005-01-08 16:33 ` Guy 0 siblings, 1 reply; 95+ messages in thread From: maarten @ 2005-01-08 12:19 UTC (permalink / raw) To: linux-raid On Saturday 08 January 2005 11:22, Mario Holbe wrote: > Guy <bugzilla@watkins-home.com> wrote: > > Each of his six RAID5 arrays will only use 1 partition of each physical > > disk. > > His plan is to combine the 6 arrays with LVM or a linear array. > > Ah, I just missed that part, sorry & thanks :) > I agree with you then - it's something like a RAID5+0 (analogue > to RAID1+0) then and it *should* work just fine :) Yes, it should. And the array does indeed work, but I'm plagued with a host of other -unrelated- problems now. :-( First, I had to upgrade from kernel 2.4 to 2.6 because I lacked a driver for my SATA card. That was quite complicated as there is no official support to run 2.6 kernels on Suse 9.0. It also entailed migrating to lvm2 as lvm is not part of 2.6. But as it turns out, the 2.6 kernel I eventually installed does somehow not initialize my bttv cards correctly and also ALSA has problems. So now I'm reverting back to a 2.4 kernel which should support SATA, version 2.4.28, which I'm building as we speak from vanilla kernel sources... But that has a lot of hurdles too, and I'd still have to find an ALSA driver for that as well. Worse, I fear that I will have to migrate back to lvm1 now too, so that means copying all the 200 Gig data _again_, which by itself takes about 10 hours... :-( Ah, if only I would have bought ATA disks instead of SATA ! I've thought about putting the disks in another system and just use them over NFS, but that would mean another system that is powered for 24/7, and that gets to be a bit much. Reinstalling from scratch with a 9.1 / 2.6 distro is a worse option still, as mythtv carries such a bucket full of obscure dependencies I'd hate to install all that again. In other words, I'm not there yet, but at least that has little or nothing to do with lvm or md. But this does 'suck' a lot. Maarten -- ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-08 12:19 ` maarten @ 2005-01-08 16:33 ` Guy 2005-01-08 16:58 ` maarten 0 siblings, 1 reply; 95+ messages in thread From: Guy @ 2005-01-08 16:33 UTC (permalink / raw) To: 'maarten', linux-raid Maarten, I was thinking again! You plan on using an 80 gig disk as a spare disk. 2 40 Gig partitions. If both spares end up in the same RAID5 array, that would be bad! mdadm supports spare groups, you should create 2 groups, put 1 spare partition in each group. Then put 1/2 of your RAID5 arrays in each group. Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten Sent: Saturday, January 08, 2005 7:19 AM To: linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks On Saturday 08 January 2005 11:22, Mario Holbe wrote: > Guy <bugzilla@watkins-home.com> wrote: > > Each of his six RAID5 arrays will only use 1 partition of each physical > > disk. > > His plan is to combine the 6 arrays with LVM or a linear array. > > Ah, I just missed that part, sorry & thanks :) > I agree with you then - it's something like a RAID5+0 (analogue > to RAID1+0) then and it *should* work just fine :) Yes, it should. And the array does indeed work, but I'm plagued with a host of other -unrelated- problems now. :-( First, I had to upgrade from kernel 2.4 to 2.6 because I lacked a driver for my SATA card. That was quite complicated as there is no official support to run 2.6 kernels on Suse 9.0. It also entailed migrating to lvm2 as lvm is not part of 2.6. But as it turns out, the 2.6 kernel I eventually installed does somehow not initialize my bttv cards correctly and also ALSA has problems. So now I'm reverting back to a 2.4 kernel which should support SATA, version 2.4.28, which I'm building as we speak from vanilla kernel sources... But that has a lot of hurdles too, and I'd still have to find an ALSA driver for that as well. Worse, I fear that I will have to migrate back to lvm1 now too, so that means copying all the 200 Gig data _again_, which by itself takes about 10 hours... :-( Ah, if only I would have bought ATA disks instead of SATA ! I've thought about putting the disks in another system and just use them over NFS, but that would mean another system that is powered for 24/7, and that gets to be a bit much. Reinstalling from scratch with a 9.1 / 2.6 distro is a worse option still, as mythtv carries such a bucket full of obscure dependencies I'd hate to install all that again. In other words, I'm not there yet, but at least that has little or nothing to do with lvm or md. But this does 'suck' a lot. Maarten -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-08 16:33 ` Guy @ 2005-01-08 16:58 ` maarten 0 siblings, 0 replies; 95+ messages in thread From: maarten @ 2005-01-08 16:58 UTC (permalink / raw) To: linux-raid On Saturday 08 January 2005 17:33, Guy wrote: > You plan on using an 80 gig disk as a spare disk. 2 40 Gig partitions. If > both spares end up in the same RAID5 array, that would be bad! mdadm > supports spare groups, you should create 2 groups, put 1 spare partition in > each group. Then put 1/2 of your RAID5 arrays in each group. Good point. But I was planning to monitor it a bit, so I suppose I'd notice that, and add another disk to remedy it. I just decommissioned 4 80GB drives so there's plenty where they came from. ;) ... As it turns out I do indeed need to kill the LVM2 array and downgrade to lvm1 yet again, because 2.4.28 seems to have no support for it. Bummer. That's isn't too bad, the raid arrays stay active so the long resyncs will not happen, just the incredibly slow tar-over-netcat network backup session. Before I do this I must make sure that the old data disks are still okay... Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-07 21:57 ` Guy 2005-01-08 10:22 ` Mario Holbe @ 2005-01-08 14:52 ` Frank van Maarseveen 2005-01-08 15:50 ` Mario Holbe ` (2 more replies) 1 sibling, 3 replies; 95+ messages in thread From: Frank van Maarseveen @ 2005-01-08 14:52 UTC (permalink / raw) To: Guy; +Cc: 'Mario Holbe', linux-raid On Fri, Jan 07, 2005 at 04:57:35PM -0500, Guy wrote: > His plan is to split the disks into 6 partitions. > Each of his six RAID5 arrays will only use 1 partition of each physical > disk. > If he were to lose a disk, all 6 RAID5 arrays would only see 1 failed disk. > If he gets 2 read errors, on different disks, at the same time, he has a 1/6 > chance they would be in the same array (which would be bad). > His plan is to combine the 6 arrays with LVM or a linear array. Intriguing setup. Do you think this actually improves the reliability with respect to disk failure compared to creating just one large RAID5 array? For one second I thought it's a clever trick but gut feeling tells me the odds of losing the entire array won't change (simplified -- because the increased complexity creates room for additional errors). -- Frank ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-08 14:52 ` Frank van Maarseveen @ 2005-01-08 15:50 ` Mario Holbe 2005-01-08 16:32 ` Guy 2005-01-08 16:49 ` maarten 2 siblings, 0 replies; 95+ messages in thread From: Mario Holbe @ 2005-01-08 15:50 UTC (permalink / raw) To: linux-raid Frank van Maarseveen <frankvm@frankvm.com> wrote: > Intriguing setup. Do you think this actually improves the reliability > with respect to disk failure compared to creating just one large RAID5 Well, there is this one special case where it's a bit more robust: sector read errors. > me the odds of losing the entire array won't change (simplified -- > because the increased complexity creates room for additional errors). You don't do anything else with RAID1 or 5 or whatever: You add code to reduce the impact of a single disk failure. You add new points of failure to reduce the impact of other points of failure there, too. In this case here, you add code (the RAID0 or LVM code, whatever you like more) to reduce the impact of two sector read errors on two disks. Of course the new code can contain new points of failures. It's as always: know the risk and decide :) regards, Mario -- <jv> Oh well, config <jv> one actually wonders what force in the universe is holding it <jv> and makes it working <Beeth> chances and accidents :) ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-08 14:52 ` Frank van Maarseveen 2005-01-08 15:50 ` Mario Holbe @ 2005-01-08 16:32 ` Guy 2005-01-08 17:16 ` maarten 2005-01-08 16:49 ` maarten 2 siblings, 1 reply; 95+ messages in thread From: Guy @ 2005-01-08 16:32 UTC (permalink / raw) To: 'Frank van Maarseveen'; +Cc: 'Mario Holbe', linux-raid I don't recall having 2 disks with read errors at the same time. But others on this list have. Correctable read errors is my most common problem with my 14 disk array. I think this partitioning approach will help. But as you say, it is more complicated, which adds some risk, I believe. But you can compute the level of reduced risk, but you can't compute the level of increased risk. Some added risk: More complicated setup, increases user errors. Example: Maarten plans to have 2 spare partitions on an extra disk. Once he corrects the read error on the failed partition, he needs to remove the failed partition, fail the spare and add the original partition back to the correct array. He has a 6 times increased risk of choosing the wrong partition to fail or remove. Is that 36 time increased risk of user error? Of course, the level of error may be negligible, depending on who the user is. But it is still an increase of risk. There was at least 1 case on this list where someone failed or removed the wrong disk from an array, so it does happen. If 6 partitions is 6 time better than 1, then 36 would be 6 times better than 6. Is there a sweet spot? Also, I mentioned it before. Don't combine the RAID5 arrays with RAID0. Since the RAID5 arrays are on the same set of disks, the poor disk heads will be flapping all over the place. Use a linear array, or LVM. Also, Neil has an item on his wish list to handle bad blocks. Once this is built into md, the 6 partition idea is useless. I test my disks every night with a tool from Seagate. I don't think I have had a bad block since I started using this tool each night. The tool is free, it is called "SeaTools Enterprise Edition". I assume it only works with Seagate disks. Guy -----Original Message----- From: Frank van Maarseveen [mailto:frankvm@frankvm.com] Sent: Saturday, January 08, 2005 9:52 AM To: Guy Cc: 'Mario Holbe'; linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks On Fri, Jan 07, 2005 at 04:57:35PM -0500, Guy wrote: > His plan is to split the disks into 6 partitions. > Each of his six RAID5 arrays will only use 1 partition of each physical > disk. > If he were to lose a disk, all 6 RAID5 arrays would only see 1 failed disk. > If he gets 2 read errors, on different disks, at the same time, he has a 1/6 > chance they would be in the same array (which would be bad). > His plan is to combine the 6 arrays with LVM or a linear array. Intriguing setup. Do you think this actually improves the reliability with respect to disk failure compared to creating just one large RAID5 array? For one second I thought it's a clever trick but gut feeling tells me the odds of losing the entire array won't change (simplified -- because the increased complexity creates room for additional errors). -- Frank ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-08 16:32 ` Guy @ 2005-01-08 17:16 ` maarten 2005-01-08 18:55 ` Guy 0 siblings, 1 reply; 95+ messages in thread From: maarten @ 2005-01-08 17:16 UTC (permalink / raw) To: linux-raid On Saturday 08 January 2005 17:32, Guy wrote: > I don't recall having 2 disks with read errors at the same time. But > others on this list have. Correctable read errors is my most common > problem with my 14 disk array. I think this partitioning approach will > help. But as you say, it is more complicated, which adds some risk, I > believe. But you can compute the level of reduced risk, but you can't > compute the level of increased risk. true. Especially since LVM is completely new to me. > Some added risk: > More complicated setup, increases user errors. I have confidence in myself (knock, knock). I triplecheck every action I do with the output of 'cat /proc/mdadm' before hitting [enter] so as to not make thinking errors like using hdf5 instead of hde6, and similar mistakes. I'm paranoid by nature, so that helps, too ;-) > Example: Maarten plans to have 2 spare partitions on an extra disk. > Once he corrects the read error on the failed partition, he needs to remove > the failed partition, fail the spare and add the original partition back to > the correct array. He has a 6 times increased risk of choosing the wrong You must mean in the other order. If I fail the spare first, I'm toast! ;-) > partition to fail or remove. Is that 36 time increased risk of user error? > Of course, the level of error may be negligible, depending on who the user > is. But it is still an increase of risk. First of all you need to make everything as uniform as possible, meaning all disks belonging to array md3 are numbered hdX6, all of md4 are hdX7, etc. I suppose this goes without saying for most people here, but it helps a LOT. > than 6. Is there a sweet spot? Heh. Somewhere between 1 and 36 I'd bet. :) > Also, Neil has an item on his wish list to handle bad blocks. Once this is > built into md, the 6 partition idea is useless. I know but I'm not going to wait for that. For now I have limited options. Mine has not only the benefits outlined, but also the benefit of being able to use an older disk as a spare. I guess having this with a spare beats having one huge array without a spare. Or else I'd need to buy yet another 250GB drive, and they're not really 'dirt cheap' if you know what I mean. > I test my disks every night with a tool from Seagate. I don't think I have > had a bad block since I started using this tool each night. The tool is > free, it is called "SeaTools Enterprise Edition". I assume it only works > with Seagate disks. That's interesting. Is that an _online_ test, or do you stop the array every night ? The latter would seem quite error-prone by itself already, and the former... well I don't suppose Seagate supports linux, really. Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-08 17:16 ` maarten @ 2005-01-08 18:55 ` Guy 2005-01-08 19:25 ` maarten 0 siblings, 1 reply; 95+ messages in thread From: Guy @ 2005-01-08 18:55 UTC (permalink / raw) To: 'maarten', linux-raid My warning about user error was not targeted at you! :) Sorry if it seemed so. And the order does not matter! A: Remove the failed disk. Fail the spare System is degraded. Add the failed/repaired disk. Rebuild starts. B: Remove the failed disk. Add the failed/repaired disk. Fail the spare System is degraded. Rebuild starts. Both A and B above require the array to go degraded until the repaired disk is rebuilt. But with A, the longer you delay adding the repaired disk, the longer you are degraded. In my case, that would be less than 1 minute. I do fail the spare last, but not really much of an issue. No toast anyway! It would be cool if the rebuild to the repaired disk could be done before the spare was failed or removed. Then the array would not be degraded at all. If I ever re-build my system, or build a new system, I hope to use RAID6. The Seagate test is on-line. Before I started using the Seagate tool, I used dd. My disks claim to be able to re-locate bad blocks on read error. But I am not sure if this is correctable errors or not. If not correctable errors are re-located, what data does the drive return? Since I don't know, I don't use this option. I did use this option for awhile, but after re-reading about it, I got concerned and turned it off. This is from the readme file: Automatic Read Reallocation Enable (ARRE) -Marreon/off enable/disable ARRE bit On, drive automatically relocates bad blocks detected during read operations. Off, drive creates Check condition status with sense key of Medium Error if bad blocks are detected during read operations. Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten Sent: Saturday, January 08, 2005 12:17 PM To: linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks On Saturday 08 January 2005 17:32, Guy wrote: > I don't recall having 2 disks with read errors at the same time. But > others on this list have. Correctable read errors is my most common > problem with my 14 disk array. I think this partitioning approach will > help. But as you say, it is more complicated, which adds some risk, I > believe. But you can compute the level of reduced risk, but you can't > compute the level of increased risk. true. Especially since LVM is completely new to me. > Some added risk: > More complicated setup, increases user errors. I have confidence in myself (knock, knock). I triplecheck every action I do with the output of 'cat /proc/mdadm' before hitting [enter] so as to not make thinking errors like using hdf5 instead of hde6, and similar mistakes. I'm paranoid by nature, so that helps, too ;-) > Example: Maarten plans to have 2 spare partitions on an extra disk. > Once he corrects the read error on the failed partition, he needs to remove > the failed partition, fail the spare and add the original partition back to > the correct array. He has a 6 times increased risk of choosing the wrong You must mean in the other order. If I fail the spare first, I'm toast! ;-) > partition to fail or remove. Is that 36 time increased risk of user error? > Of course, the level of error may be negligible, depending on who the user > is. But it is still an increase of risk. First of all you need to make everything as uniform as possible, meaning all disks belonging to array md3 are numbered hdX6, all of md4 are hdX7, etc. I suppose this goes without saying for most people here, but it helps a LOT. > than 6. Is there a sweet spot? Heh. Somewhere between 1 and 36 I'd bet. :) > Also, Neil has an item on his wish list to handle bad blocks. Once this is > built into md, the 6 partition idea is useless. I know but I'm not going to wait for that. For now I have limited options. Mine has not only the benefits outlined, but also the benefit of being able to use an older disk as a spare. I guess having this with a spare beats having one huge array without a spare. Or else I'd need to buy yet another 250GB drive, and they're not really 'dirt cheap' if you know what I mean. > I test my disks every night with a tool from Seagate. I don't think I have > had a bad block since I started using this tool each night. The tool is > free, it is called "SeaTools Enterprise Edition". I assume it only works > with Seagate disks. That's interesting. Is that an _online_ test, or do you stop the array every night ? The latter would seem quite error-prone by itself already, and the former... well I don't suppose Seagate supports linux, really. Maarten - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-08 18:55 ` Guy @ 2005-01-08 19:25 ` maarten 2005-01-08 20:33 ` Mario Holbe 2005-01-08 23:09 ` Guy 0 siblings, 2 replies; 95+ messages in thread From: maarten @ 2005-01-08 19:25 UTC (permalink / raw) To: linux-raid On Saturday 08 January 2005 19:55, you wrote: > My warning about user error was not targeted at you! :) > Sorry if it seemed so. :-) > And the order does not matter! Hm... yes you're right. But adding the disk is more prudent (or is it?) Grr. Now you've got ME thinking ! ;-) Normally, the minute a drive fails, it gets kicked and the spare would kick in and md syncs this spare. We now have a non-degraded array again. If I then fail the spare first, the array goes into degraded mode. Whereas if I hotadd the disk, it becomes a spare. Presumably if I now fail the original spare, the real disk will get synced again, to get the same setup as before. But yes, you're right; during this step it is degraded again. Oh well... > It would be cool if the rebuild to the repaired disk could be done before > the spare was failed or removed. Then the array would not be degraded at > all. Yes, but this would be impossible to do, since md cannot anticipate _which_ disk you're going to fail before it happens. ;) > If I ever re-build my system, or build a new system, I hope to use RAID6. I tried this in last fall, but it didn't work out then. See the list archives. > The Seagate test is on-line. Before I started using the Seagate tool, I > used dd. I'm not as cautious as you are. I just pray the hot spare does what its supposed to do. > My disks claim to be able to re-locate bad blocks on read error. But I am > not sure if this is correctable errors or not. If not correctable errors > are re-located, what data does the drive return? Since I don't know, I > don't use this option. I did use this option for awhile, but after > re-reading about it, I got concerned and turned it off. Afaik, if a drive senses it gets more 'difficult' than usual to read a sector, it will automatically copy it to a spare sector and reassign it. However, I doubt the OS gets any wiser this happens, so neither would md. In which cases the error gets noticed by md I don't precisely know, but I reckon that may well be when the error is uncorrectible. Not _undetectable_, to quote from another thread... 8-) > This is from the readme file: > Automatic Read Reallocation Enable (ARRE) > -Marreon/off enable/disable ARRE bit > On, drive automatically relocates bad blocks detected > during read operations. Off, drive creates Check condition > status with sense key of Medium Error if bad blocks are > detected during read operations. Hm. I would definitely ENable that option. But what do I know. It also depends I guess on how fatal reading bad data undetected is for you. For me, if one of my mpegs or mp3s develops a bad sector I can probably live with that. :-) Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-08 19:25 ` maarten @ 2005-01-08 20:33 ` Mario Holbe 2005-01-08 23:01 ` maarten 2005-01-08 23:09 ` Guy 1 sibling, 1 reply; 95+ messages in thread From: Mario Holbe @ 2005-01-08 20:33 UTC (permalink / raw) To: linux-raid maarten <maarten@ultratux.net> wrote: > On Saturday 08 January 2005 19:55, you wrote: >> My disks claim to be able to re-locate bad blocks on read error. But I am >> not sure if this is correctable errors or not. If not correctable errors >> are re-located, what data does the drive return? Since I don't know, I ... > Afaik, if a drive senses it gets more 'difficult' than usual to read a sector, > it will automatically copy it to a spare sector and reassign it. However, I No, this is usually not the case. At least I don't know IDE drives that do so. This is why I call it `sector read error'. Each newer disk has some amount of `spare sectors' which can be used to relocate bad sectors. Usually, you have two situations where you can detect a bad sector: 1. If you write to it and this attempt fails and 2. If you read from it and this attempt fails. 1. would require some verify-operation, so I'm not sure if this is done at all in the wild. 2. has a simple problem: If you get a read-request for sector x and you cannot read it, what data should you return then? The answer is simple: you don't return data but an error (the read- error). Additionally you mark the sector as bad and relocate the next write-request for that sector to some spare sector and further read-requests then too. However, you still have to respond error messages for each subsequent read-request before the first relocated write-request appears. And afaik this is what current disks do. That's why you can just re-sync the failed disk to the array again without any problem - because the write-request happens then, the relocation takes place and everything's fine. regards, Mario -- The social dynamics of the net are a direct consequence of the fact that nobody has yet developed a Remote Strangulation Protocol. -- Larry Wall ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-08 20:33 ` Mario Holbe @ 2005-01-08 23:01 ` maarten 2005-01-09 10:10 ` Mario Holbe 0 siblings, 1 reply; 95+ messages in thread From: maarten @ 2005-01-08 23:01 UTC (permalink / raw) To: linux-raid On Saturday 08 January 2005 21:33, Mario Holbe wrote: > maarten <maarten@ultratux.net> wrote: > > On Saturday 08 January 2005 19:55, you wrote: > >> My disks claim to be able to re-locate bad blocks on read error. But I > >> am not sure if this is correctable errors or not. If not correctable > >> errors are re-located, what data does the drive return? Since I don't > >> know, I > > ... > > > Afaik, if a drive senses it gets more 'difficult' than usual to read a > > sector, it will automatically copy it to a spare sector and reassign it. > > However, I > > No, this is usually not the case. At least I don't know IDE drives > that do so. This is why I call it `sector read error'. Do you mean SCSI ones do ? If so, I thought the firmware intelligence between ATA and SCSI vanished long ago. > Each newer disk has some amount of `spare sectors' which can be > used to relocate bad sectors. Usually, you have two situations > where you can detect a bad sector: > 1. If you write to it and this attempt fails and > 2. If you read from it and this attempt fails. Hm. I'm not extremely well versed on modern drive technology but nevertheless: How I understood it is somewhat different, namely: 1. If you write to it and that fails the drive will allocate a spare sector. From that we [should be] able to conclude that if you get a write failure that the drive ran out of spare sectors. (is that a fact, or not??) 2. If you read from it, the drives' firmware will see an error and: 2a: Retry the read a couple more times, succeed, copy that to a spare sector and reallocate. - OR 2b: Retry the read, fail miserably despite that and (only then) signal a read error to the host. I've heard for a long time that drives are much more sophisticated than before, retrying failed reads. They can try to read 'off-track' (off-axis) and such things that were impossible when stepping motors were still used. But that was more than 10 years ago, now they all have coil-actuated heads. In other words, drives don't wait till the sector is really unreadable, they'll reallocate at the first sign of trouble (decaying signal strength, spurious crc errors, stuff like that). This is also suggested by the observable behaviour of drive and OS; if a reallocation only would occur after the fact, ie. when the data is beyond salvaging, then every sector reallocation would by definition lead to corrupt data in that file. Generally speaking -since there are so many spare sectors- an OS would die very soon as all its files / libs/ DLLs got corrupted due to the reallocation (which is supposed to be transparent to the host, only the drive knows). But... I have no solid proof of this though, other than reasoning like this. > 1. would require some verify-operation, so I'm not sure if this > is done at all in the wild. > 2. has a simple problem: If you get a read-request for sector x > and you cannot read it, what data should you return then? The > answer is simple: you don't return data but an error (the read- > error). Additionally you mark the sector as bad and relocate the > next write-request for that sector to some spare sector and further > read-requests then too. However, you still have to respond error > messages for each subsequent read-request before the first > relocated write-request appears. > And afaik this is what current disks do. That's why you can just > re-sync the failed disk to the array again without any problem - > because the write-request happens then, the relocation takes place > and everything's fine. So basically what you're saying is that reallocation _only_ happens on _writes_ ? Hm. Maybe, I don't know... The problem with my theory is that if it is true, then that automatically means that whenever md gets a read error that that data is indeed gone. Or maybe that isn't a problem since the disk gets kicked, and afterwards during resync the reallocation pays off. Yeah. That must be it. :-) Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-08 23:01 ` maarten @ 2005-01-09 10:10 ` Mario Holbe 2005-01-09 16:23 ` Guy 0 siblings, 1 reply; 95+ messages in thread From: Mario Holbe @ 2005-01-09 10:10 UTC (permalink / raw) To: linux-raid maarten <maarten@ultratux.net> wrote: > Do you mean SCSI ones do ? If so, I thought the firmware intelligence between I don't think, SCSI ones do so. However, I don't know many SCSI drives and thus I limited my sentence to IDE drives :) > 1. If you write to it and that fails the drive will allocate a spare sector. As I said earlier: >> 1. would require some verify-operation, so I'm not sure if this >> is done at all in the wild. A verify would take time and therefore I think, this is not done. Btw: *if* it would be done, write speed to disks should be read-speed/2 or smaller, but usually it isn't. > From that we [should be] able to conclude that if you get a write failure > that the drive ran out of spare sectors. (is that a fact, or not??) Yes, this is a fact. > So basically what you're saying is that reallocation _only_ happens on > _writes_ ? Hm. Maybe, I don't know... What I'm saying is: bad sectors are _only_ detected on reads and reallocations only happen on writes, yes. > Or maybe that isn't a problem since the disk gets kicked, and afterwards > during resync the reallocation pays off. Yeah. That must be it. :-) This is what I said, yes :) regards, Mario -- () Ascii Ribbon Campaign /\ Support plain text e-mail ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-09 10:10 ` Mario Holbe @ 2005-01-09 16:23 ` Guy 2005-01-09 16:36 ` Michael Tokarev 0 siblings, 1 reply; 95+ messages in thread From: Guy @ 2005-01-09 16:23 UTC (permalink / raw) To: 'Mario Holbe', linux-raid Bad sectors are detected on write. There are 5 wires going to each of the read/write heads on my disk drives. I think each head can read after write in 1 pass. My specs say it re-maps on write failure and read failure. Both are optional. But, I don't know if this is normal or not. My disks are Seagate ST118202LC, 10,000 RPM, 18 Gig SCSI. Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Mario Holbe Sent: Sunday, January 09, 2005 5:10 AM To: linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks maarten <maarten@ultratux.net> wrote: > Do you mean SCSI ones do ? If so, I thought the firmware intelligence between I don't think, SCSI ones do so. However, I don't know many SCSI drives and thus I limited my sentence to IDE drives :) > 1. If you write to it and that fails the drive will allocate a spare sector. As I said earlier: >> 1. would require some verify-operation, so I'm not sure if this >> is done at all in the wild. A verify would take time and therefore I think, this is not done. Btw: *if* it would be done, write speed to disks should be read-speed/2 or smaller, but usually it isn't. > From that we [should be] able to conclude that if you get a write failure > that the drive ran out of spare sectors. (is that a fact, or not??) Yes, this is a fact. > So basically what you're saying is that reallocation _only_ happens on > _writes_ ? Hm. Maybe, I don't know... What I'm saying is: bad sectors are _only_ detected on reads and reallocations only happen on writes, yes. > Or maybe that isn't a problem since the disk gets kicked, and afterwards > during resync the reallocation pays off. Yeah. That must be it. :-) This is what I said, yes :) regards, Mario -- () Ascii Ribbon Campaign /\ Support plain text e-mail - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-09 16:23 ` Guy @ 2005-01-09 16:36 ` Michael Tokarev 2005-01-09 17:52 ` Peter T. Breuer 0 siblings, 1 reply; 95+ messages in thread From: Michael Tokarev @ 2005-01-09 16:36 UTC (permalink / raw) To: linux-raid Guy wrote: > Bad sectors are detected on write. There are 5 wires going to each of the Bad sectors can be detected both on write and on read. Unfortunately, most time there will be *read* error -- it is pretty possible for the drive to perform a read-check after write, and the data may be ok at that time, but not when you will want to read it a month later... > read/write heads on my disk drives. I think each head can read after write > in 1 pass. My specs say it re-maps on write failure and read failure. Both > are optional. But, I don't know if this is normal or not. My disks are > Seagate ST118202LC, 10,000 RPM, 18 Gig SCSI. I think all modern drives support bad block remapping on both read and write. But think about it: if there's a read error, it means the drive CAN NOT read the "right" data for some reason (for some definition of "right" anyway) -- ie, the drive "knows" there's some problem with the data and it can't completely reconstruct what has been written to the block before. In this case, while remapping the block in question help to avoid further error in this block, it does NOT help to restore the data which the drive can't read. And sure it is not an option in this case to report that read was successeful and to pass, say, zero-filled block to the controller... ;) /mjt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-09 16:36 ` Michael Tokarev @ 2005-01-09 17:52 ` Peter T. Breuer 2005-01-09 17:59 ` Michael Tokarev 0 siblings, 1 reply; 95+ messages in thread From: Peter T. Breuer @ 2005-01-09 17:52 UTC (permalink / raw) To: linux-raid Michael Tokarev <mjt@tls.msk.ru> wrote: > I think all modern drives support bad block remapping on both read and write. > But think about it: if there's a read error, it means the drive CAN NOT read > the "right" data for some reason (for some definition of "right" anyway) -- > ie, the drive "knows" there's some problem with the data and it can't I really don't want RAID to fault the disk offline in this case. I want RAID to read from the other disk(s) instead, and rewrite the data on the disk that gave the fail notice on that sector, and if that gives no error, then just carry on and be happy ... Peter ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-09 17:52 ` Peter T. Breuer @ 2005-01-09 17:59 ` Michael Tokarev 2005-01-09 18:34 ` Peter T. Breuer 2005-01-09 20:28 ` Guy 0 siblings, 2 replies; 95+ messages in thread From: Michael Tokarev @ 2005-01-09 17:59 UTC (permalink / raw) To: linux-raid Peter T. Breuer wrote: > Michael Tokarev <mjt@tls.msk.ru> wrote: > >>I think all modern drives support bad block remapping on both read and write. >>But think about it: if there's a read error, it means the drive CAN NOT read >>the "right" data for some reason (for some definition of "right" anyway) -- >>ie, the drive "knows" there's some problem with the data and it can't > > I really don't want RAID to fault the disk offline in this case. I want > RAID to read from the other disk(s) instead, and rewrite the data on the > disk that gave the fail notice on that sector, and if that gives no error, > then just carry on and be happy ... There where some patches posted to this list some time ago that tries to do just that (or a discussion.. i don't remember). Yes, md code currently doesn't do such things, and fails a drive after the first error -- it's the simplest way to go ;) /mjt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-09 17:59 ` Michael Tokarev @ 2005-01-09 18:34 ` Peter T. Breuer 2005-01-09 20:28 ` Guy 1 sibling, 0 replies; 95+ messages in thread From: Peter T. Breuer @ 2005-01-09 18:34 UTC (permalink / raw) To: linux-raid Michael Tokarev <mjt@tls.msk.ru> wrote: > There where some patches posted to this list some time ago that tries to > do just that (or a discussion.. i don't remember). Yes, md code currently > doesn't do such things, and fails a drive after the first error -- it's > the simplest way to go ;) There would be two things to do (for raid1): 1) make the raid1_end_request code notice a failure on READ, but not panic, simply resubmit the i/o to another mirror (it has to count "tries") and only give up after the last try has failed. 2) hmmm .. is there a 2)? Well, maybe. Perhasp check that read errors per bio (as opposed to per request) don't fault the disk to the upper layers .. I don't think they can. And possible arrange for 2 read bios to be prepared but only one to be sent, and discard the second if the first succeeds, or try it if the first fails. Actually, looking at raid1_end_request, it looks as though it does try again: if ((r1_bio->cmd == READ) || (r1_bio->cmd == READA)) { ... /* * we have only one bio on the read side */ if (uptodate) raid_end_bio_io(r1_bio); else { /* * oops, read error: */ char b[BDEVNAME_SIZE]; printk(KERN_ERR "raid1: %s: rescheduling sector %llu\n", bdevname(conf->mirrors[mirror].rdev->bdev,b), (unsigned long long)r1_bio->sector); reschedule_retry(r1_bio); } } But does reschedule_retry try a different disk? Anyway, there is maybe a mistake in this code because we decrement the number of outsanding reads in all cases: atomic_dec(&conf->mirrors[mirror].rdev->nr_pending); return 0; but if the read is retried it should not be unpended yet! Well, that depends on your logic .. I suppose that morally the request should be unpended, but not the read, which is still pending. And I seem to remember that nr_pending is to tell the raid layers if we are in use or not, so I don't think we want to unpend here. Well, reschedule_retry does try the same read again: static void reschedule_retry(r1bio_t *r1_bio) { unsigned long flags; mddev_t *mddev = r1_bio->mddev; spin_lock_irqsave(&retry_list_lock, flags); list_add(&r1_bio->retry_list, &retry_list_head); spin_unlock_irqrestore(&retry_list_lock, flags); md_wakeup_thread(mddev->thread); } So it adds the whole read request (using the master, not the bio that failed) onto a retry list. Maybe that list will be checked for nonemptiness, which solves the nr_pending problem. It looks like a separate kernel thread (raid1d) does the retries. And bless me but if it doesn't try and send the read elsewhere ... case READ: case READA: if (map(mddev, &rdev) == -1) { printk(KERN_ALERT "raid1: %s: unrecoverable I/O" " read error for block %llu\n", bdevname(bio->bi_bdev,b), (unsigned long long)r1_bio->sector); raid_end_bio_io(r1_bio); break; } Not sure what that is about (?? disk is not in array?), but the next bit is clear: printk(KERN_ERR "raid1: %s: redirecting sector %llu to" " another mirror\n", bdevname(rdev->bdev,b), (unsigned long long)r1_bio->sector); So it will try and redirect. It rewrites the target of the bio: bio->bi_bdev = rdev->bdev; .. bio->bi_sector = r1_bio->sector + rdev->data_offset; It resets the offset in case it is different for this disk. bio->bi_rw = r1_bio->cmd; Dunno why it needs to do that. It should be unchanged. generic_make_request(bio); And submit. break; So it looks to me as though reads ARE redirected. It would be trivial t do a write on the failed disk to. Peter ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-09 17:59 ` Michael Tokarev 2005-01-09 18:34 ` Peter T. Breuer @ 2005-01-09 20:28 ` Guy 2005-01-09 20:47 ` Peter T. Breuer 1 sibling, 1 reply; 95+ messages in thread From: Guy @ 2005-01-09 20:28 UTC (permalink / raw) To: 'Michael Tokarev', linux-raid It is on Neil's wish list (or to do list)! Mine too! From Neil Brown: http://marc.theaimsgroup.com/?l=linux-raid&m=110055742813074&w=2 Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Michael Tokarev Sent: Sunday, January 09, 2005 1:00 PM To: linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks Peter T. Breuer wrote: > Michael Tokarev <mjt@tls.msk.ru> wrote: > >>I think all modern drives support bad block remapping on both read and write. >>But think about it: if there's a read error, it means the drive CAN NOT read >>the "right" data for some reason (for some definition of "right" anyway) -- >>ie, the drive "knows" there's some problem with the data and it can't > > I really don't want RAID to fault the disk offline in this case. I want > RAID to read from the other disk(s) instead, and rewrite the data on the > disk that gave the fail notice on that sector, and if that gives no error, > then just carry on and be happy ... There where some patches posted to this list some time ago that tries to do just that (or a discussion.. i don't remember). Yes, md code currently doesn't do such things, and fails a drive after the first error -- it's the simplest way to go ;) /mjt - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-09 20:28 ` Guy @ 2005-01-09 20:47 ` Peter T. Breuer 2005-01-10 7:19 ` Peter T. Breuer 0 siblings, 1 reply; 95+ messages in thread From: Peter T. Breuer @ 2005-01-09 20:47 UTC (permalink / raw) To: linux-raid Guy <bugzilla@watkins-home.com> wrote: > It is on Neil's wish list (or to do list)! Mine too! What is? Can you please be specific? > From Neil Brown: > > http://marc.theaimsgroup.com/?l=linux-raid&m=110055742813074&w=2 If you are talking about (and I am guessing, thanks to the uniform sensation of opaque experientiality that passes over me when I see the format of your posts, or the lacl of it) reading from the other disk when one sector read fails on the first, that appears to be in 2.6.3 at least, as my reading of the code goes. What neil says in your reference you can let the kernel kick out a drive that has a read error, let user-space have a quick look at the drive and see if it might be a recoverable error, and then give the drive back to the kernel is true. As far as I can see from a quick scan of the (raid1) code, he DOES kick a disk out on read error, but also DOES RETRY the read from another disk for that sector. Currently he does that in the resync thread. He needs a list of failed reads and only needs to kick the disk when recovery fails. At the present time it is trivial to add a write as well as a read on a retry. I can add the read accounting. Neil's comments indicate that he is interested in doing this in a generic way. So am I, but I'll settle for "non-generic" first. Peter ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-09 20:47 ` Peter T. Breuer @ 2005-01-10 7:19 ` Peter T. Breuer 2005-01-10 9:05 ` Guy 2005-01-10 12:31 ` Peter T. Breuer 0 siblings, 2 replies; 95+ messages in thread From: Peter T. Breuer @ 2005-01-10 7:19 UTC (permalink / raw) To: linux-raid Peter T. Breuer <ptb@lab.it.uc3m.es> wrote: > DOES kick a disk out on read error, but also DOES RETRY the read from > another disk for that sector. Currently he does that in the resync > thread. > > He needs a list of failed reads and only needs to kick the disk when > recovery fails. Well, here is a patch to at least stop the array (RAID 1) being failed until all possible read sources have been exhausted for the sector in question. It's untested - I only checked that it compiles. The idea here is to modify raid1.c so that 1) in make_request, on read (as well as on write, where we already do it) we set the master bios "remaining" count to the number of viable disks in the array. That's the third of the three hunks in the patch below and is harmless unless somebody somewhere already uses the "remaining" field in the read branch. I don't see it if so. 2) in raid1_end_request, I pushed the if (!uptodate) test which faults the current disk out of the array down a few lines (past no code at all, just a branch test for READ or WRITE) and copied it into both the start of the READ and WRITE branches of the code. That shows up very badly under diff, which makes it look as though I did something else entirely. But that's all, and that is harmless. This is the first two hunks of the patch below. Diff makes it look as though I moved the branch UP, but I moved the code before the branch DOWN. After moving the faulting code into the two branches, in the READ branch ONLY I weakened the condition that faulted the disk from "if !uptodate" to "if !uptodate and there is no other source to try". That's probably harmless in itself, modulo accounting questions - there might be things like nr_pending still to tweak. This leaves things a bit unfair - "don't come any closer or the hacker gets it". The LAST disk that fails a read, in case all disks fail to read on that sector, gets ejected from the array. But which it is is random, depending on the order we try (anyone know if the rechedule_retry call is "fair" in the technical sense?). In my opinion no disk should ever be ejected from the array in these cicumstances - it's just a read error produced by the array as a whole and we have already done our bestto avoid it and can do on more. In a strong sense, as it sems to me, "error is the correct read result". I've marked what line to comment with /* PTB ... */ in the patch. --- linux-2.6.3/drivers/md/raid1.c.orig Tue Dec 28 00:39:01 2004 +++ linux-2.6.3/drivers/md/raid1.c Mon Jan 10 07:39:38 2005 @@ -354,9 +354,15 @@ /* * this branch is our 'one mirror IO has finished' event handler: */ - if (!uptodate) - md_error(r1_bio->mddev, conf->mirrors[mirror].rdev); - else + update_head_pos(mirror, r1_bio); + if ((r1_bio->cmd == READ) || (r1_bio->cmd == READA)) { + if (!uptodate +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT + && atomic_dec_and_test(&r1_bio->remaining) +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */ + ) { /* PTB remove next line to be much fairer! */ + md_error(r1_bio->mddev, conf->mirrors[mirror].rdev); + } else /* * Set R1BIO_Uptodate in our master bio, so that * we will return a good error code for to the higher @@ -368,8 +374,6 @@ */ set_bit(R1BIO_Uptodate, &r1_bio->state); - update_head_pos(mirror, r1_bio); - if ((r1_bio->cmd == READ) || (r1_bio->cmd == READA)) { if (!r1_bio->read_bio) BUG(); /* @@ -387,6 +391,20 @@ reschedule_retry(r1_bio); } } else { + if (!uptodate) + md_error(r1_bio->mddev, conf->mirrors[mirror].rdev); + else + /* + * Set R1BIO_Uptodate in our master bio, so that + * we will return a good error code for to the higher + * levels even if IO on some other mirrored buffer fails. + * + * The 'master' represents the composite IO operation to + * user-side. So if something waits for IO, then it will + * wait for the 'master' bio. + */ + set_bit(R1BIO_Uptodate, &r1_bio->state); + if (r1_bio->read_bio) BUG(); @@ -708,6 +726,19 @@ read_bio->bi_end_io = raid1_end_request; read_bio->bi_rw = r1_bio->cmd; read_bio->bi_private = r1_bio; + +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT + atomic_set(&r1_bio->remaining, 0); + /* select target devices under spinlock */ + spin_lock_irq(&conf->device_lock); + for (i = 0; i < disks; i++) { + if (conf->mirrors[i].rdev && + !conf->mirrors[i].rdev->faulty) { + atomic_inc(&r1_bio->remaining); + } + } + spin_unlock_irq(&conf->device_lock); +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */ generic_make_request(read_bio); return 0; Peter ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-10 7:19 ` Peter T. Breuer @ 2005-01-10 9:05 ` Guy 2005-01-10 9:38 ` Peter T. Breuer 2005-01-10 12:31 ` Peter T. Breuer 1 sibling, 1 reply; 95+ messages in thread From: Guy @ 2005-01-10 9:05 UTC (permalink / raw) To: 'Peter T. Breuer', linux-raid This confuses me! A RAID1 array does not fail on a read error, unless the read error is on the only disk. Maybe you have found a bug? Were you able to cause an array to fail by having 1 disk give a read error? Or are you just preventing a single read error from kicking a disk? I think this is what you are trying to say, if so, it has value. Based on the code below, I think you are not referring to failing the array, but failing a disk. Would be nice to then attempt to correct the read error(s). Also, log the errors. Else the array could continue to degrade until finally the same block is bad on all devices. You said if all disks get a read error the last disks is kicked. What data is returned to the user? Normally, the array would go off-line. But since you still have 1 or more disks in the array, it is a new condition. My guess is that you have not given this enough thought. Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer Sent: Monday, January 10, 2005 2:19 AM To: linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks Peter T. Breuer <ptb@lab.it.uc3m.es> wrote: > DOES kick a disk out on read error, but also DOES RETRY the read from > another disk for that sector. Currently he does that in the resync > thread. > > He needs a list of failed reads and only needs to kick the disk when > recovery fails. Well, here is a patch to at least stop the array (RAID 1) being failed until all possible read sources have been exhausted for the sector in question. It's untested - I only checked that it compiles. The idea here is to modify raid1.c so that 1) in make_request, on read (as well as on write, where we already do it) we set the master bios "remaining" count to the number of viable disks in the array. That's the third of the three hunks in the patch below and is harmless unless somebody somewhere already uses the "remaining" field in the read branch. I don't see it if so. 2) in raid1_end_request, I pushed the if (!uptodate) test which faults the current disk out of the array down a few lines (past no code at all, just a branch test for READ or WRITE) and copied it into both the start of the READ and WRITE branches of the code. That shows up very badly under diff, which makes it look as though I did something else entirely. But that's all, and that is harmless. This is the first two hunks of the patch below. Diff makes it look as though I moved the branch UP, but I moved the code before the branch DOWN. After moving the faulting code into the two branches, in the READ branch ONLY I weakened the condition that faulted the disk from "if !uptodate" to "if !uptodate and there is no other source to try". That's probably harmless in itself, modulo accounting questions - there might be things like nr_pending still to tweak. This leaves things a bit unfair - "don't come any closer or the hacker gets it". The LAST disk that fails a read, in case all disks fail to read on that sector, gets ejected from the array. But which it is is random, depending on the order we try (anyone know if the rechedule_retry call is "fair" in the technical sense?). In my opinion no disk should ever be ejected from the array in these cicumstances - it's just a read error produced by the array as a whole and we have already done our bestto avoid it and can do on more. In a strong sense, as it sems to me, "error is the correct read result". I've marked what line to comment with /* PTB ... */ in the patch. --- linux-2.6.3/drivers/md/raid1.c.orig Tue Dec 28 00:39:01 2004 +++ linux-2.6.3/drivers/md/raid1.c Mon Jan 10 07:39:38 2005 @@ -354,9 +354,15 @@ /* * this branch is our 'one mirror IO has finished' event handler: */ - if (!uptodate) - md_error(r1_bio->mddev, conf->mirrors[mirror].rdev); - else + update_head_pos(mirror, r1_bio); + if ((r1_bio->cmd == READ) || (r1_bio->cmd == READA)) { + if (!uptodate +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT + && atomic_dec_and_test(&r1_bio->remaining) +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */ + ) { /* PTB remove next line to be much fairer! */ + md_error(r1_bio->mddev, conf->mirrors[mirror].rdev); + } else /* * Set R1BIO_Uptodate in our master bio, so that * we will return a good error code for to the higher @@ -368,8 +374,6 @@ */ set_bit(R1BIO_Uptodate, &r1_bio->state); - update_head_pos(mirror, r1_bio); - if ((r1_bio->cmd == READ) || (r1_bio->cmd == READA)) { if (!r1_bio->read_bio) BUG(); /* @@ -387,6 +391,20 @@ reschedule_retry(r1_bio); } } else { + if (!uptodate) + md_error(r1_bio->mddev, conf->mirrors[mirror].rdev); + else + /* + * Set R1BIO_Uptodate in our master bio, so that + * we will return a good error code for to the higher + * levels even if IO on some other mirrored buffer fails. + * + * The 'master' represents the composite IO operation to + * user-side. So if something waits for IO, then it will + * wait for the 'master' bio. + */ + set_bit(R1BIO_Uptodate, &r1_bio->state); + if (r1_bio->read_bio) BUG(); @@ -708,6 +726,19 @@ read_bio->bi_end_io = raid1_end_request; read_bio->bi_rw = r1_bio->cmd; read_bio->bi_private = r1_bio; + +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT + atomic_set(&r1_bio->remaining, 0); + /* select target devices under spinlock */ + spin_lock_irq(&conf->device_lock); + for (i = 0; i < disks; i++) { + if (conf->mirrors[i].rdev && + !conf->mirrors[i].rdev->faulty) { + atomic_inc(&r1_bio->remaining); + } + } + spin_unlock_irq(&conf->device_lock); +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */ generic_make_request(read_bio); return 0; Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-10 9:05 ` Guy @ 2005-01-10 9:38 ` Peter T. Breuer 0 siblings, 0 replies; 95+ messages in thread From: Peter T. Breuer @ 2005-01-10 9:38 UTC (permalink / raw) To: linux-raid Guy <bugzilla@watkins-home.com> wrote: > Peter T. Breuer <ptb@lab.it.uc3m.es> wrote: > > Well, here is a patch to at least stop the array (RAID 1) being failed > > until all possible read sources have been exhausted for the sector in > > question. It's untested - I only checked that it compiles. > A RAID1 array does not fail on a read error, unless the read error is on the > only disk. I'm sorry, I meant "degraded", not "failed", when I wrote that summary. To clarify, the patch stops the mirror disk in question being _faulted_ out of the array when a sector read _fails_ on the disk. The read is instead retried on another disk (as is the case at present in the standard code, if I recall correctly - the patch only stops the current disk also being faulted while the retry is scheduled). In addition I pointed to what line to comment to stop any disk being ever faulted at all on a read error, which ("not faulting") in my opinion is more correct. The reasoning is that either we try all disks and succeed on one, in which case there is nothing to mention to anybody, or we succeed on none and there really is an error in that position in the array, on all disks, and that's the right thing to say. What happens on recovery is another question. There may be scattered error blocks. I would also like to submit a write to the dubious sectors, from the readable disk, once we have found it. > Maybe you have found a bug? There are bugs, but that is not one of them. If you want to check the patch, check to see if schedule_retry moves the current target of the bio to another disk in a fair way. I didn't check. Peter ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-10 7:19 ` Peter T. Breuer 2005-01-10 9:05 ` Guy @ 2005-01-10 12:31 ` Peter T. Breuer 2005-01-10 13:19 ` Peter T. Breuer 1 sibling, 1 reply; 95+ messages in thread From: Peter T. Breuer @ 2005-01-10 12:31 UTC (permalink / raw) To: linux-raid Peter T. Breuer <ptb@lab.it.uc3m.es> wrote: > --- linux-2.6.3/drivers/md/raid1.c.orig Tue Dec 28 00:39:01 2004 > +++ linux-2.6.3/drivers/md/raid1.c Mon Jan 10 07:39:38 2005 > @@ -354,9 +354,15 @@ > /* > * this branch is our 'one mirror IO has finished' event handler: > */ > - if (!uptodate) > - md_error(r1_bio->mddev, conf->mirrors[mirror].rdev); > - else > + update_head_pos(mirror, r1_bio); > + if ((r1_bio->cmd == READ) || (r1_bio->cmd == READA)) { > + if (!uptodate > +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT > + && atomic_dec_and_test(&r1_bio->remaining) > +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */ > + ) { /* PTB remove next line to be much fairer! */ > + md_error(r1_bio->mddev, conf->mirrors[mirror].rdev); > + } else Hmm ... I must be crackers at 7.39 in the morning. Surely if the bio is not uptodate but the read attampt's time is not yet up, we don't want to tell the master bio that the io was successful (the "else")! That should have read "if if", not "if and". I.e. - if (!uptodate) - md_error(r1_bio->mddev, conf->mirrors[mirror].rdev); - else + update_head_pos(mirror, r1_bio); + if ((r1_bio->cmd == READ) || (r1_bio->cmd == READA)) { + if (!uptodate) { +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT + if (atomic_dec_and_test(&r1_bio->remaining)) +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */ + /* PTB remove next line to be much fairer! */ + md_error(r1_bio->mddev, conf->mirrors[mirror].rdev); + } else So if the bio is not uptodate we just drop through (after decrementing the count on the master) into the existing code which checks this bio uptodateness and sends a retry if it is not good. Yep. I'll send out another patch later, with rewrite on read fail too. When I've woken up. Peter ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-10 12:31 ` Peter T. Breuer @ 2005-01-10 13:19 ` Peter T. Breuer 2005-01-10 18:37 ` Peter T. Breuer 0 siblings, 1 reply; 95+ messages in thread From: Peter T. Breuer @ 2005-01-10 13:19 UTC (permalink / raw) To: linux-raid Peter T. Breuer <ptb@lab.it.uc3m.es> wrote: > Hmm ... I must be crackers at 7.39 in the morning. Surely if the bio Perhaps this is more obviously correct (or less obviously incorrect). Same rationale as before. Detailed reasoning after lunch. This patch is noticably less invasive, less convoluted. See embedded comments. --- linux-2.6.3/drivers/md/raid1.c.orig Tue Dec 28 00:39:01 2004 +++ linux-2.6.3/drivers/md/raid1.c Mon Jan 10 14:05:46 2005 @@ -354,9 +354,15 @@ /* * this branch is our 'one mirror IO has finished' event handler: */ - if (!uptodate) - md_error(r1_bio->mddev, conf->mirrors[mirror].rdev); - else + if (!uptodate) { +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT + /* + * Only fault disk out of array on write error, not read. + */ + if (r1_bio->cmd == WRITE) +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */ + md_error(r1_bio->mddev, conf->mirrors[mirror].rdev); + } else /* * Set R1BIO_Uptodate in our master bio, so that * we will return a good error code for to the higher @@ -375,7 +381,12 @@ /* * we have only one bio on the read side */ - if (uptodate) + if (uptodate +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT + /* Give up and error if we're last */ + || atomic_dec_and_test(&r1_bio->remaining) +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */ + ) raid_end_bio_io(r1_bio); else { /* @@ -708,6 +720,18 @@ read_bio->bi_end_io = raid1_end_request; read_bio->bi_rw = r1_bio->cmd; read_bio->bi_private = r1_bio; +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT + atomic_set(&r1_bio->remaining, 0); + /* count source devices under spinlock */ + spin_lock_irq(&conf->device_lock); + for (i = 0; i < disks; i++) { + if (conf->mirrors[i].rdev && + !conf->mirrors[i].rdev->faulty) { + atomic_inc(&r1_bio->remaining); + } + } + spin_unlock_irq(&conf->device_lock); +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */ generic_make_request(read_bio); return 0; ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-10 13:19 ` Peter T. Breuer @ 2005-01-10 18:37 ` Peter T. Breuer 2005-01-11 11:34 ` Peter T. Breuer 0 siblings, 1 reply; 95+ messages in thread From: Peter T. Breuer @ 2005-01-10 18:37 UTC (permalink / raw) To: linux-raid Peter T. Breuer <ptb@lab.it.uc3m.es> wrote: > Peter T. Breuer <ptb@lab.it.uc3m.es> wrote: > > Hmm ... I must be crackers at 7.39 in the morning. Surely if the bio > > Perhaps this is more obviously correct (or less obviously incorrect). So I'll do the commentary for it now. The last hunk of this three hunk patch is the easiest to explain: > --- linux-2.6.3/drivers/md/raid1.c.orig Tue Dec 28 00:39:01 2004 > +++ linux-2.6.3/drivers/md/raid1.c Mon Jan 10 14:05:46 2005 > @@ -708,6 +720,18 @@ > read_bio->bi_end_io = raid1_end_request; > read_bio->bi_rw = r1_bio->cmd; > read_bio->bi_private = r1_bio; > +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT > + atomic_set(&r1_bio->remaining, 0); > + /* count source devices under spinlock */ > + spin_lock_irq(&conf->device_lock); > + for (i = 0; i < disks; i++) { > + if (conf->mirrors[i].rdev && > + !conf->mirrors[i].rdev->faulty) { > + atomic_inc(&r1_bio->remaining); > + } > + } > + spin_unlock_irq(&conf->device_lock); > +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */ > > generic_make_request(read_bio); > return 0; > That simply adds to the raid1 make_request code in the READ branch the same stanza that appears in the WRITE branch already, namely a calculation of how many working disks there are in the array, which is put into the "remaining" field of the raid1 master bio being set up. So we put the count of valid disks in the "remaining" field during construction of a raid1 read bio. If I am off by one, I apologize. The write size code starts the count at 1 instead of 0, and I don't know why. If anyone wants to see the WRITE side equivalent, it goes: for (i = 0; i < disks; i++) { if (conf->mirrors[i].rdev && !conf->mirrors[i].rdev->faulty) { ... r1_bio->write_bios[i] = bio; } else r1_bio->write_bios[i] = NULL; } atomic_set(&r1_bio->remaining, 1); for (i = 0; i < disks; i++) { if (!r1_bio->write_bios[i]) continue; ... atomic_inc(&r1_bio->remaining); generic_make_request(mbio); } so I reckon that's equivalent, apart from the off-by-one. Explain me somebody. In the end_request code, simply, instead of erroring the current disk out of the array whenever an error happens, do it only if a WRITE is being handled. We still won't mark the request uptodate as that's in the else part of the if !uptodate, where we don't touch. That's the first hunk here. The second hunk is in the same routine, but down in the READ side of the code split, further on. We finish the request not only if we are utodate (success), but also if we are not uptodate but we are plain out of disks to try and read from (so the request will be errored since it is not marked yuptodate still). We decrement the "remaining" count in the test. > @@ -354,9 +354,15 @@ > /* > * this branch is our 'one mirror IO has finished' event handler: > */ > - if (!uptodate) > - md_error(r1_bio->mddev, conf->mirrors[mirror].rdev); > - else > + if (!uptodate) { > +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT > + /* > + * Only fault disk out of array on write error, not read. > + */ > + if (r1_bio->cmd == WRITE) > +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */ > + md_error(r1_bio->mddev, conf->mirrors[mirror].rdev); > + } else > /* > * Set R1BIO_Uptodate in our master bio, so that > * we will return a good error code for to the higher > @@ -375,7 +381,12 @@ > /* > * we have only one bio on the read side > */ > - if (uptodate) > + if (uptodate > +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT > + /* Give up and error if we're last */ > + || atomic_dec_and_test(&r1_bio->remaining) > +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */ > + ) > raid_end_bio_io(r1_bio); > else { > /* Peter ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-10 18:37 ` Peter T. Breuer @ 2005-01-11 11:34 ` Peter T. Breuer 0 siblings, 0 replies; 95+ messages in thread From: Peter T. Breuer @ 2005-01-11 11:34 UTC (permalink / raw) To: linux-raid I'm looking for a natural way to rewrite failed read sectors on read1. ANy ideas? There are several pitfalls to do with barriers in the different raid threads. My first crude idea was to put into raid1_end_request a sync_request(mddev, r1_bio->sector, 0); just before raid1_end_bio_io(r1_bio) is run on a succesful retried read. But this supposes that nothing in sync_request will sleep (or is that ok in end_request nowadays). If not possible inline I will have to schedule it instead. Another possibility is not to run raid1_end_bio_io just yet but instead convert the r1_bio we just did ok into a SPECIAL and put it on the retry queue and let the raid1 treat it (by running the WRITE half of a READ-WRITE resync operation on it). I can modify raid1d to do the users end_bio_io for us if needed. Or I can run sync_request_write(mddev, r1_bio); directly (somehow I get a shiver down my spine) from the end_request. Ideas? Advice? Derision? OK - so to be definite, what would be wrong with r1_bio->cmd = SPECIAL; reschedule_retry(r1_bio); instead of raid_end_bio_io(r1_bio); in raid1_end_request? This should result in the raid1d thread doing a write-half to all devices from the bio buffer taht we just filled with a successful read. The question is when we get to ack the user on the read. Maybe I should clone the bio. Peter ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-08 19:25 ` maarten 2005-01-08 20:33 ` Mario Holbe @ 2005-01-08 23:09 ` Guy 2005-01-09 0:56 ` maarten 2005-01-13 2:05 ` Neil Brown 1 sibling, 2 replies; 95+ messages in thread From: Guy @ 2005-01-08 23:09 UTC (permalink / raw) To: 'maarten', linux-raid Maarten said: "Normally, the minute a drive fails, it gets kicked and the spare would kick in and md syncs this spare. We now have a non-degraded array again." Guy says: But, you make it seem instantaneously! The array will be degraded until the re-sync is done. In my case, that takes about 60 minutes, so 1 extra minute is insignificant. Marrten said: "Yes, but this would be impossible to do, since md cannot anticipate _which_ disk you're going to fail before it happens. ;)" Guy says: But, I could tell md which disk I want to spare. After all, I know which disk I am going to fail. Maybe even an option to mark a disk as "to be failed", which would cause it to be spared before it goes off-line. Then md could fail the disk after it has been spared. Neil, add this to the wish list! :) EMC does this on their big iron. If the system determines a disk is having too many issues (bad blocks or whatever), the system predicts a failure, the system copies the disk to a spare. That way a second failure during the re-sync would not be fatal. And a direct disk to disk copy is much faster (or easier) than a re-build from parity. This is how it was explained to me about 5 years ago. No idea if it was marketing lies or truth. But I liked the fact that my data stayed redundant while the spare was being re-built. This would not work if a drive failed, only if a drive failure was predicted. Another cool feature... the disk array then makes a support call. The disk is replaced quickly, normally before any redundancy was lost. Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten Sent: Saturday, January 08, 2005 2:25 PM To: linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks On Saturday 08 January 2005 19:55, you wrote: > My warning about user error was not targeted at you! :) > Sorry if it seemed so. :-) > And the order does not matter! Hm... yes you're right. But adding the disk is more prudent (or is it?) Grr. Now you've got ME thinking ! ;-) Normally, the minute a drive fails, it gets kicked and the spare would kick in and md syncs this spare. We now have a non-degraded array again. If I then fail the spare first, the array goes into degraded mode. Whereas if I hotadd the disk, it becomes a spare. Presumably if I now fail the original spare, the real disk will get synced again, to get the same setup as before. But yes, you're right; during this step it is degraded again. Oh well... > It would be cool if the rebuild to the repaired disk could be done before > the spare was failed or removed. Then the array would not be degraded at > all. Yes, but this would be impossible to do, since md cannot anticipate _which_ disk you're going to fail before it happens. ;) > If I ever re-build my system, or build a new system, I hope to use RAID6. I tried this in last fall, but it didn't work out then. See the list archives. > The Seagate test is on-line. Before I started using the Seagate tool, I > used dd. I'm not as cautious as you are. I just pray the hot spare does what its supposed to do. > My disks claim to be able to re-locate bad blocks on read error. But I am > not sure if this is correctable errors or not. If not correctable errors > are re-located, what data does the drive return? Since I don't know, I > don't use this option. I did use this option for awhile, but after > re-reading about it, I got concerned and turned it off. Afaik, if a drive senses it gets more 'difficult' than usual to read a sector, it will automatically copy it to a spare sector and reassign it. However, I doubt the OS gets any wiser this happens, so neither would md. In which cases the error gets noticed by md I don't precisely know, but I reckon that may well be when the error is uncorrectible. Not _undetectable_, to quote from another thread... 8-) > This is from the readme file: > Automatic Read Reallocation Enable (ARRE) > -Marreon/off enable/disable ARRE bit > On, drive automatically relocates bad blocks detected > during read operations. Off, drive creates Check condition > status with sense key of Medium Error if bad blocks are > detected during read operations. Hm. I would definitely ENable that option. But what do I know. It also depends I guess on how fatal reading bad data undetected is for you. For me, if one of my mpegs or mp3s develops a bad sector I can probably live with that. :-) Maarten - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-08 23:09 ` Guy @ 2005-01-09 0:56 ` maarten 2005-01-13 2:05 ` Neil Brown 1 sibling, 0 replies; 95+ messages in thread From: maarten @ 2005-01-09 0:56 UTC (permalink / raw) To: linux-raid On Sunday 09 January 2005 00:09, Guy wrote: > Maarten said: > "Normally, the minute a drive fails, it gets kicked and the spare would > kick in and md syncs this spare. We now have a non-degraded array again." > > Guy says: > But, you make it seem instantaneously! The array will be degraded until > the re-sync is done. In my case, that takes about 60 minutes, so 1 extra > minute is insignificant. No, sure it is not instantaneous, far from it. Sorry if I made that impression. On my system it takes a whole lot longer than 60 minutes, more like 360 minutes. (in my other array where I use whole-disk 160 GB volumes). > Marrten said: > "Yes, but this would be impossible to do, since md cannot anticipate > _which_ > disk you're going to fail before it happens. ;)" > > Guy says: > But, I could tell md which disk I want to spare. After all, I know which > disk I am going to fail. Maybe even an option to mark a disk as "to be > failed", which would cause it to be spared before it goes off-line. Then > md could fail the disk after it has been spared. Neil, add this to the > wish list! :) Yes, that would be a smart option indeed :) It gets rid of the window where any failure would be fatal. But I suppose Neil is overworked as it is. > EMC does this on their big iron. If the system determines a disk is having > too many issues (bad blocks or whatever), the system predicts a failure, > the system copies the disk to a spare. That way a second failure during > the re-sync would not be fatal. And a direct disk to disk copy is much > faster (or easier) than a re-build from parity. This is how it was > explained to me about 5 years ago. No idea if it was marketing lies or > truth. But I liked the fact that my data stayed redundant while the spare > was being re-built. This would not work if a drive failed, only if a drive > failure was predicted. Another cool feature... the disk array then makes a > support call. The disk is replaced quickly, normally before any redundancy > was lost. Hehe. Cool. Big iron -> You indeed get what ya pay for :-)) ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-08 23:09 ` Guy 2005-01-09 0:56 ` maarten @ 2005-01-13 2:05 ` Neil Brown 2005-01-13 4:55 ` Guy 2005-01-13 9:27 ` Peter T. Breuer 1 sibling, 2 replies; 95+ messages in thread From: Neil Brown @ 2005-01-13 2:05 UTC (permalink / raw) To: Guy; +Cc: 'maarten', linux-raid On Saturday January 8, bugzilla@watkins-home.com wrote: > > Guy says: > But, I could tell md which disk I want to spare. After all, I know which > disk I am going to fail. Maybe even an option to mark a disk as "to be > failed", which would cause it to be spared before it goes off-line. Then md > could fail the disk after it has been spared. Neil, add this to the wish > list! :) Once the "bitmap of potentially dirty blocks" is working, this could be done in user space (though there would be a small window). - fail out the chosen drive. - combine it with the spare in a raid1 with no superblock - add this raid1 back into the main array. - md will notice that it has recently been removed and will only rebuild those blocks which need to be rebuilt - raid for the raid1 to fully sync - fail out the drive you want to remove. You only have a tiny window where the array is degraded, and it we were to allow an md array to block all IO requests for a time, you could make that window irrelevant. NeilBrown ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-13 2:05 ` Neil Brown @ 2005-01-13 4:55 ` Guy 2005-01-13 9:27 ` Peter T. Breuer 1 sibling, 0 replies; 95+ messages in thread From: Guy @ 2005-01-13 4:55 UTC (permalink / raw) To: 'Neil Brown'; +Cc: 'maarten', linux-raid 1. Would the re-sync of the RAID5 wait for the re-sync of the RAID1, since 2 different arrays depend on the same device? 2. Will the "bitmap of potentially dirty blocks" be able to keep a disk in the array if it has bad blocks? 3. Will RAID1 be able to re-sync to another disk if the source disk has bad blocks? Even if they are un-correctable? Once the re-sync is done, then RAID5 could re-construct the missing data, and correct the RAID1 array. Ouch!, seems like a catch 22. RAID5 should go first and correct the bad blocks first, and then, any new bad blocks found during the RAID1 re-sync. But, the bitmap would need to be quad-state (synced, right is good, left is good, both are bad). Since RAID1 can have more than 2 devices, maybe 1 bit per device (synced, not synced). The more I think, the harder it gets! :) If 1, 2 and 3 above are all yes, then it seems like a usable workaround. And, in the future, maybe RAID5 arrays would be made up of RAID1 arrays with only 1 disk each. Using grow to copy a failing disk to another (RAID1), then removing the failing disk. Then shrinking the RAID1 back to 1 disk. Then there would be no window. Using this method, #1 above is irrelevant, or less relevant! Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Neil Brown Sent: Wednesday, January 12, 2005 9:06 PM To: Guy Cc: 'maarten'; linux-raid@vger.kernel.org Subject: RE: Spares and partitioning huge disks On Saturday January 8, bugzilla@watkins-home.com wrote: > > Guy says: > But, I could tell md which disk I want to spare. After all, I know which > disk I am going to fail. Maybe even an option to mark a disk as "to be > failed", which would cause it to be spared before it goes off-line. Then md > could fail the disk after it has been spared. Neil, add this to the wish > list! :) Once the "bitmap of potentially dirty blocks" is working, this could be done in user space (though there would be a small window). - fail out the chosen drive. - combine it with the spare in a raid1 with no superblock - add this raid1 back into the main array. - md will notice that it has recently been removed and will only rebuild those blocks which need to be rebuilt - wait for the raid1 to fully sync - fail out the drive you want to remove. You only have a tiny window where the array is degraded, and if we were to allow an md array to block all IO requests for a time, you could make that window irrelevant. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-13 2:05 ` Neil Brown 2005-01-13 4:55 ` Guy @ 2005-01-13 9:27 ` Peter T. Breuer 2005-01-13 15:53 ` Guy 1 sibling, 1 reply; 95+ messages in thread From: Peter T. Breuer @ 2005-01-13 9:27 UTC (permalink / raw) To: linux-raid Neil Brown <neilb@cse.unsw.edu.au> wrote: > On Saturday January 8, bugzilla@watkins-home.com wrote: > > > > Guy says: > > But, I could tell md which disk I want to spare. After all, I know which > > disk I am going to fail. Maybe even an option to mark a disk as "to be > > failed", which would cause it to be spared before it goes off-line. Then md > > could fail the disk after it has been spared. Neil, add this to the wish > > list! :) > > Once the "bitmap of potentially dirty blocks" is working, this could > be done in user space (though there would be a small window). > > - fail out the chosen drive. > - combine it with the spare in a raid1 with no superblock > - add this raid1 back into the main array. > - md will notice that it has recently been removed and will only > rebuild those blocks which need to be rebuilt > - raid for the raid1 to fully sync > - fail out the drive you want to remove. I don't really understand what this is all about, but I recall that when I was writing FR5 one of the things I wanted as an objective was to be able to REPLACE one of the disks in the array efficiently because currently there's no real way that doesn't take you through a degraded array, since you have to add the replacement as a spare, then fail one of the existing disks. What I wanted was to allow the replacement to be added in and synced up in the background. Is that what you are talking about? I don't recall if I actually did it or merely planned to do it, but I recall considering it (and that should logically imply that I probably did something about it). > You only have a tiny window where the array is degraded, and it we > were to allow an md array to block all IO requests for a time, you > could make that window irrelevant. Well, I don't see where there's any window in which its degraded. If one triggers a sync after adding in the spare and marking it as failed then the spare will get a copy from the rest and new writes will also go to it, no? Ahh .. I now recall that maybe I did this in practice for RAID5 simply by running RAID5 over individual RAID1s already in degraded mode. To "replace" any of the disks one adds a mirror component to one of the degraded RAID1s, waits till it syncs up, then fails and removes the original component. Hey presto - replacement without degradation. Presumably that also works for RAID1. I.e. you run RAID1 over several RAID1s already in degraded mode. To replace one of the disks you simply add in the replacement to one of the "degraded" RAID1s. When it's synced you fail out the original component. Peter ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-13 9:27 ` Peter T. Breuer @ 2005-01-13 15:53 ` Guy 2005-01-13 17:16 ` Peter T. Breuer 0 siblings, 1 reply; 95+ messages in thread From: Guy @ 2005-01-13 15:53 UTC (permalink / raw) To: 'Peter T. Breuer', linux-raid Peter said: "Well, I don't see where there's any window in which its degraded." These are the steps that cause the window (see "Original Message" for full details): 1. fail out the chosen drive. (array is now degraded) 2. combine it with the spare in a raid1 with no superblock (re-synce starts) 3. add this raid1 back into the main array. (The main array is now in-sync other than any changes that occurred since you failed the disk in step 1) Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer Sent: Thursday, January 13, 2005 4:28 AM To: linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks Neil Brown <neilb@cse.unsw.edu.au> wrote: > On Saturday January 8, bugzilla@watkins-home.com wrote: > > > > Guy says: > > But, I could tell md which disk I want to spare. After all, I know which > > disk I am going to fail. Maybe even an option to mark a disk as "to be > > failed", which would cause it to be spared before it goes off-line. Then md > > could fail the disk after it has been spared. Neil, add this to the wish > > list! :) > > Once the "bitmap of potentially dirty blocks" is working, this could > be done in user space (though there would be a small window). > > - fail out the chosen drive. > - combine it with the spare in a raid1 with no superblock > - add this raid1 back into the main array. > - md will notice that it has recently been removed and will only > rebuild those blocks which need to be rebuilt > - raid for the raid1 to fully sync > - fail out the drive you want to remove. I don't really understand what this is all about, but I recall that when I was writing FR5 one of the things I wanted as an objective was to be able to REPLACE one of the disks in the array efficiently because currently there's no real way that doesn't take you through a degraded array, since you have to add the replacement as a spare, then fail one of the existing disks. What I wanted was to allow the replacement to be added in and synced up in the background. Is that what you are talking about? I don't recall if I actually did it or merely planned to do it, but I recall considering it (and that should logically imply that I probably did something about it). > You only have a tiny window where the array is degraded, and it we > were to allow an md array to block all IO requests for a time, you > could make that window irrelevant. Well, I don't see where there's any window in which its degraded. If one triggers a sync after adding in the spare and marking it as failed then the spare will get a copy from the rest and new writes will also go to it, no? Ahh .. I now recall that maybe I did this in practice for RAID5 simply by running RAID5 over individual RAID1s already in degraded mode. To "replace" any of the disks one adds a mirror component to one of the degraded RAID1s, waits till it syncs up, then fails and removes the original component. Hey presto - replacement without degradation. Presumably that also works for RAID1. I.e. you run RAID1 over several RAID1s already in degraded mode. To replace one of the disks you simply add in the replacement to one of the "degraded" RAID1s. When it's synced you fail out the original component. Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-13 15:53 ` Guy @ 2005-01-13 17:16 ` Peter T. Breuer 2005-01-13 20:40 ` Guy 0 siblings, 1 reply; 95+ messages in thread From: Peter T. Breuer @ 2005-01-13 17:16 UTC (permalink / raw) To: linux-raid Guy <bugzilla@watkins-home.com> wrote: > Peter said: > "Well, I don't see where there's any window in which its degraded." > > These are the steps that cause the window (see "Original Message" for full > details): > > 1. fail out the chosen drive. (array is now degraded) I would suggest "don't do that then". Start with an array of degraded RAID1s, as I suggested, and add in an extra disk to one of the raid1s, wait till it syncs, then remove the original component. Instant new (degraded) RAID1 in the place of the old, and the array above none the wiser. > 2. combine it with the spare in a raid1 with no superblock (re-synce starts) Why "no superblock"? Oh well - let's leave it as a mystery. > 3. add this raid1 back into the main array. (The main array is now in-sync > other than any changes that occurred since you failed the disk in step 1) Well, if you have an array of arrays it seems that the main array must have been degraded too, but I don't see where you took the subarray out of it in the sequence above (in order to add it back in now). The problem pointed out is that if the disk you are going to swap out is faulty, there's no way of copying from it perfectly. The read patch I posted a few days ago will help, but it won't paper over real sector errors - it may allow the copy to processd, however (I'll have to check what happens during a sync). So one has to substitute using data from the redundant parts of the array above (in the array-of-arrays solution). But there's no communication at present :(. Well, 1) if one were to use bitmaps, I would suggest that in the case of an array of arrays that the bitmap be shared between an array and its subarrays - do we really care in which disk a problem is? No - we know we just have to try and find some good data and correct a problem in that block and we can go searching for the details if and when we need. 2) I don't see any problem in, even without a bitmap, simply augmenting the repair strategy (which you people don't have yet, heh) for read errors to including getting the data from the array above if we are in a subarray, not just using our own redundancy. Peter ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-13 17:16 ` Peter T. Breuer @ 2005-01-13 20:40 ` Guy 2005-01-13 23:32 ` Peter T. Breuer 0 siblings, 1 reply; 95+ messages in thread From: Guy @ 2005-01-13 20:40 UTC (permalink / raw) To: 'Peter T. Breuer', linux-raid Maybe you missed my post from yesterday. http://marc.theaimsgroup.com/?l=linux-raid&m=110559211400459&w=2 No superblock was to prevent overwriting data on the failing component of the top RAID5 array. If you build the top array with degraded RAID1 arrays, then use a super block for the RAID1 arrays. Also, so all of the RAID1 arrays don't seem degraded, configure them with only 1 device. Grow them to 2 devices when needed. Then shrink them back to 1 when done. The RAID1 idea will not work since a bad block will take out the RAID1. But there are more issues, see the above URL. Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer Sent: Thursday, January 13, 2005 12:17 PM To: linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks Guy <bugzilla@watkins-home.com> wrote: > Peter said: > "Well, I don't see where there's any window in which its degraded." > > These are the steps that cause the window (see "Original Message" for full > details): > > 1. fail out the chosen drive. (array is now degraded) I would suggest "don't do that then". Start with an array of degraded RAID1s, as I suggested, and add in an extra disk to one of the raid1s, wait till it syncs, then remove the original component. Instant new (degraded) RAID1 in the place of the old, and the array above none the wiser. > 2. combine it with the spare in a raid1 with no superblock (re-synce starts) Why "no superblock"? Oh well - let's leave it as a mystery. > 3. add this raid1 back into the main array. (The main array is now in-sync > other than any changes that occurred since you failed the disk in step 1) Well, if you have an array of arrays it seems that the main array must have been degraded too, but I don't see where you took the subarray out of it in the sequence above (in order to add it back in now). The problem pointed out is that if the disk you are going to swap out is faulty, there's no way of copying from it perfectly. The read patch I posted a few days ago will help, but it won't paper over real sector errors - it may allow the copy to processd, however (I'll have to check what happens during a sync). So one has to substitute using data from the redundant parts of the array above (in the array-of-arrays solution). But there's no communication at present :(. Well, 1) if one were to use bitmaps, I would suggest that in the case of an array of arrays that the bitmap be shared between an array and its subarrays - do we really care in which disk a problem is? No - we know we just have to try and find some good data and correct a problem in that block and we can go searching for the details if and when we need. 2) I don't see any problem in, even without a bitmap, simply augmenting the repair strategy (which you people don't have yet, heh) for read errors to including getting the data from the array above if we are in a subarray, not just using our own redundancy. Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-13 20:40 ` Guy @ 2005-01-13 23:32 ` Peter T. Breuer 2005-01-14 2:43 ` Guy 0 siblings, 1 reply; 95+ messages in thread From: Peter T. Breuer @ 2005-01-13 23:32 UTC (permalink / raw) To: linux-raid Guy <bugzilla@watkins-home.com> wrote: > Maybe you missed my post from yesterday. Possibly I did - I certainly don't read everything and not everything gets to me. Or maye I saw it and could not decipher it. I don't know! > http://marc.theaimsgroup.com/?l=linux-raid&m=110559211400459&w=2 > No superblock was to prevent overwriting data on the failing component of You say that no superblock in one of the raid1 subarray's disks stops overwriting data on a top raid5 array? This really sounds like double dutch! And if it does (What? How? Why?), so what? > the top RAID5 array. If you build the top array with degraded RAID1 arrays, > then use a super block for the RAID1 arrays. There possibly a missing verb in that sentence. Or maybe not. It is hard to tell. Hmmmmmmmm .......... nope, I really can't see where that sentence is trying to go. Let's suppose it really does have the form of a computer language IF THEN, so the conditional test would be "you build the top array with degraded RAID1 arrays". Well, I can interpret that to say that I build the top array (which is RAID5) OF degraded RAID1 arrays, and then that would match what I suggested. OK. So that would mean "IF you do what you suggest THEN ...". Then what? Then "an imperative". An imperative? Why should I obey an imperative? Well, what does this imperative say anyway? It says "use a super block for the RAID1 arrays". OK, that could be "use RAID1 arrays with superblocks". That is "do the normal thing with RAID1". So the whole sentence says "IF you do what you suggest THEN do the normal thing with RAID1". OK - I agree. You are trying to say "IF I do things my way THEN I don't have to do anything strange at the RAID1 level". OK? Whew! I can see why I skipped whatever you said before if that is what it takes to decipher it! Bt then the entire sentence says nothing strange or exciting. > Also, so all of the RAID1 arrays don't seem degraded, configure them with > only 1 device. Whaaaaaaat? Oh no, I give up - I really can't parse this. Hang on - maybe the tenses are wrong. Maybe you are trying to say "you don't have to configure the RAID1 arrays as having 1 good disk and 1 failed disk". Well, I disagree. Correct me if I am wrong but as far as I know you cannot change the number of disks in a raid array. I'd be happy to learn you can, but for all I know if you start with n disks comprised of m good and p failed disks, then n = m + p and the total can never change. In my 2.6.3 codebase for raid there is only one point where conf->raid_disks is changed, and it is in the "run" routine, where it is set once and for all and never changed. > Grow them to 2 devices when needed. Then shrink them back > to 1 when done. If it were possible I'd be happy to hear of it. Maybe it is possible - but it would be in a newer codebase than the 2.6.3 code I have running here. If so, why this convoluted way of saying that? > The RAID1 idea will not work since a bad block will take out the RAID1. But Uh - yes it will work, no a bad block will not "take out the RAID1", whatever you mean by that. I presume you mean that a bad block in the disk being read will mess up the raid1 subarray. No it won't - it will just prevent a block being copied. I see nothing terrible in that. The result will be that everything else but that block will be copied. If you like we can even arrange that the missing block be corrected THEN at that moment from data available in the superarray, but I don't see that as necessary. Why? Well, 1) because now that you have told me that the disk you want to swap out is bad, then the top level array has morally lost its redundancy already! So just take the disk out and replace it - you won't be degrading the top array any more than it already is while you do this. 2) oh - but you say that yes we are losing redundancy on the good sectors of the disk. Oh? And which are those? Well let's just go ahead with the RAID1 sync and whenever we hit an unreadable sector then launch a request to the top level array for a read from the rest of the RAID5 for that sector only. 3) oh, but you don't like to do that during resync? Shrug .. then mar the place on a bitmap and after the resync has finished as best you are able then launch an extra cleanup phase from the RAID5 to cover the blocks marked bad in the bitmap (one can do this periodically anyway). I don't see that one really needs to do 3 in place of 2, but I am experimenting. Anyway, the point is that ÿour "it will not work" is wrong. > there are more issues, see the above URL. Does it contain more of the same? Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-13 23:32 ` Peter T. Breuer @ 2005-01-14 2:43 ` Guy 0 siblings, 0 replies; 95+ messages in thread From: Guy @ 2005-01-14 2:43 UTC (permalink / raw) To: 'Peter T. Breuer', linux-raid Well, I give up! It seems we don't talk the same language. That is ok with me. Bye! -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer Sent: Thursday, January 13, 2005 6:33 PM To: linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks Guy <bugzilla@watkins-home.com> wrote: > Maybe you missed my post from yesterday. Possibly I did - I certainly don't read everything and not everything gets to me. Or maye I saw it and could not decipher it. I don't know! > http://marc.theaimsgroup.com/?l=linux-raid&m=110559211400459&w=2 > No superblock was to prevent overwriting data on the failing component of You say that no superblock in one of the raid1 subarray's disks stops overwriting data on a top raid5 array? This really sounds like double dutch! And if it does (What? How? Why?), so what? > the top RAID5 array. If you build the top array with degraded RAID1 arrays, > then use a super block for the RAID1 arrays. There possibly a missing verb in that sentence. Or maybe not. It is hard to tell. Hmmmmmmmm .......... nope, I really can't see where that sentence is trying to go. Let's suppose it really does have the form of a computer language IF THEN, so the conditional test would be "you build the top array with degraded RAID1 arrays". Well, I can interpret that to say that I build the top array (which is RAID5) OF degraded RAID1 arrays, and then that would match what I suggested. OK. So that would mean "IF you do what you suggest THEN ...". Then what? Then "an imperative". An imperative? Why should I obey an imperative? Well, what does this imperative say anyway? It says "use a super block for the RAID1 arrays". OK, that could be "use RAID1 arrays with superblocks". That is "do the normal thing with RAID1". So the whole sentence says "IF you do what you suggest THEN do the normal thing with RAID1". OK - I agree. You are trying to say "IF I do things my way THEN I don't have to do anything strange at the RAID1 level". OK? Whew! I can see why I skipped whatever you said before if that is what it takes to decipher it! Bt then the entire sentence says nothing strange or exciting. > Also, so all of the RAID1 arrays don't seem degraded, configure them with > only 1 device. Whaaaaaaat? Oh no, I give up - I really can't parse this. Hang on - maybe the tenses are wrong. Maybe you are trying to say "you don't have to configure the RAID1 arrays as having 1 good disk and 1 failed disk". Well, I disagree. Correct me if I am wrong but as far as I know you cannot change the number of disks in a raid array. I'd be happy to learn you can, but for all I know if you start with n disks comprised of m good and p failed disks, then n = m + p and the total can never change. In my 2.6.3 codebase for raid there is only one point where conf->raid_disks is changed, and it is in the "run" routine, where it is set once and for all and never changed. > Grow them to 2 devices when needed. Then shrink them back > to 1 when done. If it were possible I'd be happy to hear of it. Maybe it is possible - but it would be in a newer codebase than the 2.6.3 code I have running here. If so, why this convoluted way of saying that? > The RAID1 idea will not work since a bad block will take out the RAID1. But Uh - yes it will work, no a bad block will not "take out the RAID1", whatever you mean by that. I presume you mean that a bad block in the disk being read will mess up the raid1 subarray. No it won't - it will just prevent a block being copied. I see nothing terrible in that. The result will be that everything else but that block will be copied. If you like we can even arrange that the missing block be corrected THEN at that moment from data available in the superarray, but I don't see that as necessary. Why? Well, 1) because now that you have told me that the disk you want to swap out is bad, then the top level array has morally lost its redundancy already! So just take the disk out and replace it - you won't be degrading the top array any more than it already is while you do this. 2) oh - but you say that yes we are losing redundancy on the good sectors of the disk. Oh? And which are those? Well let's just go ahead with the RAID1 sync and whenever we hit an unreadable sector then launch a request to the top level array for a read from the rest of the RAID5 for that sector only. 3) oh, but you don't like to do that during resync? Shrug .. then mar the place on a bitmap and after the resync has finished as best you are able then launch an extra cleanup phase from the RAID5 to cover the blocks marked bad in the bitmap (one can do this periodically anyway). I don't see that one really needs to do 3 in place of 2, but I am experimenting. Anyway, the point is that ÿour "it will not work" is wrong. > there are more issues, see the above URL. Does it contain more of the same? Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-08 14:52 ` Frank van Maarseveen 2005-01-08 15:50 ` Mario Holbe 2005-01-08 16:32 ` Guy @ 2005-01-08 16:49 ` maarten 2005-01-08 19:01 ` maarten 2005-01-09 19:33 ` Frank van Maarseveen 2 siblings, 2 replies; 95+ messages in thread From: maarten @ 2005-01-08 16:49 UTC (permalink / raw) To: linux-raid On Saturday 08 January 2005 15:52, Frank van Maarseveen wrote: > On Fri, Jan 07, 2005 at 04:57:35PM -0500, Guy wrote: > > His plan is to split the disks into 6 partitions. > > Each of his six RAID5 arrays will only use 1 partition of each physical > > disk. > > If he were to lose a disk, all 6 RAID5 arrays would only see 1 failed > > disk. If he gets 2 read errors, on different disks, at the same time, he > > has a 1/6 chance they would be in the same array (which would be bad). > > His plan is to combine the 6 arrays with LVM or a linear array. > > Intriguing setup. Do you think this actually improves the reliability > with respect to disk failure compared to creating just one large RAID5 > array? Yes. But I get no credits; someone else here invented the idea. > For one second I thought it's a clever trick but gut feeling tells > me the odds of losing the entire array won't change (simplified -- > because the increased complexity creates room for additional errors). No. It is somewhat more complex, true, but no different than making, for example, 6 md arrays for six different mountpoints. And I just add all six together in an LVM. The idea behind it is that not all errors with md are fatal. In the case of a non-fatal error, just re-adding the disk might solve it since the drive then will remap the bad sector. However, IF during that resync one other drive has a read error, it gets kicked too and the array dies. The chances of that happening are not very small; during resync all of the other drives get read in whole, so that is much more intensive than normal operation. So at the precise moment you really can't afford to get a read error, the chances of getting one are greater than ever(!). By dividing the physical disk in smaller parts one decreases the chance of a second disk with a bad sector being on the same array. You could have 3 or even 4 disks with bad sectors without losing the array, provided you're lucky and they all are on different parts of the drive platters (precisely: in different arrays). This is in theory of course, you'd be stupid to leave an array degraded and let chance decide which one breaks next... ;-) Besides this, the resync time in case of a fault decreases by a factor 6 too as an added bonus. I don't know about you but over here resyncing a 250GB disk takes the better part of the day. (To be honest, that was a slow system) Now it is certain that you'll strike a compromise between the added complexity and the benefits of this setup, so you choose an arbitrary amount of md arrays to define. For me six seemed okay, there is no need to go overboard and define real small arrays like 10 GB ones (24 of them). ;-) Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-08 16:49 ` maarten @ 2005-01-08 19:01 ` maarten 2005-01-10 16:34 ` maarten 2005-01-09 19:33 ` Frank van Maarseveen 1 sibling, 1 reply; 95+ messages in thread From: maarten @ 2005-01-08 19:01 UTC (permalink / raw) To: linux-raid On Saturday 08 January 2005 17:49, maarten wrote: > On Saturday 08 January 2005 15:52, Frank van Maarseveen wrote: > > On Fri, Jan 07, 2005 at 04:57:35PM -0500, Guy wrote: > > > His plan is to split the disks into 6 partitions. > > > Each of his six RAID5 arrays will only use 1 partition of each physical > > > disk. > > > If he were to lose a disk, all 6 RAID5 arrays would only see 1 failed > > > disk. If he gets 2 read errors, on different disks, at the same time, > > > he has a 1/6 chance they would be in the same array (which would be > > > bad). His plan is to combine the 6 arrays with LVM or a linear array. > > > > Intriguing setup. Do you think this actually improves the reliability > > with respect to disk failure compared to creating just one large RAID5 > > array? As the system is now online again, busy copying, I can show the exact config: dozer:~ # fdisk -l /dev/hde Disk /dev/hde: 250.0 GB, 250059350016 bytes 255 heads, 63 sectors/track, 30401 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hde1 1 268 2152678+ fd Linux raid autodetect /dev/hde2 269 331 506047+ fd Linux raid autodetect /dev/hde3 332 575 1959930 fd Linux raid autodetect /dev/hde4 576 30401 239577345 5 Extended /dev/hde5 576 5439 39070048+ fd Linux raid autodetect /dev/hde6 5440 10303 39070048+ fd Linux raid autodetect /dev/hde7 10304 15167 39070048+ fd Linux raid autodetect /dev/hde8 15168 20031 39070048+ fd Linux raid autodetect /dev/hde9 20032 24895 39070048+ fd Linux raid autodetect /dev/hde10 24896 29759 39070048+ fd Linux raid autodetect /dev/hde11 29760 30401 5156833+ 83 Linux dozer:~ # cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid5] [multipath] read_ahead 1024 sectors md1 : active raid1 hdg2[1] hde2[0] 505920 blocks [2/2] [UU] md0 : active raid1 hdg1[1] hde1[0] 2152576 blocks [4/2] [UU__] md2 : active raid1 sdb2[0] sda2[1] 505920 blocks [2/2] [UU] md3 : active raid5 sdb5[2] sda5[3] hdg5[1] hde5[0] 117209856 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] md4 : active raid5 sdb6[2] sda6[3] hdg6[1] hde6[0] 117209856 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] md5 : active raid5 sdb7[2] sda7[3] hdg7[1] hde7[0] 117209856 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] md6 : active raid5 sdb8[2] sda8[3] hdg8[1] hde8[0] 117209856 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] md7 : active raid5 sdb9[2] sda9[3] hdg9[1] hde9[0] 117209856 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] md8 : active raid5 sdb10[2] sda10[3] hdg10[1] hde10[0] 117209856 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] # Where md0 is "/" (temporary degraded), and md1 and md2 are swap. # The md3 through md8 are the big arrays that are part of LVM. dozer:~ # pvscan pvscan -- reading all physical volumes (this may take a while...) pvscan -- ACTIVE PV "/dev/md3" of VG "lvm_video" [111.72 GB / 0 free] pvscan -- ACTIVE PV "/dev/md4" of VG "lvm_video" [111.72 GB / 0 free] pvscan -- ACTIVE PV "/dev/md5" of VG "lvm_video" [111.72 GB / 0 free] pvscan -- ACTIVE PV "/dev/md6" of VG "lvm_video" [111.72 GB / 0 free] pvscan -- ACTIVE PV "/dev/md7" of VG "lvm_video" [111.72 GB / 0 free] pvscan -- ACTIVE PV "/dev/md8" of VG "lvm_video" [111.72 GB / 0 free] pvscan -- total: 6 [670.68 GB] / in use: 6 [670.68 GB] / in no VG: 0 [0] dozer:~ # vgdisplay --- Volume group --- VG Name lvm_video VG Access read/write VG Status available/resizable VG # 0 MAX LV 256 Cur LV 1 Open LV 1 MAX LV Size 2 TB Max PV 256 Cur PV 6 Act PV 6 VG Size 670.31 GB PE Size 32 MB Total PE 21450 Alloc PE / Size 21450 / 670.31 GB Free PE / Size 0 / 0 VG UUID F0EF61-uu4P-cnCq-6oQ6-CO5n-NE9g-5xjdTE dozer:~ # df Filesystem 1K-blocks Used Available Use% Mounted on /dev/md0 1953344 1877604 75740 97% / /dev/lvm_video/mythtv 702742528 42549352 660193176 7% /mnt/store # As of yet there are no spares. This is a todo, the most important thing is to get the app back in working state now. I'll probably make a /usr md device in future from hdX3, as "/" is completely full. This was because of legacy constraints, migrating drives... Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-08 19:01 ` maarten @ 2005-01-10 16:34 ` maarten 2005-01-10 16:36 ` Gordon Henderson ` (2 more replies) 0 siblings, 3 replies; 95+ messages in thread From: maarten @ 2005-01-10 16:34 UTC (permalink / raw) To: linux-raid On Saturday 08 January 2005 20:01, maarten wrote: > On Saturday 08 January 2005 17:49, maarten wrote: > > On Saturday 08 January 2005 15:52, Frank van Maarseveen wrote: > > > On Fri, Jan 07, 2005 at 04:57:35PM -0500, Guy wrote: > As the system is now online again, busy copying, I can show the exact > config: > Well all about the array is done and working fine up to now. Except one thing that I didn't anticipate: The application that's supposed to run has some problems under -I suppose- the new 2.4.28 kernel. I've had two panics / oopses in syslog already, and the process then is unkillable, so a reboot is in order. But I think that's bttv related, not the I/O layer. In any case I suffered through two lengthy raid resyncs already... ;-| So I've been shopping around for a *big* servercase today so I can put all disks (these 5, plus 6 from the current fileserver) in one big tower. I'll then use that over NFS and can revert back to my older working kernel. I've chosen a Chieftec case, as can be seen here http://www.chieftec.com/products/Workcolor/CA-01.htm and here in detail http://www.chieftec.com/products/Workcolor/NewBA.htm Nice drive cages, eh ? :-) P.S.: I get this filling up my logs. Should I be worried about that ? Jan 10 11:30:32 dozer kernel: raid5: switching cache buffer size, 512 --> 4096 Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 4096 --> 512 Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 512 --> 4096 Jan 10 11:30:36 dozer kernel: raid5: switching cache buffer size, 4096 --> 512 Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-10 16:34 ` maarten @ 2005-01-10 16:36 ` Gordon Henderson 2005-01-10 17:10 ` maarten 2005-01-10 17:13 ` Spares and partitioning huge disks Guy 2005-01-11 10:09 ` Spares and partitioning huge disks KELEMEN Peter 2 siblings, 1 reply; 95+ messages in thread From: Gordon Henderson @ 2005-01-10 16:36 UTC (permalink / raw) To: maarten; +Cc: linux-raid On Mon, 10 Jan 2005, maarten wrote: > P.S.: I get this filling up my logs. Should I be worried about that ? > Jan 10 11:30:32 dozer kernel: raid5: switching cache buffer size, 512 --> 4096 > Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 4096 --> 512 > Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 512 --> 4096 > Jan 10 11:30:36 dozer kernel: raid5: switching cache buffer size, 4096 --> 512 As I understand it, the "fix" is to comment it out in the kernel sources and compile & install a new kernel... It seems to be an artifact of LVM - then only times I've seen lots of these are when I experimented with LVM... (incidentally I had some instability with the occasional panic with LVM, so dumped it for that particular application, and same hardware & Kernel has been solid since) Gordon ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-10 16:36 ` Gordon Henderson @ 2005-01-10 17:10 ` maarten 2005-01-16 16:19 ` 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks 0 siblings, 1 reply; 95+ messages in thread From: maarten @ 2005-01-10 17:10 UTC (permalink / raw) To: linux-raid On Monday 10 January 2005 17:36, Gordon Henderson wrote: > On Mon, 10 Jan 2005, maarten wrote: > > P.S.: I get this filling up my logs. Should I be worried about that ? > > Jan 10 11:30:32 dozer kernel: raid5: switching cache buffer size, 512 --> > > 4096 Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, > > 4096 --> 512 Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer > > size, 512 --> 4096 Jan 10 11:30:36 dozer kernel: raid5: switching cache > > buffer size, 4096 --> 512 > > As I understand it, the "fix" is to comment it out in the kernel sources > and compile & install a new kernel... Ehm...? > It seems to be an artifact of LVM - then only times I've seen lots of > these are when I experimented with LVM... (incidentally I had some > instability with the occasional panic with LVM, so dumped it for that > particular application, and same hardware & Kernel has been solid since) I'm certain I saw it before, when I didn't use LVM at all. Maybe the kernel scans for LVM at boot, but LVM was not in initrd for sure. But is it dangerous or detrimental to performance (other than that it logs way too much) ? Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel 2005-01-10 17:10 ` maarten @ 2005-01-16 16:19 ` Mitchell Laks 2005-01-16 17:53 ` Gordon Henderson ` (2 more replies) 0 siblings, 3 replies; 95+ messages in thread From: Mitchell Laks @ 2005-01-16 16:19 UTC (permalink / raw) To: maarten; +Cc: linux-raid HI, I have 3 questions. 1) Maarten, Where did you buy the big chieftec chasis (CA-01B i think) and what did you pay for it? I have been using antec sx1000 chasis and yours looks better and bigger. 2) Also what are reasonable resync times for your big raid5 arrays? I had resync time or two days by accident recently for 4x 250 hard drives because i did not have dma enabled. that is solved, but i had switched to raid1 in the interm and now i am curious what others are used to. 3) Also, i have a module driver question. I use a asus K8V-X motherboard. It has sata and parallel ide channels. I use the sata for my system and use the parallel for data storage on ide raid. I am using combining the 2 motherboard IDE cable channels with highpoint rocket133 cards to provide 2 more ide ata channels. I installed debian and it defaulted to using the hpt366 modules for the rocket133 controllers. I suspect (correct me if I am wrong) that the hpt302 on the highpoint website is the RIGHT module to use (I notice for instance that when I compare the hdparm settings on the western digital drives on the motherboard ide channels are set with more advanced dma settings "turned on" than on the rocket133 controllers. Perhaps this is because it is using the 'incorrect hpt366' module? Of course I would prefer to use the hpt302 module (after i compile it...). So how do I get to insure that the system will use the hpt302 over the hpt366 that it seems to be chosing. If I 1) compile the module hpt302 from source 2) dump it in the /lib/modules/2.6.9-1-386/kernel/drivers/ide/pci directory 3) put a line hpt302 in the /etc/modules file (maybe at the top?) 4) put a line hpt302 at the top of the file /etc/mkinitd/modules. 5) run mkinitrd to generate the new initrd.img will this insure that the module hpt302 is loaded on preference to the hpt366 module? 4) Maarten mentioned that he had a problem with 2 different drives on the same channel for raid5. What was the problem exactly with that. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel 2005-01-16 16:19 ` 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks @ 2005-01-16 17:53 ` Gordon Henderson 2005-01-16 18:22 ` Maarten 2005-01-16 19:39 ` Guy 2 siblings, 0 replies; 95+ messages in thread From: Gordon Henderson @ 2005-01-16 17:53 UTC (permalink / raw) To: linux-raid On Sun, 16 Jan 2005, Mitchell Laks wrote: > 3) Also, i have a module driver question. > I use a asus K8V-X motherboard. It has sata and parallel ide channels. I use > the sata for my system and use the parallel for data storage on ide raid. > I am using combining the 2 motherboard IDE cable channels with highpoint > rocket133 cards to provide 2 more ide ata channels. > > I installed debian and it defaulted to using the hpt366 modules for the > rocket133 controllers. I've just been down this road myself... Debian Woody, kernels 2.4 and 2.6 and a Highpoint rocket133 controller... > I suspect (correct me if I am wrong) that the hpt302 on the highpoint website > is the RIGHT module to use (I notice for instance that when I compare the > hdparm settings on the western digital drives on the motherboard ide channels > are set with more advanced dma settings "turned on" than on the rocket133 > controllers. Perhaps this is because it is using the 'incorrect hpt366' > module? Using the module off their web site worked for me with kernel 2.4.28 - but it turned my IDE drives into SCSI drives! No real issue, but the smart drive termperature program stopped working... The driver wouldn't compile with 2.6.10, but the hpt366 driver did work under 2.6.10 and seems to work very well - and the drives still look like IDE drives and hddtemp still works. > Of course I would prefer to use the hpt302 module (after i compile > it...). So Would you? The 366 driver in 2.6.10 recognises that it's a 302 card and seems to work well... dmesg output: HPT302: IDE controller at PCI slot 0000:00:0a.0 HPT302: chipset revision 1 HPT37X: using 33MHz PCI clock HPT302: 100% native mode on irq 18 ide2: BM-DMA at 0x9800-0x9807, BIOS settings: hde:DMA, hdf:pio ide3: BM-DMA at 0x9808-0x980f, BIOS settings: hdg:DMA, hdh:pio Probing IDE interface ide2... hde: Maxtor 6Y080L0, ATA DISK drive ide2 at 0xb000-0xb007,0xa802 on irq 18 Probing IDE interface ide3... hdg: Maxtor 6Y080L0, ATA DISK drive ide3 at 0xa400-0xa407,0xa002 on irq 18 I have it compiled into the kernel here too - not a module. (personal choice, I never have modules unless I can avoid it) > how do I get to insure that the system will use the hpt302 over the hpt366 > that it seems to be chosing. If I > 1) compile the module hpt302 from source > 2) dump it in the /lib/modules/2.6.9-1-386/kernel/drivers/ide/pci directory > 3) put a line hpt302 in the /etc/modules file (maybe at the top?) > 4) put a line hpt302 at the top of the file /etc/mkinitd/modules. > 5) run mkinitrd to generate the new initrd.img The easiest way would be to compile a custom kernel yourself. Just leave out the Highpoint drivers and then compile and load the hpt302 module at boot time by listing it in the /etc/modules file. > 4) Maarten mentioned that he had a problem with 2 different drives on > the same channel for raid5. What was the problem exactly with that. It's possible that a failing IDE drive will crowbar the bus and take out the other drive with it - not neccessarily damage any data on the drive, but prevent it being seen by the OS. I've experienced this myself. In a RIAD-5 situation, you'd lose 2 drives which would not be a good thing.. Gordon ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel 2005-01-16 16:19 ` 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks 2005-01-16 17:53 ` Gordon Henderson @ 2005-01-16 18:22 ` Maarten 2005-01-16 19:39 ` Guy 2 siblings, 0 replies; 95+ messages in thread From: Maarten @ 2005-01-16 18:22 UTC (permalink / raw) To: linux-raid On Sunday 16 January 2005 17:19, Mitchell Laks wrote: > HI, > I have 3 questions. > > 1) Maarten, Where did you buy the big chieftec chasis (CA-01B i think) and > what did you pay for it? I have been using antec sx1000 chasis and yours > looks better and bigger. I paid 120 euro I think. The BA-01B is a bit cheaper but exactly the same except for the missing side window. I bought it at some local shop in the netherlands. However I went and bought a more powerful and ultrasilent Tagan 480 W PSU to replace the 360W chieftec one. That Tagan PSU was more expensive than the chieftec case+psu together. The case pleases me, but in all fairness the Antec cases are better in respect to details, this case has some mild sharp edges which I did not ever find with Antec. But that is a minor detail( to me). Of course it's not as bad as with noname cheap case brands. Overall I thought this case deserved a 5/5 for design, a 5/5 for ingenuity, and a 4/5 for craftsmanship. (Incidentally the Tagan PSU deserves a full 5/5 too) The shop will ship in Holland, but not abroad AFAIK. And that would be cost-prohibitive anyway, this case is really a big sucker. > 2) Also what are reasonable resync times for your big raid5 arrays? > I had resync time or two days by accident recently for 4x 250 hard drives > because i did not have dma enabled. that is solved, but i had switched to > raid1 in the interm and now i am curious what others are used to. Not sure. At first I built the array on a lowly old celeron-500 and the resync time of each of the 6 arrays was IIRC 50 minutes, so about 5 hours all told. With the new case I also installed a much faster board, an athlon 1400, so resync now is at (about) 20 minutes for each array, but I admit I did not take notes there. The other big array, consisting of whole-drive 160GB disks (5-1)x160GB=640GB did a resync in a little over 2 hours I think. Less than three, at any rate. > 3) Also, i have a module driver question. > I use a asus K8V-X motherboard. It has sata and parallel ide channels. I > use the sata for my system and use the parallel for data storage on ide > raid. I am using combining the 2 motherboard IDE cable channels with > highpoint rocket133 cards to provide 2 more ide ata channels. I myself now have in use: The VIA onboard ATA channels One Promise SATA TX2 150 Two noname SIL / silicon image SATA controllers One Promise ATA Tx133 The onboard VIA SATA controller is left unused. I may use it later but it gave me some problems in the past so I went for the simplest solution now. > I installed debian and it defaulted to using the hpt366 modules for the > rocket133 controllers. > I suspect (correct me if I am wrong) that the hpt302 on the highpoint > website is the RIGHT module to use (I notice for instance that when I > compare the hdparm settings on the western digital drives on the > motherboard ide channels are set with more advanced dma settings "turned > on" than on the rocket133 controllers. Perhaps this is because it is > using the 'incorrect hpt366' module? I once had a mainboard with HPT cntr onboard, an older version though (266?). Since then I carefully avoided highpoint as well as I could... I will not buy one unless held at gunpoint. Same as with Sony, I hate that brand. > Of course I would prefer to use the hpt302 module (after i compile it...). > So how do I get to insure that the system will use the hpt302 over the > hpt366 that it seems to be chosing. If I > 1) compile the module hpt302 from source > 2) dump it in the /lib/modules/2.6.9-1-386/kernel/drivers/ide/pci directory > 3) put a line hpt302 in the /etc/modules file (maybe at the top?) > 4) put a line hpt302 at the top of the file /etc/mkinitd/modules. > 5) run mkinitrd to generate the new initrd.img > > will this insure that the module hpt302 is loaded on preference to the > hpt366 module? Sorry, I've no clue on that. Your story sounds reasonable, but why not start by getting the module compiled ? Most times, that is the hard part. If that succeeds and you can insmod it without probs, there is plenty of time to convince the kernel to load the right module I think. > 4) Maarten mentioned that he had a problem with 2 different drives on the > same channel for raid5. What was the problem exactly with that. The same problem everyone has. If an IDE drive fails, it does not -as SCSI drives tend to do- leave the electrical IDE bus in a free/useable state. So the other drive on that cable is still "good", but unreachable/dead for now. This obviously leads to a fatal 2-drive failure. It doesn't matter the second drive's failure is only temporary; the md / ide code doesn't know that. You can often restore the array manually, but this is not something that is done lightly, so really better to be avoided... Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel 2005-01-16 16:19 ` 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks 2005-01-16 17:53 ` Gordon Henderson 2005-01-16 18:22 ` Maarten @ 2005-01-16 19:39 ` Guy 2005-01-16 20:55 ` Maarten 2 siblings, 1 reply; 95+ messages in thread From: Guy @ 2005-01-16 19:39 UTC (permalink / raw) To: 'Mitchell Laks'; +Cc: linux-raid If your rebuild seems too slow, make sure you increase the speed limit! Details in "man md". echo 100000 > /proc/sys/dev/raid/speed_limit_max I added this to /etc/sysctl.conf # RAID rebuild min/max speed K/Sec per device dev.raid.speed_limit_min = 1000 dev.raid.speed_limit_max = 100000 Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Mitchell Laks Sent: Sunday, January 16, 2005 11:20 AM To: maarten Cc: linux-raid@vger.kernel.org Subject: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel HI, I have 3 questions. 1) Maarten, Where did you buy the big chieftec chasis (CA-01B i think) and what did you pay for it? I have been using antec sx1000 chasis and yours looks better and bigger. 2) Also what are reasonable resync times for your big raid5 arrays? I had resync time or two days by accident recently for 4x 250 hard drives because i did not have dma enabled. that is solved, but i had switched to raid1 in the interm and now i am curious what others are used to. 3) Also, i have a module driver question. I use a asus K8V-X motherboard. It has sata and parallel ide channels. I use the sata for my system and use the parallel for data storage on ide raid. I am using combining the 2 motherboard IDE cable channels with highpoint rocket133 cards to provide 2 more ide ata channels. I installed debian and it defaulted to using the hpt366 modules for the rocket133 controllers. I suspect (correct me if I am wrong) that the hpt302 on the highpoint website is the RIGHT module to use (I notice for instance that when I compare the hdparm settings on the western digital drives on the motherboard ide channels are set with more advanced dma settings "turned on" than on the rocket133 controllers. Perhaps this is because it is using the 'incorrect hpt366' module? Of course I would prefer to use the hpt302 module (after i compile it...). So how do I get to insure that the system will use the hpt302 over the hpt366 that it seems to be chosing. If I 1) compile the module hpt302 from source 2) dump it in the /lib/modules/2.6.9-1-386/kernel/drivers/ide/pci directory 3) put a line hpt302 in the /etc/modules file (maybe at the top?) 4) put a line hpt302 at the top of the file /etc/mkinitd/modules. 5) run mkinitrd to generate the new initrd.img will this insure that the module hpt302 is loaded on preference to the hpt366 module? 4) Maarten mentioned that he had a problem with 2 different drives on the same channel for raid5. What was the problem exactly with that. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel 2005-01-16 19:39 ` Guy @ 2005-01-16 20:55 ` Maarten 2005-01-16 21:58 ` Guy 0 siblings, 1 reply; 95+ messages in thread From: Maarten @ 2005-01-16 20:55 UTC (permalink / raw) To: linux-raid On Sunday 16 January 2005 20:39, Guy wrote: > If your rebuild seems too slow, make sure you increase the speed limit! > Details in "man md". > > echo 100000 > /proc/sys/dev/raid/speed_limit_max Hi Guy, You always say that, but that never helps me (since my distro already has 100000 as default). Are there even distros that have this set too low ? Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel 2005-01-16 20:55 ` Maarten @ 2005-01-16 21:58 ` Guy 0 siblings, 0 replies; 95+ messages in thread From: Guy @ 2005-01-16 21:58 UTC (permalink / raw) To: 'Maarten', linux-raid Yes, RedHat 9 defaults to much less, 10,000 I think. I assumed it was the md default. Maybe a RedHat 9 issue. I just looked at the man page for md. It says "The default is 100,000.". I did upgrade to Kernel 2.4.28 a few weeks ago. I guess the default was changed in a newer version of md. My /etc/sysctl.conf has a date of Dec 12, 2003. So, whatever kernel I had over 1 year ago had a default of 10,000, or so. Anyway, it has helped some people in the past. :) I guess it depends on the kernel/md version. I guess a default of no limit would be nice. But no support for that, yet! Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Maarten Sent: Sunday, January 16, 2005 3:56 PM To: linux-raid@vger.kernel.org Subject: Re: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel On Sunday 16 January 2005 20:39, Guy wrote: > If your rebuild seems too slow, make sure you increase the speed limit! > Details in "man md". > > echo 100000 > /proc/sys/dev/raid/speed_limit_max Hi Guy, You always say that, but that never helps me (since my distro already has 100000 as default). Are there even distros that have this set too low ? Maarten - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-10 16:34 ` maarten 2005-01-10 16:36 ` Gordon Henderson @ 2005-01-10 17:13 ` Guy 2005-01-10 17:35 ` hard disk re-locates bad block on read Guy ` (3 more replies) 2005-01-11 10:09 ` Spares and partitioning huge disks KELEMEN Peter 2 siblings, 4 replies; 95+ messages in thread From: Guy @ 2005-01-10 17:13 UTC (permalink / raw) To: 'maarten', linux-raid In my log files, which go back to Dec 12 I have 4 of these: raid5: switching cache buffer size, 4096 --> 1024 And 2 of these: raid5: switching cache buffer size, 1024 --> 4096 So, it would concern me! The message is from RAID5, not LVM. I base this on "raid5:" in the log entry. :) Guy I found this from Neil: "You will probably also see a message in the kernel logs like: raid5: switching cache buffer size, 4096 --> 1024 The raid5 stripe cache must match the request size used by any client. It is PAGE_SIZE at start up, but changes whenever is sees a request of a difference size. Reading from /dev/mdX uses a request size of 1K. Most filesystems use a request size of 4k. So, when you do the 'dd', the cache size changes and you get a small performance drop because of this. If you make a filesystem on the array and then mount it, it will probably switch back to 4k requests and resync should speed up. NeilBrown" -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten Sent: Monday, January 10, 2005 11:34 AM To: linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks On Saturday 08 January 2005 20:01, maarten wrote: > On Saturday 08 January 2005 17:49, maarten wrote: > > On Saturday 08 January 2005 15:52, Frank van Maarseveen wrote: > > > On Fri, Jan 07, 2005 at 04:57:35PM -0500, Guy wrote: > As the system is now online again, busy copying, I can show the exact > config: > Well all about the array is done and working fine up to now. Except one thing that I didn't anticipate: The application that's supposed to run has some problems under -I suppose- the new 2.4.28 kernel. I've had two panics / oopses in syslog already, and the process then is unkillable, so a reboot is in order. But I think that's bttv related, not the I/O layer. In any case I suffered through two lengthy raid resyncs already... ;-| So I've been shopping around for a *big* servercase today so I can put all disks (these 5, plus 6 from the current fileserver) in one big tower. I'll then use that over NFS and can revert back to my older working kernel. I've chosen a Chieftec case, as can be seen here http://www.chieftec.com/products/Workcolor/CA-01.htm and here in detail http://www.chieftec.com/products/Workcolor/NewBA.htm Nice drive cages, eh ? :-) P.S.: I get this filling up my logs. Should I be worried about that ? Jan 10 11:30:32 dozer kernel: raid5: switching cache buffer size, 512 --> 4096 Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 4096 --> 512 Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 512 --> 4096 Jan 10 11:30:36 dozer kernel: raid5: switching cache buffer size, 4096 --> 512 Maarten - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* hard disk re-locates bad block on read. 2005-01-10 17:13 ` Spares and partitioning huge disks Guy @ 2005-01-10 17:35 ` Guy 2005-01-11 14:34 ` Tom Coughlan 2005-01-10 18:24 ` Spares and partitioning huge disks maarten ` (2 subsequent siblings) 3 siblings, 1 reply; 95+ messages in thread From: Guy @ 2005-01-10 17:35 UTC (permalink / raw) To: linux-raid My disks have the option to relocate bad blocks on read error. I was concerned that bogus data would be returned to the OS. They say CRC errors return corrupt data to the OS! I hope not! So it seems CRC errors and unreadable blocks both are corrupt or lost. But the OS does not know. So, I will leave this option turned off. Guy I sent this to Seagate: With ARRE (Automatic Read Reallocation Enable) turned on. Does it relocate blocks that can't be read, or blocks that had correctable read problems? Or both? If it re-locates un-readable blocks, then what data does it return to the OS? Thanks, Guy ================================================================== Guy, If the block is bad at a hardware level then it is reallocated and a spare is used in it's place. In a bad block the data is lost, the sparing of the block is transparent to the operating system. Blocks with correctable read problems are one's with corrupt data at the OS level. Jimmie P. Seagate Technical Support ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: hard disk re-locates bad block on read. 2005-01-10 17:35 ` hard disk re-locates bad block on read Guy @ 2005-01-11 14:34 ` Tom Coughlan 2005-01-11 22:43 ` Guy 0 siblings, 1 reply; 95+ messages in thread From: Tom Coughlan @ 2005-01-11 14:34 UTC (permalink / raw) To: Guy; +Cc: linux-raid On Mon, 2005-01-10 at 12:35, Guy wrote: > My disks have the option to relocate bad blocks on read error. > I was concerned that bogus data would be returned to the OS. > > They say CRC errors return corrupt data to the OS! I hope not! > So it seems CRC errors and unreadable blocks both are corrupt or lost. > But the OS does not know. > So, I will leave this option turned off. > > Guy > > I sent this to Seagate: > With ARRE (Automatic Read Reallocation Enable) turned on. Does it relocate > blocks that can't be read, or blocks that had correctable read problems? > Or both? FWIW, the SCSI standard has been clear on this point for many years: "An ARRE bit of one indicates that the device server shall enable automatic reallocation of defective data blocks during read operations. ... The automatic reallocation shall then be performed only if the device server successfully recovers the data. The recovered data shall be placed in the reallocated block." (SBC-2) Blocks that can not be read are not relocated. The read command simply returns an error to the OS. > > If it re-locates un-readable blocks, then what data does it return to the > OS? > > Thanks, > Guy > > ================================================================== > > Guy, > If the block is bad at a hardware level then it is reallocated and a spare > is used in it's place. In a bad block the data is lost, the sparing of the > block is transparent to the operating system. Blocks with correctable read > problems are one's with corrupt data at the OS level. > > Jimmie P. > Seagate Technical Support > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: hard disk re-locates bad block on read. 2005-01-11 14:34 ` Tom Coughlan @ 2005-01-11 22:43 ` Guy 2005-01-12 13:51 ` Tom Coughlan 0 siblings, 1 reply; 95+ messages in thread From: Guy @ 2005-01-11 22:43 UTC (permalink / raw) To: 'Tom Coughlan'; +Cc: linux-raid Good, your description is what I had assumed at first. But when I re-read the drive specs, it was vague, so I set ARRE back to 0. So, it should be a good thing to set it to 1, correct? Do you agree that Seagate's email is wrong? Or am I just reading it wrong? I did not realize ARRE was a standard. I thought it was a Seagate thing. Thanks, Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Tom Coughlan Sent: Tuesday, January 11, 2005 9:35 AM To: Guy Cc: linux-raid@vger.kernel.org Subject: Re: hard disk re-locates bad block on read. On Mon, 2005-01-10 at 12:35, Guy wrote: > My disks have the option to relocate bad blocks on read error. > I was concerned that bogus data would be returned to the OS. > > They say CRC errors return corrupt data to the OS! I hope not! > So it seems CRC errors and unreadable blocks both are corrupt or lost. > But the OS does not know. > So, I will leave this option turned off. > > Guy > > I sent this to Seagate: > With ARRE (Automatic Read Reallocation Enable) turned on. Does it relocate > blocks that can't be read, or blocks that had correctable read problems? > Or both? FWIW, the SCSI standard has been clear on this point for many years: "An ARRE bit of one indicates that the device server shall enable automatic reallocation of defective data blocks during read operations. ... The automatic reallocation shall then be performed only if the device server successfully recovers the data. The recovered data shall be placed in the reallocated block." (SBC-2) Blocks that can not be read are not relocated. The read command simply returns an error to the OS. > > If it re-locates un-readable blocks, then what data does it return to the > OS? > > Thanks, > Guy > > ================================================================== > > Guy, > If the block is bad at a hardware level then it is reallocated and a spare > is used in it's place. In a bad block the data is lost, the sparing of the > block is transparent to the operating system. Blocks with correctable read > problems are one's with corrupt data at the OS level. > > Jimmie P. > Seagate Technical Support > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: hard disk re-locates bad block on read. 2005-01-11 22:43 ` Guy @ 2005-01-12 13:51 ` Tom Coughlan 0 siblings, 0 replies; 95+ messages in thread From: Tom Coughlan @ 2005-01-12 13:51 UTC (permalink / raw) To: Guy; +Cc: linux-raid On Tue, 2005-01-11 at 17:43, Guy wrote: > Good, your description is what I had assumed at first. But when I > re-read > the drive specs, it was vague, so I set ARRE back to 0. > > So, it should be a good thing to set it to 1, correct? I would. > Do you agree that Seagate's email is wrong? Or am I just reading it > wrong? I can't figure out what he is saying in the last sentence. I do believe that Seagate engineers are aware of the correct way to implement ARRE. I can't vouch for whether their firmware always gets it right. > I did not realize ARRE was a standard. I thought it was a Seagate > thing. > > Thanks, > Guy > > -----Original Message----- > From: linux-raid-owner@vger.kernel.org > [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Tom Coughlan > Sent: Tuesday, January 11, 2005 9:35 AM > To: Guy > Cc: linux-raid@vger.kernel.org > Subject: Re: hard disk re-locates bad block on read. > > On Mon, 2005-01-10 at 12:35, Guy wrote: > > My disks have the option to relocate bad blocks on read error. > > I was concerned that bogus data would be returned to the OS. > > > > They say CRC errors return corrupt data to the OS! I hope not! > > So it seems CRC errors and unreadable blocks both are corrupt or lost. > > But the OS does not know. > > So, I will leave this option turned off. > > > > Guy > > > > I sent this to Seagate: > > With ARRE (Automatic Read Reallocation Enable) turned on. Does it > relocate > > blocks that can't be read, or blocks that had correctable read problems? > > Or both? > > FWIW, the SCSI standard has been clear on this point for many years: > > "An ARRE bit of one indicates that the device server shall enable > automatic reallocation of defective data blocks during read operations. > ... The automatic reallocation shall then be performed only if the > device server successfully recovers the data. The recovered data shall > be placed in the reallocated block." (SBC-2) > > Blocks that can not be read are not relocated. The read command simply > returns an error to the OS. > > > > > If it re-locates un-readable blocks, then what data does it return to the > > OS? > > > > Thanks, > > Guy > > > > ================================================================== > > > > Guy, > > If the block is bad at a hardware level then it is reallocated and a > spare > > is used in it's place. In a bad block the data is lost, the sparing of the > > block is transparent to the operating system. Blocks with correctable read > > problems are one's with corrupt data at the OS level. > > > > Jimmie P. > > Seagate Technical Support > > ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-10 17:13 ` Spares and partitioning huge disks Guy 2005-01-10 17:35 ` hard disk re-locates bad block on read Guy @ 2005-01-10 18:24 ` maarten 2005-01-10 20:09 ` Guy 2005-01-10 18:40 ` maarten 2005-01-12 11:41 ` RAID-6 Gordon Henderson 3 siblings, 1 reply; 95+ messages in thread From: maarten @ 2005-01-10 18:24 UTC (permalink / raw) To: linux-raid On Monday 10 January 2005 18:13, Guy wrote: > In my log files, which go back to Dec 12 > > I have 4 of these: > raid5: switching cache buffer size, 4096 --> 1024 > > And 2 of these: > raid5: switching cache buffer size, 1024 --> 4096 Heh. Is that all...? :-)) Now THIS is my log: dozer:/var/log # cat messages | grep "switching cache buffer size" | wc -l 55880 So that is why I'm a bit worried. Usually when my computer tells me something _every_second_ I tend to take it seriously. But maybe it's just lonely and looking for some attention. Heh. ;) > I found this from Neil: > "You will probably also see a message in the kernel logs like: > raid5: switching cache buffer size, 4096 --> 1024 > > The raid5 stripe cache must match the request size used by any client. > It is PAGE_SIZE at start up, but changes whenever is sees a request of a > difference size. > Reading from /dev/mdX uses a request size of 1K. > Most filesystems use a request size of 4k. > > So, when you do the 'dd', the cache size changes and you get a small > performance drop because of this. > If you make a filesystem on the array and then mount it, it will probably > switch back to 4k requests and resync should speed up. Okay. So with as many switches as I see, it would be likely that something either accesses the md device concurrently with the FS, or that the FS does this constant switching by itself. Now my FS is XFS, maybe that filesystem has this behaviour ? Anyone having a raid-5 with XFS on top can confirm this ? I usually use Reiserfs, but it seems that XFS is particularly good / fast with big files, whereas reiserfs excels with small files, that is why I use it here. As far as I know there are no accesses that bypass the FS; no Oracle, no cat, no dd. Only LVM and XFS (but it did this before LVM too). Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-10 18:24 ` Spares and partitioning huge disks maarten @ 2005-01-10 20:09 ` Guy 2005-01-10 21:21 ` maarten 2005-01-11 1:04 ` maarten 0 siblings, 2 replies; 95+ messages in thread From: Guy @ 2005-01-10 20:09 UTC (permalink / raw) To: 'maarten', linux-raid I know the log files are very annoying, but... I wonder if all that switching is causing md to void its cache? It may be a performance problem for md to change the strip cache size so often. I use ext3 filesystems. No problems with performance (that I know of). I have never tried any others. Do you think the performance difference of the various filesystems would affect your PVR? Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten Sent: Monday, January 10, 2005 1:25 PM To: linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks On Monday 10 January 2005 18:13, Guy wrote: > In my log files, which go back to Dec 12 > > I have 4 of these: > raid5: switching cache buffer size, 4096 --> 1024 > > And 2 of these: > raid5: switching cache buffer size, 1024 --> 4096 Heh. Is that all...? :-)) Now THIS is my log: dozer:/var/log # cat messages | grep "switching cache buffer size" | wc -l 55880 So that is why I'm a bit worried. Usually when my computer tells me something _every_second_ I tend to take it seriously. But maybe it's just lonely and looking for some attention. Heh. ;) > I found this from Neil: > "You will probably also see a message in the kernel logs like: > raid5: switching cache buffer size, 4096 --> 1024 > > The raid5 stripe cache must match the request size used by any client. > It is PAGE_SIZE at start up, but changes whenever is sees a request of a > difference size. > Reading from /dev/mdX uses a request size of 1K. > Most filesystems use a request size of 4k. > > So, when you do the 'dd', the cache size changes and you get a small > performance drop because of this. > If you make a filesystem on the array and then mount it, it will probably > switch back to 4k requests and resync should speed up. Okay. So with as many switches as I see, it would be likely that something either accesses the md device concurrently with the FS, or that the FS does this constant switching by itself. Now my FS is XFS, maybe that filesystem has this behaviour ? Anyone having a raid-5 with XFS on top can confirm this ? I usually use Reiserfs, but it seems that XFS is particularly good / fast with big files, whereas reiserfs excels with small files, that is why I use it here. As far as I know there are no accesses that bypass the FS; no Oracle, no cat, no dd. Only LVM and XFS (but it did this before LVM too). Maarten - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-10 20:09 ` Guy @ 2005-01-10 21:21 ` maarten 2005-01-11 1:04 ` maarten 1 sibling, 0 replies; 95+ messages in thread From: maarten @ 2005-01-10 21:21 UTC (permalink / raw) To: linux-raid On Monday 10 January 2005 21:09, Guy wrote: > I use ext3 filesystems. No problems with performance (that I know of). > I have never tried any others. Reportedly, deleting a big (>2GB) file under XFS is way faster than under ext3, but I never verified this myself. I barely ever ran ext3, I went with my distros' default, which was Reiserfs. Deleting huge files is a special case though, so it does not make much sense to benchmark or tune for that. But in this special case it matters. > Do you think the performance difference of the various filesystems would > affect your PVR? Ehm, no. Well, not the raid-5 overhead at least. The FS helps the GUI being more 'snappy'. It has been reported that ext3 takes a real long time to delete huge files (up to several seconds or more) (unconfirmed by me). This is what a copy of a 5GB file and subsequent delete does on my system: -rw-r--r-- 1 root root 4909077650 Jan 10 04:43 file dozer:/mnt/store # time cp file away real 3m15.778s user 0m0.640s sys 0m45.230s dozer:/mnt/store # time rm away real 0m0.237s user 0m0.000s sys 0m0.020s (This was while the machine was idle, no recordings going on) But the machine IS starved for CPU and bus bandwidth; the cpu should be almost fully pegged with recording up to two simultaneous channels, compressing realtime(!) in mpeg4 from two cheap bttv cards (at 480x480 rez, 2600 Kbps). Therefore it is the fastest CPU I could afford back in last spring, an Athlon XP 2600. The machine is also overclocked from 200 FSB to 233, yielding a 38 MHz PCI bus. Strangely enough, despite there being 5 PCI cards, amongst which two disk I/O controllers, this seems to work just fine. It's been tested in-depth by recording shows daily for months and it crashes rarely ( meaning < once a month which is not too bad, considering ). Maybe the bigass Zalman copper CPU cooler and the 12cm fan hovering above it help there, too ;-) At the beginning it ran off a single 160GB disk, so when I switched to raid-5 I was very afraid that either the extra CPU load, the extra IRQ load or the bus bandwidth would saturate, thus killing performance. You see, the PCI bus is fairly loaded too, since not only does it have to handle the various ATA controllers, but two uncompressed videostreams from the TV cards as well. So all in all, the overhead of raid seems insignificant to me, or the code is very well optimized indeed :-) Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-10 20:09 ` Guy 2005-01-10 21:21 ` maarten @ 2005-01-11 1:04 ` maarten 1 sibling, 0 replies; 95+ messages in thread From: maarten @ 2005-01-11 1:04 UTC (permalink / raw) To: linux-raid [I'm resending this since I never saw it make the list] On Monday 10 January 2005 21:09, Guy wrote: > I use ext3 filesystems. No problems with performance (that I know of). > I have never tried any others. Reportedly, deleting a big (>2GB) file under XFS is way faster than under ext3, but I never verified this myself. I barely ever ran ext3, I went with my distros' default, which was Reiserfs. Deleting huge files is a special case though, so it does not make much sense to benchmark or tune for that. But in this special case it matters. > Do you think the performance difference of the various filesystems would > affect your PVR? Ehm, no. Well, not the raid-5 overhead at least. The FS helps the GUI being more 'snappy'. It has been reported that ext3 takes a real long time to delete huge files (up to several seconds or more) (unconfirmed by me). This is what a copy of a 5GB file and subsequent delete does on my system: -rw-r--r-- 1 root root 4909077650 Jan 10 04:43 file dozer:/mnt/store # time cp file away real 3m15.778s user 0m0.640s sys 0m45.230s dozer:/mnt/store # time rm away real 0m0.237s user 0m0.000s sys 0m0.020s (This was while the machine was idle, no recordings going on) But the machine IS starved for CPU and bus bandwidth; the cpu should be almost fully pegged with recording up to two simultaneous channels, compressing realtime(!) in mpeg4 from two cheap bttv cards (at 480x480 rez, 2600 Kbps). Therefore it is the fastest CPU I could afford back in last spring, an Athlon XP 2600. The machine is also overclocked from 200 FSB to 233, yielding a 38 MHz PCI bus. Strangely enough, despite there being 5 PCI cards, amongst which two disk I/O controllers, this seems to work just fine. It's been tested in-depth by recording shows daily for months and it crashes rarely ( meaning < once a month which is not too bad, considering ). Maybe the bigass Zalman copper CPU cooler and the 12cm fan hovering above it help there, too ;-) At the beginning it ran off a single 160GB disk, so when I switched to raid-5 I was very afraid that either the extra CPU load, the extra IRQ load or the bus bandwidth would saturate, thus killing performance. You see, the PCI bus is fairly loaded too, since not only does it have to handle the various ATA controllers, but two uncompressed videostreams from the TV cards as well. So all in all, the overhead of raid seems insignificant to me, or the code is very well optimized indeed :-) Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-10 17:13 ` Spares and partitioning huge disks Guy 2005-01-10 17:35 ` hard disk re-locates bad block on read Guy 2005-01-10 18:24 ` Spares and partitioning huge disks maarten @ 2005-01-10 18:40 ` maarten 2005-01-10 19:41 ` Guy 2005-01-12 11:41 ` RAID-6 Gordon Henderson 3 siblings, 1 reply; 95+ messages in thread From: maarten @ 2005-01-10 18:40 UTC (permalink / raw) To: linux-raid On Monday 10 January 2005 18:13, Guy wrote: > I have 4 of these: > raid5: switching cache buffer size, 4096 --> 1024 > > And 2 of these: > raid5: switching cache buffer size, 1024 --> 4096 Another thing that only strikes me now that I'm actually counting them: I have _exactly_ 24 of those messages per minute. At any moment, at any time, idle or not. (Although a PVR is never really 'idle', but nevertheless) (well, not really _every_ time, but out of ten samples nine were 24 and only one was 36) And this one may be particularly interesting: Jan 9 02:47:19 dozer kernel: raid5: switching cache buffer size, 512 --> 4096 Jan 9 02:47:24 dozer kernel: raid5: switching cache buffer size, 0 --> 4096 Jan 9 02:47:24 dozer kernel: raid5: switching cache buffer size, 4096 --> 512 Switching from buffer size 0 ? Wtf ? Another thing to note is that my switches are between 4096 and 512, not between 4k and 1k as Neil's reply would indicate being normal. But I don't consider this bit really important. Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-10 18:40 ` maarten @ 2005-01-10 19:41 ` Guy 0 siblings, 0 replies; 95+ messages in thread From: Guy @ 2005-01-10 19:41 UTC (permalink / raw) To: 'maarten', linux-raid You been doing zero length IOs again? :) How many zero length IOs can you do in a second? -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten Sent: Monday, January 10, 2005 1:41 PM To: linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks On Monday 10 January 2005 18:13, Guy wrote: > I have 4 of these: > raid5: switching cache buffer size, 4096 --> 1024 > > And 2 of these: > raid5: switching cache buffer size, 1024 --> 4096 Another thing that only strikes me now that I'm actually counting them: I have _exactly_ 24 of those messages per minute. At any moment, at any time, idle or not. (Although a PVR is never really 'idle', but nevertheless) (well, not really _every_ time, but out of ten samples nine were 24 and only one was 36) And this one may be particularly interesting: Jan 9 02:47:19 dozer kernel: raid5: switching cache buffer size, 512 --> 4096 Jan 9 02:47:24 dozer kernel: raid5: switching cache buffer size, 0 --> 4096 Jan 9 02:47:24 dozer kernel: raid5: switching cache buffer size, 4096 --> 512 Switching from buffer size 0 ? Wtf ? Another thing to note is that my switches are between 4096 and 512, not between 4k and 1k as Neil's reply would indicate being normal. But I don't consider this bit really important. Maarten - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* RAID-6 ... 2005-01-10 17:13 ` Spares and partitioning huge disks Guy ` (2 preceding siblings ...) 2005-01-10 18:40 ` maarten @ 2005-01-12 11:41 ` Gordon Henderson 2005-01-13 2:11 ` RAID-6 Neil Brown 3 siblings, 1 reply; 95+ messages in thread From: Gordon Henderson @ 2005-01-12 11:41 UTC (permalink / raw) To: linux-raid Until now I haven't really paid too much attention to the RAID-6 stuff, but I have an application which needs to be as resilient to disk failures as possible. So other than what's at: ftp://ftp.kernel.org/pub/linux/kernel/people/hpa/ and the archives of this list (which I'm re-reading now), can anyone give me a quick heads-up about it? Specifically I'm still buried in the dark days of 2.4.27/28 - are there recent patches against 2.4? If RAID-6 isn't viable for me right now, what I'm planning is as follows: Put 8 x 250GB SATA drives in the system, and arrange them in 4 pairs of RAID-1 units. Assemble the 4 RAID-1 units into a RAID-5. Big waste of disk space, but thats not really important for this application, and disk is cheap, (relatively) So I'll end up with just over 700GB of usable storage, with the potential of surviving a minimum of any 3 disks disks failing, and possibly 4 or 5, depending on just where they fail (although disks would be replaced way before it got to that stage!) Certainly any 2 can fail, and if it were 2 in the same RAID-1 unit, (which would cause the RAID-5 to become degraded) and I were desperate, I could move a disk and deliberately fail another RAID-1 to recover the RAID-5 ... In the absence of RAID-6, would anyone do it differently? Note: I'm relatively new to mdadm, but can see it's the way of the future, (especially after I had to use it in-anger recently to recover from a 2-disk failure in an old 8-disk RAID-5 array), and I'm looking at the spare-groups part of it all and wondering if that might be an alternative, but I'd like to avoid the possibility of the array failing catastrophically during a re-build if at all possible. Cheers, Gordon ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: RAID-6 ... 2005-01-12 11:41 ` RAID-6 Gordon Henderson @ 2005-01-13 2:11 ` Neil Brown 2005-01-15 16:12 ` RAID-6 Gordon Henderson 0 siblings, 1 reply; 95+ messages in thread From: Neil Brown @ 2005-01-13 2:11 UTC (permalink / raw) To: Gordon Henderson; +Cc: linux-raid On Wednesday January 12, gordon@drogon.net wrote: > > Until now I haven't really paid too much attention to the RAID-6 stuff, > but I have an application which needs to be as resilient to disk failures > as possible. > > So other than what's at: > > ftp://ftp.kernel.org/pub/linux/kernel/people/hpa/ > > and the archives of this list (which I'm re-reading now), can anyone give > me a quick heads-up about it? > > Specifically I'm still buried in the dark days of 2.4.27/28 - are there > recent patches against 2.4? There is no current support for raid6 in any 2.4 kernel and I am not aware of anyone planning such support. Assume it is 2.6 only. > > If RAID-6 isn't viable for me right now, what I'm planning is as follows: > > Put 8 x 250GB SATA drives in the system, and arrange them in 4 pairs of > RAID-1 units. > > Assemble the 4 RAID-1 units into a RAID-5. Sounds reasonable. NeilBrown ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: RAID-6 ... 2005-01-13 2:11 ` RAID-6 Neil Brown @ 2005-01-15 16:12 ` Gordon Henderson 2005-01-17 8:04 ` RAID-6 Turbo Fredriksson 0 siblings, 1 reply; 95+ messages in thread From: Gordon Henderson @ 2005-01-15 16:12 UTC (permalink / raw) To: linux-raid On Thu, 13 Jan 2005, Neil Brown wrote: > There is no current support for raid6 in any 2.4 kernel and I am not > aware of anyone planning such support. Assume it is 2.6 only. How "real-life" tested is RAID-6 so-far? Anyone using it in anger on a production server? I've spent the past day or 2 getting to grips with kernel 2.6.10 and mdadm 1.8.0 on a test system running Debian Woody, on a rather clunky old test PC - Asus XG-DLS, Twin Xeon PIII/500's on-board IDE with 2 x old 4GB drives attached, it also has on-board Adaptec SCSI with 2 x 18GB drives - one on an 8-bit bus, and a Highpoint HPT302 card with 2 modern 80 GB drives (Maxtor, I know, but keep them cool and so-far so good...) Performance isn't exactly stellar (one PCI bus!) but I did squeeze 60MB/sec out of a RAID-0 off the 2 new drives... Clunky by todays standards, but 5.5 years ago when it was new, it rocked! Anyway, so far so good. I've constructed a RAID-6 system with a 4GB partition on 5 of the drives, and done some tests & what not. The tests I've done involve creating the array, putting a filesystem on it (ext3), writing a bigfile of zeros, checksumming it, failing a drive, adding it back in, failing another, failing a 2nd, adding them in, failing another before the 2nd drive finished resyncing, etc. all the time writing a file & checksumming it between unmounting & re-mounting it. There was a script posted round about July last year which I used to get some ideas from. So-far so good, no corruption, but it doesn't doesn't mean anything in the real-world. So who's using RAID-6 for real? Can it be considered more or less stable than RAID-5? Should I stick to my RAID-5 on-top of RAID-1 pairs? Or should I just take a chance with RAID-6? (And nearly 6 years ago when I started using Linux s/w RAID I said this to myself, but stuck with it and haven't had a problem I could pin down to software... So who knows!) Gordon ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: RAID-6 ... 2005-01-15 16:12 ` RAID-6 Gordon Henderson @ 2005-01-17 8:04 ` Turbo Fredriksson 0 siblings, 0 replies; 95+ messages in thread From: Turbo Fredriksson @ 2005-01-17 8:04 UTC (permalink / raw) To: linux-raid >>>>> "Gordon" == Gordon Henderson <gordon@drogon.net> writes: Gordon> On Thu, 13 Jan 2005, Neil Brown wrote: >> There is no current support for raid6 in any 2.4 kernel and I >> am not aware of anyone planning such support. Assume it is 2.6 >> only. Gordon> How "real-life" tested is RAID-6 so-far? Anyone using it Gordon> in anger on a production server? I've been using it for a couple of months (3 or 4 if I'm not mistaken) on my SPARC64 (Sun Blade 1000 - 2xSPARC III/750) with (four+two)*9Gb disks which gives me 16Gb disks pace. With this setup (I had a few 9Gb disks that I couldn't/wouldn't use for anything else) four (4!) disks can fail without it matter... Have worked flawlessly even though the disks are OLD - 'smartctl' shows that almost all of the disks have had more than 28000 hours 'uptime' (i.e. 'powered on'). That's more than 3 years (POWERED ON mind you!). Granted, I've been 'fortunate' (?!) to have had NO disk crashes etc, but I did simulate a few when I sat the system up and it worked just fine... Kernel 2.6.8.1 (with a couple of patches to get it to boot/work on SPARC64). If it works this great on a SPARC64 (with which the kernel have problems with), then it should work just FINE on a ia32... Gordon> Can it be considered more or less stable than RAID-5? For me (NOTE!!) I'd say "just as stable". But naturally this (should/could) depend on the exact kernel version in use... If a kernel version works this/that good, stick with it... Gordon> Should I stick to my RAID-5 on-top of RAID-1 pairs? In theory, that would be "more secure/safe" since both RAID5 and RAID1 is better tested, but... -- Soviet genetic SEAL Team 6 FSF nitrate Honduras $400 million in gold bullion Albanian Kennedy Ft. Meade DES fissionable Uzi quiche kibo [See http://www.aclu.org/echelonwatch/index.html for more about this] ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-10 16:34 ` maarten 2005-01-10 16:36 ` Gordon Henderson 2005-01-10 17:13 ` Spares and partitioning huge disks Guy @ 2005-01-11 10:09 ` KELEMEN Peter 2 siblings, 0 replies; 95+ messages in thread From: KELEMEN Peter @ 2005-01-11 10:09 UTC (permalink / raw) To: linux-raid * Maarten (maarten@ultratux.net) [20050110 17:34]: > P.S.: I get this filling up my logs. Should I be worried about that ? > Jan 10 11:30:32 dozer kernel: raid5: switching cache buffer size, 512 --> 4096 > Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 4096 --> 512 > Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 512 --> 4096 > Jan 10 11:30:36 dozer kernel: raid5: switching cache buffer size, 4096 --> 512 You have an internal XFS log on the RAID device and it is accessed in sector units by default, md is reporting the changes (harmless). Best workaround is to instruct your filesystem to use 4K sectors: mkfs.xfs -s size=4k HTH, Peter -- .+'''+. .+'''+. .+'''+. .+'''+. .+'' Kelemen Péter / \ / \ Peter.Kelemen@cern.ch .+' `+...+' `+...+' `+...+' `+...+' - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-08 16:49 ` maarten 2005-01-08 19:01 ` maarten @ 2005-01-09 19:33 ` Frank van Maarseveen 2005-01-09 21:26 ` maarten 1 sibling, 1 reply; 95+ messages in thread From: Frank van Maarseveen @ 2005-01-09 19:33 UTC (permalink / raw) To: linux-raid On Sat, Jan 08, 2005 at 05:49:32PM +0100, maarten wrote: > > > For one second I thought it's a clever trick but gut feeling tells > > me the odds of losing the entire array won't change (simplified -- > > because the increased complexity creates room for additional errors). > > No. It is somewhat more complex, true, but no different than making, for Got it. > However, IF during that > resync one other drive has a read error, it gets kicked too and the array > dies. The chances of that happening are not very small; Ouch! never considered this. So, RAID5 will actually decrease reliability in a significant number of cases because: - >1 read errors can cause a total break-down whereas it used to cause only a few userland I/O errors, disruptive but not foobar. - disk replacement is quite risky. This is totally unexpected to me but it should have been obvious: there's no bad block list in MD so if we would postpone I/O errors during reconstruction then 1: it might cause silent data corruption when I/O error unexpectedly disappears. 2: we might silently loose redundancy in a number of places. I think RAID6 but especially RAID1 is safer. A small side note on disk behavior: If it becomes possible to do block remapping at any level (MD, DM/LVM, FS) then we might not want to write to sectors with read errors at all but just remap the corresponding blocks by software as long as we have free blocks: save disk-internal spare sectors so the disk firmware can pre-emptively remap degraded but ECC correctable sectors upon read. -- Frank ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-09 19:33 ` Frank van Maarseveen @ 2005-01-09 21:26 ` maarten 2005-01-09 22:29 ` Frank van Maarseveen ` (2 more replies) 0 siblings, 3 replies; 95+ messages in thread From: maarten @ 2005-01-09 21:26 UTC (permalink / raw) To: linux-raid On Sunday 09 January 2005 20:33, Frank van Maarseveen wrote: > On Sat, Jan 08, 2005 at 05:49:32PM +0100, maarten wrote: > > However, IF during that > > resync one other drive has a read error, it gets kicked too and the array > > dies. The chances of that happening are not very small; > > Ouch! never considered this. So, RAID5 will actually decrease reliability > in a significant number of cases because: > - >1 read errors can cause a total break-down whereas it used > to cause only a few userland I/O errors, disruptive but not foobar. Well, yes and no. You can decide to do a full backup in case you hadn't, prior to changing drives. And if it is _just_ a bad sector, you can 'assemble --force' yielding what you would've had in a non-raid setup; some file somewhere that's got corrupted. No big deal, ie. the same trouble as was caused without raid-5. > - disk replacement is quite risky. This is totally unexpected to me > but it should have been obvious: there's no bad block list in MD > so if we would postpone I/O errors during reconstruction then > 1: it might cause silent data corruption when I/O error > unexpectedly disappears. > 2: we might silently loose redundancy in a number of places. Not sure if I understood all of that, but I think you're saying that md _could_ disregard read errors _when_already_running_in_degraded_mode_ so as to preserve the array at all cost. Hum. That choice should be left to the user if it happens, he probably knows best what to choose in the circumstances. No really, what would be best is that md made a difference between total media failure and sector failure. If one sector is bad on one drive [and it gets kicked therefore] it should be possible when a further read error occurs on other media, to try and read the missing sector data from the kicked drive, who may well have the data there waiting, intact and all. Don't know how hard that is really, but one could maybe think of pushing a disk in an intermediate state between "failed" and "good" like "in_disgrace" what signals to the end user "Don't remove this disk as yet; we may still need it, but add and resync a spare at your earliest convenience as we're running in degraded mode as of now". Hmm. Complicated stuff. :-) This kind of error will get more and more predominant with growing media and decreasing disk quality. Statistically there is not a huge chance of getting a read failure on a 18GB scsi disk, but on a cheap(ish) 500 GB ATA disk that is an entrirely different ballpark. > I think RAID6 but especially RAID1 is safer. Well, duh :) At the expense of buying everything twice, sure it's safer :)) > A small side note on disk behavior: > If it becomes possible to do block remapping at any level (MD, DM/LVM, > FS) then we might not want to write to sectors with read errors at all > but just remap the corresponding blocks by software as long as we have > free blocks: save disk-internal spare sectors so the disk firmware can > pre-emptively remap degraded but ECC correctable sectors upon read. Well I dunno. In ancient times, the OS was charged with remapping bad sectors back when disk drives had no intelligence. Now we delegated that task to the disk. I'm not sure reverting back to the old behaviour is a smart move. But with raid, who knows... And as it is I don't think you get the chance to save the disk-internal spare sectors; the disk handles that transparently so any higher layer cannot only not prevent that, but is even kept completely ignorant to it happening. Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-09 21:26 ` maarten @ 2005-01-09 22:29 ` Frank van Maarseveen 2005-01-09 23:16 ` maarten 2005-01-09 23:20 ` Guy 2005-01-10 0:42 ` Spares and partitioning huge disks Guy 2 siblings, 1 reply; 95+ messages in thread From: Frank van Maarseveen @ 2005-01-09 22:29 UTC (permalink / raw) To: linux-raid On Sun, Jan 09, 2005 at 10:26:25PM +0100, maarten wrote: > On Sunday 09 January 2005 20:33, Frank van Maarseveen wrote: > > On Sat, Jan 08, 2005 at 05:49:32PM +0100, maarten wrote: > > > > However, IF during that > > > resync one other drive has a read error, it gets kicked too and the array > > > dies. The chances of that happening are not very small; > > > > Ouch! never considered this. So, RAID5 will actually decrease reliability > > in a significant number of cases because: > > > - >1 read errors can cause a total break-down whereas it used > > to cause only a few userland I/O errors, disruptive but not foobar. > > Well, yes and no. You can decide to do a full backup in case you hadn't, backup (or taking snapshots) is orthogonal to this. > prior to changing drives. And if it is _just_ a bad sector, you can 'assemble > --force' yielding what you would've had in a non-raid setup; some file > somewhere that's got corrupted. No big deal, ie. the same trouble as was > caused without raid-5. I doubt that it's the same: either it wil fail totally during the reconstruction or it might fail with a silent corruption. Silent corruptions are a big deal. It won't loudly fail _and_ leave the array operational for an easy fixup later on so I think it's not the same. > > - disk replacement is quite risky. This is totally unexpected to me > > but it should have been obvious: there's no bad block list in MD > > so if we would postpone I/O errors during reconstruction then > > 1: it might cause silent data corruption when I/O error > > unexpectedly disappears. > > 2: we might silently loose redundancy in a number of places. > > Not sure if I understood all of that, but I think you're saying that md > _could_ disregard read errors _when_already_running_in_degraded_mode_ so as > to preserve the array at all cost. We can't. Imagine a 3 disk RAID5 array, one disk being replaced. While writing the new disk we get a single randon read error on one of the other two disks. Ignoring that implies either: 1: making up a phoney data block when a checksum block was hit by the error. 2: generating a garbage checksum block. RAID won't remember these events because there is no bad block list. Now suppose the array is operational again and hits a read error after some random interval. Then either it may: 1: return corrupt data without notice. 2: recalculate a block based on garbage. so, we can't ignore errors during RAID5 reconstruction and we're toast if it happens, even more toast than we would have been with a normal disk (barring the case of an entirely dead disk). If you look at the lower level then of course RAID5 has an advantage but to me it seems to vaporize when exposed to the _complexity_ of handling secondary errors during the reconstruction. -- Frank ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-09 22:29 ` Frank van Maarseveen @ 2005-01-09 23:16 ` maarten 2005-01-10 8:15 ` Frank van Maarseveen 0 siblings, 1 reply; 95+ messages in thread From: maarten @ 2005-01-09 23:16 UTC (permalink / raw) To: linux-raid On Sunday 09 January 2005 23:29, Frank van Maarseveen wrote: > On Sun, Jan 09, 2005 at 10:26:25PM +0100, maarten wrote: > > On Sunday 09 January 2005 20:33, Frank van Maarseveen wrote: > > > On Sat, Jan 08, 2005 at 05:49:32PM +0100, maarten wrote: > > Well, yes and no. You can decide to do a full backup in case you hadn't, > > backup (or taking snapshots) is orthogonal to this. Hm. Okay, you're right. > > prior to changing drives. And if it is _just_ a bad sector, you can > > 'assemble --force' yielding what you would've had in a non-raid setup; > > some file somewhere that's got corrupted. No big deal, ie. the same > > trouble as was caused without raid-5. > > I doubt that it's the same: either it wil fail totally during the > reconstruction or it might fail with a silent corruption. Silent > corruptions are a big deal. It won't loudly fail _and_ leave the array > operational for an easy fixup later on so I think it's not the same. I either don't understand this, or I don't agree. Assemble --force effectively disables all sanitychecks, so it just can't "fail" that. The result is therefore an array that either (A) holds a good FS with a couple of corrupted files (silent corruption) or (B) a filesystem that needs [metadata] fixing, or (C) one big mess that hardly resembles a FS. It stands to reason that in case (C) you either made a user error assembling the wrong parts or what you had wasn't a bad sector error in the first place but media failure or another type of disastrous corruption. I've been there. I suffered through a raid-5 two-disk failure, and I've got all of my data back eventually, even if some silent corruptions have happened (though I did not notice it, but that's no wonder with 500.000+ files) It is ugly, and the last resort, but that doesn't mean it can't work. > > > - disk replacement is quite risky. This is totally unexpected to me > > > but it should have been obvious: there's no bad block list in MD > > > so if we would postpone I/O errors during reconstruction then > > > 1: it might cause silent data corruption when I/O error > > > unexpectedly disappears. > > > 2: we might silently loose redundancy in a number of places. > > > > Not sure if I understood all of that, but I think you're saying that md > > _could_ disregard read errors _when_already_running_in_degraded_mode_ so > > as to preserve the array at all cost. > > We can't. Imagine a 3 disk RAID5 array, one disk being replaced. While > writing the new disk we get a single randon read error on one of the > other two disks. Ignoring that implies either: > 1: making up a phoney data block when a checksum block was hit by the > error. 2: generating a garbage checksum block. Well, yes. But some people -when confronted with the choice between losing everything or having silent corruptions- will happily accept the latter. At least you could try to find the bad file(s) by md5sum, whereas in the total failure scenario you're left with nothing. Of course that choice depends on how good and recent your backups are. For my scenario, I wholly depend on md raid to preserve my files; I will not and cannot start backing up TV shows to DLT tape or something. That is a no-no economically. There is just no way to backup 700GB data in a home user environment, unless you want to spend a full week to burn it onto 170 DVDs. (Or buy twice the amount of disks and leave them locked in a safe) So I certainly would opt for the "possibility of silent corruption" choice. And if I ever find the corrupted file I delete it and mark it for 'new retrieval" or some such followup. Or restore from tape where applicable. > RAID won't remember these events because there is no bad block list. Now > suppose the array is operational again and hits a read error after some > random interval. Then either it may: > 1: return corrupt data without notice. > 2: recalculate a block based on garbage. Definitely true, but we're still talking about errors on a single block, or a couple of blocks at most. The other 1000.000+ blocks are still okay. Again, it all depends on your circumstances what is worse: losing all the files including the good ones, or having silent corruptions somewhere. > so, we can't ignore errors during RAID5 reconstruction and we're toast > if it happens, even more toast than we would have been with a normal > disk (barring the case of an entirely dead disk). If you look at the > lower level then of course RAID5 has an advantage but to me it seems to > vaporize when exposed to the _complexity_ of handling secondary errors > during the reconstruction. You cut out my entire idea about leaving the 'failed' disk around to eventually being able to compensate a further block error on another media. Why ? It would _solve_ your problem, wouldn't it ? Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-09 23:16 ` maarten @ 2005-01-10 8:15 ` Frank van Maarseveen 2005-01-14 17:29 ` Dieter Stueken 0 siblings, 1 reply; 95+ messages in thread From: Frank van Maarseveen @ 2005-01-10 8:15 UTC (permalink / raw) To: linux-raid On Mon, Jan 10, 2005 at 12:16:58AM +0100, maarten wrote: > On Sunday 09 January 2005 23:29, Frank van Maarseveen wrote: > > I either don't understand this, or I don't agree. Assemble --force effectively > disables all sanitychecks, ok, wasn't sure about that. but then: > The result is > therefore an array that either (A) holds a good FS with a couple of corrupted > files (silent corruption) > So I certainly would opt for the "possibility of silent corruption" choice. > And if I ever find the corrupted file I delete it and mark it for 'new > retrieval" or some such followup. Or restore from tape where applicable. > > > so, we can't ignore errors during RAID5 reconstruction and we're toast > > if it happens, even more toast than we would have been with a normal > > disk (barring the case of an entirely dead disk). If you look at the > > lower level then of course RAID5 has an advantage but to me it seems to > > vaporize when exposed to the _complexity_ of handling secondary errors > > during the reconstruction. > > You cut out my entire idea about leaving the 'failed' disk around to > eventually being able to compensate a further block error on another media. > Why ? It would _solve_ your problem, wouldn't it ? I did not intend to cut it out but simplified the situation a bit: if you have all the RAID5 disks even with a bunch of errors spread out over all of them then yes, you basically still have the data. Nothing is lost provided there's no double fault and disks are not dead yet. But there are not many technical people I would trust for recovering from this situation. And I wouldn't trust myself without a significant coffee intake either :) -- Frank ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-10 8:15 ` Frank van Maarseveen @ 2005-01-14 17:29 ` Dieter Stueken 2005-01-14 17:46 ` maarten 2005-01-15 0:13 ` Michael Tokarev 0 siblings, 2 replies; 95+ messages in thread From: Dieter Stueken @ 2005-01-14 17:29 UTC (permalink / raw) To: linux-raid Frank van Maarseveen wrote: > On Mon, Jan 10, 2005 at 12:16:58AM +0100, maarten wrote: >>You cut out my entire idea about leaving the 'failed' disk around to >>eventually being able to compensate a further block error on another media. >>Why ? It would _solve_ your problem, wouldn't it ? > > I did not intend to cut it out but simplified the situation a bit: if > you have all the RAID5 disks even with a bunch of errors spread out over > all of them then yes, you basically still have the data. Nothing is > lost provided there's no double fault and disks are not dead yet. But > there are not many technical people I would trust for recovering from > this situation. And I wouldn't trust myself without a significant > coffee intake either :) I think read errors are to be handled very differently compared to disk failures. In particular the affected disk should not be kicked out incautious. If done so, you waste the real power of the RAID5 system immediately! As long, as any other part of the disk can still be read, this data must be preserved by all means. As long as only parts of a disk (even of different disks) can't be read, it is not a fatal problem, as long as the data can still be read from an other disk of the array. There is no reason to kill any disk in advance. What I'm missing is some improved concept of replacing a disk: Kicking off some disk at first and starting to resync to a spare disk thereafter is a very dangerous approach. Instead some "presync" should be possible: After a decision to replace some disk, the new (spare) disk should be prepared in advance, while all other disks are still running. After the spare disk was successfully prepared, the disk to replace may be disabled. This sounds a bit like RAID6, but it is much simpler. The complicated part may be the phase where I have one additional disk. A simple solution would be to perform a resync offline, while no write takes place. This may even be performed by a userland utility. If I want to perform the "presync" online, I have to carry out writes to both disks simultaneously, while the presync takes place. Dieter. -- Dieter Stüken, con terra GmbH, Münster stueken@conterra.de http://www.conterra.de/ (0)251-7474-501 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-14 17:29 ` Dieter Stueken @ 2005-01-14 17:46 ` maarten 2005-01-14 19:14 ` Derek Piper 2005-01-15 0:13 ` Michael Tokarev 1 sibling, 1 reply; 95+ messages in thread From: maarten @ 2005-01-14 17:46 UTC (permalink / raw) To: linux-raid Mod parent "+5 Insightful". Very well though out and said, Dieter. Maarten On Friday 14 January 2005 18:29, Dieter Stueken wrote: > Frank van Maarseveen wrote: > > I did not intend to cut it out but simplified the situation a bit: if > > you have all the RAID5 disks even with a bunch of errors spread out over > > all of them then yes, you basically still have the data. Nothing is > > lost provided there's no double fault and disks are not dead yet. But > > there are not many technical people I would trust for recovering from > > this situation. And I wouldn't trust myself without a significant > > coffee intake either :) > > I think read errors are to be handled very differently compared to disk > failures. In particular the affected disk should not be kicked out > incautious. If done so, you waste the real power of the RAID5 system > immediately! As long, as any other part of the disk can still be read, > this data must be preserved by all means. As long as only parts of a disk > (even of different disks) can't be read, it is not a fatal problem, as long > as the data can still be read from an other disk of the array. There is no > reason to kill any disk in advance. > > What I'm missing is some improved concept of replacing a disk: > Kicking off some disk at first and starting to resync to a spare > disk thereafter is a very dangerous approach. Instead some "presync" > should be possible: After a decision to replace some disk, the new > (spare) disk should be prepared in advance, while all other disks are still > running. After the spare disk was successfully prepared, the disk to > replace may be disabled. > > This sounds a bit like RAID6, but it is much simpler. The complicated part > may be the phase where I have one additional disk. A simple solution would > be to perform a resync offline, while no write takes place. This may even > be performed by a userland utility. If I want to perform the "presync" > online, I have to carry out writes to both disks simultaneously, while the > presync takes place. > > Dieter. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-14 17:46 ` maarten @ 2005-01-14 19:14 ` Derek Piper 0 siblings, 0 replies; 95+ messages in thread From: Derek Piper @ 2005-01-14 19:14 UTC (permalink / raw) To: linux-raid Ah, that does sound much better I agree ... having just been bitten by the 'oh dear, I got one bit was out of place, bye bye disk' problem myself. Even if it only 'failed' a 'chunk', it would be an improvement. I'll take 64K over 60GB any day. The read for the chunk could then be calculated using parity and a notification sent upwards saying something to this effect: 'uh, hey, I'm having to regenerate data from disk N at area X on-the-fly (i.e. I'm 'degraded') but all disks are still with us and the other data is not in harms way, you might want to think about backups and possibly a new disk'. If the chunk/sector (choose how much you want to fail) can then be read again, clear the 'alert'. Of course if you get two identical chunks that miss-read, you're screwed. Probably less screwed than if it were whole disk though. Derek On Fri, 14 Jan 2005 18:46:54 +0100, maarten <maarten@ultratux.net> wrote: > > Mod parent "+5 Insightful". > > Very well though out and said, Dieter. > > Maarten > > On Friday 14 January 2005 18:29, Dieter Stueken wrote: > > Frank van Maarseveen wrote: > > > > I did not intend to cut it out but simplified the situation a bit: if > > > you have all the RAID5 disks even with a bunch of errors spread out over > > > all of them then yes, you basically still have the data. Nothing is > > > lost provided there's no double fault and disks are not dead yet. But > > > there are not many technical people I would trust for recovering from > > > this situation. And I wouldn't trust myself without a significant > > > coffee intake either :) > > > > I think read errors are to be handled very differently compared to disk > > failures. In particular the affected disk should not be kicked out > > incautious. If done so, you waste the real power of the RAID5 system > > immediately! As long, as any other part of the disk can still be read, > > this data must be preserved by all means. As long as only parts of a disk > > (even of different disks) can't be read, it is not a fatal problem, as long > > as the data can still be read from an other disk of the array. There is no > > reason to kill any disk in advance. > > > > What I'm missing is some improved concept of replacing a disk: > > Kicking off some disk at first and starting to resync to a spare > > disk thereafter is a very dangerous approach. Instead some "presync" > > should be possible: After a decision to replace some disk, the new > > (spare) disk should be prepared in advance, while all other disks are still > > running. After the spare disk was successfully prepared, the disk to > > replace may be disabled. > > > > This sounds a bit like RAID6, but it is much simpler. The complicated part > > may be the phase where I have one additional disk. A simple solution would > > be to perform a resync offline, while no write takes place. This may even > > be performed by a userland utility. If I want to perform the "presync" > > online, I have to carry out writes to both disks simultaneously, while the > > presync takes place. > > > > Dieter. > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Derek Piper - derek.piper@gmail.com http://doofer.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-14 17:29 ` Dieter Stueken 2005-01-14 17:46 ` maarten @ 2005-01-15 0:13 ` Michael Tokarev 2005-01-15 9:34 ` Peter T. Breuer 1 sibling, 1 reply; 95+ messages in thread From: Michael Tokarev @ 2005-01-15 0:13 UTC (permalink / raw) To: linux-raid Dieter Stueken wrote: [] > I think read errors are to be handled very differently compared to disk > failures. In particular the affected disk should not be kicked out > incautious. If done so, you waste the real power of the RAID5 system > immediately! As long, as any other part of the disk can still be read, > this data must be preserved by all means. As long as only parts of a disk > (even of different disks) can't be read, it is not a fatal problem, as long > as the data can still be read from an other disk of the array. There is no > reason to kill any disk in advance. I once was successeful at recovering a (quite large at the time being) filesystem after multiple read errors developed by two disks running in a raid1 array (as it turned out it was the chassis fan who was at fault, the disks become too hot and the weather was hot too, and two disks went bed almost at once). Raid kicked one disk out of the array after first read error, and, thanks God (or whatever), second disk developed error right after that, so the data was still in sync. I've read everything from one disk (dd conv=noerror), noticing the bad blocks, and when read the missing blocks from the second drive (dd skip=n seek=n). I'm afraid to think what'd be done if the second drive lasted a bit longer (the filesystem was quite active). (And yes I know it was me who really was at fault, because I didn't enable various sensors monitoring...) More, I was once successeful at recovering raid5 array after two disk failure, but it was much more difficult... And I wasn't able to recover all data at that time, just because I had no time to figure out how to reconstruct data using parity block (I only recovered the data blocks, zeroing unreadable ones). That all to say: yes indeed, this lack of "smart error handling" is a noticieable omission in linux software raid. There are quite some (sometimes fatal to the data) failure scenarios that'd not had happened provided the smart error handling where in place. /mjt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-15 0:13 ` Michael Tokarev @ 2005-01-15 9:34 ` Peter T. Breuer 2005-01-15 9:54 ` Mikael Abrahamsson 0 siblings, 1 reply; 95+ messages in thread From: Peter T. Breuer @ 2005-01-15 9:34 UTC (permalink / raw) To: linux-raid Michael Tokarev <mjt@tls.msk.ru> wrote: > That all to say: yes indeed, this lack of "smart error handling" is > a noticieable omission in linux software raid. There are quite some > (sometimes fatal to the data) failure scenarios that'd not had happened > provided the smart error handling where in place. I also agree that "redundancy per block" is probably a much better idea than "redundancy per disk". Probably needs a "how hot are you?" primitive, though! The read patch I posted should get you over glitches from sporadic read errors that would otherwise fault the disk, but one wants to add accunting for such things and watch them. Peter ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-15 9:34 ` Peter T. Breuer @ 2005-01-15 9:54 ` Mikael Abrahamsson 2005-01-15 10:31 ` Brad Campbell 2005-01-15 10:33 ` Peter T. Breuer 0 siblings, 2 replies; 95+ messages in thread From: Mikael Abrahamsson @ 2005-01-15 9:54 UTC (permalink / raw) To: linux-raid On Sat, 15 Jan 2005, Peter T. Breuer wrote: > I also agree that "redundancy per block" is probably a much better idea > than "redundancy per disk". Probably needs a "how hot are you?" > primitive, though! Would a methodology that'll do if read error then recreate the block from parity write to sector that had read error wait until write has completed flush buffers read back block from drive if block still bad fail disk log result This would give the drive a chance to relocate the block to its spare blocks it has available for just this instance? If you get a write error then the drive is obviously (?) out of spare sectors and should be rightfully failed. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-15 9:54 ` Mikael Abrahamsson @ 2005-01-15 10:31 ` Brad Campbell 2005-01-15 11:10 ` Mikael Abrahamsson 2005-01-15 10:33 ` Peter T. Breuer 1 sibling, 1 reply; 95+ messages in thread From: Brad Campbell @ 2005-01-15 10:31 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: linux-raid Mikael Abrahamsson wrote: > On Sat, 15 Jan 2005, Peter T. Breuer wrote: > > >>I also agree that "redundancy per block" is probably a much better idea >>than "redundancy per disk". Probably needs a "how hot are you?" >>primitive, though! > > > Would a methodology that'll do > > if read error then > recreate the block from parity > write to sector that had read error In theory this should reallocate the bad sector. > wait until write has completed If the write fails, fail the drive as bad things are going to happen. > flush buffers > read back block from drive > if block still bad > fail disk > log result Make sure the logging is done in such a way as mdadm can send you an E-mail and say. "Hey, sda just had a bad block. Be aware. Brad ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-15 10:31 ` Brad Campbell @ 2005-01-15 11:10 ` Mikael Abrahamsson 0 siblings, 0 replies; 95+ messages in thread From: Mikael Abrahamsson @ 2005-01-15 11:10 UTC (permalink / raw) To: linux-raid On Sat, 15 Jan 2005, Brad Campbell wrote: > Make sure the logging is done in such a way as mdadm can send you an > E-mail and say. "Hey, sda just had a bad block. Be aware. Definately. The 3ware daemon does this when it detects the drive did a relocation (I guess it does it via SMART) and I definately like the way this is done. As far as I can understand, the 3ware hw raid5 will kick any drive that has read errors on it though, but I am not 100% sure of this, this is just my guess from experience with it. I have successfully re-introduced a failed drive into the raid though, so I guess it'll relocate just the way we discussed, when it actually tries to write again to the bad sector. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-15 9:54 ` Mikael Abrahamsson 2005-01-15 10:31 ` Brad Campbell @ 2005-01-15 10:33 ` Peter T. Breuer 2005-01-15 11:07 ` Mikael Abrahamsson 1 sibling, 1 reply; 95+ messages in thread From: Peter T. Breuer @ 2005-01-15 10:33 UTC (permalink / raw) To: linux-raid Mikael Abrahamsson <swmike@swm.pp.se> wrote: > if read error then > recreate the block from parity > write to sector that had read error > wait until write has completed > flush buffers > read back block from drive > if block still bad > fail disk > log result Well, I haven't checked the RAID5 code (which is what you seem to be thinking of), but I can tell you that the RAID1 code simply retries a failed read. Unfortunately, it also ejects the disk with the bad read from the array. So it was fairly simple to alter the RAID1 code to "don't do that then". Just remove the line that says to error the disk out, and let the retry code do its bit. One also has to add a counter so that if there is no way left of getting the data, then the read eventually does return an error to the user. Thus far no real problem. The dangerous bit is launching a rewrite of the eaffceted block, which I think one does by placing the ultimately successful read on the queue for the raid1d thread, and changing the cmd type to "special", which should trigger the raid1d thread to do a rewrite from it. But I haven't dared test that yet. I'll revisit that patch over the weekend. Now, all that is what you summarised as recreate the block from parity write to sector that had read error and I don't see any need for much of the rest except log result In particular you seem to be trying to do things synchronusly, when that's not at all necessary, or perhaps desirable. The user will get a succes notice from the read when end_io is run on the originating request, and we can be doing other things at the same time. The raid code really has a copy of the original request, so we can ack the original while carrying on with other things - we just have to be careful not to lose the buffers with the read data in them (increment reference counts and so on). I'd appreciate Neil's help with that but he hasn't commented on the patch I published so far! Peter ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Spares and partitioning huge disks 2005-01-15 10:33 ` Peter T. Breuer @ 2005-01-15 11:07 ` Mikael Abrahamsson 0 siblings, 0 replies; 95+ messages in thread From: Mikael Abrahamsson @ 2005-01-15 11:07 UTC (permalink / raw) To: linux-raid On Sat, 15 Jan 2005, Peter T. Breuer wrote: > In particular you seem to be trying to do things synchronusly, when > that's not at all necessary, or perhaps desirable. The user will get a Well, no, I only tried to summarize what needed to be done, not the way to accomplish it in the best way. Your summary reflecting my own attempt at a summary seems better, but it seems we both agree on the merit of the concept of trying to write to a sector that has given a read error, if we can recreate the data from other sources such as mirror or parity. If this fails, fail the disk. Anyhow, log what happened. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-09 21:26 ` maarten 2005-01-09 22:29 ` Frank van Maarseveen @ 2005-01-09 23:20 ` Guy 2005-01-10 7:42 ` Gordon Henderson 2005-01-10 0:42 ` Spares and partitioning huge disks Guy 2 siblings, 1 reply; 95+ messages in thread From: Guy @ 2005-01-09 23:20 UTC (permalink / raw) To: 'maarten', linux-raid It was said: "> I think RAID6 but especially RAID1 is safer. Well, duh :) At the expense of buying everything twice, sure it's safer :))" Guy says: I disagree with the above. True, RAID6 can lose 2 disks without data loss. But, RAID1 and RAID5 can only lose 1 disk without data loss. If RAID1 or RAID5 had a read error during a re-sync, both would die. Now, RAID5 has more disks, so the odds are increased that a read error could occur. But you can improve those odds by partitioning the disks and creating sub arrays, then combining them. Per Maarten's plan. Of course, having a bad sector should not cause a disk to be kicked out! The RAID software should handle this. Most hardware based RAID systems can handle bad blocks. But this is another issue. Why is RAID1 preferred over RAID5? RAID1 is considered faster than RAID5. Most systems tend to read much more than they write. So, having 2 disks to read from (RAID1) can double your read rate. RAID5 tends to have better seek time in a multi threaded environment (more then 1 seek attempted concurrently). If you test with bonnie++, try 10 bonnies at the same time and note the sum of the seek times (you must add them yourself). With RAID1 it should about double, with RAID5, it depends on the number of disks in the array. Most home systems tend to only do 1 thing at a time. So, most people don't focus on seek time, they tent to focus on sequential read or write rates. In a multi user/process/thread environment, you don't do much sequential I/O, it tends to be random. But, assuming you need the extra space RAID5 yields, if you choose RAID1 instead, you would have many more disks than just 2, in a RAID10 environment you would have the improved seek rates of RAID5 (times ~2) and about double the overall read rate of RAID5. This is why some large systems tend to use RAID1 over RAID5. The largest system I worked on had over 300 disks, configured as RAID1. I think it was over kill on performance, RAID5 would have been just fine! But it was not my choice, also not my money. Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten Sent: Sunday, January 09, 2005 4:26 PM To: linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks On Sunday 09 January 2005 20:33, Frank van Maarseveen wrote: > On Sat, Jan 08, 2005 at 05:49:32PM +0100, maarten wrote: > > However, IF during that > > resync one other drive has a read error, it gets kicked too and the array > > dies. The chances of that happening are not very small; > > Ouch! never considered this. So, RAID5 will actually decrease reliability > in a significant number of cases because: > - >1 read errors can cause a total break-down whereas it used > to cause only a few userland I/O errors, disruptive but not foobar. Well, yes and no. You can decide to do a full backup in case you hadn't, prior to changing drives. And if it is _just_ a bad sector, you can 'assemble --force' yielding what you would've had in a non-raid setup; some file somewhere that's got corrupted. No big deal, ie. the same trouble as was caused without raid-5. > - disk replacement is quite risky. This is totally unexpected to me > but it should have been obvious: there's no bad block list in MD > so if we would postpone I/O errors during reconstruction then > 1: it might cause silent data corruption when I/O error > unexpectedly disappears. > 2: we might silently loose redundancy in a number of places. Not sure if I understood all of that, but I think you're saying that md _could_ disregard read errors _when_already_running_in_degraded_mode_ so as to preserve the array at all cost. Hum. That choice should be left to the user if it happens, he probably knows best what to choose in the circumstances. No really, what would be best is that md made a difference between total media failure and sector failure. If one sector is bad on one drive [and it gets kicked therefore] it should be possible when a further read error occurs on other media, to try and read the missing sector data from the kicked drive, who may well have the data there waiting, intact and all. Don't know how hard that is really, but one could maybe think of pushing a disk in an intermediate state between "failed" and "good" like "in_disgrace" what signals to the end user "Don't remove this disk as yet; we may still need it, but add and resync a spare at your earliest convenience as we're running in degraded mode as of now". Hmm. Complicated stuff. :-) This kind of error will get more and more predominant with growing media and decreasing disk quality. Statistically there is not a huge chance of getting a read failure on a 18GB scsi disk, but on a cheap(ish) 500 GB ATA disk that is an entrirely different ballpark. > I think RAID6 but especially RAID1 is safer. Well, duh :) At the expense of buying everything twice, sure it's safer :)) > A small side note on disk behavior: > If it becomes possible to do block remapping at any level (MD, DM/LVM, > FS) then we might not want to write to sectors with read errors at all > but just remap the corresponding blocks by software as long as we have > free blocks: save disk-internal spare sectors so the disk firmware can > pre-emptively remap degraded but ECC correctable sectors upon read. Well I dunno. In ancient times, the OS was charged with remapping bad sectors back when disk drives had no intelligence. Now we delegated that task to the disk. I'm not sure reverting back to the old behaviour is a smart move. But with raid, who knows... And as it is I don't think you get the chance to save the disk-internal spare sectors; the disk handles that transparently so any higher layer cannot only not prevent that, but is even kept completely ignorant to it happening. Maarten - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-09 23:20 ` Guy @ 2005-01-10 7:42 ` Gordon Henderson 2005-01-10 9:03 ` Guy 0 siblings, 1 reply; 95+ messages in thread From: Gordon Henderson @ 2005-01-10 7:42 UTC (permalink / raw) To: linux-raid On Sun, 9 Jan 2005, Guy wrote: > Why is RAID1 preferred over RAID5? > RAID1 is considered faster than RAID5. Most systems tend to read much more > than they write. You'd think that, wouldn't you? However, - I've recently been doing work to graph disk IO by reading /proc/partitions and feeding it into MRTG - what I saw surprised me, although it really shouldn't. Most of the systems I've been graphing over the past few weeks write all the time and rarely read -I'm putting this down to things like log files being written more or less all the time, and the active data set residing in the filesystem/buffer cache more or less all the time. (also ext3 which wants to write all the time too) However, I guess it all depends on what the server is doing - for a workstion it may well be the case that it does more reads. Have a quick look at http://lion.drogon.net/mrtg/diskIO.html This is a moderately busy web server with a couple of dozen virtual web sites and runs a MUD and several majordomo lists. Blue is writes, Green reads. Note periods of heavy read activity just after midnight when it does a backup (over the 'net to another server and it also sucks another server onto the 'archive' partition), and 2am is when it analyses the web log-files. Also note that it's swapping - this has 256MB of RAM and is due for an upgrade, but swap is keeping it all ticking away nicely. The var partition seems to sustain writes at approx. 200-300 sectors/second... Not a fantastic amount, but I found it rather surprising. (I'll put the MRTG code online for anyone who wants it in a few days and let you know) Gordon ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-10 7:42 ` Gordon Henderson @ 2005-01-10 9:03 ` Guy 2005-01-10 12:21 ` Stats... [RE: Spares and partitioning huge disks] Gordon Henderson 0 siblings, 1 reply; 95+ messages in thread From: Guy @ 2005-01-10 9:03 UTC (permalink / raw) To: 'Gordon Henderson', linux-raid You said: "Have a quick look at http://lion.drogon.net/mrtg/diskIO.html" Are you crazy! Quick look, my screen is 1600X1200. Your quick look is over twice that size! :) What is all the red? Oh, it's eye strain! :) Why do your graphs read right to left? It makes my head hurt! Well, I am surprised. I have read somewhere that about a 10 to 1 ratio is common. That's 10 reads per 1 write! Maybe you got your ins and outs reversed! If you data set is small enough to fit in the buffer cache, then you may be correct. I have worked on systems with a database of well over 1T bytes. The system had about 10T bytes total. No way for all that to fit in the buffer cache! But, I don't have any data to deny what you say. Here is my home system using iostat: # iostat -d Linux 2.4.28 (watkins-home.com) 01/10/2005 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn dev8-0 1.77 209.57 10.08 38729760 1862232 dev8-1 1.78 210.75 10.08 38948102 1862232 dev8-2 10.13 522.27 36.50 96518688 6745616 dev8-3 10.52 524.71 37.94 96970042 7011712 dev8-4 10.55 525.05 38.00 97032904 7021816 dev8-5 10.60 524.90 37.99 97004392 7021080 dev8-6 10.59 524.86 38.17 96997424 7054816 dev8-7 10.56 524.89 38.23 97002160 7064552 dev8-8 10.54 524.73 38.25 96973096 7068552 dev8-9 10.54 524.55 37.89 96940336 7001736 dev8-10 10.15 522.54 36.74 96568080 6789584 dev8-11 10.18 522.51 36.52 96562208 6749696 dev8-12 10.21 522.60 36.74 96578592 6790600 dev8-13 10.22 522.67 37.10 96592848 6856800 dev8-14 10.17 522.46 36.95 96552728 6828544 dev8-15 10.19 522.30 36.70 96523856 6782136 The first 2 are my OS disks (RAID1). The others are my 14 disk RAID5. That's about 13 to 1 on my RAID5 array. But I would not claim my system is typical. It is a home system, not a database server or fancy web server. Mostly just a samba server. My email server is a different box. Oops, I just checked my email server, it has 64 meg of RAM and only does email all the time. Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn dev3-0 1.32 2.10 19.72 7284164 68380122 That's 1 to 9. I give up! I got to mirror that system some day! Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Gordon Henderson Sent: Monday, January 10, 2005 2:42 AM To: linux-raid@vger.kernel.org Subject: RE: Spares and partitioning huge disks On Sun, 9 Jan 2005, Guy wrote: > Why is RAID1 preferred over RAID5? > RAID1 is considered faster than RAID5. Most systems tend to read much more > than they write. You'd think that, wouldn't you? However, - I've recently been doing work to graph disk IO by reading /proc/partitions and feeding it into MRTG - what I saw surprised me, although it really shouldn't. Most of the systems I've been graphing over the past few weeks write all the time and rarely read -I'm putting this down to things like log files being written more or less all the time, and the active data set residing in the filesystem/buffer cache more or less all the time. (also ext3 which wants to write all the time too) However, I guess it all depends on what the server is doing - for a workstion it may well be the case that it does more reads. Have a quick look at http://lion.drogon.net/mrtg/diskIO.html This is a moderately busy web server with a couple of dozen virtual web sites and runs a MUD and several majordomo lists. Blue is writes, Green reads. Note periods of heavy read activity just after midnight when it does a backup (over the 'net to another server and it also sucks another server onto the 'archive' partition), and 2am is when it analyses the web log-files. Also note that it's swapping - this has 256MB of RAM and is due for an upgrade, but swap is keeping it all ticking away nicely. The var partition seems to sustain writes at approx. 200-300 sectors/second... Not a fantastic amount, but I found it rather surprising. (I'll put the MRTG code online for anyone who wants it in a few days and let you know) Gordon - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Stats... [RE: Spares and partitioning huge disks] 2005-01-10 9:03 ` Guy @ 2005-01-10 12:21 ` Gordon Henderson 0 siblings, 0 replies; 95+ messages in thread From: Gordon Henderson @ 2005-01-10 12:21 UTC (permalink / raw) To: Guy; +Cc: linux-raid On Mon, 10 Jan 2005, Guy wrote: > You said: > "Have a quick look at > > http://lion.drogon.net/mrtg/diskIO.html" > > Are you crazy! Quick look, my screen is 1600X1200. > Your quick look is over twice that size! :) Well, it's only a few graphs though, and you don't need to scroll horizontally if you don't want to... > What is all the red? Oh, it's eye strain! :) they are just standard MRTG graphs arranged in a grid - mainly intended for my own consumption, but I'm happy to share the code, etc. Give me a few days and I'll tidy it up. After I had a server lose a case fan and overheat I've kinda gone overboard on stats, etc. better to have them than not, I guess.... Eg. http://lion.drogon.net/mrtg/sensors.html http://lion.drogon.net/mrtg/systemStats.html and there are others, but they are even duller... > Why do your graphs read right to left? It makes my head hurt! New data comes in at the left. I read books left to right and I'm left-handed. Sue me.. > Well, I am surprised. I have read somewhere that about a 10 to 1 ratio is > common. That's 10 reads per 1 write! > > Maybe you got your ins and outs reversed! Well.. This did cross my mind! But I did some tests and check with vmstat and I'm fairly confident it's doing the right thing. It does measure sectors (or whatever comes out of /proc/partitions) so it might look more than the number of blocks you may expect the filesystem to be dealing with. If you look at http://lion.drogon.net/mrtg/diskio.hda6-day.png http://lion.drogon.net/mrtg/diskio.hdc6-day.png (small PNG images!) you'll see a green blip at about 11:30... This is the result of 2 runs of: lion:/archive# tar cf /dev/null . So this (and other tests I did when setting it up) makes me confident it's doing the right thing! > If you data set is small enough to fit in the buffer cache, then you may > be correct. I have worked on systems with a database of well over 1T > bytes. The system had about 10T bytes total. No way for all that to fit > in the buffer cache! But, I don't have any data to deny what you say. Sure - this server is only a small one with 2 disks and a couple of dozen web sites - it seems to churn out just under half a GB a day, but the log-files are written and grow all the time )-: This might be a consideration for further tuning though if it were to get worse... In any case, the original comment about reads overshadowing writes may well be true, but at the end of the day it really does depend on the application and use of the server, and I think a few people might be surprised at exactly what their servers are getting up to - especially database servers which write log-files... All my web servers seem to exhibit this behaviour though (part of my business is server hosting) - heres a screeen dump of a small(ish) company central server doing Intranet, file serving, (samba + NFS) and email for about 30 people (it's a 4-disk RAID5 configuration) http://www.drogon.net/pix.png (small 40KB image) This is just the overall totals of the 4 disks - I can't show you the rest easilly as it's behind a firewall... It seems to exhibit the same behaviour - constant writes with the exception of overnights just after midnight when it takes a snapshot of itself then dumps the snapshot to tape. The read-blip at 6AM is the locate DB update. (and at 8am I run a 'du' over some of the disks so I can use xdu and point fingers at disk space hogs) Most of the writes are to the /var partition - log-files and it's probably the email server whinging about SPAM (it runs sendmail, MD & SA) The partition that holds the user-data is a bit more balanced - the number of reads & writes are about the same, although writes are marginally more. > Here is my home system using iostat: Hm. Debian doesn't seem to have 'iostat' - must look it up... > # iostat -d > Linux 2.4.28 (watkins-home.com) 01/10/2005 > > Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > dev8-0 1.77 209.57 10.08 38729760 1862232 > dev8-1 1.78 210.75 10.08 38948102 1862232 > dev8-2 10.13 522.27 36.50 96518688 6745616 > dev8-3 10.52 524.71 37.94 96970042 7011712 > dev8-4 10.55 525.05 38.00 97032904 7021816 > dev8-5 10.60 524.90 37.99 97004392 7021080 > dev8-6 10.59 524.86 38.17 96997424 7054816 > dev8-7 10.56 524.89 38.23 97002160 7064552 > dev8-8 10.54 524.73 38.25 96973096 7068552 > dev8-9 10.54 524.55 37.89 96940336 7001736 > dev8-10 10.15 522.54 36.74 96568080 6789584 > dev8-11 10.18 522.51 36.52 96562208 6749696 > dev8-12 10.21 522.60 36.74 96578592 6790600 > dev8-13 10.22 522.67 37.10 96592848 6856800 > dev8-14 10.17 522.46 36.95 96552728 6828544 > dev8-15 10.19 522.30 36.70 96523856 6782136 Thats a rather busy home server - are you streaming music/video off it? Mine (just a RAID1 system) sits with the disk spun down 95% of the time... (noflushd) however, it's due for a rebuild when it'll turn into a home 'media server' then it might get a little bit more lively! > Oops, I just checked my email server, it has 64 meg of RAM and only does > email all the time. > Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > dev3-0 1.32 2.10 19.72 7284164 68380122 > > That's 1 to 9. I give up! Log-files... Love em or loathe em... > I got to mirror that system some day! Go for it, you know you want to :) Gordon ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: Spares and partitioning huge disks 2005-01-09 21:26 ` maarten 2005-01-09 22:29 ` Frank van Maarseveen 2005-01-09 23:20 ` Guy @ 2005-01-10 0:42 ` Guy 2 siblings, 0 replies; 95+ messages in thread From: Guy @ 2005-01-10 0:42 UTC (permalink / raw) To: 'maarten', linux-raid I really like the "in_disgrace" idea! But, not for a simple bad block. Those should be corrected by recovering the redundant copy, and re-writing it to correct the bad block. If you kick the disk out, but still depend on it if another disk gets a read error, then you must maintain a list of changed blocks, or stripes. If a block or stripe has changed, you could not read the data from the "in_disgrace" disk, since it would not have current data. This list must be maintained after a re-boot, or the "in_disgrace" disk must be failed if the list is lost. "in_disgrace" would be good for write errors (maybe the drive ran out of spare blocks), or maybe read errors that exceed some user defined, per disk threshold. "in_disgrace" would be a good way to replace a failed disk! Assume a disk has failed and a spare has been re-built. You now have a replacement disk. Remove the failed disk. Add the replacement disk, which becomes a spare. Set the spare to "in_disgrace". :) System is not degraded. Rebuild starts to spare the "in_disgrace" disk. Rebuild finishes, the "in_disgrace" disk is changed to failed. It does not change what I have said before, but the label "in_disgrace" makes it much easier to explain!!!!!! Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten Sent: Sunday, January 09, 2005 4:26 PM To: linux-raid@vger.kernel.org Subject: Re: Spares and partitioning huge disks On Sunday 09 January 2005 20:33, Frank van Maarseveen wrote: > On Sat, Jan 08, 2005 at 05:49:32PM +0100, maarten wrote: > > However, IF during that > > resync one other drive has a read error, it gets kicked too and the array > > dies. The chances of that happening are not very small; > > Ouch! never considered this. So, RAID5 will actually decrease reliability > in a significant number of cases because: > - >1 read errors can cause a total break-down whereas it used > to cause only a few userland I/O errors, disruptive but not foobar. Well, yes and no. You can decide to do a full backup in case you hadn't, prior to changing drives. And if it is _just_ a bad sector, you can 'assemble --force' yielding what you would've had in a non-raid setup; some file somewhere that's got corrupted. No big deal, ie. the same trouble as was caused without raid-5. > - disk replacement is quite risky. This is totally unexpected to me > but it should have been obvious: there's no bad block list in MD > so if we would postpone I/O errors during reconstruction then > 1: it might cause silent data corruption when I/O error > unexpectedly disappears. > 2: we might silently loose redundancy in a number of places. Not sure if I understood all of that, but I think you're saying that md _could_ disregard read errors _when_already_running_in_degraded_mode_ so as to preserve the array at all cost. Hum. That choice should be left to the user if it happens, he probably knows best what to choose in the circumstances. No really, what would be best is that md made a difference between total media failure and sector failure. If one sector is bad on one drive [and it gets kicked therefore] it should be possible when a further read error occurs on other media, to try and read the missing sector data from the kicked drive, who may well have the data there waiting, intact and all. Don't know how hard that is really, but one could maybe think of pushing a disk in an intermediate state between "failed" and "good" like "in_disgrace" what signals to the end user "Don't remove this disk as yet; we may still need it, but add and resync a spare at your earliest convenience as we're running in degraded mode as of now". Hmm. Complicated stuff. :-) This kind of error will get more and more predominant with growing media and decreasing disk quality. Statistically there is not a huge chance of getting a read failure on a 18GB scsi disk, but on a cheap(ish) 500 GB ATA disk that is an entrirely different ballpark. > I think RAID6 but especially RAID1 is safer. Well, duh :) At the expense of buying everything twice, sure it's safer :)) > A small side note on disk behavior: > If it becomes possible to do block remapping at any level (MD, DM/LVM, > FS) then we might not want to write to sectors with read errors at all > but just remap the corresponding blocks by software as long as we have > free blocks: save disk-internal spare sectors so the disk firmware can > pre-emptively remap degraded but ECC correctable sectors upon read. Well I dunno. In ancient times, the OS was charged with remapping bad sectors back when disk drives had no intelligence. Now we delegated that task to the disk. I'm not sure reverting back to the old behaviour is a smart move. But with raid, who knows... And as it is I don't think you get the chance to save the disk-internal spare sectors; the disk handles that transparently so any higher layer cannot only not prevent that, but is even kept completely ignorant to it happening. Maarten - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel @ 2005-01-16 21:28 Mitchell Laks 2005-01-16 22:49 ` Maarten 2005-01-17 11:41 ` Gordon Henderson 0 siblings, 2 replies; 95+ messages in thread From: Mitchell Laks @ 2005-01-16 21:28 UTC (permalink / raw) To: linux-raid Thank you to Gordon, Maarten and Guy for your helpful responses. I learned much from each of your comments. Maarten: I paid $70 for an antec sl450 power supply. seems better price than you are saying ( is your power supply better?). Also I liked the idea of 6 +3+3 slots on your box, but i dont see it for sale in the us. Gordon: I get the same output on 2.6.8 sarge kernel for hpt366 driver. I notice that running hdparm /dev/hde that the IO_support is set at default 16 bit while on the other hard drive on the natice ide bus /dev/hdb has IO_support at 32bit. I wondered if I get the other driver whether that will improve things... Guy: > echo 100000 > /proc/sys/dev/raid/speed_limit_max >I added this to /etc/sysctl.conf ># RAID rebuild min/max speed K/Sec per device >dev.raid.speed_limit_min = 1000 >dev.raid.speed_limit_max = 100000 I notice that according to the man page the settings you describe are the defaults. Why did you have to adjust them? Moreover When I cat /proc/sys/dev/raid/speed_limit_max -> i get ->200000 /proc/sys/dev/raid/speed_limit_min -> i get -> 1000. interesting is i didnt adjust them up myself.... Should I adjust the speed+limit_max down to 100000??? I wonder where in debian it got adjusted up? Thanks ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel 2005-01-16 21:28 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks @ 2005-01-16 22:49 ` Maarten 2005-01-17 11:41 ` Gordon Henderson 1 sibling, 0 replies; 95+ messages in thread From: Maarten @ 2005-01-16 22:49 UTC (permalink / raw) To: linux-raid On Sunday 16 January 2005 22:28, Mitchell Laks wrote: > Thank you to Gordon, Maarten and Guy for your helpful responses. I learned > much from each of your comments. > > Maarten: I paid $70 for an antec sl450 power supply. seems better price > than you are saying ( is your power supply better?). Heh. At double that price, I would sure as hell hope so...!! ;-) If the Tagan is comparable to anything from Antec, it would be to the True480, not the SL450. The Tagan is inaudible, if there were no case- and cpu fans whirring and mainboard-LEDs lit, you couldn't say if the unit was on or off. Add to that the fact that this psu has SO many connectors I could connect all 10 harddrives without using a single splitter(!), that all connectors are gold plated, that it weighs more than the average complete case, and that it is very efficient... But judge for yourself: http://www.3dvelocity.com/reviews/tagan/tg480.htm I really don't buy such extreme hardware usually. But since I put the life of 1.4TB raid-5 data (or 2 TB raw disk capacity) in its hands, it seemed like a good idea at the time. Actually, it still does. :-) Maarten ^ permalink raw reply [flat|nested] 95+ messages in thread
* RE: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel 2005-01-16 21:28 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks 2005-01-16 22:49 ` Maarten @ 2005-01-17 11:41 ` Gordon Henderson 1 sibling, 0 replies; 95+ messages in thread From: Gordon Henderson @ 2005-01-17 11:41 UTC (permalink / raw) To: Mitchell Laks; +Cc: linux-raid On Sun, 16 Jan 2005, Mitchell Laks wrote: > Thank you to Gordon, Maarten and Guy for your helpful responses. I learned > much from each of your comments. > > Gordon: I get the same output on 2.6.8 sarge kernel for hpt366 driver. I > notice that running hdparm /dev/hde that the IO_support is set at > default 16 bit while on the other hard drive on the natice ide bus > /dev/hdb has IO_support at 32bit. I wondered if I get the other driver > whether that will improve things... I get the same - 16-bit, however on that particular box, I also get 16-bit for the on-board controller too (it is 6 years old though with a single 32-bit, 33MHz PCI bus!) On other servers with a much more modern modo (dual Athlon systems) I see 32-bit for the on-board controller and 16-bit for the PCI Promise controllers they have (I don't have anything else with a Highpoint card) I'm not really up on PCI bus, etc. arcitecture, but I suspect the only impact will be a doubling of the number of transactions going over the PCI bus - probably not really an issue unless you have lots of PCI devices on the same bus which all need to talk to each other, or to something external (eg. Ethernet) FWIW: I got a reply back from HighPoint about my question to run it under 2.6.10... Thanks for your contacting us! Our current OpenSource driver doesn't support kernel 2.6.10. And that was all they had to say. Ho hum. > I notice that according to the man page the settings you describe are the > defaults. Why did you have to adjust them? > Moreover When I cat > /proc/sys/dev/raid/speed_limit_max -> i get ->200000 > /proc/sys/dev/raid/speed_limit_min -> i get -> 1000. > > interesting is i didnt adjust them up myself.... Should I adjust the > speed+limit_max down to 100000??? > I wonder where in debian it got adjusted up? I don't think this is a distribution issue at all - certianly Debian doesn't do anything with it at all and nothing appears to be inserted into /etc/sysctl.conf 100000 seems to have been the default since at least 2.4.22 (the oldest kernel I have running s/w RAID) Gordon ^ permalink raw reply [flat|nested] 95+ messages in thread
end of thread, other threads:[~2005-01-17 11:41 UTC | newest]
Thread overview: 95+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-06 14:16 Spares and partitioning huge disks maarten
2005-01-06 16:46 ` Guy
2005-01-06 17:08 ` maarten
2005-01-06 17:31 ` Guy
2005-01-06 18:18 ` maarten
[not found] ` <41DD83DA.9040609@h3c.com>
2005-01-06 19:42 ` maarten
2005-01-07 20:59 ` Mario Holbe
2005-01-07 21:57 ` Guy
2005-01-08 10:22 ` Mario Holbe
2005-01-08 12:19 ` maarten
2005-01-08 16:33 ` Guy
2005-01-08 16:58 ` maarten
2005-01-08 14:52 ` Frank van Maarseveen
2005-01-08 15:50 ` Mario Holbe
2005-01-08 16:32 ` Guy
2005-01-08 17:16 ` maarten
2005-01-08 18:55 ` Guy
2005-01-08 19:25 ` maarten
2005-01-08 20:33 ` Mario Holbe
2005-01-08 23:01 ` maarten
2005-01-09 10:10 ` Mario Holbe
2005-01-09 16:23 ` Guy
2005-01-09 16:36 ` Michael Tokarev
2005-01-09 17:52 ` Peter T. Breuer
2005-01-09 17:59 ` Michael Tokarev
2005-01-09 18:34 ` Peter T. Breuer
2005-01-09 20:28 ` Guy
2005-01-09 20:47 ` Peter T. Breuer
2005-01-10 7:19 ` Peter T. Breuer
2005-01-10 9:05 ` Guy
2005-01-10 9:38 ` Peter T. Breuer
2005-01-10 12:31 ` Peter T. Breuer
2005-01-10 13:19 ` Peter T. Breuer
2005-01-10 18:37 ` Peter T. Breuer
2005-01-11 11:34 ` Peter T. Breuer
2005-01-08 23:09 ` Guy
2005-01-09 0:56 ` maarten
2005-01-13 2:05 ` Neil Brown
2005-01-13 4:55 ` Guy
2005-01-13 9:27 ` Peter T. Breuer
2005-01-13 15:53 ` Guy
2005-01-13 17:16 ` Peter T. Breuer
2005-01-13 20:40 ` Guy
2005-01-13 23:32 ` Peter T. Breuer
2005-01-14 2:43 ` Guy
2005-01-08 16:49 ` maarten
2005-01-08 19:01 ` maarten
2005-01-10 16:34 ` maarten
2005-01-10 16:36 ` Gordon Henderson
2005-01-10 17:10 ` maarten
2005-01-16 16:19 ` 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks
2005-01-16 17:53 ` Gordon Henderson
2005-01-16 18:22 ` Maarten
2005-01-16 19:39 ` Guy
2005-01-16 20:55 ` Maarten
2005-01-16 21:58 ` Guy
2005-01-10 17:13 ` Spares and partitioning huge disks Guy
2005-01-10 17:35 ` hard disk re-locates bad block on read Guy
2005-01-11 14:34 ` Tom Coughlan
2005-01-11 22:43 ` Guy
2005-01-12 13:51 ` Tom Coughlan
2005-01-10 18:24 ` Spares and partitioning huge disks maarten
2005-01-10 20:09 ` Guy
2005-01-10 21:21 ` maarten
2005-01-11 1:04 ` maarten
2005-01-10 18:40 ` maarten
2005-01-10 19:41 ` Guy
2005-01-12 11:41 ` RAID-6 Gordon Henderson
2005-01-13 2:11 ` RAID-6 Neil Brown
2005-01-15 16:12 ` RAID-6 Gordon Henderson
2005-01-17 8:04 ` RAID-6 Turbo Fredriksson
2005-01-11 10:09 ` Spares and partitioning huge disks KELEMEN Peter
2005-01-09 19:33 ` Frank van Maarseveen
2005-01-09 21:26 ` maarten
2005-01-09 22:29 ` Frank van Maarseveen
2005-01-09 23:16 ` maarten
2005-01-10 8:15 ` Frank van Maarseveen
2005-01-14 17:29 ` Dieter Stueken
2005-01-14 17:46 ` maarten
2005-01-14 19:14 ` Derek Piper
2005-01-15 0:13 ` Michael Tokarev
2005-01-15 9:34 ` Peter T. Breuer
2005-01-15 9:54 ` Mikael Abrahamsson
2005-01-15 10:31 ` Brad Campbell
2005-01-15 11:10 ` Mikael Abrahamsson
2005-01-15 10:33 ` Peter T. Breuer
2005-01-15 11:07 ` Mikael Abrahamsson
2005-01-09 23:20 ` Guy
2005-01-10 7:42 ` Gordon Henderson
2005-01-10 9:03 ` Guy
2005-01-10 12:21 ` Stats... [RE: Spares and partitioning huge disks] Gordon Henderson
2005-01-10 0:42 ` Spares and partitioning huge disks Guy
-- strict thread matches above, loose matches on Subject: below --
2005-01-16 21:28 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks
2005-01-16 22:49 ` Maarten
2005-01-17 11:41 ` Gordon Henderson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).