Spares and partitioning huge disks

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Spares and partitioning huge disks
@ 2005-01-06 14:16 maarten
  2005-01-06 16:46 ` Guy
                   ` (2 more replies)
  0 siblings, 3 replies; 95+ messages in thread
From: maarten @ 2005-01-06 14:16 UTC (permalink / raw)
  To: linux-raid

Hi

I just got my 4 new 250GB disks.  I have read someone on this list advocating 
that it is better to build arrays with smaller volumes, as that decreases the 
chance of failure, especially failures of two disks in a raid5 configuration. 

The idea behind it was that since a drive gets kicked when a read-error 
occurs, the chance is lower that a 40 GB part develops a read error than for 
the full size 250 GB.  Thus, if you have 24 40GB parts, there is no fatal two 
disk failure when part sda6 and part sdc4 develop a bad sector at the same 
time. On the other hand, if the (full-size) disk sda1 and sdc1 do fail at the 
same time, you're in deep shit. 
I thought it was real insightful, so I would like to try that now.

(Thanks to the original poster, I don't recall your name, sorry)

Now my two questions regarding this.

1) What is better, make 6 raid5 arrays consisting of all 40GB partitions and 
group them in a LVM set, or group them in a raid-0 set (if the latter is even 
possible that is) ?

2) Seen as the 'physical' volumes are now 40 GB, I could add an older 80GB 
disk partitioned in two 40GB halves, and use those two as hot-spares. 
However, for that to work you'd have to be able to add the spares to _all_ 
raid sets, not specific ones, if you understand what I mean.  So they would 
act as 'roaming' spares, and they would get used by the first array that 
needs a spare (when a failure occurs of course).  But... is this possible ? 

Thanks for any insights!
Maarten

-- 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-06 14:16 Spares and partitioning huge disks maarten
@ 2005-01-06 16:46 ` Guy
  2005-01-06 17:08   ` maarten
  2005-01-06 17:31 ` Guy
  2005-01-07 20:59 ` Mario Holbe
  2 siblings, 1 reply; 95+ messages in thread
From: Guy @ 2005-01-06 16:46 UTC (permalink / raw)
  To: 'maarten', linux-raid

This is from "man mdadm":
As  well  as  reporting  events,  mdadm may move a spare drive from one
array to another if they are in the same spare-group and if the  desti-
nation array has a failed drive but not spares.

You can do what you want.  I have never tried.  My arrays are too different.
I don't want to waste an 18Gig spare on a 256M array.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten
Sent: Thursday, January 06, 2005 9:17 AM
To: linux-raid@vger.kernel.org
Subject: Spares and partitioning huge disks

Hi

I just got my 4 new 250GB disks.  I have read someone on this list
advocating 
that it is better to build arrays with smaller volumes, as that decreases
the 
chance of failure, especially failures of two disks in a raid5
configuration. 

The idea behind it was that since a drive gets kicked when a read-error 
occurs, the chance is lower that a 40 GB part develops a read error than for

the full size 250 GB.  Thus, if you have 24 40GB parts, there is no fatal
two 
disk failure when part sda6 and part sdc4 develop a bad sector at the same 
time. On the other hand, if the (full-size) disk sda1 and sdc1 do fail at
the 
same time, you're in deep shit. 
I thought it was real insightful, so I would like to try that now.

(Thanks to the original poster, I don't recall your name, sorry)

Now my two questions regarding this.

1) What is better, make 6 raid5 arrays consisting of all 40GB partitions and

group them in a LVM set, or group them in a raid-0 set (if the latter is
even 
possible that is) ?

2) Seen as the 'physical' volumes are now 40 GB, I could add an older 80GB 
disk partitioned in two 40GB halves, and use those two as hot-spares. 
However, for that to work you'd have to be able to add the spares to _all_ 
raid sets, not specific ones, if you understand what I mean.  So they would 
act as 'roaming' spares, and they would get used by the first array that 
needs a spare (when a failure occurs of course).  But... is this possible ? 

Thanks for any insights!
Maarten

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-06 16:46 ` Guy
@ 2005-01-06 17:08   ` maarten
  0 siblings, 0 replies; 95+ messages in thread
From: maarten @ 2005-01-06 17:08 UTC (permalink / raw)
  To: linux-raid

On Thursday 06 January 2005 17:46, Guy wrote:
> This is from "man mdadm":

Oops. really ?  Sorry, I should have checked that myself.  Mea culpa.

> As  well  as  reporting  events,  mdadm may move a spare drive from one
> array to another if they are in the same spare-group and if the  desti-
> nation array has a failed drive but not spares.
>
> You can do what you want.  I have never tried.  My arrays are too
> different. I don't want to waste an 18Gig spare on a 256M array.

Same (but different) idea here, I'd hate to waste a 250GB for a spare. :-)

I still have one LVM issue to figure out, they say that you should set the 
partition type to 0x8e, but I can't find if that applies to md devices too. 
And if it does how do you set that. Running 'fdisk /dev/mdx' ?  Well, why not 
I suppose, but I've never ran fdisk on an md device, only mkfs.*

Thanks Guy,
Maarten

-- 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-06 14:16 Spares and partitioning huge disks maarten
  2005-01-06 16:46 ` Guy
@ 2005-01-06 17:31 ` Guy
  2005-01-06 18:18   ` maarten
  2005-01-07 20:59 ` Mario Holbe
  2 siblings, 1 reply; 95+ messages in thread
From: Guy @ 2005-01-06 17:31 UTC (permalink / raw)
  To: 'maarten', linux-raid

This idea of splitting larger disks into smaller partitions, then
re-assembling them seems odd.  But it should help with the "bad block kicks
out a disk" problem.

I have never used RAID0.
I have never used more than 1 pv with LVM on Linux.

However, if you are going to use LVM anyway, why not allow LVM to assemble
the disks?  I do that sort of thing all the time with HP-UX.  I create
stripped mirrors using 4 or more disks.  With HP-UX, use the -D option with
lvcreate.  No idea if Linux and LVM can strip.

You are making me think!  I hate that!  :)  Since your 6 RAID5 arrays are on
the same 4 disks, striping them will kill performance.  The poor heads will
be going from 1 end to the other, all the time.  You should use LINEAR is
you combine them with md.  If you use LVM, make sure it does not stripe
them.  With LVM on HP-UX, the default behavior is to not stripe.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten
Sent: Thursday, January 06, 2005 9:17 AM
To: linux-raid@vger.kernel.org
Subject: Spares and partitioning huge disks

Hi

I just got my 4 new 250GB disks.  I have read someone on this list
advocating 
that it is better to build arrays with smaller volumes, as that decreases
the 
chance of failure, especially failures of two disks in a raid5
configuration. 

The idea behind it was that since a drive gets kicked when a read-error 
occurs, the chance is lower that a 40 GB part develops a read error than for

the full size 250 GB.  Thus, if you have 24 40GB parts, there is no fatal
two 
disk failure when part sda6 and part sdc4 develop a bad sector at the same 
time. On the other hand, if the (full-size) disk sda1 and sdc1 do fail at
the 
same time, you're in deep shit. 
I thought it was real insightful, so I would like to try that now.

(Thanks to the original poster, I don't recall your name, sorry)

Now my two questions regarding this.

1) What is better, make 6 raid5 arrays consisting of all 40GB partitions and

group them in a LVM set, or group them in a raid-0 set (if the latter is
even 
possible that is) ?

2) Seen as the 'physical' volumes are now 40 GB, I could add an older 80GB 
disk partitioned in two 40GB halves, and use those two as hot-spares. 
However, for that to work you'd have to be able to add the spares to _all_ 
raid sets, not specific ones, if you understand what I mean.  So they would 
act as 'roaming' spares, and they would get used by the first array that 
needs a spare (when a failure occurs of course).  But... is this possible ? 

Thanks for any insights!
Maarten

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-06 17:31 ` Guy
@ 2005-01-06 18:18   ` maarten
       [not found]     ` <41DD83DA.9040609@h3c.com>
  0 siblings, 1 reply; 95+ messages in thread
From: maarten @ 2005-01-06 18:18 UTC (permalink / raw)
  To: linux-raid

On Thursday 06 January 2005 18:31, Guy wrote:
> This idea of splitting larger disks into smaller partitions, then
> re-assembling them seems odd.  But it should help with the "bad block kicks
> out a disk" problem.

Yes.  And I'm absolutely sure I read it on linux-raid.  Couple of months back.

> However, if you are going to use LVM anyway, why not allow LVM to assemble
> the disks?  I do that sort of thing all the time with HP-UX.  I create
> stripped mirrors using 4 or more disks.  With HP-UX, use the -D option with
> lvcreate.  No idea if Linux and LVM can strip.

I think so.  But I am more familiar with md, so I'll still use that.  In any 
case LVM's striping is akin raid-0, whereas I will definitely use raid-5.

> You are making me think!  I hate that!  :)  Since your 6 RAID5 arrays are

;-)  Terrible, isn't it.

> on the same 4 disks, striping them will kill performance.  The poor heads
> will be going from 1 end to the other, all the time.  You should use LINEAR
> is you combine them with md.  If you use LVM, make sure it does not stripe
> them.  With LVM on HP-UX, the default behavior is to not stripe.

Exactly what I thought.  That they are on the same disks should not matter; 
only when one full md set (6-1)*40GB=200GB is full (or used, or whatever) 
will the "access" move on to the next set of drives.  It is indeed imperative 
to NOT have LVM striping (nor use raid-0, thanks for observing that!), as 
that would be totally counterproductive and may thus kill performance. 
(r/w head thrashing)

For all clarity, this is how it would look:

md0 : active raid5 sda1[0] sdb1[1] sdc1[2] sdd1[3]
      40000000 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
...
...
...
md5 : active raid5 sda6[0] sdb6[1] sdc6[2] sdd6[3]
      40000000 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

The lvm part is still new to me, but the goal is simply to add all PVs /dev/
md0 through /dev/md5 to one LV yielding... well, a very large volume. :-)

I was planning to do this quickly tonight, but I've overlooked one essential 
thing ;-|  The old server has already 220 GB of data built of 4 80GB disks in 
raid-5.  But I cannot connect all 8 disks at the same time, so I'll have to 
'free up' another system to define the arrays and copy the data over Gbit 
LAN.  I definitely don't want to lose the data!
What complicates this a bit is that I wanted to copy the OS verbatim (it is 
not part of that raid-5 set, just raid-1). But I suppose booting a rescue CD 
would enable me to somehow netcat the OS over to the new disks...
We'll see.

But for now I'm searching my home for a spare system with SATA onboard... :-)

Maarten

-- 

^ permalink raw reply	[flat|nested] 95+ messages in thread

[parent not found: <41DD83DA.9040609@h3c.com>]

* Re: Spares and partitioning huge disks
       [not found]     ` <41DD83DA.9040609@h3c.com>
@ 2005-01-06 19:42       ` maarten
  0 siblings, 0 replies; 95+ messages in thread
From: maarten @ 2005-01-06 19:42 UTC (permalink / raw)
  To: Mike Hardy, linux-raid

On Thursday 06 January 2005 19:30, Mike Hardy wrote:
> maarten wrote:
> > I was planning to do this quickly tonight, but I've overlooked one
> > essential thing ;-|  The old server has already 220 GB of data built of 4
> > 80GB disks in raid-5.  But I cannot connect all 8 disks at the same time,
> > so I'll have to 'free up' another system to define the arrays and copy
> > the data over Gbit LAN.  I definitely don't want to lose the data!
> > What complicates this a bit is that I wanted to copy the OS verbatim (it
> > is not part of that raid-5 set, just raid-1). But I suppose booting a
> > rescue CD would enable me to somehow netcat the OS over to the new
> > disks... We'll see.
>
> You could degrade the current raid5 set by plugging one of the new
> drives in and copying the 220GB to it directly, then you could build the
> new raid5 sets with one drive "missing" and then finally dump the data
> on the new raid5's and then hotadd the missing drive

Hey.  I knew that trick of course, but only now that you mention it I realize 
that indeed one single new disk is big enough to hold all of the old data.
Stupid :)  I never though of that.  Go figure.  Those disks get BIG indeed...!
:-))

Right now I took my main(*) fileserver offline, unplugged all the disks from 
it and connected the new disks.  Using a RIP CDrom I partitioned them and 
'mkraid'ed (mdadm not yet being on RIP) the first md device which will hold 
the OS.  As we speak the netcat session is running, so if I didn't make any 
typos or thinkos soon I will hopefully reboot to a full-fledged system.

(*) These new disks are not for my _fileserver_, but for my MythTV PVR, the 
machine they will eventually end up in, which holds the 220 GB TV & movies.

So far, so good...

Maarten

-- 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-06 14:16 Spares and partitioning huge disks maarten
  2005-01-06 16:46 ` Guy
  2005-01-06 17:31 ` Guy
@ 2005-01-07 20:59 ` Mario Holbe
  2005-01-07 21:57   ` Guy
  2 siblings, 1 reply; 95+ messages in thread
From: Mario Holbe @ 2005-01-07 20:59 UTC (permalink / raw)
  To: linux-raid

maarten <maarten@ultratux.net> wrote:
> I just got my 4 new 250GB disks.  I have read someone on this list advocating 
> that it is better to build arrays with smaller volumes, as that decreases the 
> chance of failure, especially failures of two disks in a raid5 configuration. 

This might be true for read-errors.
However, if a whole disk dies (perhaps because the IDE controller fails,
I assume you're having IDE disks or because of a temperature failure or
something like that) with a couple of partitions on it, you get a lot of
simultaneously 'disks' (partitions), which would completely kill your
RAID5, because RAID5 can IMHO only recover one failing device.
I'd assume, such a setup would kill you in this case, while with only
4 devices (whole 250G disks) you'd survive it. I'm quite sure one
could get it managed back together with more or less expert knowledge,
but I belive the complete RAID would stop processing first.
Just to make this clear - all this are spontaneous assumptions, I did
never play with RAID5.

regards,
   Mario
-- 
I heard, if you play a NT-CD backwards, you get satanic messages...
That's nothing. If you play it forwards, it installs NT.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-07 20:59 ` Mario Holbe
@ 2005-01-07 21:57   ` Guy
  2005-01-08 10:22     ` Mario Holbe
  2005-01-08 14:52     ` Frank van Maarseveen
  0 siblings, 2 replies; 95+ messages in thread
From: Guy @ 2005-01-07 21:57 UTC (permalink / raw)
  To: 'Mario Holbe', linux-raid

His plan is to split the disks into 6 partitions.
Each of his six RAID5 arrays will only use 1 partition of each physical
disk.
If he were to lose a disk, all 6 RAID5 arrays would only see 1 failed disk.
If he gets 2 read errors, on different disks, at the same time, he has a 1/6
chance they would be in the same array (which would be bad).

Everything SHOULD work just fine.  :)

His plan is to combine the 6 arrays with LVM or a linear array.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Mario Holbe
Sent: Friday, January 07, 2005 3:59 PM
To: linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

maarten <maarten@ultratux.net> wrote:
> I just got my 4 new 250GB disks.  I have read someone on this list
advocating 
> that it is better to build arrays with smaller volumes, as that decreases
the 
> chance of failure, especially failures of two disks in a raid5
configuration. 

This might be true for read-errors.
However, if a whole disk dies (perhaps because the IDE controller fails,
I assume you're having IDE disks or because of a temperature failure or
something like that) with a couple of partitions on it, you get a lot of
simultaneously 'disks' (partitions), which would completely kill your
RAID5, because RAID5 can IMHO only recover one failing device.
I'd assume, such a setup would kill you in this case, while with only
4 devices (whole 250G disks) you'd survive it. I'm quite sure one
could get it managed back together with more or less expert knowledge,
but I belive the complete RAID would stop processing first.
Just to make this clear - all this are spontaneous assumptions, I did
never play with RAID5.

regards,
   Mario
-- 
I heard, if you play a NT-CD backwards, you get satanic messages...
That's nothing. If you play it forwards, it installs NT.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-07 21:57   ` Guy
@ 2005-01-08 10:22     ` Mario Holbe
  2005-01-08 12:19       ` maarten
  2005-01-08 14:52     ` Frank van Maarseveen
  1 sibling, 1 reply; 95+ messages in thread
From: Mario Holbe @ 2005-01-08 10:22 UTC (permalink / raw)
  To: linux-raid

Guy <bugzilla@watkins-home.com> wrote:
> Each of his six RAID5 arrays will only use 1 partition of each physical
> disk.
> His plan is to combine the 6 arrays with LVM or a linear array.

Ah, I just missed that part, sorry & thanks :)
I agree with you then - it's something like a RAID5+0 (analogue
to RAID1+0) then and it *should* work just fine :)



regards,
   Mario
-- 
Independence Day: Fortunately, the alien computer operating system works just
fine with the laptop. This proves an important point which Apple enthusiasts
have known for years. While the evil empire of Microsoft may dominate the
computers of Earth people, more advanced life forms clearly prefer Mac's.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-08 10:22     ` Mario Holbe
@ 2005-01-08 12:19       ` maarten
  2005-01-08 16:33         ` Guy
  0 siblings, 1 reply; 95+ messages in thread
From: maarten @ 2005-01-08 12:19 UTC (permalink / raw)
  To: linux-raid

On Saturday 08 January 2005 11:22, Mario Holbe wrote:
> Guy <bugzilla@watkins-home.com> wrote:
> > Each of his six RAID5 arrays will only use 1 partition of each physical
> > disk.
> > His plan is to combine the 6 arrays with LVM or a linear array.
>
> Ah, I just missed that part, sorry & thanks :)
> I agree with you then - it's something like a RAID5+0 (analogue
> to RAID1+0) then and it *should* work just fine :)

Yes, it should.  And the array does indeed work, but I'm plagued with a host 
of other -unrelated- problems now. :-(
First, I had to upgrade from kernel 2.4 to 2.6 because I lacked a driver for 
my SATA card.  That was quite complicated as there is no official support to 
run 2.6 kernels on Suse 9.0. It also entailed migrating to lvm2 as lvm is not 
part of 2.6.  But as it turns out, the 2.6 kernel I eventually installed does 
somehow not initialize my bttv cards correctly and also ALSA has problems.  
So now I'm reverting back to a 2.4 kernel which should support SATA, version 
2.4.28, which I'm building as we speak from vanilla kernel sources...
But that has a lot of hurdles too, and I'd still have to find an ALSA driver 
for that as well.  Worse, I fear that I will have to migrate back to lvm1 now 
too, so that means copying all the 200 Gig data _again_, which by itself 
takes about 10 hours... :-(
Ah, if only I would have bought ATA disks instead of SATA !
I've thought about putting the disks in another system and just use them over 
NFS, but that would mean another system that is powered for 24/7, and that 
gets to be a bit much.  Reinstalling from scratch with a 9.1 / 2.6 distro is 
a worse option still, as mythtv carries such a bucket full of obscure 
dependencies I'd hate to install all that again. 

In other words, I'm not there yet, but at least that has little or nothing to 
do with lvm or md.  But this does 'suck' a lot.

Maarten

-- 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-08 12:19       ` maarten
@ 2005-01-08 16:33         ` Guy
  2005-01-08 16:58           ` maarten
  0 siblings, 1 reply; 95+ messages in thread
From: Guy @ 2005-01-08 16:33 UTC (permalink / raw)
  To: 'maarten', linux-raid

Maarten,
	I was thinking again!

You plan on using an 80 gig disk as a spare disk.  2 40 Gig partitions.  If
both spares end up in the same RAID5 array, that would be bad!  mdadm
supports spare groups, you should create 2 groups, put 1 spare partition in
each group.  Then put 1/2 of your RAID5 arrays in each group.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten
Sent: Saturday, January 08, 2005 7:19 AM
To: linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

On Saturday 08 January 2005 11:22, Mario Holbe wrote:
> Guy <bugzilla@watkins-home.com> wrote:
> > Each of his six RAID5 arrays will only use 1 partition of each physical
> > disk.
> > His plan is to combine the 6 arrays with LVM or a linear array.
>
> Ah, I just missed that part, sorry & thanks :)
> I agree with you then - it's something like a RAID5+0 (analogue
> to RAID1+0) then and it *should* work just fine :)

Yes, it should.  And the array does indeed work, but I'm plagued with a host

of other -unrelated- problems now. :-(
First, I had to upgrade from kernel 2.4 to 2.6 because I lacked a driver for

my SATA card.  That was quite complicated as there is no official support to

run 2.6 kernels on Suse 9.0. It also entailed migrating to lvm2 as lvm is
not 
part of 2.6.  But as it turns out, the 2.6 kernel I eventually installed
does 
somehow not initialize my bttv cards correctly and also ALSA has problems.  
So now I'm reverting back to a 2.4 kernel which should support SATA, version

2.4.28, which I'm building as we speak from vanilla kernel sources...
But that has a lot of hurdles too, and I'd still have to find an ALSA driver

for that as well.  Worse, I fear that I will have to migrate back to lvm1
now 
too, so that means copying all the 200 Gig data _again_, which by itself 
takes about 10 hours... :-(
Ah, if only I would have bought ATA disks instead of SATA !
I've thought about putting the disks in another system and just use them
over 
NFS, but that would mean another system that is powered for 24/7, and that 
gets to be a bit much.  Reinstalling from scratch with a 9.1 / 2.6 distro is

a worse option still, as mythtv carries such a bucket full of obscure 
dependencies I'd hate to install all that again. 

In other words, I'm not there yet, but at least that has little or nothing
to 
do with lvm or md.  But this does 'suck' a lot.

Maarten

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-08 16:33         ` Guy
@ 2005-01-08 16:58           ` maarten
  0 siblings, 0 replies; 95+ messages in thread
From: maarten @ 2005-01-08 16:58 UTC (permalink / raw)
  To: linux-raid

On Saturday 08 January 2005 17:33, Guy wrote:

> You plan on using an 80 gig disk as a spare disk.  2 40 Gig partitions.  If
> both spares end up in the same RAID5 array, that would be bad!  mdadm
> supports spare groups, you should create 2 groups, put 1 spare partition in
> each group.  Then put 1/2 of your RAID5 arrays in each group.

Good point.  But I was planning to monitor it a bit, so I suppose I'd notice 
that, and add another disk to remedy it.  I just decommissioned 4 80GB drives 
so there's plenty where they came from. ;)

...
As it turns out I do indeed need to kill the LVM2 array and downgrade to lvm1 
yet again, because 2.4.28 seems to have no support for it.  Bummer.
That's isn't too bad, the raid arrays stay active so the long resyncs will not 
happen, just the incredibly slow tar-over-netcat network backup session.
Before I do this I must make sure that the old data disks are still okay...

Maarten

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-07 21:57   ` Guy
  2005-01-08 10:22     ` Mario Holbe
@ 2005-01-08 14:52     ` Frank van Maarseveen
  2005-01-08 15:50       ` Mario Holbe
                         ` (2 more replies)
  1 sibling, 3 replies; 95+ messages in thread
From: Frank van Maarseveen @ 2005-01-08 14:52 UTC (permalink / raw)
  To: Guy; +Cc: 'Mario Holbe', linux-raid

On Fri, Jan 07, 2005 at 04:57:35PM -0500, Guy wrote:
> His plan is to split the disks into 6 partitions.
> Each of his six RAID5 arrays will only use 1 partition of each physical
> disk.
> If he were to lose a disk, all 6 RAID5 arrays would only see 1 failed disk.
> If he gets 2 read errors, on different disks, at the same time, he has a 1/6
> chance they would be in the same array (which would be bad).
> His plan is to combine the 6 arrays with LVM or a linear array.

Intriguing setup. Do you think this actually improves the reliability
with respect to disk failure compared to creating just one large RAID5
array?

For one second I thought it's a clever trick but gut feeling tells
me the odds of losing the entire array won't change (simplified --
because the increased complexity creates room for additional errors).

-- 
Frank

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-08 14:52     ` Frank van Maarseveen
@ 2005-01-08 15:50       ` Mario Holbe
  2005-01-08 16:32       ` Guy
  2005-01-08 16:49       ` maarten
  2 siblings, 0 replies; 95+ messages in thread
From: Mario Holbe @ 2005-01-08 15:50 UTC (permalink / raw)
  To: linux-raid

Frank van Maarseveen <frankvm@frankvm.com> wrote:
> Intriguing setup. Do you think this actually improves the reliability
> with respect to disk failure compared to creating just one large RAID5

Well, there is this one special case where it's a bit more robust:
sector read errors.

> me the odds of losing the entire array won't change (simplified --
> because the increased complexity creates room for additional errors).

You don't do anything else with RAID1 or 5 or whatever: You add
code to reduce the impact of a single disk failure. You add new
points of failure to reduce the impact of other points of failure
there, too.
In this case here, you add code (the RAID0 or LVM code, whatever
you like more) to reduce the impact of two sector read errors on
two disks. Of course the new code can contain new points of
failures. It's as always: know the risk and decide :)

regards,
   Mario
-- 
<jv> Oh well, config
<jv> one actually wonders what force in the universe is holding it
<jv> and makes it working
<Beeth> chances and accidents :)

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-08 14:52     ` Frank van Maarseveen
  2005-01-08 15:50       ` Mario Holbe
@ 2005-01-08 16:32       ` Guy
  2005-01-08 17:16         ` maarten
  2005-01-08 16:49       ` maarten
  2 siblings, 1 reply; 95+ messages in thread
From: Guy @ 2005-01-08 16:32 UTC (permalink / raw)
  To: 'Frank van Maarseveen'; +Cc: 'Mario Holbe', linux-raid

I don't recall having 2 disks with read errors at the same time.  But others
on this list have.  Correctable read errors is my most common problem with
my 14 disk array.  I think this partitioning approach will help.  But as you
say, it is more complicated, which adds some risk, I believe.  But you can
compute the level of reduced risk, but you can't compute the level of
increased risk.

Some added risk:
	More complicated setup, increases user errors.
	Example:  Maarten plans to have 2 spare partitions on an extra disk.
Once he corrects the read error on the failed partition, he needs to remove
the failed partition, fail the spare and add the original partition back to
the correct array.  He has a 6 times increased risk of choosing the wrong
partition to fail or remove.  Is that 36 time increased risk of user error?
Of course, the level of error may be negligible, depending on who the user
is.  But it is still an increase of risk.

	There was at least 1 case on this list where someone failed or
removed the wrong disk from an array, so it does happen.

If 6 partitions is 6 time better than 1, then 36 would be 6 times better
than 6.  Is there a sweet spot?

Also, I mentioned it before.  Don't combine the RAID5 arrays with RAID0.
Since the RAID5 arrays are on the same set of disks, the poor disk heads
will be flapping all over the place.  Use a linear array, or LVM.

Also, Neil has an item on his wish list to handle bad blocks.  Once this is
built into md, the 6 partition idea is useless.

I test my disks every night with a tool from Seagate.  I don't think I have
had a bad block since I started using this tool each night.  The tool is
free, it is called "SeaTools Enterprise Edition".  I assume it only works
with Seagate disks.

Guy

-----Original Message-----
From: Frank van Maarseveen [mailto:frankvm@frankvm.com] 
Sent: Saturday, January 08, 2005 9:52 AM
To: Guy
Cc: 'Mario Holbe'; linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

On Fri, Jan 07, 2005 at 04:57:35PM -0500, Guy wrote:
> His plan is to split the disks into 6 partitions.
> Each of his six RAID5 arrays will only use 1 partition of each physical
> disk.
> If he were to lose a disk, all 6 RAID5 arrays would only see 1 failed
disk.
> If he gets 2 read errors, on different disks, at the same time, he has a
1/6
> chance they would be in the same array (which would be bad).
> His plan is to combine the 6 arrays with LVM or a linear array.

Intriguing setup. Do you think this actually improves the reliability
with respect to disk failure compared to creating just one large RAID5
array?

For one second I thought it's a clever trick but gut feeling tells
me the odds of losing the entire array won't change (simplified --
because the increased complexity creates room for additional errors).

-- 
Frank

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-08 16:32       ` Guy
@ 2005-01-08 17:16         ` maarten
  2005-01-08 18:55           ` Guy
  0 siblings, 1 reply; 95+ messages in thread
From: maarten @ 2005-01-08 17:16 UTC (permalink / raw)
  To: linux-raid

On Saturday 08 January 2005 17:32, Guy wrote:
> I don't recall having 2 disks with read errors at the same time.  But
> others on this list have.  Correctable read errors is my most common
> problem with my 14 disk array.  I think this partitioning approach will
> help.  But as you say, it is more complicated, which adds some risk, I
> believe.  But you can compute the level of reduced risk, but you can't
> compute the level of increased risk.

true.  Especially since LVM is completely new to me.

> Some added risk:
> 	More complicated setup, increases user errors.

I have confidence in myself (knock, knock).  I triplecheck every action I do 
with the output of 'cat /proc/mdadm' before hitting [enter] so as to not make 
thinking errors like using hdf5 instead of hde6, and similar mistakes.
I'm paranoid by nature, so that helps, too ;-)

> 	Example:  Maarten plans to have 2 spare partitions on an extra disk.
> Once he corrects the read error on the failed partition, he needs to remove
> the failed partition, fail the spare and add the original partition back to
> the correct array.  He has a 6 times increased risk of choosing the wrong

You must mean in the other order. If I fail the spare first, I'm toast! ;-)

> partition to fail or remove.  Is that 36 time increased risk of user error?
> Of course, the level of error may be negligible, depending on who the user
> is.  But it is still an increase of risk.

First of all you need to make everything as uniform as possible, meaning all 
disks belonging to array md3 are numbered hdX6, all of md4 are hdX7, etc.
I suppose this goes without saying for most people here, but it helps a LOT.

> than 6.  Is there a sweet spot?

Heh. Somewhere between 1 and 36 I'd bet. :)

> Also, Neil has an item on his wish list to handle bad blocks.  Once this is
> built into md, the 6 partition idea is useless.

I know but I'm not going to wait for that.  For now I have limited options.
Mine has not only the benefits outlined, but also the benefit of being able to 
use an older disk as a spare. I guess having this with a spare beats having 
one huge array without a spare.  Or else I'd need to buy yet another 250GB 
drive, and they're not really 'dirt cheap' if you know what I mean.  

> I test my disks every night with a tool from Seagate.  I don't think I have
> had a bad block since I started using this tool each night.  The tool is
> free, it is called "SeaTools Enterprise Edition".  I assume it only works
> with Seagate disks.

That's interesting.  Is that an _online_ test, or do you stop the array every 
night ?  The latter would seem quite error-prone by itself already, and the 
former... well I don't suppose Seagate supports linux, really.

Maarten

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-08 17:16         ` maarten
@ 2005-01-08 18:55           ` Guy
  2005-01-08 19:25             ` maarten
  0 siblings, 1 reply; 95+ messages in thread
From: Guy @ 2005-01-08 18:55 UTC (permalink / raw)
  To: 'maarten', linux-raid

My warning about user error was not targeted at you!  :)
Sorry if it seemed so.

And the order does not matter!

A:
Remove the failed disk.
Fail the spare
System is degraded.
Add the failed/repaired disk.
Rebuild starts.

B:
Remove the failed disk.
Add the failed/repaired disk.
Fail the spare
System is degraded.
Rebuild starts.

Both A and B above require the array to go degraded until the repaired disk
is rebuilt.  But with A, the longer you delay adding the repaired disk, the
longer you are degraded.  In my case, that would be less than 1 minute.  I
do fail the spare last, but not really much of an issue.  No toast anyway!

It would be cool if the rebuild to the repaired disk could be done before
the spare was failed or removed.  Then the array would not be degraded at
all.

If I ever re-build my system, or build a new system, I hope to use RAID6.

The Seagate test is on-line.  Before I started using the Seagate tool, I
used dd.

My disks claim to be able to re-locate bad blocks on read error.  But I am
not sure if this is correctable errors or not.  If not correctable errors
are re-located, what data does the drive return?  Since I don't know, I
don't use this option.  I did use this option for awhile, but after
re-reading about it, I got concerned and turned it off.

This is from the readme file:
Automatic Read Reallocation Enable (ARRE)
        -Marreon/off  enable/disable ARRE bit
           On, drive automatically relocates bad blocks detected
           during read operations.  Off, drive creates Check condition
           status with sense key of Medium Error if bad blocks are
           detected during read operations.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten
Sent: Saturday, January 08, 2005 12:17 PM
To: linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

On Saturday 08 January 2005 17:32, Guy wrote:
> I don't recall having 2 disks with read errors at the same time.  But
> others on this list have.  Correctable read errors is my most common
> problem with my 14 disk array.  I think this partitioning approach will
> help.  But as you say, it is more complicated, which adds some risk, I
> believe.  But you can compute the level of reduced risk, but you can't
> compute the level of increased risk.

true.  Especially since LVM is completely new to me.

> Some added risk:
> 	More complicated setup, increases user errors.

I have confidence in myself (knock, knock).  I triplecheck every action I do

with the output of 'cat /proc/mdadm' before hitting [enter] so as to not
make 
thinking errors like using hdf5 instead of hde6, and similar mistakes.
I'm paranoid by nature, so that helps, too ;-)

> 	Example:  Maarten plans to have 2 spare partitions on an extra disk.
> Once he corrects the read error on the failed partition, he needs to
remove
> the failed partition, fail the spare and add the original partition back
to
> the correct array.  He has a 6 times increased risk of choosing the wrong

You must mean in the other order. If I fail the spare first, I'm toast! ;-)

> partition to fail or remove.  Is that 36 time increased risk of user
error?
> Of course, the level of error may be negligible, depending on who the user
> is.  But it is still an increase of risk.

First of all you need to make everything as uniform as possible, meaning all

disks belonging to array md3 are numbered hdX6, all of md4 are hdX7, etc.
I suppose this goes without saying for most people here, but it helps a LOT.

> than 6.  Is there a sweet spot?

Heh. Somewhere between 1 and 36 I'd bet. :)

> Also, Neil has an item on his wish list to handle bad blocks.  Once this
is
> built into md, the 6 partition idea is useless.

I know but I'm not going to wait for that.  For now I have limited options.
Mine has not only the benefits outlined, but also the benefit of being able
to 
use an older disk as a spare. I guess having this with a spare beats having 
one huge array without a spare.  Or else I'd need to buy yet another 250GB 
drive, and they're not really 'dirt cheap' if you know what I mean.  

> I test my disks every night with a tool from Seagate.  I don't think I
have
> had a bad block since I started using this tool each night.  The tool is
> free, it is called "SeaTools Enterprise Edition".  I assume it only works
> with Seagate disks.

That's interesting.  Is that an _online_ test, or do you stop the array
every 
night ?  The latter would seem quite error-prone by itself already, and the 
former... well I don't suppose Seagate supports linux, really.

Maarten

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-08 18:55           ` Guy
@ 2005-01-08 19:25             ` maarten
  2005-01-08 20:33               ` Mario Holbe
  2005-01-08 23:09               ` Guy
  0 siblings, 2 replies; 95+ messages in thread
From: maarten @ 2005-01-08 19:25 UTC (permalink / raw)
  To: linux-raid

On Saturday 08 January 2005 19:55, you wrote:
> My warning about user error was not targeted at you!  :)
> Sorry if it seemed so.

:-)

> And the order does not matter!

Hm... yes you're right. But adding the disk is more prudent (or is it?)

Grr. Now you've got ME thinking !  ;-)

Normally, the minute a drive fails, it gets kicked and the spare would kick in 
and md syncs this spare.  We now have a non-degraded array again.
If I then fail the spare first, the array goes into degraded mode. Whereas if 
I hotadd the disk, it becomes a spare. Presumably if I now fail the original 
spare, the real disk will get synced again, to get the same setup as before.
But yes, you're right; during this step it is degraded again. Oh well...

> It would be cool if the rebuild to the repaired disk could be done before
> the spare was failed or removed.  Then the array would not be degraded at
> all.

Yes, but this would be impossible to do, since md cannot anticipate _which_ 
disk you're going to fail before it happens. ;)

> If I ever re-build my system, or build a new system, I hope to use RAID6.

I tried this in last fall, but it didn't work out then. See the list archives.

> The Seagate test is on-line.  Before I started using the Seagate tool, I
> used dd.

I'm not as cautious as you are. I just pray the hot spare does what its 
supposed to do.

> My disks claim to be able to re-locate bad blocks on read error.  But I am
> not sure if this is correctable errors or not.  If not correctable errors
> are re-located, what data does the drive return?  Since I don't know, I
> don't use this option.  I did use this option for awhile, but after
> re-reading about it, I got concerned and turned it off.

Afaik, if a drive senses it gets more 'difficult' than usual to read a sector, 
it will automatically copy it to a spare sector and reassign it. However, I 
doubt the OS gets any wiser this happens, so neither would md.
In which cases the error gets noticed by md I don't precisely know, but I 
reckon that may well be when the error is uncorrectible.
Not _undetectable_, to quote from another thread... 8-)  

> This is from the readme file:
> Automatic Read Reallocation Enable (ARRE)
>         -Marreon/off  enable/disable ARRE bit
>            On, drive automatically relocates bad blocks detected
>            during read operations.  Off, drive creates Check condition
>            status with sense key of Medium Error if bad blocks are
>            detected during read operations.

Hm. I would definitely ENable that option.  But what do I know.

It also depends I guess on how fatal reading bad data undetected is for you. 
For me, if one of my mpegs or mp3s develops a bad sector I can probably live 
with that. :-)

Maarten

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-08 19:25             ` maarten
@ 2005-01-08 20:33               ` Mario Holbe
  2005-01-08 23:01                 ` maarten
  2005-01-08 23:09               ` Guy
  1 sibling, 1 reply; 95+ messages in thread
From: Mario Holbe @ 2005-01-08 20:33 UTC (permalink / raw)
  To: linux-raid

maarten <maarten@ultratux.net> wrote:
> On Saturday 08 January 2005 19:55, you wrote:
>> My disks claim to be able to re-locate bad blocks on read error.  But I am
>> not sure if this is correctable errors or not.  If not correctable errors
>> are re-located, what data does the drive return?  Since I don't know, I
...
> Afaik, if a drive senses it gets more 'difficult' than usual to read a sector, 
> it will automatically copy it to a spare sector and reassign it. However, I 

No, this is usually not the case. At least I don't know IDE drives
that do so. This is why I call it `sector read error'.
Each newer disk has some amount of `spare sectors' which can be
used to relocate bad sectors. Usually, you have two situations
where you can detect a bad sector:
1. If you write to it and this attempt fails and
2. If you read from it and this attempt fails.
1. would require some verify-operation, so I'm not sure if this
is done at all in the wild.
2. has a simple problem: If you get a read-request for sector x
and you cannot read it, what data should you return then? The
answer is simple: you don't return data but an error (the read-
error). Additionally you mark the sector as bad and relocate the
next write-request for that sector to some spare sector and further
read-requests then too. However, you still have to respond error
messages for each subsequent read-request before the first
relocated write-request appears.
And afaik this is what current disks do. That's why you can just
re-sync the failed disk to the array again without any problem -
because the write-request happens then, the relocation takes place
and everything's fine.

regards,
   Mario
-- 
The social dynamics of the net are a direct consequence of the fact that
nobody has yet developed a Remote Strangulation Protocol.  -- Larry Wall

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-08 20:33               ` Mario Holbe
@ 2005-01-08 23:01                 ` maarten
  2005-01-09 10:10                   ` Mario Holbe
  0 siblings, 1 reply; 95+ messages in thread
From: maarten @ 2005-01-08 23:01 UTC (permalink / raw)
  To: linux-raid

On Saturday 08 January 2005 21:33, Mario Holbe wrote:
> maarten <maarten@ultratux.net> wrote:
> > On Saturday 08 January 2005 19:55, you wrote:
> >> My disks claim to be able to re-locate bad blocks on read error.  But I
> >> am not sure if this is correctable errors or not.  If not correctable
> >> errors are re-located, what data does the drive return?  Since I don't
> >> know, I
>
> ...
>
> > Afaik, if a drive senses it gets more 'difficult' than usual to read a
> > sector, it will automatically copy it to a spare sector and reassign it.
> > However, I
>
> No, this is usually not the case. At least I don't know IDE drives
> that do so. This is why I call it `sector read error'.

Do you mean SCSI ones do ?  If so, I thought the firmware intelligence between 
ATA and SCSI vanished long ago.

> Each newer disk has some amount of `spare sectors' which can be
> used to relocate bad sectors. Usually, you have two situations
> where you can detect a bad sector:
> 1. If you write to it and this attempt fails and
> 2. If you read from it and this attempt fails.

Hm.  I'm not extremely well versed on modern drive technology but 
nevertheless: How I understood it is somewhat different, namely:

1.  If you write to it and that fails the drive will allocate a spare sector.  
From that we [should be] able to conclude that if you get a write failure 
that the drive ran out of spare sectors. (is that a fact, or not??)

2. If you read from it, the drives' firmware will see an error and:
2a: Retry the read a couple more times, succeed, copy that to a spare sector 
and reallocate.   - OR
2b: Retry the read, fail miserably despite that and (only then) signal a read 
error to the host.

I've heard for a long time that drives are much more sophisticated than 
before, retrying failed reads.  They can try to read 'off-track' (off-axis) 
and such things that were impossible when stepping motors were still used. 
But that was more than 10 years ago, now they all have coil-actuated heads.

In other words, drives don't wait till the sector is really unreadable, 
they'll reallocate at the first sign of trouble (decaying signal strength, 
spurious crc errors, stuff like that).  This is also suggested by the 
observable behaviour of drive and OS; if a reallocation only would occur 
after the fact, ie. when the data is beyond salvaging, then every sector 
reallocation would by definition lead to corrupt data in that file. Generally 
speaking -since there are so many spare sectors- an OS would die very soon as 
all its files / libs/ DLLs got corrupted due to the reallocation (which is 
supposed to be transparent to the host, only the drive knows).
But... I have no solid proof of this though, other than reasoning like this.

> 1. would require some verify-operation, so I'm not sure if this
> is done at all in the wild.
> 2. has a simple problem: If you get a read-request for sector x
> and you cannot read it, what data should you return then? The
> answer is simple: you don't return data but an error (the read-
> error). Additionally you mark the sector as bad and relocate the
> next write-request for that sector to some spare sector and further
> read-requests then too. However, you still have to respond error
> messages for each subsequent read-request before the first
> relocated write-request appears.
> And afaik this is what current disks do. That's why you can just
> re-sync the failed disk to the array again without any problem -
> because the write-request happens then, the relocation takes place
> and everything's fine.

So basically what you're saying is that reallocation _only_ happens on 
_writes_ ?  Hm.  Maybe, I don't know...

The problem with my theory is that if it is true, then that automatically 
means that whenever md gets a read error that that data is indeed gone.
Or maybe that isn't a problem since the disk gets kicked, and afterwards 
during resync the reallocation pays off. Yeah.  That must be it. :-)

Maarten

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-08 23:01                 ` maarten
@ 2005-01-09 10:10                   ` Mario Holbe
  2005-01-09 16:23                     ` Guy
  0 siblings, 1 reply; 95+ messages in thread
From: Mario Holbe @ 2005-01-09 10:10 UTC (permalink / raw)
  To: linux-raid

maarten <maarten@ultratux.net> wrote:
> Do you mean SCSI ones do ?  If so, I thought the firmware intelligence between 

I don't think, SCSI ones do so. However, I don't know many
SCSI drives and thus I limited my sentence to IDE drives :)

> 1.  If you write to it and that fails the drive will allocate a spare sector.  

As I said earlier:

>> 1. would require some verify-operation, so I'm not sure if this
>> is done at all in the wild.

A verify would take time and therefore I think, this is not done.
Btw: *if* it would be done, write speed to disks should be read-speed/2
or smaller, but usually it isn't.

> From that we [should be] able to conclude that if you get a write failure 
> that the drive ran out of spare sectors. (is that a fact, or not??)

Yes, this is a fact.

> So basically what you're saying is that reallocation _only_ happens on 
> _writes_ ?  Hm.  Maybe, I don't know...

What I'm saying is: bad sectors are _only_ detected on reads and
reallocations only happen on writes, yes.

> Or maybe that isn't a problem since the disk gets kicked, and afterwards 
> during resync the reallocation pays off. Yeah.  That must be it. :-)

This is what I said, yes :)


regards,
   Mario
-- 
() Ascii Ribbon Campaign
/\ Support plain text e-mail


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-09 10:10                   ` Mario Holbe
@ 2005-01-09 16:23                     ` Guy
  2005-01-09 16:36                       ` Michael Tokarev
  0 siblings, 1 reply; 95+ messages in thread
From: Guy @ 2005-01-09 16:23 UTC (permalink / raw)
  To: 'Mario Holbe', linux-raid

Bad sectors are detected on write.  There are 5 wires going to each of the
read/write heads on my disk drives.  I think each head can read after write
in 1 pass.  My specs say it re-maps on write failure and read failure.  Both
are optional.  But, I don't know if this is normal or not.  My disks are
Seagate ST118202LC, 10,000 RPM, 18 Gig SCSI.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Mario Holbe
Sent: Sunday, January 09, 2005 5:10 AM
To: linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

maarten <maarten@ultratux.net> wrote:
> Do you mean SCSI ones do ?  If so, I thought the firmware intelligence
between 

I don't think, SCSI ones do so. However, I don't know many
SCSI drives and thus I limited my sentence to IDE drives :)

> 1.  If you write to it and that fails the drive will allocate a spare
sector.  

As I said earlier:

>> 1. would require some verify-operation, so I'm not sure if this
>> is done at all in the wild.

A verify would take time and therefore I think, this is not done.
Btw: *if* it would be done, write speed to disks should be read-speed/2
or smaller, but usually it isn't.

> From that we [should be] able to conclude that if you get a write failure 
> that the drive ran out of spare sectors. (is that a fact, or not??)

Yes, this is a fact.

> So basically what you're saying is that reallocation _only_ happens on 
> _writes_ ?  Hm.  Maybe, I don't know...

What I'm saying is: bad sectors are _only_ detected on reads and
reallocations only happen on writes, yes.

> Or maybe that isn't a problem since the disk gets kicked, and afterwards 
> during resync the reallocation pays off. Yeah.  That must be it. :-)

This is what I said, yes :)

regards,
   Mario
-- 
() Ascii Ribbon Campaign
/\ Support plain text e-mail

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-09 16:23                     ` Guy
@ 2005-01-09 16:36                       ` Michael Tokarev
  2005-01-09 17:52                         ` Peter T. Breuer
  0 siblings, 1 reply; 95+ messages in thread
From: Michael Tokarev @ 2005-01-09 16:36 UTC (permalink / raw)
  To: linux-raid

Guy wrote:
> Bad sectors are detected on write.  There are 5 wires going to each of the

Bad sectors can be detected both on write and on read.  Unfortunately,
most time there will be *read* error -- it is pretty possible for the
drive to perform a read-check after write, and the data may be ok at
that time, but not when you will want to read it a month later...

> read/write heads on my disk drives.  I think each head can read after write
> in 1 pass.  My specs say it re-maps on write failure and read failure.  Both
> are optional.  But, I don't know if this is normal or not.  My disks are
> Seagate ST118202LC, 10,000 RPM, 18 Gig SCSI.

I think all modern drives support bad block remapping on both read and write.
But think about it: if there's a read error, it means the drive CAN NOT read
the "right" data for some reason (for some definition of "right" anyway) --
ie, the drive "knows" there's some problem with the data and it can't
completely reconstruct what has been written to the block before.  In this
case, while remapping the block in question help to avoid further error in
this block, it does NOT help to restore the data which the drive can't read.

And sure it is not an option in this case to report that read was successeful
and to pass, say, zero-filled block to the controller... ;)

/mjt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-09 16:36                       ` Michael Tokarev
@ 2005-01-09 17:52                         ` Peter T. Breuer
  2005-01-09 17:59                           ` Michael Tokarev
  0 siblings, 1 reply; 95+ messages in thread
From: Peter T. Breuer @ 2005-01-09 17:52 UTC (permalink / raw)
  To: linux-raid

Michael Tokarev <mjt@tls.msk.ru> wrote:
> I think all modern drives support bad block remapping on both read and write.
> But think about it: if there's a read error, it means the drive CAN NOT read
> the "right" data for some reason (for some definition of "right" anyway) --
> ie, the drive "knows" there's some problem with the data and it can't

I really don't want RAID to fault the disk offline in this case.  I want
RAID to read from the other disk(s) instead, and rewrite the data on the
disk that gave the fail notice on that sector, and if that gives no error,
then just carry on and be happy ...

Peter


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-09 17:52                         ` Peter T. Breuer
@ 2005-01-09 17:59                           ` Michael Tokarev
  2005-01-09 18:34                             ` Peter T. Breuer
  2005-01-09 20:28                             ` Guy
  0 siblings, 2 replies; 95+ messages in thread
From: Michael Tokarev @ 2005-01-09 17:59 UTC (permalink / raw)
  To: linux-raid

Peter T. Breuer wrote:
> Michael Tokarev <mjt@tls.msk.ru> wrote:
> 
>>I think all modern drives support bad block remapping on both read and write.
>>But think about it: if there's a read error, it means the drive CAN NOT read
>>the "right" data for some reason (for some definition of "right" anyway) --
>>ie, the drive "knows" there's some problem with the data and it can't
> 
> I really don't want RAID to fault the disk offline in this case.  I want
> RAID to read from the other disk(s) instead, and rewrite the data on the
> disk that gave the fail notice on that sector, and if that gives no error,
> then just carry on and be happy ...

There where some patches posted to this list some time ago that tries to
do just that (or a discussion.. i don't remember).  Yes, md code currently
doesn't do such things, and fails a drive after the first error -- it's
the simplest way to go ;)

/mjt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-09 17:59                           ` Michael Tokarev
@ 2005-01-09 18:34                             ` Peter T. Breuer
  2005-01-09 20:28                             ` Guy
  1 sibling, 0 replies; 95+ messages in thread
From: Peter T. Breuer @ 2005-01-09 18:34 UTC (permalink / raw)
  To: linux-raid

Michael Tokarev <mjt@tls.msk.ru> wrote:
> There where some patches posted to this list some time ago that tries to
> do just that (or a discussion.. i don't remember).  Yes, md code currently
> doesn't do such things, and fails a drive after the first error -- it's
> the simplest way to go ;)

There would be two things to do (for raid1):

  1) make the raid1_end_request code notice a failure on READ, but not
     panic, simply resubmit the i/o to another mirror (it has to count
     "tries") and only give up after the last try has failed.

  2) hmmm .. is there a 2)? Well, maybe. Perhasp check that read errors
     per bio (as opposed to per request) don't fault the disk to the
     upper layers ..  I don't think they can. And possible arrange for 
     2 read bios to be prepared but only one to be sent, and discard
     the second if the first succeeds, or try it if the first fails.

Actually, looking at raid1_end_request, it looks as though it does try
again:

       if ((r1_bio->cmd == READ) || (r1_bio->cmd == READA)) {
                 ...
                /*
                 * we have only one bio on the read side
                 */
                if (uptodate)
                        raid_end_bio_io(r1_bio);
                else {
                        /*
                         * oops, read error:
                         */
                        char b[BDEVNAME_SIZE];
                        printk(KERN_ERR "raid1: %s: rescheduling sector %llu\n",
                                bdevname(conf->mirrors[mirror].rdev->bdev,b), (unsigned long long)r1_bio->sector);
                        reschedule_retry(r1_bio);
                }
        } 

But does reschedule_retry try a different disk?  Anyway, there is maybe
a mistake in this code because we decrement the number of outsanding
reads in all cases:

        atomic_dec(&conf->mirrors[mirror].rdev->nr_pending);
        return 0;

but if the read is retried it should not be unpended yet!  Well, that
depends on your logic ..  I suppose that morally the request should be
unpended, but not the read, which is still pending.  And I seem to
remember that nr_pending is to tell the raid layers if we are in use or
not, so I don't think we want to unpend here.

Well, reschedule_retry does try the same read again:

 static void reschedule_retry(r1bio_t *r1_bio)
 {
        unsigned long flags;
        mddev_t *mddev = r1_bio->mddev;

        spin_lock_irqsave(&retry_list_lock, flags);
        list_add(&r1_bio->retry_list, &retry_list_head);
        spin_unlock_irqrestore(&retry_list_lock, flags);

        md_wakeup_thread(mddev->thread);
 }

So it adds the whole read request (using the master, not the bio that
failed) onto a retry list.  Maybe that list will be checked for
nonemptiness, which solves the nr_pending problem.

It looks like a separate kernel thread (raid1d) does the retries.

And bless me but if it doesn't try and send the read elsewhere ...

               case READ:
               case READA:
                        if (map(mddev, &rdev) == -1) {
                                printk(KERN_ALERT "raid1: %s: unrecoverable I/O"
                                " read error for block %llu\n",
                                bdevname(bio->bi_bdev,b),
                                (unsigned long long)r1_bio->sector);
                                raid_end_bio_io(r1_bio);
                                break;
                        }

Not sure what that is about (?? disk is not in array?), but the next
bit is clear:

                        printk(KERN_ERR "raid1: %s: redirecting sector %llu to"
                                " another mirror\n",
                                bdevname(rdev->bdev,b),
                                (unsigned long long)r1_bio->sector);

So it will try and redirect. It rewrites the target of the bio:

                        bio->bi_bdev = rdev->bdev;
                        ..
                        bio->bi_sector = r1_bio->sector + rdev->data_offset;

It resets the offset in case it is different for this disk.

                        bio->bi_rw = r1_bio->cmd;

Dunno why it needs to do that. It should be unchanged.

                        generic_make_request(bio);

And submit.

                        break;


So it looks to me as though reads ARE redirected. It would be trivial t
do a write on the failed disk to.


Peter




^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-09 17:59                           ` Michael Tokarev
  2005-01-09 18:34                             ` Peter T. Breuer
@ 2005-01-09 20:28                             ` Guy
  2005-01-09 20:47                               ` Peter T. Breuer
  1 sibling, 1 reply; 95+ messages in thread
From: Guy @ 2005-01-09 20:28 UTC (permalink / raw)
  To: 'Michael Tokarev', linux-raid

It is on Neil's wish list (or to do list)!  Mine too!

From Neil Brown:

http://marc.theaimsgroup.com/?l=linux-raid&m=110055742813074&w=2

Guy


-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Michael Tokarev
Sent: Sunday, January 09, 2005 1:00 PM
To: linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

Peter T. Breuer wrote:
> Michael Tokarev <mjt@tls.msk.ru> wrote:
> 
>>I think all modern drives support bad block remapping on both read and
write.
>>But think about it: if there's a read error, it means the drive CAN NOT
read
>>the "right" data for some reason (for some definition of "right" anyway)
--
>>ie, the drive "knows" there's some problem with the data and it can't
> 
> I really don't want RAID to fault the disk offline in this case.  I want
> RAID to read from the other disk(s) instead, and rewrite the data on the
> disk that gave the fail notice on that sector, and if that gives no error,
> then just carry on and be happy ...

There where some patches posted to this list some time ago that tries to
do just that (or a discussion.. i don't remember).  Yes, md code currently
doesn't do such things, and fails a drive after the first error -- it's
the simplest way to go ;)

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-09 20:28                             ` Guy
@ 2005-01-09 20:47                               ` Peter T. Breuer
  2005-01-10  7:19                                 ` Peter T. Breuer
  0 siblings, 1 reply; 95+ messages in thread
From: Peter T. Breuer @ 2005-01-09 20:47 UTC (permalink / raw)
  To: linux-raid

Guy <bugzilla@watkins-home.com> wrote:
> It is on Neil's wish list (or to do list)!  Mine too!

What is? Can you please be specific?

> From Neil Brown:
> 
> http://marc.theaimsgroup.com/?l=linux-raid&m=110055742813074&w=2

If you are talking about (and I am guessing, thanks to the uniform
sensation of opaque experientiality that passes over me when I see the
format of your posts, or the lacl of it) reading from the other disk
when one sector read fails on the first, that appears to be in 2.6.3 at
least, as my reading of the code goes.  What neil says in your reference

   you can let the kernel kick out a drive that has a read error, let
   user-space have a quick look at the drive and see if it might be a
   recoverable error, and then give the drive back to the kernel

is true.  As far as I can see from a quick scan of the (raid1) code, he
DOES kick a disk out on read error, but also DOES RETRY the read from
another disk for that sector.  Currently he does that in the resync
thread. 

He needs a list of failed reads and only needs to kick the disk when
recovery fails.

At the present time it is trivial to add a write as well as a read on a
retry.

I can add the read accounting.

Neil's comments indicate that he is interested in doing this in a
generic way. 

So am I, but I'll settle for "non-generic" first.

Peter

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-09 20:47                               ` Peter T. Breuer
@ 2005-01-10  7:19                                 ` Peter T. Breuer
  2005-01-10  9:05                                   ` Guy
  2005-01-10 12:31                                   ` Peter T. Breuer
  0 siblings, 2 replies; 95+ messages in thread
From: Peter T. Breuer @ 2005-01-10  7:19 UTC (permalink / raw)
  To: linux-raid

Peter T. Breuer <ptb@lab.it.uc3m.es> wrote:
> DOES kick a disk out on read error, but also DOES RETRY the read from
> another disk for that sector.  Currently he does that in the resync
> thread. 
> 
> He needs a list of failed reads and only needs to kick the disk when
> recovery fails.

Well, here is a patch to at least stop the array (RAID 1) being failed
until all possible read sources have been exhausted for the sector in
question.  It's untested - I only checked that it compiles.

The idea here is to modify raid1.c so that

   1) in make_request, on read (as well as on write, where we already do
      it) we set the master bios "remaining" count to the number of viable
      disks in the array.

      That's the third of the three hunks in the patch below and is
      harmless unless somebody somewhere already uses the "remaining"
      field in the read branch.  I don't see it if so.

   2) in raid1_end_request, I pushed the if (!uptodate) test which
      faults the current disk out of the array down a few lines (past no
      code at all, just a branch test for READ or WRITE) and copied it
      into both the start of the READ and WRITE branches of the code.

      That shows up very badly under diff, which makes it look as
      though I did something else entirely. But that's all, and that
      is harmless. This is the first two hunks of the patch below.
      Diff makes it look as though I moved the branch UP, but I
      moved the code before the branch DOWN.

      After moving the faulting code into the two branches, in the
      READ branch ONLY I weakened the condition that faulted the disk
      from "if !uptodate" to "if !uptodate and there is no other source
      to try".

      That's probably harmless in itself, modulo accounting questions -
      there might be things like nr_pending still to tweak.

This leaves things a bit unfair - "don't come any closer or the hacker
gets it".  The LAST disk that fails a read, in case all disks fail to
read on that sector, gets ejected from the array.  But which it is is
random, depending on the order we try (anyone know if the
rechedule_retry call is "fair" in the technical sense?). In my opinion 
no disk should ever be ejected from the array in these cicumstances -
it's just a read error produced by the array as a whole and we have
already done our bestto avoid it and can do on more. In a strong sense,
as it sems to me, "error is the correct read result". I've marked what
line to comment with /* PTB ... */ in the patch.



--- linux-2.6.3/drivers/md/raid1.c.orig	Tue Dec 28 00:39:01 2004
+++ linux-2.6.3/drivers/md/raid1.c	Mon Jan 10 07:39:38 2005
@@ -354,9 +354,15 @@
 	/*
 	 * this branch is our 'one mirror IO has finished' event handler:
 	 */
-	if (!uptodate)
-		md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
-	else
+	update_head_pos(mirror, r1_bio);
+	if ((r1_bio->cmd == READ) || (r1_bio->cmd == READA)) {
+	        if (!uptodate
+#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT
+                        && atomic_dec_and_test(&r1_bio->remaining)
+#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */
+                        ) { /* PTB remove next line to be much fairer! */
+		        md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
+                } else
 		/*
 		 * Set R1BIO_Uptodate in our master bio, so that
 		 * we will return a good error code for to the higher
@@ -368,8 +374,6 @@
 		 */
		set_bit(R1BIO_Uptodate, &r1_bio->state);
 
-	update_head_pos(mirror, r1_bio);
-	if ((r1_bio->cmd == READ) || (r1_bio->cmd == READA)) {
 		if (!r1_bio->read_bio)
 			BUG();
 		/*
@@ -387,6 +391,20 @@
 			reschedule_retry(r1_bio);
 		}
 	} else {
+	        if (!uptodate)
+		        md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
+	        else
+		/*
+		 * Set R1BIO_Uptodate in our master bio, so that
+		 * we will return a good error code for to the higher
+		 * levels even if IO on some other mirrored buffer fails.
+		 *
+		 * The 'master' represents the composite IO operation to
+		 * user-side. So if something waits for IO, then it will
+		 * wait for the 'master' bio.
+		 */
+		        set_bit(R1BIO_Uptodate, &r1_bio->state);
+
 
 		if (r1_bio->read_bio)
 			BUG();
@@ -708,6 +726,19 @@
 		read_bio->bi_end_io = raid1_end_request;
 		read_bio->bi_rw = r1_bio->cmd;
 		read_bio->bi_private = r1_bio;
+
+#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT
+	        atomic_set(&r1_bio->remaining, 0);
+	        /* select target devices under spinlock */
+	        spin_lock_irq(&conf->device_lock);
+	        for (i = 0;  i < disks; i++) {
+		        if (conf->mirrors[i].rdev &&
+		        !conf->mirrors[i].rdev->faulty) {
+		                atomic_inc(&r1_bio->remaining);
+		        } 
+	        }
+	        spin_unlock_irq(&conf->device_lock);
+#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */
 
 		generic_make_request(read_bio);
 		return 0;

Peter


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-10  7:19                                 ` Peter T. Breuer
@ 2005-01-10  9:05                                   ` Guy
  2005-01-10  9:38                                     ` Peter T. Breuer
  2005-01-10 12:31                                   ` Peter T. Breuer
  1 sibling, 1 reply; 95+ messages in thread
From: Guy @ 2005-01-10  9:05 UTC (permalink / raw)
  To: 'Peter T. Breuer', linux-raid

This confuses me!

A RAID1 array does not fail on a read error, unless the read error is on the
only disk.
Maybe you have found a bug?

Were you able to cause an array to fail by having 1 disk give a read error?

Or are you just preventing a single read error from kicking a disk?
I think this is what you are trying to say, if so, it has value.

Based on the code below, I think you are not referring to failing the array,
but failing a disk.

Would be nice to then attempt to correct the read error(s).  Also, log the
errors.  Else the array could continue to degrade until finally the same
block is bad on all devices.

You said if all disks get a read error the last disks is kicked.  What data
is returned to the user?  Normally, the array would go off-line.  But since
you still have 1 or more disks in the array, it is a new condition.

My guess is that you have not given this enough thought.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer
Sent: Monday, January 10, 2005 2:19 AM
To: linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

Peter T. Breuer <ptb@lab.it.uc3m.es> wrote:
> DOES kick a disk out on read error, but also DOES RETRY the read from
> another disk for that sector.  Currently he does that in the resync
> thread. 
> 
> He needs a list of failed reads and only needs to kick the disk when
> recovery fails.

Well, here is a patch to at least stop the array (RAID 1) being failed
until all possible read sources have been exhausted for the sector in
question.  It's untested - I only checked that it compiles.

The idea here is to modify raid1.c so that

   1) in make_request, on read (as well as on write, where we already do
      it) we set the master bios "remaining" count to the number of viable
      disks in the array.

      That's the third of the three hunks in the patch below and is
      harmless unless somebody somewhere already uses the "remaining"
      field in the read branch.  I don't see it if so.

   2) in raid1_end_request, I pushed the if (!uptodate) test which
      faults the current disk out of the array down a few lines (past no
      code at all, just a branch test for READ or WRITE) and copied it
      into both the start of the READ and WRITE branches of the code.

      That shows up very badly under diff, which makes it look as
      though I did something else entirely. But that's all, and that
      is harmless. This is the first two hunks of the patch below.
      Diff makes it look as though I moved the branch UP, but I
      moved the code before the branch DOWN.

      After moving the faulting code into the two branches, in the
      READ branch ONLY I weakened the condition that faulted the disk
      from "if !uptodate" to "if !uptodate and there is no other source
      to try".

      That's probably harmless in itself, modulo accounting questions -
      there might be things like nr_pending still to tweak.

This leaves things a bit unfair - "don't come any closer or the hacker
gets it".  The LAST disk that fails a read, in case all disks fail to
read on that sector, gets ejected from the array.  But which it is is
random, depending on the order we try (anyone know if the
rechedule_retry call is "fair" in the technical sense?). In my opinion 
no disk should ever be ejected from the array in these cicumstances -
it's just a read error produced by the array as a whole and we have
already done our bestto avoid it and can do on more. In a strong sense,
as it sems to me, "error is the correct read result". I've marked what
line to comment with /* PTB ... */ in the patch.



--- linux-2.6.3/drivers/md/raid1.c.orig	Tue Dec 28 00:39:01 2004
+++ linux-2.6.3/drivers/md/raid1.c	Mon Jan 10 07:39:38 2005
@@ -354,9 +354,15 @@
 	/*
 	 * this branch is our 'one mirror IO has finished' event handler:
 	 */
-	if (!uptodate)
-		md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
-	else
+	update_head_pos(mirror, r1_bio);
+	if ((r1_bio->cmd == READ) || (r1_bio->cmd == READA)) {
+	        if (!uptodate
+#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT
+                        && atomic_dec_and_test(&r1_bio->remaining)
+#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */
+                        ) { /* PTB remove next line to be much fairer! */
+		        md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
+                } else
 		/*
 		 * Set R1BIO_Uptodate in our master bio, so that
 		 * we will return a good error code for to the higher
@@ -368,8 +374,6 @@
 		 */
		set_bit(R1BIO_Uptodate, &r1_bio->state);
 
-	update_head_pos(mirror, r1_bio);
-	if ((r1_bio->cmd == READ) || (r1_bio->cmd == READA)) {
 		if (!r1_bio->read_bio)
 			BUG();
 		/*
@@ -387,6 +391,20 @@
 			reschedule_retry(r1_bio);
 		}
 	} else {
+	        if (!uptodate)
+		        md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
+	        else
+		/*
+		 * Set R1BIO_Uptodate in our master bio, so that
+		 * we will return a good error code for to the higher
+		 * levels even if IO on some other mirrored buffer fails.
+		 *
+		 * The 'master' represents the composite IO operation to
+		 * user-side. So if something waits for IO, then it will
+		 * wait for the 'master' bio.
+		 */
+		        set_bit(R1BIO_Uptodate, &r1_bio->state);
+
 
 		if (r1_bio->read_bio)
 			BUG();
@@ -708,6 +726,19 @@
 		read_bio->bi_end_io = raid1_end_request;
 		read_bio->bi_rw = r1_bio->cmd;
 		read_bio->bi_private = r1_bio;
+
+#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT
+	        atomic_set(&r1_bio->remaining, 0);
+	        /* select target devices under spinlock */
+	        spin_lock_irq(&conf->device_lock);
+	        for (i = 0;  i < disks; i++) {
+		        if (conf->mirrors[i].rdev &&
+		        !conf->mirrors[i].rdev->faulty) {
+		                atomic_inc(&r1_bio->remaining);
+		        } 
+	        }
+	        spin_unlock_irq(&conf->device_lock);
+#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */
 
 		generic_make_request(read_bio);
 		return 0;

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-10  9:05                                   ` Guy
@ 2005-01-10  9:38                                     ` Peter T. Breuer
  0 siblings, 0 replies; 95+ messages in thread
From: Peter T. Breuer @ 2005-01-10  9:38 UTC (permalink / raw)
  To: linux-raid

Guy <bugzilla@watkins-home.com> wrote:
> Peter T. Breuer <ptb@lab.it.uc3m.es> wrote:
> > Well, here is a patch to at least stop the array (RAID 1) being failed
> > until all possible read sources have been exhausted for the sector in
> > question.  It's untested - I only checked that it compiles.
> A RAID1 array does not fail on a read error, unless the read error is on the
> only disk.

I'm sorry, I meant "degraded", not "failed", when I wrote that summary.

To clarify, the patch stops the mirror disk in question being _faulted_
out of the array when a sector read _fails_ on the disk.  The read is
instead retried on another disk (as is the case at present in the
standard code, if I recall correctly - the patch only stops the current
disk also being faulted while the retry is scheduled).

In addition I pointed to what line to comment to stop any disk being
ever faulted at all on a read error, which ("not faulting") in my
opinion is more correct.  The reasoning is that either we try all disks
and succeed on one, in which case there is nothing to mention to
anybody, or we succeed on none and there really is an error in that
position in the array, on all disks, and that's the right thing to say.

What happens on recovery is another question. There may be scattered
error blocks.

I would also like to submit a write to the dubious sectors, from the
readable disk, once we have found it. 

> Maybe you have found a bug?

There are bugs, but that is not one of them.

If you want to check the patch, check to see if schedule_retry moves
the current target of the bio to another disk in a fair way. I didn't
check.

Peter

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-10  7:19                                 ` Peter T. Breuer
  2005-01-10  9:05                                   ` Guy
@ 2005-01-10 12:31                                   ` Peter T. Breuer
  2005-01-10 13:19                                     ` Peter T. Breuer
  1 sibling, 1 reply; 95+ messages in thread
From: Peter T. Breuer @ 2005-01-10 12:31 UTC (permalink / raw)
  To: linux-raid

Peter T. Breuer <ptb@lab.it.uc3m.es> wrote:
> --- linux-2.6.3/drivers/md/raid1.c.orig Tue Dec 28 00:39:01 2004
> +++ linux-2.6.3/drivers/md/raid1.c      Mon Jan 10 07:39:38 2005
> @@ -354,9 +354,15 @@
>         /*
>          * this branch is our 'one mirror IO has finished' event handler:
>          */
> -       if (!uptodate)
> -               md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
> -       else
> +       update_head_pos(mirror, r1_bio);
> +       if ((r1_bio->cmd == READ) || (r1_bio->cmd == READA)) {
> +               if (!uptodate
> +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT
> +                        && atomic_dec_and_test(&r1_bio->remaining)
> +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */
> +                        ) { /* PTB remove next line to be much fairer! */
> +                       md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
> +                } else

Hmm ...  I must be crackers at 7.39 in the morning.  Surely if the bio
is not uptodate but the read attampt's time is not yet up, we don't want
to tell the master bio that the io was successful (the "else")!  That
should have read "if if", not "if and".  I.e.

-       if (!uptodate)
-               md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
-       else
+       update_head_pos(mirror, r1_bio);
+       if ((r1_bio->cmd == READ) || (r1_bio->cmd == READA)) {
+               if (!uptodate) {
+#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT
+                        if (atomic_dec_and_test(&r1_bio->remaining))
+#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */
+                        /* PTB remove next line to be much fairer! */
+                                md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
+                } else

So if the bio is not uptodate we just drop through (after decrementing
the count on the master) into the existing code which checks this
bio uptodateness and sends a retry if it is not good. Yep.

I'll send out another patch later, with rewrite on read fail too. When
I've woken up.

Peter


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-10 12:31                                   ` Peter T. Breuer
@ 2005-01-10 13:19                                     ` Peter T. Breuer
  2005-01-10 18:37                                       ` Peter T. Breuer
  0 siblings, 1 reply; 95+ messages in thread
From: Peter T. Breuer @ 2005-01-10 13:19 UTC (permalink / raw)
  To: linux-raid

Peter T. Breuer <ptb@lab.it.uc3m.es> wrote:
> Hmm ...  I must be crackers at 7.39 in the morning.  Surely if the bio

Perhaps this is more obviously correct (or less obviously incorrect).
Same rationale as before. Detailed reasoning after lunch. This patch is
noticably less invasive, less convoluted. See embedded comments.


--- linux-2.6.3/drivers/md/raid1.c.orig	Tue Dec 28 00:39:01 2004
+++ linux-2.6.3/drivers/md/raid1.c	Mon Jan 10 14:05:46 2005
@@ -354,9 +354,15 @@
 	/*
 	 * this branch is our 'one mirror IO has finished' event handler:
 	 */
-	if (!uptodate)
-		md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
-	else
+	if (!uptodate) {
+#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT
+                /*
+                 * Only fault disk out of array on write error, not read.
+                 */
+                if (r1_bio->cmd == WRITE)
+#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */
+		        md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
+        } else
 		/*
 		 * Set R1BIO_Uptodate in our master bio, so that
 		 * we will return a good error code for to the higher
@@ -375,7 +381,12 @@
 		/*
 		 * we have only one bio on the read side
 		 */
-		if (uptodate)
+		if (uptodate
+#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT
+                        /* Give up and error if we're last */
+                        || atomic_dec_and_test(&r1_bio->remaining)
+#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */
+                        )
 			raid_end_bio_io(r1_bio);
 		else {
 			/*
@@ -708,6 +720,18 @@
 		read_bio->bi_end_io = raid1_end_request;
 		read_bio->bi_rw = r1_bio->cmd;
 		read_bio->bi_private = r1_bio;
+#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT
+	        atomic_set(&r1_bio->remaining, 0);
+	        /* count source devices under spinlock */
+	        spin_lock_irq(&conf->device_lock);
+	        for (i = 0;  i < disks; i++) {
+		        if (conf->mirrors[i].rdev &&
+		        !conf->mirrors[i].rdev->faulty) {
+		                atomic_inc(&r1_bio->remaining);
+		        } 
+	        }
+	        spin_unlock_irq(&conf->device_lock);
+#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */
 
 		generic_make_request(read_bio);
 		return 0;


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-10 13:19                                     ` Peter T. Breuer
@ 2005-01-10 18:37                                       ` Peter T. Breuer
  2005-01-11 11:34                                         ` Peter T. Breuer
  0 siblings, 1 reply; 95+ messages in thread
From: Peter T. Breuer @ 2005-01-10 18:37 UTC (permalink / raw)
  To: linux-raid

Peter T. Breuer <ptb@lab.it.uc3m.es> wrote:
> Peter T. Breuer <ptb@lab.it.uc3m.es> wrote:
> > Hmm ...  I must be crackers at 7.39 in the morning.  Surely if the bio
> 
> Perhaps this is more obviously correct (or less obviously incorrect).

So I'll do the commentary for it now. The last hunk of this three hunk
patch is the easiest to explain:


> --- linux-2.6.3/drivers/md/raid1.c.orig Tue Dec 28 00:39:01 2004
> +++ linux-2.6.3/drivers/md/raid1.c      Mon Jan 10 14:05:46 2005
> @@ -708,6 +720,18 @@
>                 read_bio->bi_end_io = raid1_end_request;
>                 read_bio->bi_rw = r1_bio->cmd;
>                 read_bio->bi_private = r1_bio;
> +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT
> +               atomic_set(&r1_bio->remaining, 0);
> +               /* count source devices under spinlock */
> +               spin_lock_irq(&conf->device_lock);
> +               for (i = 0;  i < disks; i++) {
> +                       if (conf->mirrors[i].rdev &&
> +                       !conf->mirrors[i].rdev->faulty) {
> +                               atomic_inc(&r1_bio->remaining);
> +                       } 
> +               }
> +               spin_unlock_irq(&conf->device_lock);
> +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */
>  
>                 generic_make_request(read_bio);
>                 return 0;
>

That simply adds to the raid1 make_request code in the READ branch
the same stanza that appears in the WRITE branch already, namely a
calculation of how many working disks there are in the array, which
is put into the "remaining" field of the raid1 master bio being
set up.

So we put the count of valid disks in the "remaining" field during
construction of a raid1 read bio.

If I am off by one, I apologize.  The write size code starts the count
at 1 instead of 0, and I don't know why. 

If anyone wants to see the WRITE side equivalent, it goes:


        for (i = 0;  i < disks; i++) {
                if (conf->mirrors[i].rdev &&
                    !conf->mirrors[i].rdev->faulty) {
                        ...
                        r1_bio->write_bios[i] = bio;
                } else
                        r1_bio->write_bios[i] = NULL;
        }

        atomic_set(&r1_bio->remaining, 1);

        for (i = 0; i < disks; i++) {
                if (!r1_bio->write_bios[i])
                        continue;
                ...
                atomic_inc(&r1_bio->remaining);
                generic_make_request(mbio);
        }

so I reckon that's equivalent, apart from the off-by-one. Explain me
somebody.


In the end_request code, simply, instead of erroring
the current disk out of the array whenever an error happens, do it
only if a WRITE is being handled. We still won't mark the request
uptodate as that's in the else part of the if !uptodate, where we don't
touch. That's the first hunk here.

The second hunk is in the same routine, but down in the READ side of
the code split, further on.  We finish the request not only if we are
utodate (success), but also if we are not uptodate but we are plain out
of disks to try and read from (so the request will be errored since it
is not marked yuptodate still).  We decrement the "remaining" count in
the test.


> @@ -354,9 +354,15 @@
>         /*
>          * this branch is our 'one mirror IO has finished' event handler:
>          */
> -       if (!uptodate)
> -               md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
> -       else
> +       if (!uptodate) {
> +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT
> +                /*
> +                 * Only fault disk out of array on write error, not read.
> +                 */
> +                if (r1_bio->cmd == WRITE)
> +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */
> +                       md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
> +        } else
>                 /*
>                  * Set R1BIO_Uptodate in our master bio, so that
>                  * we will return a good error code for to the higher
> @@ -375,7 +381,12 @@
>                 /*
>                  * we have only one bio on the read side
>                  */
> -               if (uptodate)
> +               if (uptodate
> +#ifndef DO_NOT_ADD_ROBUST_READ_SUPPORT
> +                        /* Give up and error if we're last */
> +                        || atomic_dec_and_test(&r1_bio->remaining)
> +#endif /* DO_NOT_ADD_ROBUST_READ_SUPPORT */
> +                        )
>                         raid_end_bio_io(r1_bio);
>                 else {
>                         /*
 

Peter


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-10 18:37                                       ` Peter T. Breuer
@ 2005-01-11 11:34                                         ` Peter T. Breuer
  0 siblings, 0 replies; 95+ messages in thread
From: Peter T. Breuer @ 2005-01-11 11:34 UTC (permalink / raw)
  To: linux-raid

I'm looking for a natural way to rewrite failed read sectors on read1. ANy
ideas? There are several pitfalls to do with barriers in the different
raid threads.

My first crude idea was to put into raid1_end_request a

   sync_request(mddev, r1_bio->sector, 0);

just before raid1_end_bio_io(r1_bio) is run on a succesful retried
read. But this supposes that nothing in sync_request will sleep (or
is that ok in end_request nowadays). If not possible inline I will
have to schedule it instead.

Another possibility is not to run raid1_end_bio_io just yet but instead
convert the r1_bio we just did ok into a SPECIAL and put it on the retry
queue and let the raid1 treat it (by running the WRITE half of a
READ-WRITE resync operation on it).  I can modify raid1d to do the
users end_bio_io for us if needed. Or I can run

   sync_request_write(mddev, r1_bio);

directly (somehow I get a shiver down my spine) from the end_request.

Ideas? Advice? Derision?

OK - so to be definite, what would be wrong with

        r1_bio->cmd = SPECIAL;
        reschedule_retry(r1_bio);

instead of 

        raid_end_bio_io(r1_bio);

in raid1_end_request? This should result in the raid1d thread doing a
write-half to all devices from the bio buffer taht we just filled with
a successful read.  The question is when we get to ack the user on the
read.  Maybe I should clone the bio.

Peter

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-08 19:25             ` maarten
  2005-01-08 20:33               ` Mario Holbe
@ 2005-01-08 23:09               ` Guy
  2005-01-09  0:56                 ` maarten
  2005-01-13  2:05                 ` Neil Brown
  1 sibling, 2 replies; 95+ messages in thread
From: Guy @ 2005-01-08 23:09 UTC (permalink / raw)
  To: 'maarten', linux-raid

Maarten said:
"Normally, the minute a drive fails, it gets kicked and the spare would kick
in and md syncs this spare.  We now have a non-degraded array again."

Guy says:
But, you make it seem instantaneously!  The array will be degraded until the
re-sync is done.  In my case, that takes about 60 minutes, so 1 extra minute
is insignificant.

Marrten said:
"Yes, but this would be impossible to do, since md cannot anticipate _which_

disk you're going to fail before it happens. ;)"

Guy says:
But, I could tell md which disk I want to spare.  After all, I know which
disk I am going to fail.  Maybe even an option to mark a disk as "to be
failed", which would cause it to be spared before it goes off-line.  Then md
could fail the disk after it has been spared.  Neil, add this to the wish
list!  :)

EMC does this on their big iron.  If the system determines a disk is having
too many issues (bad blocks or whatever), the system predicts a failure, the
system copies the disk to a spare.  That way a second failure during the
re-sync would not be fatal.  And a direct disk to disk copy is much faster
(or easier) than a re-build from parity.  This is how it was explained to me
about 5 years ago.  No idea if it was marketing lies or truth.  But I liked
the fact that my data stayed redundant while the spare was being re-built.
This would not work if a drive failed, only if a drive failure was
predicted.  Another cool feature... the disk array then makes a support
call.  The disk is replaced quickly, normally before any redundancy was
lost.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten
Sent: Saturday, January 08, 2005 2:25 PM
To: linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

On Saturday 08 January 2005 19:55, you wrote:
> My warning about user error was not targeted at you!  :)
> Sorry if it seemed so.

:-)

> And the order does not matter!

Hm... yes you're right. But adding the disk is more prudent (or is it?)

Grr. Now you've got ME thinking !  ;-)

Normally, the minute a drive fails, it gets kicked and the spare would kick
in 
and md syncs this spare.  We now have a non-degraded array again.
If I then fail the spare first, the array goes into degraded mode. Whereas
if 
I hotadd the disk, it becomes a spare. Presumably if I now fail the original

spare, the real disk will get synced again, to get the same setup as before.
But yes, you're right; during this step it is degraded again. Oh well...

> It would be cool if the rebuild to the repaired disk could be done before
> the spare was failed or removed.  Then the array would not be degraded at
> all.

Yes, but this would be impossible to do, since md cannot anticipate _which_ 
disk you're going to fail before it happens. ;)

> If I ever re-build my system, or build a new system, I hope to use RAID6.

I tried this in last fall, but it didn't work out then. See the list
archives.

> The Seagate test is on-line.  Before I started using the Seagate tool, I
> used dd.

I'm not as cautious as you are. I just pray the hot spare does what its 
supposed to do.

> My disks claim to be able to re-locate bad blocks on read error.  But I am
> not sure if this is correctable errors or not.  If not correctable errors
> are re-located, what data does the drive return?  Since I don't know, I
> don't use this option.  I did use this option for awhile, but after
> re-reading about it, I got concerned and turned it off.

Afaik, if a drive senses it gets more 'difficult' than usual to read a
sector, 
it will automatically copy it to a spare sector and reassign it. However, I 
doubt the OS gets any wiser this happens, so neither would md.
In which cases the error gets noticed by md I don't precisely know, but I 
reckon that may well be when the error is uncorrectible.
Not _undetectable_, to quote from another thread... 8-)  

> This is from the readme file:
> Automatic Read Reallocation Enable (ARRE)
>         -Marreon/off  enable/disable ARRE bit
>            On, drive automatically relocates bad blocks detected
>            during read operations.  Off, drive creates Check condition
>            status with sense key of Medium Error if bad blocks are
>            detected during read operations.

Hm. I would definitely ENable that option.  But what do I know.

It also depends I guess on how fatal reading bad data undetected is for you.

For me, if one of my mpegs or mp3s develops a bad sector I can probably live

with that. :-)

Maarten

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-08 23:09               ` Guy
@ 2005-01-09  0:56                 ` maarten
  2005-01-13  2:05                 ` Neil Brown
  1 sibling, 0 replies; 95+ messages in thread
From: maarten @ 2005-01-09  0:56 UTC (permalink / raw)
  To: linux-raid

On Sunday 09 January 2005 00:09, Guy wrote:
> Maarten said:
> "Normally, the minute a drive fails, it gets kicked and the spare would
> kick in and md syncs this spare.  We now have a non-degraded array again."
>
> Guy says:
> But, you make it seem instantaneously!  The array will be degraded until
> the re-sync is done.  In my case, that takes about 60 minutes, so 1 extra
> minute is insignificant.

No, sure it is not instantaneous, far from it. Sorry if I made that 
impression.  On my system it takes a whole lot longer than 60 minutes, more 
like 360 minutes.  (in my other array where I use whole-disk 160 GB volumes).

> Marrten said:
> "Yes, but this would be impossible to do, since md cannot anticipate
> _which_
> disk you're going to fail before it happens. ;)"
>
> Guy says:
> But, I could tell md which disk I want to spare.  After all, I know which
> disk I am going to fail.  Maybe even an option to mark a disk as "to be
> failed", which would cause it to be spared before it goes off-line.  Then
> md could fail the disk after it has been spared.  Neil, add this to the
> wish list!  :)

Yes, that would be a smart option indeed :)  It gets rid of the window where 
any failure would be fatal.  But I suppose Neil is overworked as it is.

> EMC does this on their big iron.  If the system determines a disk is having
> too many issues (bad blocks or whatever), the system predicts a failure,
> the system copies the disk to a spare.  That way a second failure during
> the re-sync would not be fatal.  And a direct disk to disk copy is much
> faster (or easier) than a re-build from parity.  This is how it was
> explained to me about 5 years ago.  No idea if it was marketing lies or
> truth.  But I liked the fact that my data stayed redundant while the spare
> was being re-built. This would not work if a drive failed, only if a drive
> failure was predicted.  Another cool feature... the disk array then makes a
> support call.  The disk is replaced quickly, normally before any redundancy
> was lost.

Hehe.  Cool.  Big iron -> You indeed get what ya pay for :-))




^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-08 23:09               ` Guy
  2005-01-09  0:56                 ` maarten
@ 2005-01-13  2:05                 ` Neil Brown
  2005-01-13  4:55                   ` Guy
  2005-01-13  9:27                   ` Peter T. Breuer
  1 sibling, 2 replies; 95+ messages in thread
From: Neil Brown @ 2005-01-13  2:05 UTC (permalink / raw)
  To: Guy; +Cc: 'maarten', linux-raid

On Saturday January 8, bugzilla@watkins-home.com wrote:
> 
> Guy says:
> But, I could tell md which disk I want to spare.  After all, I know which
> disk I am going to fail.  Maybe even an option to mark a disk as "to be
> failed", which would cause it to be spared before it goes off-line.  Then md
> could fail the disk after it has been spared.  Neil, add this to the wish
> list!  :)

Once the "bitmap of potentially dirty blocks" is working, this could
be done in user space (though there would be a small window).

- fail out the chosen drive.
- combine it with the spare in a raid1 with no superblock
- add this raid1 back into the main array.
- md will notice that it has recently been removed and will only
  rebuild those blocks which need to be rebuilt
- raid for the raid1 to fully sync
- fail out the drive you want to remove.

You only have a tiny window where the array is degraded, and it we
were to allow an md array to block all IO requests for a time, you
could make that window irrelevant.

NeilBrown

 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-13  2:05                 ` Neil Brown
@ 2005-01-13  4:55                   ` Guy
  2005-01-13  9:27                   ` Peter T. Breuer
  1 sibling, 0 replies; 95+ messages in thread
From: Guy @ 2005-01-13  4:55 UTC (permalink / raw)
  To: 'Neil Brown'; +Cc: 'maarten', linux-raid

1. Would the re-sync of the RAID5 wait for the re-sync of the RAID1, since 2
different arrays depend on the same device?

2. Will the "bitmap of potentially dirty blocks" be able to keep a disk in
the array if it has bad blocks?

3. Will RAID1 be able to re-sync to another disk if the source disk has bad
blocks?  Even if they are un-correctable?  Once the re-sync is done, then
RAID5 could re-construct the missing data, and correct the RAID1 array.
Ouch!, seems like a catch 22.  RAID5 should go first and correct the bad
blocks first, and then, any new bad blocks found during the RAID1 re-sync.
But, the bitmap would need to be quad-state (synced, right is good, left is
good, both are bad).  Since RAID1 can have more than 2 devices, maybe 1 bit
per device (synced, not synced).  The more I think, the harder it gets!  :)

If 1, 2 and 3 above are all yes, then it seems like a usable workaround.

And, in the future, maybe RAID5 arrays would be made up of RAID1 arrays with
only 1 disk each.  Using grow to copy a failing disk to another (RAID1),
then removing the failing disk.  Then shrinking the RAID1 back to 1 disk.
Then there would be no window.  Using this method, #1 above is irrelevant,
or less relevant!

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Neil Brown
Sent: Wednesday, January 12, 2005 9:06 PM
To: Guy
Cc: 'maarten'; linux-raid@vger.kernel.org
Subject: RE: Spares and partitioning huge disks

On Saturday January 8, bugzilla@watkins-home.com wrote:
> 
> Guy says:
> But, I could tell md which disk I want to spare.  After all, I know which
> disk I am going to fail.  Maybe even an option to mark a disk as "to be
> failed", which would cause it to be spared before it goes off-line.  Then
md
> could fail the disk after it has been spared.  Neil, add this to the wish
> list!  :)

Once the "bitmap of potentially dirty blocks" is working, this could
be done in user space (though there would be a small window).

- fail out the chosen drive.
- combine it with the spare in a raid1 with no superblock
- add this raid1 back into the main array.
- md will notice that it has recently been removed and will only
  rebuild those blocks which need to be rebuilt
- wait for the raid1 to fully sync
- fail out the drive you want to remove.

You only have a tiny window where the array is degraded, and if we
were to allow an md array to block all IO requests for a time, you
could make that window irrelevant.

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-13  2:05                 ` Neil Brown
  2005-01-13  4:55                   ` Guy
@ 2005-01-13  9:27                   ` Peter T. Breuer
  2005-01-13 15:53                     ` Guy
  1 sibling, 1 reply; 95+ messages in thread
From: Peter T. Breuer @ 2005-01-13  9:27 UTC (permalink / raw)
  To: linux-raid

Neil Brown <neilb@cse.unsw.edu.au> wrote:
> On Saturday January 8, bugzilla@watkins-home.com wrote:
> > 
> > Guy says:
> > But, I could tell md which disk I want to spare.  After all, I know which
> > disk I am going to fail.  Maybe even an option to mark a disk as "to be
> > failed", which would cause it to be spared before it goes off-line.  Then md
> > could fail the disk after it has been spared.  Neil, add this to the wish
> > list!  :)
> 
> Once the "bitmap of potentially dirty blocks" is working, this could
> be done in user space (though there would be a small window).
> 
> - fail out the chosen drive.
> - combine it with the spare in a raid1 with no superblock
> - add this raid1 back into the main array.
> - md will notice that it has recently been removed and will only
>   rebuild those blocks which need to be rebuilt
> - raid for the raid1 to fully sync
> - fail out the drive you want to remove.

I don't really understand what this is all about, but I recall that when
I was writing FR5 one of the things I wanted as an objective was to be
able to REPLACE one of the disks in the array efficiently because
currently there's no real way that doesn't take you through a degraded
array, since you have to add the replacement as a spare, then fail one
of the existing disks.

What I wanted was to allow the replacement to be added in and synced up
in the background.

Is that what you are talking about? I don't recall if I actually did it
or merely planned to do it, but I recall considering it (and that
should logically imply that I probably did something about it).

> You only have a tiny window where the array is degraded, and it we
> were to allow an md array to block all IO requests for a time, you
> could make that window irrelevant.

Well, I don't see where there's any window in which its degraded. If
one triggers a sync after adding in the spare and marking it as failed
then the spare will get a copy from the rest and new writes will also go
to it, no?

Ahh ..  I now recall that maybe I did this in practice for RAID5 simply
by running RAID5 over individual RAID1s already in degraded mode.  To
"replace" any of the disks one adds a mirror component to one of the
degraded RAID1s, waits till it syncs up, then fails and removes the
original component.  Hey presto - replacement without degradation.

Presumably that also works for RAID1. I.e. you run RAID1 over several
RAID1s already in degraded mode. To replace one of the disks you simply
add in the replacement to one of the "degraded" RAID1s. When it's
synced you fail out the original component.

Peter

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-13  9:27                   ` Peter T. Breuer
@ 2005-01-13 15:53                     ` Guy
  2005-01-13 17:16                       ` Peter T. Breuer
  0 siblings, 1 reply; 95+ messages in thread
From: Guy @ 2005-01-13 15:53 UTC (permalink / raw)
  To: 'Peter T. Breuer', linux-raid

Peter said:
"Well, I don't see where there's any window in which its degraded."

These are the steps that cause the window (see "Original Message" for full
details):

1. fail out the chosen drive. (array is now degraded)

2. combine it with the spare in a raid1 with no superblock (re-synce starts)

3. add this raid1 back into the main array. (The main array is now in-sync
other than any changes that occurred since you failed the disk in step 1)

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer
Sent: Thursday, January 13, 2005 4:28 AM
To: linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

Neil Brown <neilb@cse.unsw.edu.au> wrote:
> On Saturday January 8, bugzilla@watkins-home.com wrote:
> > 
> > Guy says:
> > But, I could tell md which disk I want to spare.  After all, I know
which
> > disk I am going to fail.  Maybe even an option to mark a disk as "to be
> > failed", which would cause it to be spared before it goes off-line.
Then md
> > could fail the disk after it has been spared.  Neil, add this to the
wish
> > list!  :)
> 
> Once the "bitmap of potentially dirty blocks" is working, this could
> be done in user space (though there would be a small window).
> 
> - fail out the chosen drive.
> - combine it with the spare in a raid1 with no superblock
> - add this raid1 back into the main array.
> - md will notice that it has recently been removed and will only
>   rebuild those blocks which need to be rebuilt
> - raid for the raid1 to fully sync
> - fail out the drive you want to remove.

I don't really understand what this is all about, but I recall that when
I was writing FR5 one of the things I wanted as an objective was to be
able to REPLACE one of the disks in the array efficiently because
currently there's no real way that doesn't take you through a degraded
array, since you have to add the replacement as a spare, then fail one
of the existing disks.

What I wanted was to allow the replacement to be added in and synced up
in the background.

Is that what you are talking about? I don't recall if I actually did it
or merely planned to do it, but I recall considering it (and that
should logically imply that I probably did something about it).

> You only have a tiny window where the array is degraded, and it we
> were to allow an md array to block all IO requests for a time, you
> could make that window irrelevant.

Well, I don't see where there's any window in which its degraded. If
one triggers a sync after adding in the spare and marking it as failed
then the spare will get a copy from the rest and new writes will also go
to it, no?

Ahh ..  I now recall that maybe I did this in practice for RAID5 simply
by running RAID5 over individual RAID1s already in degraded mode.  To
"replace" any of the disks one adds a mirror component to one of the
degraded RAID1s, waits till it syncs up, then fails and removes the
original component.  Hey presto - replacement without degradation.

Presumably that also works for RAID1. I.e. you run RAID1 over several
RAID1s already in degraded mode. To replace one of the disks you simply
add in the replacement to one of the "degraded" RAID1s. When it's
synced you fail out the original component.

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-13 15:53                     ` Guy
@ 2005-01-13 17:16                       ` Peter T. Breuer
  2005-01-13 20:40                         ` Guy
  0 siblings, 1 reply; 95+ messages in thread
From: Peter T. Breuer @ 2005-01-13 17:16 UTC (permalink / raw)
  To: linux-raid

Guy <bugzilla@watkins-home.com> wrote:
> Peter said:
> "Well, I don't see where there's any window in which its degraded."
> 
> These are the steps that cause the window (see "Original Message" for full
> details):
> 
> 1. fail out the chosen drive. (array is now degraded)

I would suggest "don't do that then".  Start with an array of degraded
RAID1s, as I suggested, and add in an extra disk to one of the raid1s,
wait till it syncs, then remove the original component.  Instant new
(degraded) RAID1 in the place of the old, and the array above none the
wiser.

> 2. combine it with the spare in a raid1 with no superblock (re-synce starts)

Why "no superblock"? Oh well - let's leave it as a mystery.

> 3. add this raid1 back into the main array. (The main array is now in-sync
> other than any changes that occurred since you failed the disk in step 1)

Well, if you have an array of arrays it seems that the main array must
have been degraded too, but I don't see where you took the subarray out
of it in the sequence above (in order to add it back in now).

The problem pointed out is that if the disk you are going to swap out is
faulty, there's no way of copying from it perfectly. The read patch I
posted a few days ago will help, but it won't paper over real sector
errors - it may allow the copy to processd, however (I'll have to check 
what happens during a sync).

So one has to substitute using data from the redundant parts of the
array above (in the array-of-arrays solution). But there's no
communication at present :(.

Well, 

  1) if one were to use bitmaps, I would suggest that in the case of an
     array of arrays that the bitmap be shared between an array and its
     subarrays - do we really care in which disk a problem is? No - we
     know we just have to try and find some good data and correct a
     problem in that block and we can go searching for the details if  
     and when we need.

  2) I don't see any problem in, even without a bitmap, simply augmenting
     the repair strategy (which you people don't have yet, heh) for
     read errors to including getting the data from the array above if
     we are in a subarray, not just using our own redundancy.

Peter

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-13 17:16                       ` Peter T. Breuer
@ 2005-01-13 20:40                         ` Guy
  2005-01-13 23:32                           ` Peter T. Breuer
  0 siblings, 1 reply; 95+ messages in thread
From: Guy @ 2005-01-13 20:40 UTC (permalink / raw)
  To: 'Peter T. Breuer', linux-raid

Maybe you missed my post from yesterday.
http://marc.theaimsgroup.com/?l=linux-raid&m=110559211400459&w=2

No superblock was to prevent overwriting data on the failing component of
the top RAID5 array.  If you build the top array with degraded RAID1 arrays,
then use a super block for the RAID1 arrays.

Also, so all of the RAID1 arrays don't seem degraded, configure them with
only 1 device.  Grow them to 2 devices when needed.  Then shrink them back
to 1 when done.

The RAID1 idea will not work since a bad block will take out the RAID1.  But
there are more issues, see the above URL.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer
Sent: Thursday, January 13, 2005 12:17 PM
To: linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

Guy <bugzilla@watkins-home.com> wrote:
> Peter said:
> "Well, I don't see where there's any window in which its degraded."
> 
> These are the steps that cause the window (see "Original Message" for full
> details):
> 
> 1. fail out the chosen drive. (array is now degraded)

I would suggest "don't do that then".  Start with an array of degraded
RAID1s, as I suggested, and add in an extra disk to one of the raid1s,
wait till it syncs, then remove the original component.  Instant new
(degraded) RAID1 in the place of the old, and the array above none the
wiser.

> 2. combine it with the spare in a raid1 with no superblock (re-synce
starts)

Why "no superblock"? Oh well - let's leave it as a mystery.

> 3. add this raid1 back into the main array. (The main array is now in-sync
> other than any changes that occurred since you failed the disk in step 1)

Well, if you have an array of arrays it seems that the main array must
have been degraded too, but I don't see where you took the subarray out
of it in the sequence above (in order to add it back in now).

The problem pointed out is that if the disk you are going to swap out is
faulty, there's no way of copying from it perfectly. The read patch I
posted a few days ago will help, but it won't paper over real sector
errors - it may allow the copy to processd, however (I'll have to check 
what happens during a sync).

So one has to substitute using data from the redundant parts of the
array above (in the array-of-arrays solution). But there's no
communication at present :(.

Well, 

  1) if one were to use bitmaps, I would suggest that in the case of an
     array of arrays that the bitmap be shared between an array and its
     subarrays - do we really care in which disk a problem is? No - we
     know we just have to try and find some good data and correct a
     problem in that block and we can go searching for the details if  
     and when we need.

  2) I don't see any problem in, even without a bitmap, simply augmenting
     the repair strategy (which you people don't have yet, heh) for
     read errors to including getting the data from the array above if
     we are in a subarray, not just using our own redundancy.

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-13 20:40                         ` Guy
@ 2005-01-13 23:32                           ` Peter T. Breuer
  2005-01-14  2:43                             ` Guy
  0 siblings, 1 reply; 95+ messages in thread
From: Peter T. Breuer @ 2005-01-13 23:32 UTC (permalink / raw)
  To: linux-raid

Guy <bugzilla@watkins-home.com> wrote:
> Maybe you missed my post from yesterday.

Possibly I did - I certainly don't read everything and not everything
gets to me. Or maye I saw it and could not decipher it. I don't know!

> http://marc.theaimsgroup.com/?l=linux-raid&m=110559211400459&w=2

> No superblock was to prevent overwriting data on the failing component of

You say that no superblock in one of the raid1 subarray's disks stops
overwriting data on a top raid5 array?  This really sounds like double
dutch!  And if it does (What?  How?  Why?), so what?

> the top RAID5 array.  If you build the top array with degraded RAID1 arrays,
> then use a super block for the RAID1 arrays.

There possibly a missing verb in that sentence.  Or maybe not.  It is
hard to tell. Hmmmmmmmm .......... nope, I really can't see where that
sentence is trying to go.

Let's suppose it really does have the form of a computer language IF
THEN, so the conditional test would be "you build the top array with
degraded RAID1 arrays".

Well, I can interpret that to say that I build the top array (which is
RAID5) OF degraded RAID1 arrays, and then that would match what I
suggested.

OK.  So that would mean "IF you do what you suggest THEN ...".  Then
what?  Then "an imperative".  An imperative?  Why should I obey an
imperative?  Well, what does this imperative say anyway? 

It says "use a super block for the RAID1 arrays".

OK, that could be "use RAID1 arrays with superblocks".  That is "do the
normal thing with RAID1".  So the whole sentence says "IF you do what
you suggest THEN do the normal thing with RAID1".  OK - I agree.  You
are trying to say "IF I do things my way THEN I don't have to do
anything strange at the RAID1 level".

OK?

Whew! I can see why I skipped whatever you said before if that is what
it takes to decipher it!

Bt then the entire sentence says nothing strange or exciting.

> Also, so all of the RAID1 arrays don't seem degraded, configure them with
> only 1 device. 

Whaaaaaaat? Oh no, I give up - I really can't parse this. 

Hang on - maybe the tenses are wrong.  Maybe you are trying to say "you
don't have to configure the RAID1 arrays as having 1 good disk and 1
failed disk".

Well, I disagree. Correct me if I am wrong but as far as I know you
cannot change the number of disks in a raid array. I'd be happy to
learn you can, but for all I know if you start with n disks comprised
of m good and p failed disks, then n = m + p and the total can never
change. In my 2.6.3 codebase for raid there is only one point where 
conf->raid_disks is changed, and it is in the "run" routine, where 
it is set once and for all and never changed.

> Grow them to 2 devices when needed.  Then shrink them back
> to 1 when done.

If it were possible I'd be happy to hear of it. Maybe it is possible -
but it would be in a newer codebase than the 2.6.3 code I have running
here.

If so, why this convoluted way of saying that?

> The RAID1 idea will not work since a bad block will take out the RAID1.  But

Uh - yes it will work, no a bad block will not "take out the RAID1",
whatever you mean by that. 

I presume you mean that a bad block in the disk being read will mess up
the raid1 subarray. No it won't - it will just prevent a block being
copied. I see nothing terrible in that. The result will be that
everything else but that block will be copied. If you like we can even
arrange that the missing block be corrected THEN at that moment from
data available in the superarray, but I don't see that as necessary.

Why?

Well, 

1) because now that you have told me that the disk you want to swap out
is bad, then the top level array has morally lost its redundancy
already!  So just take the disk out and replace it - you won't be
degrading the top array any more than it already is while you do this.

2) oh - but you say that yes we are losing redundancy on the good
sectors of the disk. Oh? And which are those? Well let's just go ahead
with the RAID1 sync and whenever we hit an unreadable sector then
launch a request to the top level array for a read from the rest of the
RAID5 for that sector only.

3) oh, but you don't like to do that during resync? Shrug .. then mar
the place on a bitmap and after the resync has finished as best you are
able then launch an extra cleanup phase from the RAID5  to cover the
blocks marked bad in the bitmap (one can do this periodically anyway).

I don't see that one really needs to do 3 in place of 2, but I am
experimenting.

Anyway, the point is that ÿour "it will not work" is wrong.

> there are more issues, see the above URL.

Does it contain more of the same?

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-13 23:32                           ` Peter T. Breuer
@ 2005-01-14  2:43                             ` Guy
  0 siblings, 0 replies; 95+ messages in thread
From: Guy @ 2005-01-14  2:43 UTC (permalink / raw)
  To: 'Peter T. Breuer', linux-raid

Well, I give up!  It seems we don't talk the same language.
That is ok with me.  Bye!

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer
Sent: Thursday, January 13, 2005 6:33 PM
To: linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

Guy <bugzilla@watkins-home.com> wrote:
> Maybe you missed my post from yesterday.

Possibly I did - I certainly don't read everything and not everything
gets to me. Or maye I saw it and could not decipher it. I don't know!

> http://marc.theaimsgroup.com/?l=linux-raid&m=110559211400459&w=2

> No superblock was to prevent overwriting data on the failing component of

You say that no superblock in one of the raid1 subarray's disks stops
overwriting data on a top raid5 array?  This really sounds like double
dutch!  And if it does (What?  How?  Why?), so what?

> the top RAID5 array.  If you build the top array with degraded RAID1
arrays,
> then use a super block for the RAID1 arrays.

There possibly a missing verb in that sentence.  Or maybe not.  It is
hard to tell. Hmmmmmmmm .......... nope, I really can't see where that
sentence is trying to go.

Let's suppose it really does have the form of a computer language IF
THEN, so the conditional test would be "you build the top array with
degraded RAID1 arrays".

Well, I can interpret that to say that I build the top array (which is
RAID5) OF degraded RAID1 arrays, and then that would match what I
suggested.

OK.  So that would mean "IF you do what you suggest THEN ...".  Then
what?  Then "an imperative".  An imperative?  Why should I obey an
imperative?  Well, what does this imperative say anyway? 

It says "use a super block for the RAID1 arrays".

OK, that could be "use RAID1 arrays with superblocks".  That is "do the
normal thing with RAID1".  So the whole sentence says "IF you do what
you suggest THEN do the normal thing with RAID1".  OK - I agree.  You
are trying to say "IF I do things my way THEN I don't have to do
anything strange at the RAID1 level".

OK?

Whew! I can see why I skipped whatever you said before if that is what
it takes to decipher it!

Bt then the entire sentence says nothing strange or exciting.

> Also, so all of the RAID1 arrays don't seem degraded, configure them with
> only 1 device. 

Whaaaaaaat? Oh no, I give up - I really can't parse this. 

Hang on - maybe the tenses are wrong.  Maybe you are trying to say "you
don't have to configure the RAID1 arrays as having 1 good disk and 1
failed disk".

Well, I disagree. Correct me if I am wrong but as far as I know you
cannot change the number of disks in a raid array. I'd be happy to
learn you can, but for all I know if you start with n disks comprised
of m good and p failed disks, then n = m + p and the total can never
change. In my 2.6.3 codebase for raid there is only one point where 
conf->raid_disks is changed, and it is in the "run" routine, where 
it is set once and for all and never changed.

> Grow them to 2 devices when needed.  Then shrink them back
> to 1 when done.

If it were possible I'd be happy to hear of it. Maybe it is possible -
but it would be in a newer codebase than the 2.6.3 code I have running
here.

If so, why this convoluted way of saying that?

> The RAID1 idea will not work since a bad block will take out the RAID1.
But

Uh - yes it will work, no a bad block will not "take out the RAID1",
whatever you mean by that. 

I presume you mean that a bad block in the disk being read will mess up
the raid1 subarray. No it won't - it will just prevent a block being
copied. I see nothing terrible in that. The result will be that
everything else but that block will be copied. If you like we can even
arrange that the missing block be corrected THEN at that moment from
data available in the superarray, but I don't see that as necessary.

Why?

Well, 

1) because now that you have told me that the disk you want to swap out
is bad, then the top level array has morally lost its redundancy
already!  So just take the disk out and replace it - you won't be
degrading the top array any more than it already is while you do this.

2) oh - but you say that yes we are losing redundancy on the good
sectors of the disk. Oh? And which are those? Well let's just go ahead
with the RAID1 sync and whenever we hit an unreadable sector then
launch a request to the top level array for a read from the rest of the
RAID5 for that sector only.

3) oh, but you don't like to do that during resync? Shrug .. then mar
the place on a bitmap and after the resync has finished as best you are
able then launch an extra cleanup phase from the RAID5  to cover the
blocks marked bad in the bitmap (one can do this periodically anyway).

I don't see that one really needs to do 3 in place of 2, but I am
experimenting.

Anyway, the point is that ÿour "it will not work" is wrong.

> there are more issues, see the above URL.

Does it contain more of the same?

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-08 14:52     ` Frank van Maarseveen
  2005-01-08 15:50       ` Mario Holbe
  2005-01-08 16:32       ` Guy
@ 2005-01-08 16:49       ` maarten
  2005-01-08 19:01         ` maarten
  2005-01-09 19:33         ` Frank van Maarseveen
  2 siblings, 2 replies; 95+ messages in thread
From: maarten @ 2005-01-08 16:49 UTC (permalink / raw)
  To: linux-raid

On Saturday 08 January 2005 15:52, Frank van Maarseveen wrote:
> On Fri, Jan 07, 2005 at 04:57:35PM -0500, Guy wrote:
> > His plan is to split the disks into 6 partitions.
> > Each of his six RAID5 arrays will only use 1 partition of each physical
> > disk.
> > If he were to lose a disk, all 6 RAID5 arrays would only see 1 failed
> > disk. If he gets 2 read errors, on different disks, at the same time, he
> > has a 1/6 chance they would be in the same array (which would be bad).
> > His plan is to combine the 6 arrays with LVM or a linear array.
>
> Intriguing setup. Do you think this actually improves the reliability
> with respect to disk failure compared to creating just one large RAID5
> array?

Yes.  But I get no credits; someone else here invented the idea.

> For one second I thought it's a clever trick but gut feeling tells
> me the odds of losing the entire array won't change (simplified --
> because the increased complexity creates room for additional errors).

No.  It is somewhat more complex, true, but no different than making, for 
example, 6 md arrays for six different mountpoints. And I just add all six 
together in an LVM. The idea behind it is that not all errors with md are 
fatal.  In the case of a non-fatal error, just re-adding the disk might solve 
it since the drive then will remap the bad sector.  However, IF during that 
resync one other drive has a read error, it gets kicked too and the array 
dies.  The chances of that happening are not very small; during resync all of 
the other drives get read in whole, so that is much more intensive than 
normal operation. So at the precise moment you really can't afford to get a 
read error, the chances of getting one are greater than ever(!). 

By dividing the physical disk in smaller parts one decreases the chance of a 
second disk with a bad sector being on the same array. You could have 3 or 
even 4 disks with bad sectors without losing the array, provided you're lucky 
and they all are on different parts of the drive platters (precisely: in 
different arrays). This is in theory of course, you'd be stupid to leave an 
array degraded and let chance decide which one breaks next... ;-)

Besides this, the resync time in case of a fault decreases by a factor 6 too 
as an added bonus. I don't know about you but over here resyncing a 250GB 
disk takes the better part of the day. (To be honest, that was a slow system)

Now it is certain that you'll strike a compromise between the added complexity 
and the benefits of this setup, so you choose an arbitrary amount of md 
arrays to define. For me six seemed okay, there is no need to go overboard 
and define real small arrays like 10 GB ones (24 of them).  ;-)

Maarten

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-08 16:49       ` maarten
@ 2005-01-08 19:01         ` maarten
  2005-01-10 16:34           ` maarten
  2005-01-09 19:33         ` Frank van Maarseveen
  1 sibling, 1 reply; 95+ messages in thread
From: maarten @ 2005-01-08 19:01 UTC (permalink / raw)
  To: linux-raid

On Saturday 08 January 2005 17:49, maarten wrote:
> On Saturday 08 January 2005 15:52, Frank van Maarseveen wrote:
> > On Fri, Jan 07, 2005 at 04:57:35PM -0500, Guy wrote:
> > > His plan is to split the disks into 6 partitions.
> > > Each of his six RAID5 arrays will only use 1 partition of each physical
> > > disk.
> > > If he were to lose a disk, all 6 RAID5 arrays would only see 1 failed
> > > disk. If he gets 2 read errors, on different disks, at the same time,
> > > he has a 1/6 chance they would be in the same array (which would be
> > > bad). His plan is to combine the 6 arrays with LVM or a linear array.
> >
> > Intriguing setup. Do you think this actually improves the reliability
> > with respect to disk failure compared to creating just one large RAID5
> > array?

As the system is now online again, busy copying, I can show the exact config:

dozer:~ # fdisk -l /dev/hde

Disk /dev/hde: 250.0 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/hde1             1       268   2152678+  fd  Linux raid autodetect
/dev/hde2           269       331    506047+  fd  Linux raid autodetect
/dev/hde3           332       575   1959930   fd  Linux raid autodetect
/dev/hde4           576     30401 239577345    5  Extended
/dev/hde5           576      5439  39070048+  fd  Linux raid autodetect
/dev/hde6          5440     10303  39070048+  fd  Linux raid autodetect
/dev/hde7         10304     15167  39070048+  fd  Linux raid autodetect
/dev/hde8         15168     20031  39070048+  fd  Linux raid autodetect
/dev/hde9         20032     24895  39070048+  fd  Linux raid autodetect
/dev/hde10        24896     29759  39070048+  fd  Linux raid autodetect
/dev/hde11        29760     30401   5156833+  83  Linux

dozer:~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [multipath]
read_ahead 1024 sectors
md1 : active raid1 hdg2[1] hde2[0]
      505920 blocks [2/2] [UU]

md0 : active raid1 hdg1[1] hde1[0]
      2152576 blocks [4/2] [UU__]

md2 : active raid1 sdb2[0] sda2[1]
      505920 blocks [2/2] [UU]

md3 : active raid5 sdb5[2] sda5[3] hdg5[1] hde5[0]
      117209856 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md4 : active raid5 sdb6[2] sda6[3] hdg6[1] hde6[0]
      117209856 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md5 : active raid5 sdb7[2] sda7[3] hdg7[1] hde7[0]
      117209856 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md6 : active raid5 sdb8[2] sda8[3] hdg8[1] hde8[0]
      117209856 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md7 : active raid5 sdb9[2] sda9[3] hdg9[1] hde9[0]
      117209856 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md8 : active raid5 sdb10[2] sda10[3] hdg10[1] hde10[0]
      117209856 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

# Where md0 is "/" (temporary degraded), and md1 and md2 are swap.  
# The md3 through md8 are the big arrays that are part of LVM.

dozer:~ # pvscan
pvscan -- reading all physical volumes (this may take a while...)
pvscan -- ACTIVE   PV "/dev/md3" of VG "lvm_video" [111.72 GB / 0 free]
pvscan -- ACTIVE   PV "/dev/md4" of VG "lvm_video" [111.72 GB / 0 free]
pvscan -- ACTIVE   PV "/dev/md5" of VG "lvm_video" [111.72 GB / 0 free]
pvscan -- ACTIVE   PV "/dev/md6" of VG "lvm_video" [111.72 GB / 0 free]
pvscan -- ACTIVE   PV "/dev/md7" of VG "lvm_video" [111.72 GB / 0 free]
pvscan -- ACTIVE   PV "/dev/md8" of VG "lvm_video" [111.72 GB / 0 free]
pvscan -- total: 6 [670.68 GB] / in use: 6 [670.68 GB] / in no VG: 0 [0]

dozer:~ # vgdisplay
--- Volume group ---
VG Name               lvm_video
VG Access             read/write
VG Status             available/resizable
VG #                  0
MAX LV                256
Cur LV                1
Open LV               1
MAX LV Size           2 TB
Max PV                256
Cur PV                6
Act PV                6
VG Size               670.31 GB
PE Size               32 MB
Total PE              21450
Alloc PE / Size       21450 / 670.31 GB
Free  PE / Size       0 / 0
VG UUID               F0EF61-uu4P-cnCq-6oQ6-CO5n-NE9g-5xjdTE

dozer:~ # df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md0               1953344   1877604     75740  97% /
/dev/lvm_video/mythtv
                     702742528  42549352 660193176   7% /mnt/store

# As of yet there are no spares.  This is a todo, the most important thing is 
to get the app back in working state now.  I'll probably make a /usr md 
device in future from hdX3, as "/" is completely full. This was because of 
legacy constraints, migrating drives...

Maarten


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-08 19:01         ` maarten
@ 2005-01-10 16:34           ` maarten
  2005-01-10 16:36             ` Gordon Henderson
                               ` (2 more replies)
  0 siblings, 3 replies; 95+ messages in thread
From: maarten @ 2005-01-10 16:34 UTC (permalink / raw)
  To: linux-raid

On Saturday 08 January 2005 20:01, maarten wrote:
> On Saturday 08 January 2005 17:49, maarten wrote:
> > On Saturday 08 January 2005 15:52, Frank van Maarseveen wrote:
> > > On Fri, Jan 07, 2005 at 04:57:35PM -0500, Guy wrote:

> As the system is now online again, busy copying, I can show the exact
> config:
>

Well all about the array is done and working fine up to now.  Except one thing 
that I didn't anticipate: The application that's supposed to run has some 
problems under -I suppose- the new 2.4.28 kernel. I've had two panics / 
oopses in syslog already, and the process then is unkillable, so a reboot is 
in order.   But I think that's bttv related, not the I/O layer.
In any case I suffered through two lengthy raid resyncs already... ;-|

So I've been shopping around for a *big* servercase today so I can put all 
disks (these 5, plus 6 from the current fileserver) in one big tower. I'll 
then use that over NFS and can revert back to my older working kernel.

I've chosen a Chieftec case, as can be seen here 
http://www.chieftec.com/products/Workcolor/CA-01.htm
and here in detail
http://www.chieftec.com/products/Workcolor/NewBA.htm
Nice drive cages, eh ? :-)

P.S.:  I get this filling up my logs. Should I be worried about that ?
Jan 10 11:30:32 dozer kernel: raid5: switching cache buffer size, 512 --> 4096
Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 4096 --> 512
Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 512 --> 4096
Jan 10 11:30:36 dozer kernel: raid5: switching cache buffer size, 4096 --> 512

Maarten

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-10 16:34           ` maarten
@ 2005-01-10 16:36             ` Gordon Henderson
  2005-01-10 17:10               ` maarten
  2005-01-10 17:13             ` Spares and partitioning huge disks Guy
  2005-01-11 10:09             ` Spares and partitioning huge disks KELEMEN Peter
  2 siblings, 1 reply; 95+ messages in thread
From: Gordon Henderson @ 2005-01-10 16:36 UTC (permalink / raw)
  To: maarten; +Cc: linux-raid

On Mon, 10 Jan 2005, maarten wrote:

> P.S.:  I get this filling up my logs. Should I be worried about that ?
> Jan 10 11:30:32 dozer kernel: raid5: switching cache buffer size, 512 --> 4096
> Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 4096 --> 512
> Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 512 --> 4096
> Jan 10 11:30:36 dozer kernel: raid5: switching cache buffer size, 4096 --> 512

As I understand it, the "fix" is to comment it out in the kernel sources
and compile & install a new kernel...

It seems to be an artifact of LVM - then only times I've seen lots of
these are when I experimented with LVM... (incidentally I had some
instability with the occasional panic with LVM, so dumped it for that
particular application, and same hardware  & Kernel has been solid since)

Gordon

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-10 16:36             ` Gordon Henderson
@ 2005-01-10 17:10               ` maarten
  2005-01-16 16:19                 ` 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks
  0 siblings, 1 reply; 95+ messages in thread
From: maarten @ 2005-01-10 17:10 UTC (permalink / raw)
  To: linux-raid

On Monday 10 January 2005 17:36, Gordon Henderson wrote:
> On Mon, 10 Jan 2005, maarten wrote:
> > P.S.:  I get this filling up my logs. Should I be worried about that ?
> > Jan 10 11:30:32 dozer kernel: raid5: switching cache buffer size, 512 -->
> > 4096 Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size,
> > 4096 --> 512 Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer
> > size, 512 --> 4096 Jan 10 11:30:36 dozer kernel: raid5: switching cache
> > buffer size, 4096 --> 512
>
> As I understand it, the "fix" is to comment it out in the kernel sources
> and compile & install a new kernel...

Ehm...?

> It seems to be an artifact of LVM - then only times I've seen lots of
> these are when I experimented with LVM... (incidentally I had some
> instability with the occasional panic with LVM, so dumped it for that
> particular application, and same hardware  & Kernel has been solid since)

I'm certain I saw it before, when I didn't use LVM at all.  Maybe the kernel 
scans for LVM at boot, but LVM was not in initrd for sure.

But is it dangerous or detrimental to performance (other than that it logs way 
too much) ?

Maarten


^ permalink raw reply	[flat|nested] 95+ messages in thread

* 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading,  raid5 :2 drives on same ide channel
  2005-01-10 17:10               ` maarten
@ 2005-01-16 16:19                 ` Mitchell Laks
  2005-01-16 17:53                   ` Gordon Henderson
                                     ` (2 more replies)
  0 siblings, 3 replies; 95+ messages in thread
From: Mitchell Laks @ 2005-01-16 16:19 UTC (permalink / raw)
  To: maarten; +Cc: linux-raid

HI,
I have 3 questions. 

1) Maarten, Where did you buy the big chieftec chasis (CA-01B  i think) and 
what did you pay for it?  I have been using antec sx1000 chasis and yours 
looks better and bigger. 

2) Also what are reasonable resync times for your big raid5 arrays?
I had resync time or two days by accident recently for 4x 250 hard drives 
because i did not have dma enabled. that is solved, but i had switched to 
raid1 in the interm and now i am curious what others are used to.

3) Also, i have a module driver question.
I use a asus K8V-X motherboard. It has sata and parallel ide channels. I use 
the sata for my system and use the  parallel for data storage on ide raid.
I am using  combining the 2 motherboard IDE cable channels with highpoint 
rocket133 cards to provide 2 more ide ata channels. 

I installed debian and it defaulted to using the hpt366 modules for the 
rocket133 controllers. 
I suspect (correct me if I am wrong) that the hpt302 on the highpoint website 
is the RIGHT module to use (I notice for instance that when I compare the 
hdparm settings on the western digital drives on the motherboard ide channels 
are set with more advanced dma settings "turned on" than   on the rocket133 
controllers. Perhaps this is because it is using the 'incorrect hpt366' 
module? 

Of course I would prefer to use the hpt302 module  (after i compile it...). So 
how do I get to insure that the system will use the hpt302 over the hpt366 
that it seems to be chosing. If I 
1) compile the module hpt302 from source 
2) dump it in the /lib/modules/2.6.9-1-386/kernel/drivers/ide/pci directory
3) put a line hpt302 in the /etc/modules file (maybe at the top?)
4) put a line hpt302 at  the top of the file /etc/mkinitd/modules.
5) run mkinitrd to generate the new initrd.img

will this insure that the module hpt302 is loaded on preference to the hpt366 
module?

4) Maarten mentioned that he had a problem with 2 different drives on the same 
channel for raid5. What was the problem exactly with that.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading,  raid5 :2 drives on same ide channel
  2005-01-16 16:19                 ` 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks
@ 2005-01-16 17:53                   ` Gordon Henderson
  2005-01-16 18:22                   ` Maarten
  2005-01-16 19:39                   ` Guy
  2 siblings, 0 replies; 95+ messages in thread
From: Gordon Henderson @ 2005-01-16 17:53 UTC (permalink / raw)
  To: linux-raid

On Sun, 16 Jan 2005, Mitchell Laks wrote:

> 3) Also, i have a module driver question.
> I use a asus K8V-X motherboard. It has sata and parallel ide channels. I use
> the sata for my system and use the  parallel for data storage on ide raid.
> I am using  combining the 2 motherboard IDE cable channels with highpoint
> rocket133 cards to provide 2 more ide ata channels.
>
> I installed debian and it defaulted to using the hpt366 modules for the
> rocket133 controllers.

I've just been down this road myself... Debian Woody, kernels 2.4 and 2.6
and a Highpoint rocket133 controller...

> I suspect (correct me if I am wrong) that the hpt302 on the highpoint website
> is the RIGHT module to use (I notice for instance that when I compare the
> hdparm settings on the western digital drives on the motherboard ide channels
> are set with more advanced dma settings "turned on" than   on the rocket133
> controllers. Perhaps this is because it is using the 'incorrect hpt366'
> module?

Using the module off their web site worked for me with kernel 2.4.28 - but
it turned my IDE drives into SCSI drives! No real issue, but the smart
drive termperature program stopped working...

The driver wouldn't compile with 2.6.10, but the hpt366 driver did work
under 2.6.10 and seems to work very well - and the drives still look like
IDE drives and hddtemp still works.

> Of course I would prefer to use the hpt302 module (after i compile
> it...). So

Would you? The 366 driver in 2.6.10 recognises that it's a 302 card and
seems to work well... dmesg output:

HPT302: IDE controller at PCI slot 0000:00:0a.0
HPT302: chipset revision 1
HPT37X: using 33MHz PCI clock
HPT302: 100% native mode on irq 18
    ide2: BM-DMA at 0x9800-0x9807, BIOS settings: hde:DMA, hdf:pio
    ide3: BM-DMA at 0x9808-0x980f, BIOS settings: hdg:DMA, hdh:pio
Probing IDE interface ide2...
hde: Maxtor 6Y080L0, ATA DISK drive
ide2 at 0xb000-0xb007,0xa802 on irq 18
Probing IDE interface ide3...
hdg: Maxtor 6Y080L0, ATA DISK drive
ide3 at 0xa400-0xa407,0xa002 on irq 18

I have it compiled into the kernel here too - not a module. (personal
choice, I never have modules unless I can avoid it)

> how do I get to insure that the system will use the hpt302 over the hpt366
> that it seems to be chosing. If I
> 1) compile the module hpt302 from source
> 2) dump it in the /lib/modules/2.6.9-1-386/kernel/drivers/ide/pci directory
> 3) put a line hpt302 in the /etc/modules file (maybe at the top?)
> 4) put a line hpt302 at  the top of the file /etc/mkinitd/modules.
> 5) run mkinitrd to generate the new initrd.img

The easiest way would be to compile a custom kernel yourself. Just leave
out the Highpoint drivers and then compile and load the hpt302 module at
boot time by listing it in the /etc/modules file.

> 4) Maarten mentioned that he had a problem with 2 different drives on
> the same channel for raid5. What was the problem exactly with that.

It's possible that a failing IDE drive will crowbar the bus and take out
the other drive with it - not neccessarily damage any data on the drive,
but prevent it being seen by the OS. I've experienced this myself. In a
RIAD-5 situation, you'd lose 2 drives which would not be a good thing..

Gordon

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading,  raid5 :2 drives on same ide channel
  2005-01-16 16:19                 ` 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks
  2005-01-16 17:53                   ` Gordon Henderson
@ 2005-01-16 18:22                   ` Maarten
  2005-01-16 19:39                   ` Guy
  2 siblings, 0 replies; 95+ messages in thread
From: Maarten @ 2005-01-16 18:22 UTC (permalink / raw)
  To: linux-raid

On Sunday 16 January 2005 17:19, Mitchell Laks wrote:
> HI,
> I have 3 questions.
>
> 1) Maarten, Where did you buy the big chieftec chasis (CA-01B  i think) and
> what did you pay for it?  I have been using antec sx1000 chasis and yours
> looks better and bigger.

I paid 120 euro I think. The BA-01B is a bit cheaper but exactly the same 
except for the missing side window. I bought it at some local shop in the 
netherlands. However I went and bought a more powerful and ultrasilent Tagan 
480 W PSU to replace the 360W chieftec one. That Tagan PSU was more expensive 
than the chieftec case+psu together.
The case pleases me, but in all fairness the Antec cases are better in respect 
to details, this case has some mild sharp edges which I did not ever find 
with Antec.  But that is a minor detail( to me). Of course it's not as bad as 
with noname cheap case brands.  Overall I thought this case deserved a 5/5 
for design, a 5/5 for ingenuity, and a 4/5 for craftsmanship.
(Incidentally the Tagan PSU deserves a full 5/5 too)

The shop will ship in Holland, but not abroad AFAIK. And that would be 
cost-prohibitive anyway, this case is really a big sucker.

> 2) Also what are reasonable resync times for your big raid5 arrays?
> I had resync time or two days by accident recently for 4x 250 hard drives
> because i did not have dma enabled. that is solved, but i had switched to
> raid1 in the interm and now i am curious what others are used to.

Not sure. At first I built the array on a lowly old celeron-500 and the resync 
time of each of the 6 arrays was IIRC 50 minutes, so about 5 hours all told.
With the new case I also installed a much faster board, an athlon 1400, so 
resync now is at (about) 20 minutes for each array, but I admit I did not 
take notes there.

The other big array, consisting of whole-drive 160GB disks (5-1)x160GB=640GB 
did a resync in a little over 2 hours I think. Less than three, at any rate.

> 3) Also, i have a module driver question.
> I use a asus K8V-X motherboard. It has sata and parallel ide channels. I
> use the sata for my system and use the  parallel for data storage on ide
> raid. I am using  combining the 2 motherboard IDE cable channels with
> highpoint rocket133 cards to provide 2 more ide ata channels.

I myself now have in use:
The VIA onboard ATA channels
One Promise SATA TX2 150
Two noname SIL / silicon image SATA controllers
One Promise ATA Tx133

The onboard VIA SATA controller is left unused. I may use it later but it gave 
me some problems in the past so I went for the simplest solution now.

> I installed debian and it defaulted to using the hpt366 modules for the
> rocket133 controllers.
> I suspect (correct me if I am wrong) that the hpt302 on the highpoint
> website is the RIGHT module to use (I notice for instance that when I
> compare the hdparm settings on the western digital drives on the
> motherboard ide channels are set with more advanced dma settings "turned
> on" than   on the rocket133 controllers. Perhaps this is because it is
> using the 'incorrect hpt366' module?

I once had a mainboard with HPT cntr onboard, an older version though (266?). 
Since then I carefully avoided highpoint as well as I could... I will not buy 
one unless held at gunpoint. Same as with Sony, I hate that brand.

> Of course I would prefer to use the hpt302 module  (after i compile it...).
> So how do I get to insure that the system will use the hpt302 over the
> hpt366 that it seems to be chosing. If I
> 1) compile the module hpt302 from source
> 2) dump it in the /lib/modules/2.6.9-1-386/kernel/drivers/ide/pci directory
> 3) put a line hpt302 in the /etc/modules file (maybe at the top?)
> 4) put a line hpt302 at  the top of the file /etc/mkinitd/modules.
> 5) run mkinitrd to generate the new initrd.img
>
> will this insure that the module hpt302 is loaded on preference to the
> hpt366 module?

Sorry, I've no clue on that. Your story sounds reasonable, but why not start 
by getting the module compiled ?  Most times, that is the hard part.  If that 
succeeds and you can insmod it without probs, there is plenty of time to 
convince the kernel to load the right module I think. 

> 4) Maarten mentioned that he had a problem with 2 different drives on the
> same channel for raid5. What was the problem exactly with that.

The same problem everyone has.  If an IDE drive fails, it does not -as SCSI 
drives tend to do- leave the electrical IDE bus in a free/useable state.  So 
the other drive on that cable is still "good", but unreachable/dead for now.  
This obviously leads to a fatal 2-drive failure.  It doesn't matter the 
second drive's failure is only temporary; the md / ide code doesn't know 
that.
You can often restore the array manually, but this is not something that is 
done lightly, so really better to be avoided...  

Maarten

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading,  raid5 :2 drives on same ide channel
  2005-01-16 16:19                 ` 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks
  2005-01-16 17:53                   ` Gordon Henderson
  2005-01-16 18:22                   ` Maarten
@ 2005-01-16 19:39                   ` Guy
  2005-01-16 20:55                     ` Maarten
  2 siblings, 1 reply; 95+ messages in thread
From: Guy @ 2005-01-16 19:39 UTC (permalink / raw)
  To: 'Mitchell Laks'; +Cc: linux-raid

If your rebuild seems too slow, make sure you increase the speed limit!
Details in "man md".

echo 100000 > /proc/sys/dev/raid/speed_limit_max

I added this to /etc/sysctl.conf
# RAID rebuild min/max speed K/Sec per device
dev.raid.speed_limit_min = 1000
dev.raid.speed_limit_max = 100000

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Mitchell Laks
Sent: Sunday, January 16, 2005 11:20 AM
To: maarten
Cc: linux-raid@vger.kernel.org
Subject: 4 questions. Chieftec chassis case CA-01B, resync times, selecting
ide driver module loading, raid5 :2 drives on same ide channel

HI,
I have 3 questions. 

1) Maarten, Where did you buy the big chieftec chasis (CA-01B  i think) and 
what did you pay for it?  I have been using antec sx1000 chasis and yours 
looks better and bigger. 

2) Also what are reasonable resync times for your big raid5 arrays?
I had resync time or two days by accident recently for 4x 250 hard drives 
because i did not have dma enabled. that is solved, but i had switched to 
raid1 in the interm and now i am curious what others are used to.

3) Also, i have a module driver question.
I use a asus K8V-X motherboard. It has sata and parallel ide channels. I use

the sata for my system and use the  parallel for data storage on ide raid.
I am using  combining the 2 motherboard IDE cable channels with highpoint 
rocket133 cards to provide 2 more ide ata channels. 

I installed debian and it defaulted to using the hpt366 modules for the 
rocket133 controllers. 
I suspect (correct me if I am wrong) that the hpt302 on the highpoint
website 
is the RIGHT module to use (I notice for instance that when I compare the 
hdparm settings on the western digital drives on the motherboard ide
channels 
are set with more advanced dma settings "turned on" than   on the rocket133 
controllers. Perhaps this is because it is using the 'incorrect hpt366' 
module? 

Of course I would prefer to use the hpt302 module  (after i compile it...).
So 
how do I get to insure that the system will use the hpt302 over the hpt366 
that it seems to be chosing. If I 
1) compile the module hpt302 from source 
2) dump it in the /lib/modules/2.6.9-1-386/kernel/drivers/ide/pci directory
3) put a line hpt302 in the /etc/modules file (maybe at the top?)
4) put a line hpt302 at  the top of the file /etc/mkinitd/modules.
5) run mkinitrd to generate the new initrd.img

will this insure that the module hpt302 is loaded on preference to the
hpt366 
module?

4) Maarten mentioned that he had a problem with 2 different drives on the
same 
channel for raid5. What was the problem exactly with that.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading,  raid5 :2 drives on same ide channel
  2005-01-16 19:39                   ` Guy
@ 2005-01-16 20:55                     ` Maarten
  2005-01-16 21:58                       ` Guy
  0 siblings, 1 reply; 95+ messages in thread
From: Maarten @ 2005-01-16 20:55 UTC (permalink / raw)
  To: linux-raid

On Sunday 16 January 2005 20:39, Guy wrote:
> If your rebuild seems too slow, make sure you increase the speed limit!
> Details in "man md".
>
> echo 100000 > /proc/sys/dev/raid/speed_limit_max

Hi Guy,
You always say that, but that never helps me (since my distro already has 
100000 as default).  Are there even distros that have this set too low ? 

Maarten



^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading,  raid5 :2 drives on same ide channel
  2005-01-16 20:55                     ` Maarten
@ 2005-01-16 21:58                       ` Guy
  0 siblings, 0 replies; 95+ messages in thread
From: Guy @ 2005-01-16 21:58 UTC (permalink / raw)
  To: 'Maarten', linux-raid

Yes, RedHat 9 defaults to much less, 10,000 I think.

I assumed it was the md default.  Maybe a RedHat 9 issue.

I just looked at the man page for md.  It says "The default is 100,000.".  I
did upgrade to Kernel 2.4.28 a few weeks ago.  I guess the default was
changed in a newer version of md.

My /etc/sysctl.conf has a date of Dec 12, 2003.  So, whatever kernel I had
over 1 year ago had a default of 10,000, or so.

Anyway, it has helped some people in the past. :)
I guess it depends on the kernel/md version.

I guess a default of no limit would be nice.  But no support for that, yet!

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Maarten
Sent: Sunday, January 16, 2005 3:56 PM
To: linux-raid@vger.kernel.org
Subject: Re: 4 questions. Chieftec chassis case CA-01B, resync times,
selecting ide driver module loading, raid5 :2 drives on same ide channel

On Sunday 16 January 2005 20:39, Guy wrote:
> If your rebuild seems too slow, make sure you increase the speed limit!
> Details in "man md".
>
> echo 100000 > /proc/sys/dev/raid/speed_limit_max

Hi Guy,
You always say that, but that never helps me (since my distro already has 
100000 as default).  Are there even distros that have this set too low ? 

Maarten

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-10 16:34           ` maarten
  2005-01-10 16:36             ` Gordon Henderson
@ 2005-01-10 17:13             ` Guy
  2005-01-10 17:35               ` hard disk re-locates bad block on read Guy
                                 ` (3 more replies)
  2005-01-11 10:09             ` Spares and partitioning huge disks KELEMEN Peter
  2 siblings, 4 replies; 95+ messages in thread
From: Guy @ 2005-01-10 17:13 UTC (permalink / raw)
  To: 'maarten', linux-raid

In my log files, which go back to Dec 12

I have 4 of these:
raid5: switching cache buffer size, 4096 --> 1024

And 2 of these:
raid5: switching cache buffer size, 1024 --> 4096

So, it would concern me!  The message is from RAID5, not LVM.  I base this
on "raid5:" in the log entry.  :)

Guy

I found this from Neil:
"You will probably also see a message in the kernel logs like:
             raid5: switching cache buffer size, 4096 --> 1024

The raid5 stripe cache must match the request size used by any client.
It is PAGE_SIZE at start up, but changes whenever is sees a request of a
difference size.
Reading from /dev/mdX uses a request size of 1K.
Most filesystems use a request size of 4k.

So, when you do the 'dd', the cache size changes and you get a small
performance drop because of this.
If you make a filesystem on the array and then mount it, it will probably
switch back to 4k requests and resync should speed up.

NeilBrown"

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten
Sent: Monday, January 10, 2005 11:34 AM
To: linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

On Saturday 08 January 2005 20:01, maarten wrote:
> On Saturday 08 January 2005 17:49, maarten wrote:
> > On Saturday 08 January 2005 15:52, Frank van Maarseveen wrote:
> > > On Fri, Jan 07, 2005 at 04:57:35PM -0500, Guy wrote:

> As the system is now online again, busy copying, I can show the exact
> config:
>

Well all about the array is done and working fine up to now.  Except one
thing 
that I didn't anticipate: The application that's supposed to run has some 
problems under -I suppose- the new 2.4.28 kernel. I've had two panics / 
oopses in syslog already, and the process then is unkillable, so a reboot is

in order.   But I think that's bttv related, not the I/O layer.
In any case I suffered through two lengthy raid resyncs already... ;-|

So I've been shopping around for a *big* servercase today so I can put all 
disks (these 5, plus 6 from the current fileserver) in one big tower. I'll 
then use that over NFS and can revert back to my older working kernel.

I've chosen a Chieftec case, as can be seen here 
http://www.chieftec.com/products/Workcolor/CA-01.htm
and here in detail
http://www.chieftec.com/products/Workcolor/NewBA.htm
Nice drive cages, eh ? :-)

P.S.:  I get this filling up my logs. Should I be worried about that ?
Jan 10 11:30:32 dozer kernel: raid5: switching cache buffer size, 512 -->
4096
Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 4096 -->
512
Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 512 -->
4096
Jan 10 11:30:36 dozer kernel: raid5: switching cache buffer size, 4096 -->
512

Maarten

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* hard disk re-locates bad block on read.
  2005-01-10 17:13             ` Spares and partitioning huge disks Guy
@ 2005-01-10 17:35               ` Guy
  2005-01-11 14:34                 ` Tom Coughlan
  2005-01-10 18:24               ` Spares and partitioning huge disks maarten
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 95+ messages in thread
From: Guy @ 2005-01-10 17:35 UTC (permalink / raw)
  To: linux-raid

My disks have the option to relocate bad blocks on read error.
I was concerned that bogus data would be returned to the OS.

They say CRC errors return corrupt data to the OS!  I hope not!
So it seems CRC errors and unreadable blocks both are corrupt or lost.
But the OS does not know.
So, I will leave this option turned off.

Guy

I sent this to Seagate:
With ARRE (Automatic Read Reallocation Enable) turned on.  Does it  relocate
blocks that can't be read, or blocks that had correctable read  problems?
Or both?

 If it re-locates un-readable blocks, then what data does it return to the
OS?

 Thanks,
 Guy

==================================================================

Guy,
 If the block is bad at a hardware level then it is reallocated and a spare
is used in it's place. In a bad block the data is lost, the sparing of the
block is transparent to the operating system. Blocks with correctable read
problems are one's with corrupt data at the OS level.

 Jimmie P.
 Seagate Technical Support

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: hard disk re-locates bad block on read.
  2005-01-10 17:35               ` hard disk re-locates bad block on read Guy
@ 2005-01-11 14:34                 ` Tom Coughlan
  2005-01-11 22:43                   ` Guy
  0 siblings, 1 reply; 95+ messages in thread
From: Tom Coughlan @ 2005-01-11 14:34 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid

On Mon, 2005-01-10 at 12:35, Guy wrote:
> My disks have the option to relocate bad blocks on read error.
> I was concerned that bogus data would be returned to the OS.
> 
> They say CRC errors return corrupt data to the OS!  I hope not!
> So it seems CRC errors and unreadable blocks both are corrupt or lost.
> But the OS does not know.
> So, I will leave this option turned off.
> 
> Guy
> 
> I sent this to Seagate:
> With ARRE (Automatic Read Reallocation Enable) turned on.  Does it  relocate
> blocks that can't be read, or blocks that had correctable read  problems?
> Or both?

FWIW, the SCSI standard has been clear on this point for many years:

"An ARRE bit of one indicates that the device server shall enable
automatic reallocation of defective data blocks during read operations.
... The automatic reallocation shall then be performed only if the
device server successfully recovers the data. The recovered data shall
be placed in the reallocated block." (SBC-2)

Blocks that can not be read are not relocated. The read command simply
returns an error to the OS.

> 
>  If it re-locates un-readable blocks, then what data does it return to the
> OS?
> 
>  Thanks,
>  Guy
> 
> ==================================================================
> 
> Guy,
>  If the block is bad at a hardware level then it is reallocated and a spare
> is used in it's place. In a bad block the data is lost, the sparing of the
> block is transparent to the operating system. Blocks with correctable read
> problems are one's with corrupt data at the OS level.
> 
>  Jimmie P.
>  Seagate Technical Support
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: hard disk re-locates bad block on read.
  2005-01-11 14:34                 ` Tom Coughlan
@ 2005-01-11 22:43                   ` Guy
  2005-01-12 13:51                     ` Tom Coughlan
  0 siblings, 1 reply; 95+ messages in thread
From: Guy @ 2005-01-11 22:43 UTC (permalink / raw)
  To: 'Tom Coughlan'; +Cc: linux-raid

Good, your description is what I had assumed at first.  But when I re-read
the drive specs, it was vague, so I set ARRE back to 0.

So, it should be a good thing to set it to 1, correct?

Do you agree that Seagate's email is wrong?  Or am I just reading it wrong?

I did not realize ARRE was a standard.  I thought it was a Seagate thing.

Thanks,
Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Tom Coughlan
Sent: Tuesday, January 11, 2005 9:35 AM
To: Guy
Cc: linux-raid@vger.kernel.org
Subject: Re: hard disk re-locates bad block on read.

On Mon, 2005-01-10 at 12:35, Guy wrote:
> My disks have the option to relocate bad blocks on read error.
> I was concerned that bogus data would be returned to the OS.
> 
> They say CRC errors return corrupt data to the OS!  I hope not!
> So it seems CRC errors and unreadable blocks both are corrupt or lost.
> But the OS does not know.
> So, I will leave this option turned off.
> 
> Guy
> 
> I sent this to Seagate:
> With ARRE (Automatic Read Reallocation Enable) turned on.  Does it
relocate
> blocks that can't be read, or blocks that had correctable read  problems?
> Or both?

FWIW, the SCSI standard has been clear on this point for many years:

"An ARRE bit of one indicates that the device server shall enable
automatic reallocation of defective data blocks during read operations.
... The automatic reallocation shall then be performed only if the
device server successfully recovers the data. The recovered data shall
be placed in the reallocated block." (SBC-2)

Blocks that can not be read are not relocated. The read command simply
returns an error to the OS.

> 
>  If it re-locates un-readable blocks, then what data does it return to the
> OS?
> 
>  Thanks,
>  Guy
> 
> ==================================================================
> 
> Guy,
>  If the block is bad at a hardware level then it is reallocated and a
spare
> is used in it's place. In a bad block the data is lost, the sparing of the
> block is transparent to the operating system. Blocks with correctable read
> problems are one's with corrupt data at the OS level.
> 
>  Jimmie P.
>  Seagate Technical Support
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: hard disk re-locates bad block on read.
  2005-01-11 22:43                   ` Guy
@ 2005-01-12 13:51                     ` Tom Coughlan
  0 siblings, 0 replies; 95+ messages in thread
From: Tom Coughlan @ 2005-01-12 13:51 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid

On Tue, 2005-01-11 at 17:43, Guy wrote:
> Good, your description is what I had assumed at first.  But when I
> re-read
> the drive specs, it was vague, so I set ARRE back to 0.
> 
> So, it should be a good thing to set it to 1, correct?

I would.

> Do you agree that Seagate's email is wrong?  Or am I just reading it
> wrong?

I can't figure out what he is saying in the last sentence. I do believe
that Seagate engineers are aware of the correct way to implement ARRE. I
can't vouch for whether their firmware always gets it right.  

> I did not realize ARRE was a standard.  I thought it was a Seagate
> thing.
> 
> Thanks,
> Guy
> 
> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org
> [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Tom Coughlan
> Sent: Tuesday, January 11, 2005 9:35 AM
> To: Guy
> Cc: linux-raid@vger.kernel.org
> Subject: Re: hard disk re-locates bad block on read.
> 
> On Mon, 2005-01-10 at 12:35, Guy wrote:
> > My disks have the option to relocate bad blocks on read error.
> > I was concerned that bogus data would be returned to the OS.
> > 
> > They say CRC errors return corrupt data to the OS!  I hope not!
> > So it seems CRC errors and unreadable blocks both are corrupt or lost.
> > But the OS does not know.
> > So, I will leave this option turned off.
> > 
> > Guy
> > 
> > I sent this to Seagate:
> > With ARRE (Automatic Read Reallocation Enable) turned on.  Does it
> relocate
> > blocks that can't be read, or blocks that had correctable read  problems?
> > Or both?
> 
> FWIW, the SCSI standard has been clear on this point for many years:
> 
> "An ARRE bit of one indicates that the device server shall enable
> automatic reallocation of defective data blocks during read operations.
> ... The automatic reallocation shall then be performed only if the
> device server successfully recovers the data. The recovered data shall
> be placed in the reallocated block." (SBC-2)
> 
> Blocks that can not be read are not relocated. The read command simply
> returns an error to the OS.
> 
> > 
> >  If it re-locates un-readable blocks, then what data does it return to the
> > OS?
> > 
> >  Thanks,
> >  Guy
> > 
> > ==================================================================
> > 
> > Guy,
> >  If the block is bad at a hardware level then it is reallocated and a
> spare
> > is used in it's place. In a bad block the data is lost, the sparing of the
> > block is transparent to the operating system. Blocks with correctable read
> > problems are one's with corrupt data at the OS level.
> > 
> >  Jimmie P.
> >  Seagate Technical Support
> > 


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-10 17:13             ` Spares and partitioning huge disks Guy
  2005-01-10 17:35               ` hard disk re-locates bad block on read Guy
@ 2005-01-10 18:24               ` maarten
  2005-01-10 20:09                 ` Guy
  2005-01-10 18:40               ` maarten
  2005-01-12 11:41               ` RAID-6 Gordon Henderson
  3 siblings, 1 reply; 95+ messages in thread
From: maarten @ 2005-01-10 18:24 UTC (permalink / raw)
  To: linux-raid

On Monday 10 January 2005 18:13, Guy wrote:
> In my log files, which go back to Dec 12
>
> I have 4 of these:
> raid5: switching cache buffer size, 4096 --> 1024
>
> And 2 of these:
> raid5: switching cache buffer size, 1024 --> 4096

Heh. Is that all...? :-))

Now THIS is my log:

dozer:/var/log # cat  messages | grep "switching cache buffer size" | wc -l
  55880

So that is why I'm a bit worried.  Usually when my computer tells me something 
_every_second_ I tend to take it seriously.  But maybe it's just lonely and 
looking for some attention. Heh. ;)

> I found this from Neil:
> "You will probably also see a message in the kernel logs like:
>              raid5: switching cache buffer size, 4096 --> 1024
>
> The raid5 stripe cache must match the request size used by any client.
> It is PAGE_SIZE at start up, but changes whenever is sees a request of a
> difference size.
> Reading from /dev/mdX uses a request size of 1K.
> Most filesystems use a request size of 4k.
>
> So, when you do the 'dd', the cache size changes and you get a small
> performance drop because of this.
> If you make a filesystem on the array and then mount it, it will probably
> switch back to 4k requests and resync should speed up.

Okay.  So with as many switches as I see, it would be likely that something 
either accesses the md device concurrently with the FS, or that the FS does 
this constant switching by itself.  Now my FS is XFS, maybe that filesystem 
has this behaviour ?  Anyone having a raid-5 with XFS on top can confirm 
this ?  I usually use Reiserfs, but it seems that XFS is particularly good / 
fast with big files, whereas reiserfs excels with small files, that is why I 
use it here.  As far as I know there are no accesses that bypass the FS; no 
Oracle, no cat, no dd.  Only LVM and XFS (but it did this before LVM too).

Maarten

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-10 18:24               ` Spares and partitioning huge disks maarten
@ 2005-01-10 20:09                 ` Guy
  2005-01-10 21:21                   ` maarten
  2005-01-11  1:04                   ` maarten
  0 siblings, 2 replies; 95+ messages in thread
From: Guy @ 2005-01-10 20:09 UTC (permalink / raw)
  To: 'maarten', linux-raid

I know the log files are very annoying, but...

I wonder if all that switching is causing md to void its cache?
It may be a performance problem for md to change the strip cache size so
often.

I use ext3 filesystems.  No problems with performance (that I know of).
I have never tried any others.

Do you think the performance difference of the various filesystems would
affect your PVR?

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten
Sent: Monday, January 10, 2005 1:25 PM
To: linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

On Monday 10 January 2005 18:13, Guy wrote:
> In my log files, which go back to Dec 12
>
> I have 4 of these:
> raid5: switching cache buffer size, 4096 --> 1024
>
> And 2 of these:
> raid5: switching cache buffer size, 1024 --> 4096

Heh. Is that all...? :-))

Now THIS is my log:

dozer:/var/log # cat  messages | grep "switching cache buffer size" | wc -l
  55880

So that is why I'm a bit worried.  Usually when my computer tells me
something 
_every_second_ I tend to take it seriously.  But maybe it's just lonely and 
looking for some attention. Heh. ;)

> I found this from Neil:
> "You will probably also see a message in the kernel logs like:
>              raid5: switching cache buffer size, 4096 --> 1024
>
> The raid5 stripe cache must match the request size used by any client.
> It is PAGE_SIZE at start up, but changes whenever is sees a request of a
> difference size.
> Reading from /dev/mdX uses a request size of 1K.
> Most filesystems use a request size of 4k.
>
> So, when you do the 'dd', the cache size changes and you get a small
> performance drop because of this.
> If you make a filesystem on the array and then mount it, it will probably
> switch back to 4k requests and resync should speed up.

Okay.  So with as many switches as I see, it would be likely that something 
either accesses the md device concurrently with the FS, or that the FS does 
this constant switching by itself.  Now my FS is XFS, maybe that filesystem 
has this behaviour ?  Anyone having a raid-5 with XFS on top can confirm 
this ?  I usually use Reiserfs, but it seems that XFS is particularly good /

fast with big files, whereas reiserfs excels with small files, that is why I

use it here.  As far as I know there are no accesses that bypass the FS; no 
Oracle, no cat, no dd.  Only LVM and XFS (but it did this before LVM too).

Maarten

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-10 20:09                 ` Guy
@ 2005-01-10 21:21                   ` maarten
  2005-01-11  1:04                   ` maarten
  1 sibling, 0 replies; 95+ messages in thread
From: maarten @ 2005-01-10 21:21 UTC (permalink / raw)
  To: linux-raid

On Monday 10 January 2005 21:09, Guy wrote:

> I use ext3 filesystems.  No problems with performance (that I know of).
> I have never tried any others.

Reportedly, deleting a big (>2GB) file under XFS is way faster than under 
ext3, but I never verified this myself.  I barely ever ran ext3, I went with 
my distros' default, which was Reiserfs.
Deleting huge files is a special case though, so it does not make much sense 
to benchmark or tune for that. But in this special case it matters.

> Do you think the performance difference of the various filesystems would
> affect your PVR?

Ehm, no.  Well, not the raid-5 overhead at least.  The FS helps the GUI being 
more 'snappy'. It has been reported that ext3 takes a real long time to 
delete huge files (up to several seconds or more) (unconfirmed by me).

This is what a copy of a 5GB file and subsequent delete does on my system:

-rw-r--r--    1 root     root     4909077650 Jan 10 04:43 file
dozer:/mnt/store # time cp file away
real    3m15.778s
user    0m0.640s
sys     0m45.230s

dozer:/mnt/store # time rm away
real    0m0.237s
user    0m0.000s
sys     0m0.020s

(This was while the machine was idle, no recordings going on)

But the machine IS starved for CPU and bus bandwidth; the cpu should be almost 
fully pegged with recording up to two simultaneous channels, compressing 
realtime(!) in mpeg4 from two cheap bttv cards (at 480x480 rez, 2600 Kbps).
Therefore it is the fastest CPU I could afford back in last spring, an Athlon 
XP 2600. The machine is also overclocked from 200 FSB to 233, yielding a 38 
MHz PCI bus. Strangely enough, despite there being 5 PCI cards, amongst which 
two disk I/O controllers, this seems to work just fine. It's been tested 
in-depth by recording shows daily for months and it crashes rarely ( meaning 
< once a month which is not too bad, considering ). Maybe the bigass Zalman 
copper CPU cooler and the 12cm fan hovering above it help there, too ;-)
At the beginning it ran off a single 160GB disk, so when I switched to raid-5 
I was very afraid that either the extra CPU load, the extra IRQ load or the 
bus bandwidth would saturate, thus killing performance. You see, the PCI bus 
is fairly loaded too, since not only does it have to handle the various ATA 
controllers, but two uncompressed videostreams from the TV cards as well.

So all in all, the overhead of raid seems insignificant to me, or the code is 
very well optimized indeed :-)

Maarten

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-10 20:09                 ` Guy
  2005-01-10 21:21                   ` maarten
@ 2005-01-11  1:04                   ` maarten
  1 sibling, 0 replies; 95+ messages in thread
From: maarten @ 2005-01-11  1:04 UTC (permalink / raw)
  To: linux-raid

[I'm resending this since I never saw it make the list]

On Monday 10 January 2005 21:09, Guy wrote:

> I use ext3 filesystems.  No problems with performance (that I know of).
> I have never tried any others.

Reportedly, deleting a big (>2GB) file under XFS is way faster than under 
ext3, but I never verified this myself.  I barely ever ran ext3, I went with 
my distros' default, which was Reiserfs.
Deleting huge files is a special case though, so it does not make much sense 
to benchmark or tune for that. But in this special case it matters.

> Do you think the performance difference of the various filesystems would
> affect your PVR?

Ehm, no.  Well, not the raid-5 overhead at least.  The FS helps the GUI being 
more 'snappy'. It has been reported that ext3 takes a real long time to 
delete huge files (up to several seconds or more) (unconfirmed by me).

This is what a copy of a 5GB file and subsequent delete does on my system:

-rw-r--r--    1 root     root     4909077650 Jan 10 04:43 file
dozer:/mnt/store # time cp file away
real    3m15.778s
user    0m0.640s
sys     0m45.230s

dozer:/mnt/store # time rm away
real    0m0.237s
user    0m0.000s
sys     0m0.020s

(This was while the machine was idle, no recordings going on)

But the machine IS starved for CPU and bus bandwidth; the cpu should be almost 
fully pegged with recording up to two simultaneous channels, compressing 
realtime(!) in mpeg4 from two cheap bttv cards (at 480x480 rez, 2600 Kbps).
Therefore it is the fastest CPU I could afford back in last spring, an Athlon 
XP 2600. The machine is also overclocked from 200 FSB to 233, yielding a 38 
MHz PCI bus. Strangely enough, despite there being 5 PCI cards, amongst which 
two disk I/O controllers, this seems to work just fine. It's been tested 
in-depth by recording shows daily for months and it crashes rarely ( meaning 
< once a month which is not too bad, considering ). Maybe the bigass Zalman 
copper CPU cooler and the 12cm fan hovering above it help there, too ;-)
At the beginning it ran off a single 160GB disk, so when I switched to raid-5 
I was very afraid that either the extra CPU load, the extra IRQ load or the 
bus bandwidth would saturate, thus killing performance. You see, the PCI bus 
is fairly loaded too, since not only does it have to handle the various ATA 
controllers, but two uncompressed videostreams from the TV cards as well.

So all in all, the overhead of raid seems insignificant to me, or the code is 
very well optimized indeed :-)

Maarten

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-10 17:13             ` Spares and partitioning huge disks Guy
  2005-01-10 17:35               ` hard disk re-locates bad block on read Guy
  2005-01-10 18:24               ` Spares and partitioning huge disks maarten
@ 2005-01-10 18:40               ` maarten
  2005-01-10 19:41                 ` Guy
  2005-01-12 11:41               ` RAID-6 Gordon Henderson
  3 siblings, 1 reply; 95+ messages in thread
From: maarten @ 2005-01-10 18:40 UTC (permalink / raw)
  To: linux-raid

On Monday 10 January 2005 18:13, Guy wrote:

> I have 4 of these:
> raid5: switching cache buffer size, 4096 --> 1024
>
> And 2 of these:
> raid5: switching cache buffer size, 1024 --> 4096

Another thing that only strikes me now that I'm actually counting them: I have 
_exactly_ 24 of those messages per minute. At any moment, at any time, idle 
or not. (Although a PVR is never really 'idle', but nevertheless)
(well, not really _every_ time, but out of ten samples nine were 24 and only 
one was 36)

And this one may be particularly interesting:

Jan  9 02:47:19 dozer kernel: raid5: switching cache buffer size, 512 --> 4096
Jan  9 02:47:24 dozer kernel: raid5: switching cache buffer size, 0 --> 4096
Jan  9 02:47:24 dozer kernel: raid5: switching cache buffer size, 4096 --> 512

Switching from buffer size 0 ?  Wtf ?

Another thing to note is that my switches are between 4096 and 512, not 
between 4k and 1k as Neil's reply would indicate being normal. But I don't 
consider this bit really important.

Maarten

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-10 18:40               ` maarten
@ 2005-01-10 19:41                 ` Guy
  0 siblings, 0 replies; 95+ messages in thread
From: Guy @ 2005-01-10 19:41 UTC (permalink / raw)
  To: 'maarten', linux-raid

You been doing zero length IOs again? :)

How many zero length IOs can you do in a second?

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten
Sent: Monday, January 10, 2005 1:41 PM
To: linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

On Monday 10 January 2005 18:13, Guy wrote:

> I have 4 of these:
> raid5: switching cache buffer size, 4096 --> 1024
>
> And 2 of these:
> raid5: switching cache buffer size, 1024 --> 4096

Another thing that only strikes me now that I'm actually counting them: I
have 
_exactly_ 24 of those messages per minute. At any moment, at any time, idle 
or not. (Although a PVR is never really 'idle', but nevertheless)
(well, not really _every_ time, but out of ten samples nine were 24 and only

one was 36)

And this one may be particularly interesting:

Jan  9 02:47:19 dozer kernel: raid5: switching cache buffer size, 512 -->
4096
Jan  9 02:47:24 dozer kernel: raid5: switching cache buffer size, 0 --> 4096
Jan  9 02:47:24 dozer kernel: raid5: switching cache buffer size, 4096 -->
512

Switching from buffer size 0 ?  Wtf ?

Another thing to note is that my switches are between 4096 and 512, not 
between 4k and 1k as Neil's reply would indicate being normal. But I don't 
consider this bit really important.

Maarten

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RAID-6 ...
  2005-01-10 17:13             ` Spares and partitioning huge disks Guy
                                 ` (2 preceding siblings ...)
  2005-01-10 18:40               ` maarten
@ 2005-01-12 11:41               ` Gordon Henderson
  2005-01-13  2:11                 ` RAID-6 Neil Brown
  3 siblings, 1 reply; 95+ messages in thread
From: Gordon Henderson @ 2005-01-12 11:41 UTC (permalink / raw)
  To: linux-raid

Until now I haven't really paid too much attention to the RAID-6 stuff,
but I have an application which needs to be as resilient to disk failures
as possible.

So other than what's at:

  ftp://ftp.kernel.org/pub/linux/kernel/people/hpa/

and the archives of this list (which I'm re-reading now), can anyone give
me a quick heads-up about it?

Specifically I'm still buried in the dark days of 2.4.27/28 - are there
recent patches against 2.4?

If RAID-6 isn't viable for me right now, what I'm planning is as follows:

Put 8 x 250GB SATA drives in the system, and arrange them in 4 pairs of
RAID-1 units.

Assemble the 4 RAID-1 units into a RAID-5.

Big waste of disk space, but thats not really important for this
application, and disk is cheap, (relatively) So I'll end up with just over
700GB of usable storage, with the potential of surviving a minimum of any
3 disks disks failing, and possibly 4 or 5, depending on just where they
fail (although disks would be replaced way before it got to that stage!)

Certainly any 2 can fail, and if it were 2 in the same RAID-1 unit, (which
would cause the RAID-5 to become degraded) and I were desperate, I could
move a disk and deliberately fail another RAID-1 to recover the RAID-5 ...

In the absence of RAID-6, would anyone do it differently?

Note: I'm relatively new to mdadm, but can see it's the way of the future,
(especially after I had to use it in-anger recently to recover from a
2-disk failure in an old 8-disk RAID-5 array), and I'm looking at the
spare-groups part of it all and wondering if that might be an alternative,
but I'd like to avoid the possibility of the array failing
catastrophically during a re-build if at all possible.

Cheers,

Gordon

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: RAID-6 ...
  2005-01-12 11:41               ` RAID-6 Gordon Henderson
@ 2005-01-13  2:11                 ` Neil Brown
  2005-01-15 16:12                   ` RAID-6 Gordon Henderson
  0 siblings, 1 reply; 95+ messages in thread
From: Neil Brown @ 2005-01-13  2:11 UTC (permalink / raw)
  To: Gordon Henderson; +Cc: linux-raid

On Wednesday January 12, gordon@drogon.net wrote:
> 
> Until now I haven't really paid too much attention to the RAID-6 stuff,
> but I have an application which needs to be as resilient to disk failures
> as possible.
> 
> So other than what's at:
> 
>   ftp://ftp.kernel.org/pub/linux/kernel/people/hpa/
> 
> and the archives of this list (which I'm re-reading now), can anyone give
> me a quick heads-up about it?
> 
> Specifically I'm still buried in the dark days of 2.4.27/28 - are there
> recent patches against 2.4?

There is no current support for raid6 in any 2.4 kernel and I am not
aware of anyone planning such support.  Assume it is 2.6 only.

> 
> If RAID-6 isn't viable for me right now, what I'm planning is as follows:
> 
> Put 8 x 250GB SATA drives in the system, and arrange them in 4 pairs of
> RAID-1 units.
> 
> Assemble the 4 RAID-1 units into a RAID-5.

Sounds reasonable.

NeilBrown

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: RAID-6 ...
  2005-01-13  2:11                 ` RAID-6 Neil Brown
@ 2005-01-15 16:12                   ` Gordon Henderson
  2005-01-17  8:04                     ` RAID-6 Turbo Fredriksson
  0 siblings, 1 reply; 95+ messages in thread
From: Gordon Henderson @ 2005-01-15 16:12 UTC (permalink / raw)
  To: linux-raid

On Thu, 13 Jan 2005, Neil Brown wrote:

> There is no current support for raid6 in any 2.4 kernel and I am not
> aware of anyone planning such support.  Assume it is 2.6 only.

How "real-life" tested is RAID-6 so-far? Anyone using it in anger on a
production server?

I've spent the past day or 2 getting to grips with kernel 2.6.10 and mdadm
1.8.0 on a test system running Debian Woody, on a rather clunky old test
PC - Asus XG-DLS, Twin Xeon PIII/500's on-board IDE with 2 x old 4GB
drives attached, it also has on-board Adaptec SCSI with 2 x 18GB drives -
one on an 8-bit bus, and a Highpoint HPT302 card with 2 modern 80 GB
drives (Maxtor, I know, but keep them cool and so-far so good...)

Performance isn't exactly stellar (one PCI bus!) but I did squeeze
60MB/sec out of a RAID-0 off the 2 new drives... Clunky by todays
standards, but 5.5 years ago when it was new, it rocked!

Anyway, so far so good. I've constructed a RAID-6 system with a 4GB
partition on 5 of the drives, and done some tests & what not.

The tests I've done involve creating the array, putting a filesystem on it
(ext3), writing a bigfile of zeros, checksumming it, failing a drive,
adding it back in, failing another, failing a 2nd, adding them in, failing
another before the 2nd drive finished resyncing, etc. all the time writing
a file & checksumming it between unmounting & re-mounting it. There was a
script posted round about July last year which I used to get some ideas
from.

So-far so good, no corruption, but it doesn't doesn't mean anything in the
real-world.

So who's using RAID-6 for real?

Can it be considered more or less stable than RAID-5?

Should I stick to my RAID-5 on-top of RAID-1 pairs?

Or should I just take a chance with RAID-6? (And nearly 6 years ago when I
started using Linux s/w RAID I said this to myself, but stuck with it and
haven't had a problem I could pin down to software... So who knows!)

Gordon

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: RAID-6 ...
  2005-01-15 16:12                   ` RAID-6 Gordon Henderson
@ 2005-01-17  8:04                     ` Turbo Fredriksson
  0 siblings, 0 replies; 95+ messages in thread
From: Turbo Fredriksson @ 2005-01-17  8:04 UTC (permalink / raw)
  To: linux-raid

>>>>> "Gordon" == Gordon Henderson <gordon@drogon.net> writes:

    Gordon> On Thu, 13 Jan 2005, Neil Brown wrote:
    >> There is no current support for raid6 in any 2.4 kernel and I
    >> am not aware of anyone planning such support.  Assume it is 2.6
    >> only.

    Gordon> How "real-life" tested is RAID-6 so-far? Anyone using it
    Gordon> in anger on a production server?

I've been using it for a couple of months (3 or 4 if I'm not mistaken)
on my SPARC64 (Sun Blade 1000 - 2xSPARC III/750) with (four+two)*9Gb
disks which gives me 16Gb disks pace.

With this setup (I had a few 9Gb disks that I couldn't/wouldn't use for
anything else) four (4!) disks can fail without it matter...

Have worked flawlessly even though the disks are OLD - 'smartctl' shows
that almost all of the disks have had more than 28000 hours 'uptime'
(i.e. 'powered on'). That's more than 3 years (POWERED ON mind you!).

Granted, I've been 'fortunate' (?!) to have had NO disk crashes etc, but
I did simulate a few when I sat the system up and it worked just fine...

Kernel 2.6.8.1 (with a couple of patches to get it to boot/work on SPARC64).

If it works this great on a SPARC64 (with which the kernel have problems
with), then it should work just FINE on a ia32...

    Gordon> Can it be considered more or less stable than RAID-5?

For me (NOTE!!) I'd say "just as stable". But naturally this (should/could)
depend on the exact kernel version in use... If a kernel version works
this/that good, stick with it...

    Gordon> Should I stick to my RAID-5 on-top of RAID-1 pairs?

In theory, that would be "more secure/safe" since both RAID5 and RAID1 is
better tested, but...
-- 
Soviet genetic SEAL Team 6 FSF nitrate Honduras $400 million in gold
bullion Albanian Kennedy Ft. Meade DES fissionable Uzi quiche kibo
[See http://www.aclu.org/echelonwatch/index.html for more about this]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-10 16:34           ` maarten
  2005-01-10 16:36             ` Gordon Henderson
  2005-01-10 17:13             ` Spares and partitioning huge disks Guy
@ 2005-01-11 10:09             ` KELEMEN Peter
  2 siblings, 0 replies; 95+ messages in thread
From: KELEMEN Peter @ 2005-01-11 10:09 UTC (permalink / raw)
  To: linux-raid

* Maarten (maarten@ultratux.net) [20050110 17:34]:

> P.S.:  I get this filling up my logs. Should I be worried about that ?
> Jan 10 11:30:32 dozer kernel: raid5: switching cache buffer size, 512 --> 4096
> Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 4096 --> 512
> Jan 10 11:30:33 dozer kernel: raid5: switching cache buffer size, 512 --> 4096
> Jan 10 11:30:36 dozer kernel: raid5: switching cache buffer size, 4096 --> 512

You have an internal XFS log on the RAID device and it is
accessed in sector units by default, md is reporting the changes
(harmless).  Best workaround is to instruct your filesystem to use
4K sectors: mkfs.xfs -s size=4k

HTH,
Peter

-- 
    .+'''+.         .+'''+.         .+'''+.         .+'''+.         .+''
 Kelemen Péter     /       \       /       \     Peter.Kelemen@cern.ch
.+'         `+...+'         `+...+'         `+...+'         `+...+'
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-08 16:49       ` maarten
  2005-01-08 19:01         ` maarten
@ 2005-01-09 19:33         ` Frank van Maarseveen
  2005-01-09 21:26           ` maarten
  1 sibling, 1 reply; 95+ messages in thread
From: Frank van Maarseveen @ 2005-01-09 19:33 UTC (permalink / raw)
  To: linux-raid

On Sat, Jan 08, 2005 at 05:49:32PM +0100, maarten wrote:
> 
> > For one second I thought it's a clever trick but gut feeling tells
> > me the odds of losing the entire array won't change (simplified --
> > because the increased complexity creates room for additional errors).
> 
> No.  It is somewhat more complex, true, but no different than making, for 

Got it.

> However, IF during that 
> resync one other drive has a read error, it gets kicked too and the array 
> dies.  The chances of that happening are not very small;

Ouch! never considered this. So, RAID5 will actually decrease reliability
in a significant number of cases because:

-	>1 read errors can cause a total break-down whereas it used
	to cause only a few userland I/O errors, disruptive but not foobar.
-	disk replacement is quite risky. This is totally unexpected to me
	but it should have been obvious: there's no bad block list in MD
	so if we would postpone I/O errors during reconstruction then
	1:	it might cause silent data corruption when I/O error
		unexpectedly disappears.
	2:	we might silently loose redundancy in a number of places.

I think RAID6 but especially RAID1 is safer.

A small side note on disk behavior:
If it becomes possible to do block remapping at any level (MD, DM/LVM,
FS) then we might not want to write to sectors with read errors at all
but just remap the corresponding blocks by software as long as we have
free blocks: save disk-internal spare sectors so the disk firmware can
pre-emptively remap degraded but ECC correctable sectors upon read.

-- 
Frank

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-09 19:33         ` Frank van Maarseveen
@ 2005-01-09 21:26           ` maarten
  2005-01-09 22:29             ` Frank van Maarseveen
                               ` (2 more replies)
  0 siblings, 3 replies; 95+ messages in thread
From: maarten @ 2005-01-09 21:26 UTC (permalink / raw)
  To: linux-raid

On Sunday 09 January 2005 20:33, Frank van Maarseveen wrote:
> On Sat, Jan 08, 2005 at 05:49:32PM +0100, maarten wrote:

> > However, IF during that
> > resync one other drive has a read error, it gets kicked too and the array
> > dies.  The chances of that happening are not very small;
>
> Ouch! never considered this. So, RAID5 will actually decrease reliability
> in a significant number of cases because:

> -	>1 read errors can cause a total break-down whereas it used
> 	to cause only a few userland I/O errors, disruptive but not foobar.

Well, yes and no.  You can decide to do a full backup in case you hadn't, 
prior to changing drives. And if it is _just_ a bad sector, you can 'assemble 
--force' yielding what you would've had in a non-raid setup; some file 
somewhere that's got corrupted. No big deal, ie. the same trouble as was 
caused without raid-5.

> -	disk replacement is quite risky. This is totally unexpected to me
> 	but it should have been obvious: there's no bad block list in MD
> 	so if we would postpone I/O errors during reconstruction then
> 	1:	it might cause silent data corruption when I/O error
> 		unexpectedly disappears.
> 	2:	we might silently loose redundancy in a number of places.

Not sure if I understood all of that, but I think you're saying that md 
_could_ disregard read errors _when_already_running_in_degraded_mode_ so as 
to preserve the array at all cost.  Hum.  That choice should be left to the 
user if it happens, he probably knows best what to choose in the 
circumstances.

No really, what would be best is that md made a difference between total media 
failure and sector failure.  If one sector is bad on one drive [and it gets 
kicked therefore] it should be possible when a further read error occurs on 
other media, to try and read the missing sector data from the kicked drive, 
who may well have the data there waiting, intact and all.

Don't know how hard that is really, but one could maybe think of pushing a 
disk in an intermediate state between "failed" and "good" like "in_disgrace" 
what signals to the end user "Don't remove this disk as yet; we may still 
need it, but add and resync a spare at your earliest convenience as we're 
running in degraded mode as of now".
Hmm.  Complicated stuff. :-)

This kind of error will get more and more predominant with growing media and 
decreasing disk quality. Statistically there is not a huge chance of getting 
a read failure on a 18GB scsi disk, but on a cheap(ish) 500 GB ATA disk that 
is an entrirely different ballpark. 

> I think RAID6 but especially RAID1 is safer.

Well, duh :)  At the expense of buying everything twice, sure it's safer :))

> A small side note on disk behavior:
> If it becomes possible to do block remapping at any level (MD, DM/LVM,
> FS) then we might not want to write to sectors with read errors at all
> but just remap the corresponding blocks by software as long as we have
> free blocks: save disk-internal spare sectors so the disk firmware can
> pre-emptively remap degraded but ECC correctable sectors upon read.

Well I dunno.  In ancient times, the OS was charged with remapping bad sectors 
back when disk drives had no intelligence.  Now we delegated that task to the 
disk.  I'm not sure reverting back to the old behaviour is a smart move.
But with raid, who knows...

And as it is I don't think you get the chance to save the disk-internal spare 
sectors; the disk handles that transparently so any higher layer cannot only 
not prevent that, but is even kept completely ignorant to it happening. 

Maarten

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-09 21:26           ` maarten
@ 2005-01-09 22:29             ` Frank van Maarseveen
  2005-01-09 23:16               ` maarten
  2005-01-09 23:20             ` Guy
  2005-01-10  0:42             ` Spares and partitioning huge disks Guy
  2 siblings, 1 reply; 95+ messages in thread
From: Frank van Maarseveen @ 2005-01-09 22:29 UTC (permalink / raw)
  To: linux-raid

On Sun, Jan 09, 2005 at 10:26:25PM +0100, maarten wrote:
> On Sunday 09 January 2005 20:33, Frank van Maarseveen wrote:
> > On Sat, Jan 08, 2005 at 05:49:32PM +0100, maarten wrote:
> 
> > > However, IF during that
> > > resync one other drive has a read error, it gets kicked too and the array
> > > dies.  The chances of that happening are not very small;
> >
> > Ouch! never considered this. So, RAID5 will actually decrease reliability
> > in a significant number of cases because:
> 
> > -	>1 read errors can cause a total break-down whereas it used
> > 	to cause only a few userland I/O errors, disruptive but not foobar.
> 
> Well, yes and no.  You can decide to do a full backup in case you hadn't, 

backup (or taking snapshots) is orthogonal to this.

> prior to changing drives. And if it is _just_ a bad sector, you can 'assemble 
> --force' yielding what you would've had in a non-raid setup; some file 
> somewhere that's got corrupted. No big deal, ie. the same trouble as was 
> caused without raid-5.

I doubt that it's the same: either it wil fail totally during the
reconstruction or it might fail with a silent corruption. Silent
corruptions are a big deal.  It won't loudly fail _and_ leave the array
operational for an easy fixup later on so I think it's not the same.

> > -	disk replacement is quite risky. This is totally unexpected to me
> > 	but it should have been obvious: there's no bad block list in MD
> > 	so if we would postpone I/O errors during reconstruction then
> > 	1:	it might cause silent data corruption when I/O error
> > 		unexpectedly disappears.
> > 	2:	we might silently loose redundancy in a number of places.
> 
> Not sure if I understood all of that, but I think you're saying that md 
> _could_ disregard read errors _when_already_running_in_degraded_mode_ so as 
> to preserve the array at all cost.

We can't. Imagine a 3 disk RAID5 array, one disk being replaced. While
writing the new disk we get a single randon read error on one of the
other two disks. Ignoring that implies either:
1:	making up a phoney data block when a checksum block was hit by the error.
2:	generating a garbage checksum block.

RAID won't remember these events because there is no bad block list. Now
suppose the array is operational again and hits a read error after some
random interval. Then either it may:
1:	return corrupt data without notice.
2:	recalculate a block based on garbage.

so, we can't ignore errors during RAID5 reconstruction and we're toast
if it happens, even more toast than we would have been with a normal
disk (barring the case of an entirely dead disk). If you look at the
lower level then of course RAID5 has an advantage but to me it seems to
vaporize when exposed to the _complexity_ of handling secondary errors
during the reconstruction.

-- 
Frank

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-09 22:29             ` Frank van Maarseveen
@ 2005-01-09 23:16               ` maarten
  2005-01-10  8:15                 ` Frank van Maarseveen
  0 siblings, 1 reply; 95+ messages in thread
From: maarten @ 2005-01-09 23:16 UTC (permalink / raw)
  To: linux-raid

On Sunday 09 January 2005 23:29, Frank van Maarseveen wrote:
> On Sun, Jan 09, 2005 at 10:26:25PM +0100, maarten wrote:
> > On Sunday 09 January 2005 20:33, Frank van Maarseveen wrote:
> > > On Sat, Jan 08, 2005 at 05:49:32PM +0100, maarten wrote:

> > Well, yes and no.  You can decide to do a full backup in case you hadn't,
>
> backup (or taking snapshots) is orthogonal to this.

Hm.  Okay, you're right.  

> > prior to changing drives. And if it is _just_ a bad sector, you can
> > 'assemble --force' yielding what you would've had in a non-raid setup;
> > some file somewhere that's got corrupted. No big deal, ie. the same
> > trouble as was caused without raid-5.
>
> I doubt that it's the same: either it wil fail totally during the
> reconstruction or it might fail with a silent corruption. Silent
> corruptions are a big deal.  It won't loudly fail _and_ leave the array
> operational for an easy fixup later on so I think it's not the same.

I either don't understand this, or I don't agree. Assemble --force effectively 
disables all sanitychecks, so it just can't "fail" that.  The result is 
therefore an array that either (A) holds a good FS with a couple of corrupted 
files (silent corruption) or (B) a filesystem that needs [metadata] fixing, 
or (C) one big mess that hardly resembles a FS. 
It stands to reason that in case (C) you either made a user error assembling 
the wrong parts or what you had wasn't a bad sector error in the first place 
but media failure or another type of disastrous corruption. 

I've been there. I suffered through a raid-5 two-disk failure, and I've got 
all of my data back eventually, even if some silent corruptions have happened 
(though I did not notice it, but that's no wonder with 500.000+ files)
It is ugly, and the last resort, but that doesn't mean it can't work.

> > > -	disk replacement is quite risky. This is totally unexpected to me
> > > 	but it should have been obvious: there's no bad block list in MD
> > > 	so if we would postpone I/O errors during reconstruction then
> > > 	1:	it might cause silent data corruption when I/O error
> > > 		unexpectedly disappears.
> > > 	2:	we might silently loose redundancy in a number of places.
> >
> > Not sure if I understood all of that, but I think you're saying that md
> > _could_ disregard read errors _when_already_running_in_degraded_mode_ so
> > as to preserve the array at all cost.
>
> We can't. Imagine a 3 disk RAID5 array, one disk being replaced. While
> writing the new disk we get a single randon read error on one of the
> other two disks. Ignoring that implies either:
> 1:	making up a phoney data block when a checksum block was hit by the
> error. 2:	generating a garbage checksum block.

Well, yes.  But some people -when confronted with the choice between losing 
everything or having silent corruptions- will happily accept the latter.  At 
least you could try to find the bad file(s) by md5sum, whereas in the total 
failure scenario you're left with nothing.
Of course that choice depends on how good and recent your backups are.

For my scenario, I wholly depend on md raid to preserve my files; I will not 
and cannot start backing up TV shows to DLT tape or something.  That is a 
no-no economically.  There is just no way to backup 700GB data in a home user 
environment, unless you want to spend a full week to burn it onto 170 DVDs.
(Or buy twice the amount of disks and leave them locked in a safe)

So I certainly would opt for the "possibility of silent corruption" choice. 
And if I ever find the corrupted file I delete it and mark it for 'new 
retrieval" or some such followup.  Or restore from tape where applicable.

> RAID won't remember these events because there is no bad block list. Now
> suppose the array is operational again and hits a read error after some
> random interval. Then either it may:
> 1:	return corrupt data without notice.
> 2:	recalculate a block based on garbage.

Definitely true, but we're still talking about errors on a single block, or a 
couple of blocks at most.  The other 1000.000+ blocks are still okay.
Again, it all depends on your circumstances what is worse: losing all the 
files including the good ones, or having silent corruptions somewhere.

> so, we can't ignore errors during RAID5 reconstruction and we're toast
> if it happens, even more toast than we would have been with a normal
> disk (barring the case of an entirely dead disk). If you look at the
> lower level then of course RAID5 has an advantage but to me it seems to
> vaporize when exposed to the _complexity_ of handling secondary errors
> during the reconstruction.

You cut out my entire idea about leaving the 'failed' disk around to 
eventually being able to compensate a further block error on another media.  
Why ?  It would _solve_ your problem, wouldn't it ?

Maarten

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-09 23:16               ` maarten
@ 2005-01-10  8:15                 ` Frank van Maarseveen
  2005-01-14 17:29                   ` Dieter Stueken
  0 siblings, 1 reply; 95+ messages in thread
From: Frank van Maarseveen @ 2005-01-10  8:15 UTC (permalink / raw)
  To: linux-raid

On Mon, Jan 10, 2005 at 12:16:58AM +0100, maarten wrote:
> On Sunday 09 January 2005 23:29, Frank van Maarseveen wrote:
> 
> I either don't understand this, or I don't agree. Assemble --force effectively 
> disables all sanitychecks,

ok, wasn't sure about that. but then:

> The result is 
> therefore an array that either (A) holds a good FS with a couple of corrupted 
> files (silent corruption)

> So I certainly would opt for the "possibility of silent corruption" choice. 
> And if I ever find the corrupted file I delete it and mark it for 'new 
> retrieval" or some such followup.  Or restore from tape where applicable.
> 
> > so, we can't ignore errors during RAID5 reconstruction and we're toast
> > if it happens, even more toast than we would have been with a normal
> > disk (barring the case of an entirely dead disk). If you look at the
> > lower level then of course RAID5 has an advantage but to me it seems to
> > vaporize when exposed to the _complexity_ of handling secondary errors
> > during the reconstruction.
> 
> You cut out my entire idea about leaving the 'failed' disk around to 
> eventually being able to compensate a further block error on another media.  
> Why ?  It would _solve_ your problem, wouldn't it ?

I did not intend to cut it out but simplified the situation a bit: if
you have all the RAID5 disks even with a bunch of errors spread out over
all of them then yes, you basically still have the data.  Nothing is
lost provided there's no double fault and disks are not dead yet. But
there are not many technical people I would trust for recovering from
this situation. And I wouldn't trust myself without a significant
coffee intake either :)

-- 
Frank

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-10  8:15                 ` Frank van Maarseveen
@ 2005-01-14 17:29                   ` Dieter Stueken
  2005-01-14 17:46                     ` maarten
  2005-01-15  0:13                     ` Michael Tokarev
  0 siblings, 2 replies; 95+ messages in thread
From: Dieter Stueken @ 2005-01-14 17:29 UTC (permalink / raw)
  To: linux-raid

Frank van Maarseveen wrote:
> On Mon, Jan 10, 2005 at 12:16:58AM +0100, maarten wrote:
>>You cut out my entire idea about leaving the 'failed' disk around to 
>>eventually being able to compensate a further block error on another media.  
>>Why ?  It would _solve_ your problem, wouldn't it ?
> 
> I did not intend to cut it out but simplified the situation a bit: if
> you have all the RAID5 disks even with a bunch of errors spread out over
> all of them then yes, you basically still have the data.  Nothing is
> lost provided there's no double fault and disks are not dead yet. But
> there are not many technical people I would trust for recovering from
> this situation. And I wouldn't trust myself without a significant
> coffee intake either :)

I think read errors are to be handled very differently compared to disk
failures. In particular the affected disk should not be kicked out
incautious. If done so, you waste the real power of the RAID5 system
immediately! As long, as any other part of the disk can still be read,
this data must be preserved by all means. As long as only parts of a disk
(even of different disks) can't be read, it is not a fatal problem, as long
as the data can still be read from an other disk of the array. There is no
reason to kill any disk in advance.

What I'm missing is some improved concept of replacing a disk:
Kicking off some disk at first and starting to resync to a spare
disk thereafter is a very dangerous approach. Instead some "presync"
should be possible: After a decision to replace some disk, the new
(spare) disk should be prepared in advance, while all other disks are still
running. After the spare disk was successfully prepared, the disk to replace
may be disabled.

This sounds a bit like RAID6, but it is much simpler. The complicated part
may be the phase where I have one additional disk. A simple solution would
be to perform a resync offline, while no write takes place. This may even be
performed by a userland utility. If I want to perform the "presync" online,
I have to carry out writes to both disks simultaneously, while the presync
takes place.

Dieter.
-- 
Dieter Stüken, con terra GmbH, Münster
     stueken@conterra.de
     http://www.conterra.de/
     (0)251-7474-501
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-14 17:29                   ` Dieter Stueken
@ 2005-01-14 17:46                     ` maarten
  2005-01-14 19:14                       ` Derek Piper
  2005-01-15  0:13                     ` Michael Tokarev
  1 sibling, 1 reply; 95+ messages in thread
From: maarten @ 2005-01-14 17:46 UTC (permalink / raw)
  To: linux-raid


Mod parent "+5 Insightful".

Very well though out and said, Dieter.

Maarten

On Friday 14 January 2005 18:29, Dieter Stueken wrote:
> Frank van Maarseveen wrote:

> > I did not intend to cut it out but simplified the situation a bit: if
> > you have all the RAID5 disks even with a bunch of errors spread out over
> > all of them then yes, you basically still have the data.  Nothing is
> > lost provided there's no double fault and disks are not dead yet. But
> > there are not many technical people I would trust for recovering from
> > this situation. And I wouldn't trust myself without a significant
> > coffee intake either :)
>
> I think read errors are to be handled very differently compared to disk
> failures. In particular the affected disk should not be kicked out
> incautious. If done so, you waste the real power of the RAID5 system
> immediately! As long, as any other part of the disk can still be read,
> this data must be preserved by all means. As long as only parts of a disk
> (even of different disks) can't be read, it is not a fatal problem, as long
> as the data can still be read from an other disk of the array. There is no
> reason to kill any disk in advance.
>
> What I'm missing is some improved concept of replacing a disk:
> Kicking off some disk at first and starting to resync to a spare
> disk thereafter is a very dangerous approach. Instead some "presync"
> should be possible: After a decision to replace some disk, the new
> (spare) disk should be prepared in advance, while all other disks are still
> running. After the spare disk was successfully prepared, the disk to
> replace may be disabled.
>
> This sounds a bit like RAID6, but it is much simpler. The complicated part
> may be the phase where I have one additional disk. A simple solution would
> be to perform a resync offline, while no write takes place. This may even
> be performed by a userland utility. If I want to perform the "presync"
> online, I have to carry out writes to both disks simultaneously, while the
> presync takes place.
>
> Dieter.



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-14 17:46                     ` maarten
@ 2005-01-14 19:14                       ` Derek Piper
  0 siblings, 0 replies; 95+ messages in thread
From: Derek Piper @ 2005-01-14 19:14 UTC (permalink / raw)
  To: linux-raid

Ah, that does sound much better I agree ... having just been bitten by
the 'oh dear, I got one bit was out of place, bye bye disk' problem
myself.

Even if it only 'failed' a 'chunk', it would be an improvement. I'll
take 64K over 60GB any day. The read for the chunk could then be
calculated using parity and a notification sent upwards saying
something to this effect: 'uh, hey, I'm having to regenerate data from
disk N at area X on-the-fly (i.e. I'm 'degraded') but all disks are
still with us and the other data is not in harms way, you might want
to think about backups and possibly a new disk'. If the chunk/sector
(choose how much you want to fail) can then be read again, clear the
'alert'. Of course if you get two identical chunks that miss-read,
you're screwed. Probably less screwed than if it were whole disk
though.

Derek

On Fri, 14 Jan 2005 18:46:54 +0100, maarten <maarten@ultratux.net> wrote:
> 
> Mod parent "+5 Insightful".
> 
> Very well though out and said, Dieter.
> 
> Maarten
> 
> On Friday 14 January 2005 18:29, Dieter Stueken wrote:
> > Frank van Maarseveen wrote:
> 
> > > I did not intend to cut it out but simplified the situation a bit: if
> > > you have all the RAID5 disks even with a bunch of errors spread out over
> > > all of them then yes, you basically still have the data.  Nothing is
> > > lost provided there's no double fault and disks are not dead yet. But
> > > there are not many technical people I would trust for recovering from
> > > this situation. And I wouldn't trust myself without a significant
> > > coffee intake either :)
> >
> > I think read errors are to be handled very differently compared to disk
> > failures. In particular the affected disk should not be kicked out
> > incautious. If done so, you waste the real power of the RAID5 system
> > immediately! As long, as any other part of the disk can still be read,
> > this data must be preserved by all means. As long as only parts of a disk
> > (even of different disks) can't be read, it is not a fatal problem, as long
> > as the data can still be read from an other disk of the array. There is no
> > reason to kill any disk in advance.
> >
> > What I'm missing is some improved concept of replacing a disk:
> > Kicking off some disk at first and starting to resync to a spare
> > disk thereafter is a very dangerous approach. Instead some "presync"
> > should be possible: After a decision to replace some disk, the new
> > (spare) disk should be prepared in advance, while all other disks are still
> > running. After the spare disk was successfully prepared, the disk to
> > replace may be disabled.
> >
> > This sounds a bit like RAID6, but it is much simpler. The complicated part
> > may be the phase where I have one additional disk. A simple solution would
> > be to perform a resync offline, while no write takes place. This may even
> > be performed by a userland utility. If I want to perform the "presync"
> > online, I have to carry out writes to both disks simultaneously, while the
> > presync takes place.
> >
> > Dieter.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Derek Piper - derek.piper@gmail.com
http://doofer.org/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-14 17:29                   ` Dieter Stueken
  2005-01-14 17:46                     ` maarten
@ 2005-01-15  0:13                     ` Michael Tokarev
  2005-01-15  9:34                       ` Peter T. Breuer
  1 sibling, 1 reply; 95+ messages in thread
From: Michael Tokarev @ 2005-01-15  0:13 UTC (permalink / raw)
  To: linux-raid

Dieter Stueken wrote:
[]
> I think read errors are to be handled very differently compared to disk
> failures. In particular the affected disk should not be kicked out
> incautious. If done so, you waste the real power of the RAID5 system
> immediately! As long, as any other part of the disk can still be read,
> this data must be preserved by all means. As long as only parts of a disk
> (even of different disks) can't be read, it is not a fatal problem, as long
> as the data can still be read from an other disk of the array. There is no
> reason to kill any disk in advance.

I once was successeful at recovering a (quite large at the time being)
filesystem after multiple read errors developed by two disks running in
a raid1 array (as it turned out it was the chassis fan who was at fault,
the disks become too hot and the weather was hot too, and two disks went
bed almost at once).  Raid kicked one disk out of the array after first
read error, and, thanks God (or whatever), second disk developed error
right after that, so the data was still in sync.  I've read everything
from one disk (dd conv=noerror), noticing the bad blocks, and when read
the missing blocks from the second drive (dd skip=n seek=n).  I'm afraid
to think what'd be done if the second drive lasted a bit longer (the
filesystem was quite active).  (And yes I know it was me who really was
at fault, because I didn't enable various sensors monitoring...)

More, I was once successeful at recovering raid5 array after two disk
failure, but it was much more difficult...  And I wasn't able to recover
all data at that time, just because I had no time to figure out how to
reconstruct data using parity block (I only recovered the data blocks,
zeroing unreadable ones).

That all to say: yes indeed, this lack of "smart error handling" is
a noticieable omission in linux software raid.  There are quite some
(sometimes fatal to the data) failure scenarios that'd not had happened
provided the smart error handling where in place.

/mjt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-15  0:13                     ` Michael Tokarev
@ 2005-01-15  9:34                       ` Peter T. Breuer
  2005-01-15  9:54                         ` Mikael Abrahamsson
  0 siblings, 1 reply; 95+ messages in thread
From: Peter T. Breuer @ 2005-01-15  9:34 UTC (permalink / raw)
  To: linux-raid

Michael Tokarev <mjt@tls.msk.ru> wrote:
> That all to say: yes indeed, this lack of "smart error handling" is
> a noticieable omission in linux software raid.  There are quite some
> (sometimes fatal to the data) failure scenarios that'd not had happened
> provided the smart error handling where in place.

I also agree that "redundancy per block" is probably a much better idea
than "redundancy per disk". Probably needs a "how hot are you?"
primitive, though!

The read patch I posted should get you over glitches from sporadic read
errors that would otherwise fault the disk, but one wants to add
accunting for such things and watch them.

Peter

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-15  9:34                       ` Peter T. Breuer
@ 2005-01-15  9:54                         ` Mikael Abrahamsson
  2005-01-15 10:31                           ` Brad Campbell
  2005-01-15 10:33                           ` Peter T. Breuer
  0 siblings, 2 replies; 95+ messages in thread
From: Mikael Abrahamsson @ 2005-01-15  9:54 UTC (permalink / raw)
  To: linux-raid

On Sat, 15 Jan 2005, Peter T. Breuer wrote:

> I also agree that "redundancy per block" is probably a much better idea
> than "redundancy per disk". Probably needs a "how hot are you?"
> primitive, though!

Would a methodology that'll do

if read error then
  recreate the block from parity
  write to sector that had read error
  wait until write has completed
  flush buffers
  read back block from drive
    if block still bad
      fail disk
  log result

This would give the drive a chance to relocate the block to its spare 
blocks it has available for just this instance?

If you get a write error then the drive is obviously (?) out of spare 
sectors and should be rightfully failed.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-15  9:54                         ` Mikael Abrahamsson
@ 2005-01-15 10:31                           ` Brad Campbell
  2005-01-15 11:10                             ` Mikael Abrahamsson
  2005-01-15 10:33                           ` Peter T. Breuer
  1 sibling, 1 reply; 95+ messages in thread
From: Brad Campbell @ 2005-01-15 10:31 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid

Mikael Abrahamsson wrote:
> On Sat, 15 Jan 2005, Peter T. Breuer wrote:
> 
> 
>>I also agree that "redundancy per block" is probably a much better idea
>>than "redundancy per disk". Probably needs a "how hot are you?"
>>primitive, though!
> 
> 
> Would a methodology that'll do
> 
> if read error then
>   recreate the block from parity
>   write to sector that had read error

In theory this should reallocate the bad sector.

>   wait until write has completed

If the write fails, fail the drive as bad things are going to happen.

>   flush buffers
>   read back block from drive
>     if block still bad
>       fail disk
>   log result

Make sure the logging is done in such a way as mdadm can send you an E-mail and say. "Hey, sda just 
had a bad block. Be aware.

Brad

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-15 10:31                           ` Brad Campbell
@ 2005-01-15 11:10                             ` Mikael Abrahamsson
  0 siblings, 0 replies; 95+ messages in thread
From: Mikael Abrahamsson @ 2005-01-15 11:10 UTC (permalink / raw)
  To: linux-raid

On Sat, 15 Jan 2005, Brad Campbell wrote:

> Make sure the logging is done in such a way as mdadm can send you an
> E-mail and say. "Hey, sda just had a bad block. Be aware.

Definately. The 3ware daemon does this when it detects the drive did a 
relocation (I guess it does it via SMART) and I definately like the way 
this is done.

As far as I can understand, the 3ware hw raid5 will kick any drive that 
has read errors on it though, but I am not 100% sure of this, this is just 
my guess from experience with it. I have successfully re-introduced a 
failed drive into the raid though, so I guess it'll relocate just the way 
we discussed, when it actually tries to write again to the bad sector.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-15  9:54                         ` Mikael Abrahamsson
  2005-01-15 10:31                           ` Brad Campbell
@ 2005-01-15 10:33                           ` Peter T. Breuer
  2005-01-15 11:07                             ` Mikael Abrahamsson
  1 sibling, 1 reply; 95+ messages in thread
From: Peter T. Breuer @ 2005-01-15 10:33 UTC (permalink / raw)
  To: linux-raid

Mikael Abrahamsson <swmike@swm.pp.se> wrote:
> if read error then
>   recreate the block from parity
>   write to sector that had read error
>   wait until write has completed
>   flush buffers
>   read back block from drive
>     if block still bad
>       fail disk
>   log result

Well, I haven't checked the RAID5 code (which is what you seem to be
thinking of), but I can tell you that the RAID1 code simply retries a
failed read. Unfortunately, it also ejects the disk with the bad read
from the array.

So it was fairly simple to alter the RAID1 code to "don't do that then".
Just remove the line that says to error the disk out, and let the retry
code do its bit.

One also has to add a counter so that if there is no way left of getting
the data, then the read eventually does return an error to the user.

Thus far no real problem.

The dangerous bit is launching a rewrite of the eaffceted block, which
I think one does by placing the ultimately successful read on the queue
for the raid1d thread, and changing the cmd type to "special", which should
trigger the raid1d thread to do a rewrite from it. But I haven't dared
test that yet.

I'll revisit that patch over the weekend.

Now, all that is what you summarised as

    recreate the block from parity
    write to sector that had read error

and I don't see any need for much of the rest except

    log result

In particular you seem to be trying to do things synchronusly, when
that's not at all necessary, or perhaps desirable. The user will get a
succes notice from the read when end_io is run on the originating
request, and we can be doing other things at the same time. The raid
code really has a copy of the original request, so we can ack the
original while carrying on with other things - we just have to be
careful not to lose the buffers with the read data in them (increment
reference counts and so on).

I'd appreciate Neil's help with that but he hasn't commented on the
patch I published so far!

Peter

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Spares and partitioning huge disks
  2005-01-15 10:33                           ` Peter T. Breuer
@ 2005-01-15 11:07                             ` Mikael Abrahamsson
  0 siblings, 0 replies; 95+ messages in thread
From: Mikael Abrahamsson @ 2005-01-15 11:07 UTC (permalink / raw)
  To: linux-raid

On Sat, 15 Jan 2005, Peter T. Breuer wrote:

> In particular you seem to be trying to do things synchronusly, when
> that's not at all necessary, or perhaps desirable. The user will get a

Well, no, I only tried to summarize what needed to be done, not the way to
accomplish it in the best way. 

Your summary reflecting my own attempt at a summary seems better, but it
seems we both agree on the merit of the concept of trying to write to a
sector that has given a read error, if we can recreate the data from other
sources such as mirror or parity. If this fails, fail the disk. Anyhow,
log what happened. 

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-09 21:26           ` maarten
  2005-01-09 22:29             ` Frank van Maarseveen
@ 2005-01-09 23:20             ` Guy
  2005-01-10  7:42               ` Gordon Henderson
  2005-01-10  0:42             ` Spares and partitioning huge disks Guy
  2 siblings, 1 reply; 95+ messages in thread
From: Guy @ 2005-01-09 23:20 UTC (permalink / raw)
  To: 'maarten', linux-raid

It was said:
"> I think RAID6 but especially RAID1 is safer.

Well, duh :)  At the expense of buying everything twice, sure it's safer
:))"

Guy says:
I disagree with the above.

True, RAID6 can lose 2 disks without data loss.
But, RAID1 and RAID5 can only lose 1 disk without data loss.

If RAID1 or RAID5 had a read error during a re-sync, both would die.

Now, RAID5 has more disks, so the odds are increased that a read error could
occur.  But you can improve those odds by partitioning the disks and
creating sub arrays, then combining them.  Per Maarten's plan.

Of course, having a bad sector should not cause a disk to be kicked out!
The RAID software should handle this.  Most hardware based RAID systems can
handle bad blocks.  But this is another issue. 

Why is RAID1 preferred over RAID5?
RAID1 is considered faster than RAID5.  Most systems tend to read much more
than they write.  So, having 2 disks to read from (RAID1) can double your
read rate.  RAID5 tends to have better seek time in a multi threaded
environment (more then 1 seek attempted concurrently).  If you test with
bonnie++, try 10 bonnies at the same time and note the sum of the seek times
(you must add them yourself).  With RAID1 it should about double, with
RAID5, it depends on the number of disks in the array.  Most home systems
tend to only do 1 thing at a time.  So, most people don't focus on seek
time, they tent to focus on sequential read or write rates.  In a multi
user/process/thread environment, you don't do much sequential I/O, it tends
to be random.

But, assuming you need the extra space RAID5 yields, if you choose RAID1
instead, you would have many more disks than just 2, in a RAID10 environment
you would have the improved seek rates of RAID5 (times ~2) and about double
the overall read rate of RAID5.  This is why some large systems tend to use
RAID1 over RAID5.  The largest system I worked on had over 300 disks,
configured as RAID1.  I think it was over kill on performance, RAID5 would
have been just fine!  But it was not my choice, also not my money.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten
Sent: Sunday, January 09, 2005 4:26 PM
To: linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

On Sunday 09 January 2005 20:33, Frank van Maarseveen wrote:
> On Sat, Jan 08, 2005 at 05:49:32PM +0100, maarten wrote:

> > However, IF during that
> > resync one other drive has a read error, it gets kicked too and the
array
> > dies.  The chances of that happening are not very small;
>
> Ouch! never considered this. So, RAID5 will actually decrease reliability
> in a significant number of cases because:

> -	>1 read errors can cause a total break-down whereas it used
> 	to cause only a few userland I/O errors, disruptive but not foobar.

Well, yes and no.  You can decide to do a full backup in case you hadn't, 
prior to changing drives. And if it is _just_ a bad sector, you can
'assemble 
--force' yielding what you would've had in a non-raid setup; some file 
somewhere that's got corrupted. No big deal, ie. the same trouble as was 
caused without raid-5.

> -	disk replacement is quite risky. This is totally unexpected to me
> 	but it should have been obvious: there's no bad block list in MD
> 	so if we would postpone I/O errors during reconstruction then
> 	1:	it might cause silent data corruption when I/O error
> 		unexpectedly disappears.
> 	2:	we might silently loose redundancy in a number of places.

Not sure if I understood all of that, but I think you're saying that md 
_could_ disregard read errors _when_already_running_in_degraded_mode_ so as 
to preserve the array at all cost.  Hum.  That choice should be left to the 
user if it happens, he probably knows best what to choose in the 
circumstances.

No really, what would be best is that md made a difference between total
media 
failure and sector failure.  If one sector is bad on one drive [and it gets 
kicked therefore] it should be possible when a further read error occurs on 
other media, to try and read the missing sector data from the kicked drive, 
who may well have the data there waiting, intact and all.

Don't know how hard that is really, but one could maybe think of pushing a 
disk in an intermediate state between "failed" and "good" like "in_disgrace"

what signals to the end user "Don't remove this disk as yet; we may still 
need it, but add and resync a spare at your earliest convenience as we're 
running in degraded mode as of now".
Hmm.  Complicated stuff. :-)

This kind of error will get more and more predominant with growing media and

decreasing disk quality. Statistically there is not a huge chance of getting

a read failure on a 18GB scsi disk, but on a cheap(ish) 500 GB ATA disk that

is an entrirely different ballpark. 

> I think RAID6 but especially RAID1 is safer.

Well, duh :)  At the expense of buying everything twice, sure it's safer :))

> A small side note on disk behavior:
> If it becomes possible to do block remapping at any level (MD, DM/LVM,
> FS) then we might not want to write to sectors with read errors at all
> but just remap the corresponding blocks by software as long as we have
> free blocks: save disk-internal spare sectors so the disk firmware can
> pre-emptively remap degraded but ECC correctable sectors upon read.

Well I dunno.  In ancient times, the OS was charged with remapping bad
sectors 
back when disk drives had no intelligence.  Now we delegated that task to
the 
disk.  I'm not sure reverting back to the old behaviour is a smart move.
But with raid, who knows...

And as it is I don't think you get the chance to save the disk-internal
spare 
sectors; the disk handles that transparently so any higher layer cannot only

not prevent that, but is even kept completely ignorant to it happening. 

Maarten

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-09 23:20             ` Guy
@ 2005-01-10  7:42               ` Gordon Henderson
  2005-01-10  9:03                 ` Guy
  0 siblings, 1 reply; 95+ messages in thread
From: Gordon Henderson @ 2005-01-10  7:42 UTC (permalink / raw)
  To: linux-raid

On Sun, 9 Jan 2005, Guy wrote:

> Why is RAID1 preferred over RAID5?
> RAID1 is considered faster than RAID5.  Most systems tend to read much more
> than they write.

You'd think that, wouldn't you?

However, - I've recently been doing work to graph disk IO by reading
/proc/partitions and feeding it into MRTG - what I saw surprised me,
although it really shouldn't. Most of the systems I've been graphing over
the past few weeks write all the time and rarely read -I'm putting this
down to things like log files being written more or less all the time, and
the active data set residing in the filesystem/buffer cache more or less
all the time. (also ext3 which wants to write all the time too)

However, I guess it all depends on what the server is doing - for a
workstion it may well be the case that it does more reads.

Have a quick look at

  http://lion.drogon.net/mrtg/diskIO.html

This is a moderately busy web server with a couple of dozen virtual web
sites and runs a MUD and several majordomo lists.

Blue is writes, Green reads. Note periods of heavy read activity just
after midnight when it does a backup (over the 'net to another server and
it also sucks another server onto the 'archive' partition), and 2am is
when it analyses the web log-files.

Also note that it's swapping - this has 256MB of RAM and is due for an
upgrade, but swap is keeping it all ticking away nicely.

The var partition seems to sustain writes at approx. 200-300
sectors/second... Not a fantastic amount, but I found it rather
surprising.

(I'll put the MRTG code online for anyone who wants it in a few days and
let you know)

Gordon

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-10  7:42               ` Gordon Henderson
@ 2005-01-10  9:03                 ` Guy
  2005-01-10 12:21                   ` Stats... [RE: Spares and partitioning huge disks] Gordon Henderson
  0 siblings, 1 reply; 95+ messages in thread
From: Guy @ 2005-01-10  9:03 UTC (permalink / raw)
  To: 'Gordon Henderson', linux-raid

You said:
"Have a quick look at

  http://lion.drogon.net/mrtg/diskIO.html"

Are you crazy!  Quick look, my screen is 1600X1200.
Your quick look is over twice that size!  :)

What is all the red?  Oh, it's eye strain!  :)

Why do your graphs read right to left?  It makes my head hurt!

Well, I am surprised.  I have read somewhere that about a 10 to 1 ratio is
common.  That's 10 reads per 1 write!

Maybe you got your ins and outs reversed!

If you data set is small enough to fit in the buffer cache, then you may be
correct.  I have worked on systems with a database of well over 1T bytes.
The system had about 10T bytes total.  No way for all that to fit in the
buffer cache!  But, I don't have any data to deny what you say.

Here is my home system using iostat:
# iostat -d
Linux 2.4.28 (watkins-home.com)    01/10/2005

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev8-0            1.77       209.57        10.08   38729760    1862232
dev8-1            1.78       210.75        10.08   38948102    1862232
dev8-2           10.13       522.27        36.50   96518688    6745616
dev8-3           10.52       524.71        37.94   96970042    7011712
dev8-4           10.55       525.05        38.00   97032904    7021816
dev8-5           10.60       524.90        37.99   97004392    7021080
dev8-6           10.59       524.86        38.17   96997424    7054816
dev8-7           10.56       524.89        38.23   97002160    7064552
dev8-8           10.54       524.73        38.25   96973096    7068552
dev8-9           10.54       524.55        37.89   96940336    7001736
dev8-10          10.15       522.54        36.74   96568080    6789584
dev8-11          10.18       522.51        36.52   96562208    6749696
dev8-12          10.21       522.60        36.74   96578592    6790600
dev8-13          10.22       522.67        37.10   96592848    6856800
dev8-14          10.17       522.46        36.95   96552728    6828544
dev8-15          10.19       522.30        36.70   96523856    6782136

The first 2 are my OS disks (RAID1).  The others are my 14 disk RAID5.

That's about 13 to 1 on my RAID5 array.  But I would not claim my system is
typical.  It is a home system, not a database server or fancy web server.
Mostly just a samba server.  My email server is a different box.

Oops, I just checked my email server, it has 64 meg of RAM and only does
email all the time.
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev3-0            1.32         2.10        19.72    7284164   68380122

That's 1 to 9.  I give up!
I got to mirror that system some day!

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Gordon Henderson
Sent: Monday, January 10, 2005 2:42 AM
To: linux-raid@vger.kernel.org
Subject: RE: Spares and partitioning huge disks

On Sun, 9 Jan 2005, Guy wrote:

> Why is RAID1 preferred over RAID5?
> RAID1 is considered faster than RAID5.  Most systems tend to read much
more
> than they write.

You'd think that, wouldn't you?

However, - I've recently been doing work to graph disk IO by reading
/proc/partitions and feeding it into MRTG - what I saw surprised me,
although it really shouldn't. Most of the systems I've been graphing over
the past few weeks write all the time and rarely read -I'm putting this
down to things like log files being written more or less all the time, and
the active data set residing in the filesystem/buffer cache more or less
all the time. (also ext3 which wants to write all the time too)

However, I guess it all depends on what the server is doing - for a
workstion it may well be the case that it does more reads.

Have a quick look at

  http://lion.drogon.net/mrtg/diskIO.html

This is a moderately busy web server with a couple of dozen virtual web
sites and runs a MUD and several majordomo lists.

Blue is writes, Green reads. Note periods of heavy read activity just
after midnight when it does a backup (over the 'net to another server and
it also sucks another server onto the 'archive' partition), and 2am is
when it analyses the web log-files.

Also note that it's swapping - this has 256MB of RAM and is due for an
upgrade, but swap is keeping it all ticking away nicely.

The var partition seems to sustain writes at approx. 200-300
sectors/second... Not a fantastic amount, but I found it rather
surprising.

(I'll put the MRTG code online for anyone who wants it in a few days and
let you know)

Gordon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Stats... [RE: Spares and partitioning huge disks]
  2005-01-10  9:03                 ` Guy
@ 2005-01-10 12:21                   ` Gordon Henderson
  0 siblings, 0 replies; 95+ messages in thread
From: Gordon Henderson @ 2005-01-10 12:21 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid

On Mon, 10 Jan 2005, Guy wrote:

> You said:
> "Have a quick look at
>
>   http://lion.drogon.net/mrtg/diskIO.html"
>
> Are you crazy!  Quick look, my screen is 1600X1200.
> Your quick look is over twice that size!  :)

Well, it's only a few graphs though, and you don't need to scroll
horizontally if you don't want to...

> What is all the red?  Oh, it's eye strain!  :)

they are just standard MRTG graphs arranged in a grid - mainly intended
for my own consumption, but I'm happy to share the code, etc. Give me a
few days and I'll tidy it up. After I had a server lose a case fan and
overheat I've kinda gone overboard on stats, etc. better to have them than
not, I guess.... Eg.

  http://lion.drogon.net/mrtg/sensors.html
  http://lion.drogon.net/mrtg/systemStats.html

and there are others, but they are even duller...

> Why do your graphs read right to left?  It makes my head hurt!

New data comes in at the left. I read books left to right and I'm
left-handed. Sue me..

> Well, I am surprised.  I have read somewhere that about a 10 to 1 ratio is
> common.  That's 10 reads per 1 write!
>
> Maybe you got your ins and outs reversed!

Well.. This did cross my mind! But I did some tests and check with vmstat
and I'm fairly confident it's doing the right thing. It does measure
sectors (or whatever comes out of /proc/partitions) so it might look more
than the number of blocks you may expect the filesystem to be dealing
with.

If you look at

  http://lion.drogon.net/mrtg/diskio.hda6-day.png
  http://lion.drogon.net/mrtg/diskio.hdc6-day.png

(small PNG images!) you'll see a green blip at about 11:30... This is the
result of 2 runs of:

  lion:/archive# tar cf /dev/null .

So this (and other tests I did when setting it up) makes me confident it's
doing the right thing!

> If you data set is small enough to fit in the buffer cache, then you may
> be correct.  I have worked on systems with a database of well over 1T
> bytes. The system had about 10T bytes total.  No way for all that to fit
> in the buffer cache!  But, I don't have any data to deny what you say.

Sure - this server is only a small one with 2 disks and a couple of dozen
web sites - it seems to churn out just under half a GB a day, but the
log-files are written and grow all the time )-: This might be a
consideration for further tuning though if it were to get worse... In any
case, the original comment about reads overshadowing writes may well be
true, but at the end of the day it really does depend on the application
and use of the server, and I think a few people might be surprised at
exactly what their servers are getting up to - especially database servers
which write log-files...

All my web servers seem to exhibit this behaviour though (part of my
business is server hosting) - heres a screeen dump of a small(ish) company
central server doing Intranet, file serving, (samba + NFS) and email for
about 30 people (it's a 4-disk RAID5 configuration)

  http://www.drogon.net/pix.png

(small 40KB image)

This is just the overall totals of the 4 disks - I can't show you the rest
easilly as it's behind a firewall...

It seems to exhibit the same behaviour - constant writes with the
exception of overnights just after midnight when it takes a snapshot of
itself then dumps the snapshot to tape. The read-blip at 6AM is the locate
DB update. (and at 8am I run a 'du' over some of the disks so I can use
xdu and point fingers at disk space hogs) Most of the writes are to the
/var partition - log-files and it's probably the email server whinging
about SPAM (it runs sendmail, MD & SA)

The partition that holds the user-data is a bit more balanced - the number
of reads & writes are about the same, although writes are marginally more.

> Here is my home system using iostat:

Hm. Debian doesn't seem to have 'iostat' - must look it up...

> # iostat -d
> Linux 2.4.28 (watkins-home.com)    01/10/2005
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> dev8-0            1.77       209.57        10.08   38729760    1862232
> dev8-1            1.78       210.75        10.08   38948102    1862232
> dev8-2           10.13       522.27        36.50   96518688    6745616
> dev8-3           10.52       524.71        37.94   96970042    7011712
> dev8-4           10.55       525.05        38.00   97032904    7021816
> dev8-5           10.60       524.90        37.99   97004392    7021080
> dev8-6           10.59       524.86        38.17   96997424    7054816
> dev8-7           10.56       524.89        38.23   97002160    7064552
> dev8-8           10.54       524.73        38.25   96973096    7068552
> dev8-9           10.54       524.55        37.89   96940336    7001736
> dev8-10          10.15       522.54        36.74   96568080    6789584
> dev8-11          10.18       522.51        36.52   96562208    6749696
> dev8-12          10.21       522.60        36.74   96578592    6790600
> dev8-13          10.22       522.67        37.10   96592848    6856800
> dev8-14          10.17       522.46        36.95   96552728    6828544
> dev8-15          10.19       522.30        36.70   96523856    6782136

Thats a rather busy home server - are you streaming music/video off it?
Mine (just a RAID1 system) sits with the disk spun down 95% of the time...
(noflushd) however, it's due for a rebuild when it'll turn into a home
'media server' then it might get a little bit more lively!

> Oops, I just checked my email server, it has 64 meg of RAM and only does
> email all the time.
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> dev3-0            1.32         2.10        19.72    7284164   68380122
>
> That's 1 to 9.  I give up!

Log-files... Love em or loathe em...

> I got to mirror that system some day!

Go for it, you know you want to :)

Gordon

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Spares and partitioning huge disks
  2005-01-09 21:26           ` maarten
  2005-01-09 22:29             ` Frank van Maarseveen
  2005-01-09 23:20             ` Guy
@ 2005-01-10  0:42             ` Guy
  2 siblings, 0 replies; 95+ messages in thread
From: Guy @ 2005-01-10  0:42 UTC (permalink / raw)
  To: 'maarten', linux-raid

I really like the "in_disgrace" idea!  But, not for a simple bad block.
Those should be corrected by recovering the redundant copy, and re-writing
it to correct the bad block.

If you kick the disk out, but still depend on it if another disk gets a read
error, then you must maintain a list of changed blocks, or stripes.  If a
block or stripe has changed, you could not read the data from the
"in_disgrace" disk, since it would not have current data.  This list must be
maintained after a re-boot, or the "in_disgrace" disk must be failed if the
list is lost.

"in_disgrace" would be good for write errors (maybe the drive ran out of
spare blocks), or maybe read errors that exceed some user defined, per disk
threshold.

"in_disgrace" would be a good way to replace a failed disk!

Assume a disk has failed and a spare has been re-built.  You now have a
replacement disk.

Remove the failed disk.
Add the replacement disk, which becomes a spare.
Set the spare to "in_disgrace".  :)
System is not degraded.
Rebuild starts to spare the "in_disgrace" disk.
Rebuild finishes, the "in_disgrace" disk is changed to failed.

It does not change what I have said before, but the label "in_disgrace"
makes it much easier to explain!!!!!!

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten
Sent: Sunday, January 09, 2005 4:26 PM
To: linux-raid@vger.kernel.org
Subject: Re: Spares and partitioning huge disks

On Sunday 09 January 2005 20:33, Frank van Maarseveen wrote:
> On Sat, Jan 08, 2005 at 05:49:32PM +0100, maarten wrote:

> > However, IF during that
> > resync one other drive has a read error, it gets kicked too and the
array
> > dies.  The chances of that happening are not very small;
>
> Ouch! never considered this. So, RAID5 will actually decrease reliability
> in a significant number of cases because:

> -	>1 read errors can cause a total break-down whereas it used
> 	to cause only a few userland I/O errors, disruptive but not foobar.

Well, yes and no.  You can decide to do a full backup in case you hadn't, 
prior to changing drives. And if it is _just_ a bad sector, you can
'assemble 
--force' yielding what you would've had in a non-raid setup; some file 
somewhere that's got corrupted. No big deal, ie. the same trouble as was 
caused without raid-5.

> -	disk replacement is quite risky. This is totally unexpected to me
> 	but it should have been obvious: there's no bad block list in MD
> 	so if we would postpone I/O errors during reconstruction then
> 	1:	it might cause silent data corruption when I/O error
> 		unexpectedly disappears.
> 	2:	we might silently loose redundancy in a number of places.

Not sure if I understood all of that, but I think you're saying that md 
_could_ disregard read errors _when_already_running_in_degraded_mode_ so as 
to preserve the array at all cost.  Hum.  That choice should be left to the 
user if it happens, he probably knows best what to choose in the 
circumstances.

No really, what would be best is that md made a difference between total
media 
failure and sector failure.  If one sector is bad on one drive [and it gets 
kicked therefore] it should be possible when a further read error occurs on 
other media, to try and read the missing sector data from the kicked drive, 
who may well have the data there waiting, intact and all.

Don't know how hard that is really, but one could maybe think of pushing a 
disk in an intermediate state between "failed" and "good" like "in_disgrace"

what signals to the end user "Don't remove this disk as yet; we may still 
need it, but add and resync a spare at your earliest convenience as we're 
running in degraded mode as of now".
Hmm.  Complicated stuff. :-)

This kind of error will get more and more predominant with growing media and

decreasing disk quality. Statistically there is not a huge chance of getting

a read failure on a 18GB scsi disk, but on a cheap(ish) 500 GB ATA disk that

is an entrirely different ballpark. 

> I think RAID6 but especially RAID1 is safer.

Well, duh :)  At the expense of buying everything twice, sure it's safer :))

> A small side note on disk behavior:
> If it becomes possible to do block remapping at any level (MD, DM/LVM,
> FS) then we might not want to write to sectors with read errors at all
> but just remap the corresponding blocks by software as long as we have
> free blocks: save disk-internal spare sectors so the disk firmware can
> pre-emptively remap degraded but ECC correctable sectors upon read.

Well I dunno.  In ancient times, the OS was charged with remapping bad
sectors 
back when disk drives had no intelligence.  Now we delegated that task to
the 
disk.  I'm not sure reverting back to the old behaviour is a smart move.
But with raid, who knows...

And as it is I don't think you get the chance to save the disk-internal
spare 
sectors; the disk handles that transparently so any higher layer cannot only

not prevent that, but is even kept completely ignorant to it happening. 

Maarten

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading,  raid5 :2 drives on same ide channel
@ 2005-01-16 21:28 Mitchell Laks
  2005-01-16 22:49 ` Maarten
  2005-01-17 11:41 ` Gordon Henderson
  0 siblings, 2 replies; 95+ messages in thread
From: Mitchell Laks @ 2005-01-16 21:28 UTC (permalink / raw)
  To: linux-raid

Thank you to Gordon, Maarten and Guy for your helpful responses. I learned 
much from each of  your comments.

Maarten: I paid $70 for an antec sl450 power supply. seems better price than 
you are saying ( is your power supply better?). Also I liked the idea of 6 
+3+3 slots on your box, but i dont see it for sale in the us.

Gordon: I get the same output on 2.6.8 sarge kernel for hpt366 driver. I 
notice that running 
hdparm /dev/hde 
that the IO_support is set at default 16 bit while on the other hard drive 
on the natice ide bus 
/dev/hdb
has IO_support at 32bit. 
I wondered if I get the other driver whether that will improve things...

Guy: 

> echo 100000 > /proc/sys/dev/raid/speed_limit_max

>I added this to /etc/sysctl.conf
># RAID rebuild min/max speed K/Sec per device
>dev.raid.speed_limit_min = 1000
>dev.raid.speed_limit_max = 100000

I notice that according to the man page the settings you describe are the 
defaults. Why did you have to adjust them?
Moreover When I cat 
 /proc/sys/dev/raid/speed_limit_max  -> i get ->200000
/proc/sys/dev/raid/speed_limit_min  -> i get -> 1000.

interesting is i didnt adjust them up myself.... Should I adjust the 
speed+limit_max down to 100000???
I wonder where in debian it got adjusted up?
Thanks

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading,  raid5 :2 drives on same ide channel
  2005-01-16 21:28 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks
@ 2005-01-16 22:49 ` Maarten
  2005-01-17 11:41 ` Gordon Henderson
  1 sibling, 0 replies; 95+ messages in thread
From: Maarten @ 2005-01-16 22:49 UTC (permalink / raw)
  To: linux-raid

On Sunday 16 January 2005 22:28, Mitchell Laks wrote:
> Thank you to Gordon, Maarten and Guy for your helpful responses. I learned
> much from each of  your comments.
>
> Maarten: I paid $70 for an antec sl450 power supply. seems better price
> than you are saying ( is your power supply better?). 

Heh.  At double that price, I would sure as hell hope so...!!  ;-)

If the Tagan is comparable to anything from Antec, it would be to the True480, 
not the SL450.  The Tagan is inaudible, if there were no case- and cpu fans 
whirring and mainboard-LEDs lit, you couldn't say if the unit was on or off. 
Add to that the fact that this psu has SO many connectors I could connect all 
10 harddrives without using a single splitter(!), that all connectors are 
gold plated, that it weighs more than the average complete case, and that it 
is very efficient...

But judge for yourself:
http://www.3dvelocity.com/reviews/tagan/tg480.htm

I really don't buy such extreme hardware usually.  But since I put the life of 
1.4TB raid-5 data (or 2 TB raw disk capacity) in its hands, it seemed like a 
good idea at the time.  Actually, it still does. :-)

Maarten

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading,  raid5 :2 drives on same ide channel
  2005-01-16 21:28 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks
  2005-01-16 22:49 ` Maarten
@ 2005-01-17 11:41 ` Gordon Henderson
  1 sibling, 0 replies; 95+ messages in thread
From: Gordon Henderson @ 2005-01-17 11:41 UTC (permalink / raw)
  To: Mitchell Laks; +Cc: linux-raid

On Sun, 16 Jan 2005, Mitchell Laks wrote:

> Thank you to Gordon, Maarten and Guy for your helpful responses. I learned
> much from each of  your comments.
>
> Gordon: I get the same output on 2.6.8 sarge kernel for hpt366 driver. I
> notice that running hdparm /dev/hde that the IO_support is set at
> default 16 bit while on the other hard drive on the natice ide bus
> /dev/hdb has IO_support at 32bit.  I wondered if I get the other driver
> whether that will improve things...

I get the same - 16-bit, however on that particular box, I also get 16-bit
for the on-board controller too (it is 6 years old though with a single
32-bit, 33MHz PCI bus!)

On other servers with a much more modern modo (dual Athlon systems) I see
32-bit for the on-board controller and 16-bit for the PCI Promise
controllers they have (I don't have anything else with a Highpoint card)

I'm not really up on PCI bus, etc. arcitecture, but I suspect the only
impact will be a doubling of the number of transactions going over the PCI
bus - probably not really an issue unless you have lots of PCI devices on
the same bus which all need to talk to each other, or to something
external (eg. Ethernet)

FWIW: I got a reply back from HighPoint about my question to run it under
2.6.10...

  Thanks for your contacting us!
  Our current OpenSource driver doesn't support kernel 2.6.10.

And that was all they had to say. Ho hum.


> I notice that according to the man page the settings you describe are the
> defaults. Why did you have to adjust them?
> Moreover When I cat
>  /proc/sys/dev/raid/speed_limit_max  -> i get ->200000
> /proc/sys/dev/raid/speed_limit_min  -> i get -> 1000.
>
> interesting is i didnt adjust them up myself.... Should I adjust the
> speed+limit_max down to 100000???
> I wonder where in debian it got adjusted up?

I don't think this is a distribution issue at all - certianly Debian
doesn't do anything with it at all and nothing appears to be inserted into
/etc/sysctl.conf

100000 seems to have been the default since at least 2.4.22 (the oldest
kernel I have running s/w RAID)

Gordon

^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2005-01-17 11:41 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-06 14:16 Spares and partitioning huge disks maarten
2005-01-06 16:46 ` Guy
2005-01-06 17:08   ` maarten
2005-01-06 17:31 ` Guy
2005-01-06 18:18   ` maarten
     [not found]     ` <41DD83DA.9040609@h3c.com>
2005-01-06 19:42       ` maarten
2005-01-07 20:59 ` Mario Holbe
2005-01-07 21:57   ` Guy
2005-01-08 10:22     ` Mario Holbe
2005-01-08 12:19       ` maarten
2005-01-08 16:33         ` Guy
2005-01-08 16:58           ` maarten
2005-01-08 14:52     ` Frank van Maarseveen
2005-01-08 15:50       ` Mario Holbe
2005-01-08 16:32       ` Guy
2005-01-08 17:16         ` maarten
2005-01-08 18:55           ` Guy
2005-01-08 19:25             ` maarten
2005-01-08 20:33               ` Mario Holbe
2005-01-08 23:01                 ` maarten
2005-01-09 10:10                   ` Mario Holbe
2005-01-09 16:23                     ` Guy
2005-01-09 16:36                       ` Michael Tokarev
2005-01-09 17:52                         ` Peter T. Breuer
2005-01-09 17:59                           ` Michael Tokarev
2005-01-09 18:34                             ` Peter T. Breuer
2005-01-09 20:28                             ` Guy
2005-01-09 20:47                               ` Peter T. Breuer
2005-01-10  7:19                                 ` Peter T. Breuer
2005-01-10  9:05                                   ` Guy
2005-01-10  9:38                                     ` Peter T. Breuer
2005-01-10 12:31                                   ` Peter T. Breuer
2005-01-10 13:19                                     ` Peter T. Breuer
2005-01-10 18:37                                       ` Peter T. Breuer
2005-01-11 11:34                                         ` Peter T. Breuer
2005-01-08 23:09               ` Guy
2005-01-09  0:56                 ` maarten
2005-01-13  2:05                 ` Neil Brown
2005-01-13  4:55                   ` Guy
2005-01-13  9:27                   ` Peter T. Breuer
2005-01-13 15:53                     ` Guy
2005-01-13 17:16                       ` Peter T. Breuer
2005-01-13 20:40                         ` Guy
2005-01-13 23:32                           ` Peter T. Breuer
2005-01-14  2:43                             ` Guy
2005-01-08 16:49       ` maarten
2005-01-08 19:01         ` maarten
2005-01-10 16:34           ` maarten
2005-01-10 16:36             ` Gordon Henderson
2005-01-10 17:10               ` maarten
2005-01-16 16:19                 ` 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks
2005-01-16 17:53                   ` Gordon Henderson
2005-01-16 18:22                   ` Maarten
2005-01-16 19:39                   ` Guy
2005-01-16 20:55                     ` Maarten
2005-01-16 21:58                       ` Guy
2005-01-10 17:13             ` Spares and partitioning huge disks Guy
2005-01-10 17:35               ` hard disk re-locates bad block on read Guy
2005-01-11 14:34                 ` Tom Coughlan
2005-01-11 22:43                   ` Guy
2005-01-12 13:51                     ` Tom Coughlan
2005-01-10 18:24               ` Spares and partitioning huge disks maarten
2005-01-10 20:09                 ` Guy
2005-01-10 21:21                   ` maarten
2005-01-11  1:04                   ` maarten
2005-01-10 18:40               ` maarten
2005-01-10 19:41                 ` Guy
2005-01-12 11:41               ` RAID-6 Gordon Henderson
2005-01-13  2:11                 ` RAID-6 Neil Brown
2005-01-15 16:12                   ` RAID-6 Gordon Henderson
2005-01-17  8:04                     ` RAID-6 Turbo Fredriksson
2005-01-11 10:09             ` Spares and partitioning huge disks KELEMEN Peter
2005-01-09 19:33         ` Frank van Maarseveen
2005-01-09 21:26           ` maarten
2005-01-09 22:29             ` Frank van Maarseveen
2005-01-09 23:16               ` maarten
2005-01-10  8:15                 ` Frank van Maarseveen
2005-01-14 17:29                   ` Dieter Stueken
2005-01-14 17:46                     ` maarten
2005-01-14 19:14                       ` Derek Piper
2005-01-15  0:13                     ` Michael Tokarev
2005-01-15  9:34                       ` Peter T. Breuer
2005-01-15  9:54                         ` Mikael Abrahamsson
2005-01-15 10:31                           ` Brad Campbell
2005-01-15 11:10                             ` Mikael Abrahamsson
2005-01-15 10:33                           ` Peter T. Breuer
2005-01-15 11:07                             ` Mikael Abrahamsson
2005-01-09 23:20             ` Guy
2005-01-10  7:42               ` Gordon Henderson
2005-01-10  9:03                 ` Guy
2005-01-10 12:21                   ` Stats... [RE: Spares and partitioning huge disks] Gordon Henderson
2005-01-10  0:42             ` Spares and partitioning huge disks Guy
  -- strict thread matches above, loose matches on Subject: below --
2005-01-16 21:28 4 questions. Chieftec chassis case CA-01B, resync times, selecting ide driver module loading, raid5 :2 drives on same ide channel Mitchell Laks
2005-01-16 22:49 ` Maarten
2005-01-17 11:41 ` Gordon Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).