Sequential writing to degraded RAID6 causing a lot of reading

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Sequential writing to degraded RAID6 causing a lot of reading
@ 2012-05-23 19:01 Patrik Horník
  2012-05-24  4:48 ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Patrik Horník @ 2012-05-23 19:01 UTC (permalink / raw)
  To: linux-raid

Hello boys,

I am running some RAID6 arrays in degraded mode, one with
left-symmetry layout and one with left-symmetry-6 layout. I am
experiencing (potentially strange) behavior that degrades performance
of both arrays.

When I am writing sequentially a lot of data to healthy RAID5 array,
it also reads internally a bit of data. I have data on arrays, so I
only write through the filesystem. So I am not sure what causing the
reads, if writing through filesystem potentially causes skipping and
not writing whole stripes  or sometimes timing causes that the whole
stripe is not written at the same time. But anyway there is only a
small ratio of reads and the performance is almost OK.

I cant test it with full healthy RAID6 array, because I dont have any
at the moment.

But when I write sequentially to RAID6 without one drive (again
through filesystem) I get almost exactly the same amount of internal
reads as writes. Is it by design and is this expected behaviour? Why
does it behave like this? It should behave exactly like healthy RAID5,
it should detect the writing of whole stripe and should not read
(almost) anything.

Thanks.

Patrik

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sequential writing to degraded RAID6 causing a lot of reading
  2012-05-23 19:01 Sequential writing to degraded RAID6 causing a lot of reading Patrik Horník
@ 2012-05-24  4:48 ` NeilBrown
  2012-05-24 12:37   ` Patrik Horník
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2012-05-24  4:48 UTC (permalink / raw)
  To: patrik; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3450 bytes --]

On Wed, 23 May 2012 21:01:09 +0200 Patrik Horník <patrik@dsl.sk> wrote:

> Hello boys,
> 
> I am running some RAID6 arrays in degraded mode, one with
> left-symmetry layout and one with left-symmetry-6 layout. I am
> experiencing (potentially strange) behavior that degrades performance
> of both arrays.
> 
> When I am writing sequentially a lot of data to healthy RAID5 array,
> it also reads internally a bit of data. I have data on arrays, so I
> only write through the filesystem. So I am not sure what causing the
> reads, if writing through filesystem potentially causes skipping and
> not writing whole stripes  or sometimes timing causes that the whole
> stripe is not written at the same time. But anyway there is only a
> small ratio of reads and the performance is almost OK.
> 
> I cant test it with full healthy RAID6 array, because I dont have any
> at the moment.
> 
> But when I write sequentially to RAID6 without one drive (again
> through filesystem) I get almost exactly the same amount of internal
> reads as writes. Is it by design and is this expected behaviour? Why
> does it behave like this? It should behave exactly like healthy RAID5,
> it should detect the writing of whole stripe and should not read
> (almost) anything.

"It should behave exactly like healthy RAID5"

Why do you say that?  Have you examined the code or imagined carefully how
the code would work?

I think what you meant to say "I expect it would behave exactly like healthy
READ5".  That is a much more sensible statement.  It is even correct.  It
just your expectations that are wrong :-)
(philosophical note: always avoid the word "should" except when applying it
to yourself).

Firstly, degraded RAID6 with a left-symmetric layout is quite different from
an optimal RAID5 because there are Q blocks sprinkled around and some D
blocks missing.  So there will always be more work to do.

Degraded left-symmetric-6 is quite similar to optimal RAID5 as the same data
is stored in the same place - so reading should be exactly the same.
However writing is generally different and the code doesn't make any attempt
to notice and optimise cases that happen to be similar to RAID5.

A particular issue is that while RAID5 does read-modify-write when updating a
single block in an array with 5 or more devices (i.e. it reads the old data
block and the parity block, subtracts the old from parity and adds the new,
then writes both back), RAID6 does not. It always does a reconstruct-write,
so on a 6-device RAID6 it will read the other 4 data blocks, compute P and Q,
and write them out with the new data.
If it did read-modify-write it might be able to get away with reading just P,
Q, and the old data block - 3 reads instead of 4.  However subtracting from
the Q block is more complicated that subtracting from the P block and has not
been implemented.

But that might not be the issue you are hitting - it simply shows that RAID6
is different from RAID5 in important but non-obvious ways.

Yes, RAID5 and RAID6 do try to detect whole-stripe write and write them out
without reading.  This is not always possible though.
Maybe if you told us how many devices were in your arrays (which may be
import to understand exactly what is happening), what the chunk size is, and
exactly what command you use to write "lots of data".  That might help
understand what is happening.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sequential writing to degraded RAID6 causing a lot of reading
  2012-05-24  4:48 ` NeilBrown
@ 2012-05-24 12:37   ` Patrik Horník
  2012-05-25 16:07     ` Patrik Horník
  2012-05-28  1:31     ` NeilBrown
  0 siblings, 2 replies; 11+ messages in thread
From: Patrik Horník @ 2012-05-24 12:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Thu, May 24, 2012 at 6:48 AM, NeilBrown <neilb@suse.de> wrote:
> On Wed, 23 May 2012 21:01:09 +0200 Patrik Horník <patrik@dsl.sk> wrote:
>
>> Hello boys,
>>
>> I am running some RAID6 arrays in degraded mode, one with
>> left-symmetry layout and one with left-symmetry-6 layout. I am
>> experiencing (potentially strange) behavior that degrades performance
>> of both arrays.
>>
>> When I am writing sequentially a lot of data to healthy RAID5 array,
>> it also reads internally a bit of data. I have data on arrays, so I
>> only write through the filesystem. So I am not sure what causing the
>> reads, if writing through filesystem potentially causes skipping and
>> not writing whole stripes  or sometimes timing causes that the whole
>> stripe is not written at the same time. But anyway there is only a
>> small ratio of reads and the performance is almost OK.
>>
>> I cant test it with full healthy RAID6 array, because I dont have any
>> at the moment.
>>
>> But when I write sequentially to RAID6 without one drive (again
>> through filesystem) I get almost exactly the same amount of internal
>> reads as writes. Is it by design and is this expected behaviour? Why
>> does it behave like this? It should behave exactly like healthy RAID5,
>> it should detect the writing of whole stripe and should not read
>> (almost) anything.
>
> "It should behave exactly like healthy RAID5"
>
> Why do you say that?  Have you examined the code or imagined carefully how
> the code would work?
>
> I think what you meant to say "I expect it would behave exactly like healthy
> READ5".  That is a much more sensible statement.  It is even correct.  It
> just your expectations that are wrong :-)
> (philosophical note: always avoid the word "should" except when applying it
> to yourself).

What I meant by should was there is theoretical way it can work that
way so it should work that way... :)

I was implicitly referring to whole stripe write.

> Firstly, degraded RAID6 with a left-symmetric layout is quite different from
> an optimal RAID5 because there are Q blocks sprinkled around and some D
> blocks missing.  So there will always be more work to do.
>
> Degraded left-symmetric-6 is quite similar to optimal RAID5 as the same data
> is stored in the same place - so reading should be exactly the same.
> However writing is generally different and the code doesn't make any attempt
> to notice and optimise cases that happen to be similar to RAID5.

Actually I have left-symmetric-6 without one of the "regular" drives
not the one with only Qs on it, so it should be similar to degraded
RAID6 with a left-symmetric in this regard.

> A particular issue is that while RAID5 does read-modify-write when updating a
> single block in an array with 5 or more devices (i.e. it reads the old data
> block and the parity block, subtracts the old from parity and adds the new,
> then writes both back), RAID6 does not. It always does a reconstruct-write,
> so on a 6-device RAID6 it will read the other 4 data blocks, compute P and Q,
> and write them out with the new data.
> If it did read-modify-write it might be able to get away with reading just P,
> Q, and the old data block - 3 reads instead of 4.  However subtracting from
> the Q block is more complicated that subtracting from the P block and has not
> been implemented.

OK, I did not know that. In my case I have 8 drives RAID6 degraded to
7 drives, so it would be plus to have it implemented the RAID5 way.
But anyway I was thinking the whole-stripe detection should work in
this case.

> But that might not be the issue you are hitting - it simply shows that RAID6
> is different from RAID5 in important but non-obvious ways.
>
> Yes, RAID5 and RAID6 do try to detect whole-stripe write and write them out
> without reading.  This is not always possible though.
> Maybe if you told us how many devices were in your arrays (which may be
> import to understand exactly what is happening), what the chunk size is, and
> exactly what command you use to write "lots of data".  That might help
> understand what is happening.

The RAID5 is 5 drives, the RAID6 arrays are 7 of 8 drives, chunk size
is 64K. I am using command dd if=/dev/zero of=file bs=X count=Y, it
behaves the same for bs between 64K to 1 MB. Actually internal read
speed from every drive is slightly higher that write speed, about cca
10%. The ratio between write speed to the array and write speed to
individual drive is cca 5.5 - 5.7.

I have enough free space on filesystem (ext3) so I guess I should be
hitting whole stripes most of the time.

Patrik

> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sequential writing to degraded RAID6 causing a lot of reading
  2012-05-24 12:37   ` Patrik Horník
@ 2012-05-25 16:07     ` Patrik Horník
  2012-05-28  1:31     ` NeilBrown
  1 sibling, 0 replies; 11+ messages in thread
From: Patrik Horník @ 2012-05-25 16:07 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

So Neil,

what do you think? Is this working as intended in this configuration?
If not, where can be problem?

BTW one other diff between my RAID5 and RAID6 I did not mention is
that I have LVM between filesystem and RAID6. There is no LVM in RAID5
case.

Thanks.

Patrik

On Thu, May 24, 2012 at 2:37 PM, Patrik Horník <patrik@dsl.sk> wrote:
> On Thu, May 24, 2012 at 6:48 AM, NeilBrown <neilb@suse.de> wrote:
>> On Wed, 23 May 2012 21:01:09 +0200 Patrik Horník <patrik@dsl.sk> wrote:
>>
>>> Hello boys,
>>>
>>> I am running some RAID6 arrays in degraded mode, one with
>>> left-symmetry layout and one with left-symmetry-6 layout. I am
>>> experiencing (potentially strange) behavior that degrades performance
>>> of both arrays.
>>>
>>> When I am writing sequentially a lot of data to healthy RAID5 array,
>>> it also reads internally a bit of data. I have data on arrays, so I
>>> only write through the filesystem. So I am not sure what causing the
>>> reads, if writing through filesystem potentially causes skipping and
>>> not writing whole stripes  or sometimes timing causes that the whole
>>> stripe is not written at the same time. But anyway there is only a
>>> small ratio of reads and the performance is almost OK.
>>>
>>> I cant test it with full healthy RAID6 array, because I dont have any
>>> at the moment.
>>>
>>> But when I write sequentially to RAID6 without one drive (again
>>> through filesystem) I get almost exactly the same amount of internal
>>> reads as writes. Is it by design and is this expected behaviour? Why
>>> does it behave like this? It should behave exactly like healthy RAID5,
>>> it should detect the writing of whole stripe and should not read
>>> (almost) anything.
>>
>> "It should behave exactly like healthy RAID5"
>>
>> Why do you say that?  Have you examined the code or imagined carefully how
>> the code would work?
>>
>> I think what you meant to say "I expect it would behave exactly like healthy
>> READ5".  That is a much more sensible statement.  It is even correct.  It
>> just your expectations that are wrong :-)
>> (philosophical note: always avoid the word "should" except when applying it
>> to yourself).
>
> What I meant by should was there is theoretical way it can work that
> way so it should work that way... :)
>
> I was implicitly referring to whole stripe write.
>
>> Firstly, degraded RAID6 with a left-symmetric layout is quite different from
>> an optimal RAID5 because there are Q blocks sprinkled around and some D
>> blocks missing.  So there will always be more work to do.
>>
>> Degraded left-symmetric-6 is quite similar to optimal RAID5 as the same data
>> is stored in the same place - so reading should be exactly the same.
>> However writing is generally different and the code doesn't make any attempt
>> to notice and optimise cases that happen to be similar to RAID5.
>
> Actually I have left-symmetric-6 without one of the "regular" drives
> not the one with only Qs on it, so it should be similar to degraded
> RAID6 with a left-symmetric in this regard.
>
>> A particular issue is that while RAID5 does read-modify-write when updating a
>> single block in an array with 5 or more devices (i.e. it reads the old data
>> block and the parity block, subtracts the old from parity and adds the new,
>> then writes both back), RAID6 does not. It always does a reconstruct-write,
>> so on a 6-device RAID6 it will read the other 4 data blocks, compute P and Q,
>> and write them out with the new data.
>> If it did read-modify-write it might be able to get away with reading just P,
>> Q, and the old data block - 3 reads instead of 4.  However subtracting from
>> the Q block is more complicated that subtracting from the P block and has not
>> been implemented.
>
> OK, I did not know that. In my case I have 8 drives RAID6 degraded to
> 7 drives, so it would be plus to have it implemented the RAID5 way.
> But anyway I was thinking the whole-stripe detection should work in
> this case.
>
>> But that might not be the issue you are hitting - it simply shows that RAID6
>> is different from RAID5 in important but non-obvious ways.
>>
>> Yes, RAID5 and RAID6 do try to detect whole-stripe write and write them out
>> without reading.  This is not always possible though.
>> Maybe if you told us how many devices were in your arrays (which may be
>> import to understand exactly what is happening), what the chunk size is, and
>> exactly what command you use to write "lots of data".  That might help
>> understand what is happening.
>
> The RAID5 is 5 drives, the RAID6 arrays are 7 of 8 drives, chunk size
> is 64K. I am using command dd if=/dev/zero of=file bs=X count=Y, it
> behaves the same for bs between 64K to 1 MB. Actually internal read
> speed from every drive is slightly higher that write speed, about cca
> 10%. The ratio between write speed to the array and write speed to
> individual drive is cca 5.5 - 5.7.
>
> I have enough free space on filesystem (ext3) so I guess I should be
> hitting whole stripes most of the time.
>
> Patrik
>
>> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sequential writing to degraded RAID6 causing a lot of reading
  2012-05-24 12:37   ` Patrik Horník
  2012-05-25 16:07     ` Patrik Horník
@ 2012-05-28  1:31     ` NeilBrown
  2014-05-15  7:04       ` Patrik Horník
  1 sibling, 1 reply; 11+ messages in thread
From: NeilBrown @ 2012-05-28  1:31 UTC (permalink / raw)
  To: patrik; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3499 bytes --]

On Thu, 24 May 2012 14:37:28 +0200 Patrik Horník <patrik@dsl.sk> wrote:

> On Thu, May 24, 2012 at 6:48 AM, NeilBrown <neilb@suse.de> wrote:

> > Firstly, degraded RAID6 with a left-symmetric layout is quite different from
> > an optimal RAID5 because there are Q blocks sprinkled around and some D
> > blocks missing.  So there will always be more work to do.
> >
> > Degraded left-symmetric-6 is quite similar to optimal RAID5 as the same data
> > is stored in the same place - so reading should be exactly the same.
> > However writing is generally different and the code doesn't make any attempt
> > to notice and optimise cases that happen to be similar to RAID5.
> 
> Actually I have left-symmetric-6 without one of the "regular" drives
> not the one with only Qs on it, so it should be similar to degraded
> RAID6 with a left-symmetric in this regard.

Yes, it should - I had assumed wrongly ;-)

> 
> > A particular issue is that while RAID5 does read-modify-write when updating a
> > single block in an array with 5 or more devices (i.e. it reads the old data
> > block and the parity block, subtracts the old from parity and adds the new,
> > then writes both back), RAID6 does not. It always does a reconstruct-write,
> > so on a 6-device RAID6 it will read the other 4 data blocks, compute P and Q,
> > and write them out with the new data.
> > If it did read-modify-write it might be able to get away with reading just P,
> > Q, and the old data block - 3 reads instead of 4.  However subtracting from
> > the Q block is more complicated that subtracting from the P block and has not
> > been implemented.
> 
> OK, I did not know that. In my case I have 8 drives RAID6 degraded to
> 7 drives, so it would be plus to have it implemented the RAID5 way.
> But anyway I was thinking the whole-stripe detection should work in
> this case.
> 
> > But that might not be the issue you are hitting - it simply shows that RAID6
> > is different from RAID5 in important but non-obvious ways.
> >
> > Yes, RAID5 and RAID6 do try to detect whole-stripe write and write them out
> > without reading.  This is not always possible though.
> > Maybe if you told us how many devices were in your arrays (which may be
> > import to understand exactly what is happening), what the chunk size is, and
> > exactly what command you use to write "lots of data".  That might help
> > understand what is happening.
> 
> The RAID5 is 5 drives, the RAID6 arrays are 7 of 8 drives, chunk size
> is 64K. I am using command dd if=/dev/zero of=file bs=X count=Y, it
> behaves the same for bs between 64K to 1 MB. Actually internal read
> speed from every drive is slightly higher that write speed, about cca
> 10%. The ratio between write speed to the array and write speed to
> individual drive is cca 5.5 - 5.7.

I cannot really picture how the read speed can be higher than the write
speed.  The spindle doesn't speed up for reads and slow down for writes does
it?  But that's not really relevant.

A 'dd' with large block size should be a good test.  I just did a simple
experiment.  With a 4-drive non-degraded RAID6 I get about a 1:100 ratio for
reads to writes for an extended write to the filesystem.
If I fail one device it becomes 1:1.  Something certainly seems wrong there.

RAID5 behaves more as you would expect - many more writes than reads.

I've made a note to look into this when I get a chance.

Thanks for the report.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sequential writing to degraded RAID6 causing a lot of reading
  2012-05-28  1:31     ` NeilBrown
@ 2014-05-15  7:04       ` Patrik Horník
  2014-05-15  7:18         ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Patrik Horník @ 2014-05-15  7:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hello Neil,

did you make some progress on this issue by any chance?

I am hitting the same problem again on degraded RAID 6 missing two
drives, kernel Debian 3.13.10-1, mdadm v3.2.5.

Thanks.

Patrik

2012-05-28 3:31 GMT+02:00 NeilBrown <neilb@suse.de>:
>
> On Thu, 24 May 2012 14:37:28 +0200 Patrik Horník <patrik@dsl.sk> wrote:
>
> > On Thu, May 24, 2012 at 6:48 AM, NeilBrown <neilb@suse.de> wrote:
>
> > > Firstly, degraded RAID6 with a left-symmetric layout is quite different from
> > > an optimal RAID5 because there are Q blocks sprinkled around and some D
> > > blocks missing.  So there will always be more work to do.
> > >
> > > Degraded left-symmetric-6 is quite similar to optimal RAID5 as the same data
> > > is stored in the same place - so reading should be exactly the same.
> > > However writing is generally different and the code doesn't make any attempt
> > > to notice and optimise cases that happen to be similar to RAID5.
> >
> > Actually I have left-symmetric-6 without one of the "regular" drives
> > not the one with only Qs on it, so it should be similar to degraded
> > RAID6 with a left-symmetric in this regard.
>
> Yes, it should - I had assumed wrongly ;-)
>
> >
> > > A particular issue is that while RAID5 does read-modify-write when updating a
> > > single block in an array with 5 or more devices (i.e. it reads the old data
> > > block and the parity block, subtracts the old from parity and adds the new,
> > > then writes both back), RAID6 does not. It always does a reconstruct-write,
> > > so on a 6-device RAID6 it will read the other 4 data blocks, compute P and Q,
> > > and write them out with the new data.
> > > If it did read-modify-write it might be able to get away with reading just P,
> > > Q, and the old data block - 3 reads instead of 4.  However subtracting from
> > > the Q block is more complicated that subtracting from the P block and has not
> > > been implemented.
> >
> > OK, I did not know that. In my case I have 8 drives RAID6 degraded to
> > 7 drives, so it would be plus to have it implemented the RAID5 way.
> > But anyway I was thinking the whole-stripe detection should work in
> > this case.
> >
> > > But that might not be the issue you are hitting - it simply shows that RAID6
> > > is different from RAID5 in important but non-obvious ways.
> > >
> > > Yes, RAID5 and RAID6 do try to detect whole-stripe write and write them out
> > > without reading.  This is not always possible though.
> > > Maybe if you told us how many devices were in your arrays (which may be
> > > import to understand exactly what is happening), what the chunk size is, and
> > > exactly what command you use to write "lots of data".  That might help
> > > understand what is happening.
> >
> > The RAID5 is 5 drives, the RAID6 arrays are 7 of 8 drives, chunk size
> > is 64K. I am using command dd if=/dev/zero of=file bs=X count=Y, it
> > behaves the same for bs between 64K to 1 MB. Actually internal read
> > speed from every drive is slightly higher that write speed, about cca
> > 10%. The ratio between write speed to the array and write speed to
> > individual drive is cca 5.5 - 5.7.
>
> I cannot really picture how the read speed can be higher than the write
> speed.  The spindle doesn't speed up for reads and slow down for writes does
> it?  But that's not really relevant.
>
> A 'dd' with large block size should be a good test.  I just did a simple
> experiment.  With a 4-drive non-degraded RAID6 I get about a 1:100 ratio for
> reads to writes for an extended write to the filesystem.
> If I fail one device it becomes 1:1.  Something certainly seems wrong there.
>
> RAID5 behaves more as you would expect - many more writes than reads.
>
> I've made a note to look into this when I get a chance.
>
> Thanks for the report.
>
> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sequential writing to degraded RAID6 causing a lot of reading
  2014-05-15  7:04       ` Patrik Horník
@ 2014-05-15  7:18         ` NeilBrown
  2014-05-15  7:50           ` Patrik Horník
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2014-05-15  7:18 UTC (permalink / raw)
  To: patrik; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4476 bytes --]

On Thu, 15 May 2014 09:04:27 +0200 Patrik Horník <patrik@dsl.sk> wrote:

> Hello Neil,
> 
> did you make some progress on this issue by any chance?

No I haven't - sorry.
After 2 year, I guess I really should.

I'll make another note for first thing next week.

NeilBrown


> 
> I am hitting the same problem again on degraded RAID 6 missing two
> drives, kernel Debian 3.13.10-1, mdadm v3.2.5.
> 
> Thanks.
> 
> Patrik
> 
> 2012-05-28 3:31 GMT+02:00 NeilBrown <neilb@suse.de>:
> >
> > On Thu, 24 May 2012 14:37:28 +0200 Patrik Horník <patrik@dsl.sk> wrote:
> >
> > > On Thu, May 24, 2012 at 6:48 AM, NeilBrown <neilb@suse.de> wrote:
> >
> > > > Firstly, degraded RAID6 with a left-symmetric layout is quite different from
> > > > an optimal RAID5 because there are Q blocks sprinkled around and some D
> > > > blocks missing.  So there will always be more work to do.
> > > >
> > > > Degraded left-symmetric-6 is quite similar to optimal RAID5 as the same data
> > > > is stored in the same place - so reading should be exactly the same.
> > > > However writing is generally different and the code doesn't make any attempt
> > > > to notice and optimise cases that happen to be similar to RAID5.
> > >
> > > Actually I have left-symmetric-6 without one of the "regular" drives
> > > not the one with only Qs on it, so it should be similar to degraded
> > > RAID6 with a left-symmetric in this regard.
> >
> > Yes, it should - I had assumed wrongly ;-)
> >
> > >
> > > > A particular issue is that while RAID5 does read-modify-write when updating a
> > > > single block in an array with 5 or more devices (i.e. it reads the old data
> > > > block and the parity block, subtracts the old from parity and adds the new,
> > > > then writes both back), RAID6 does not. It always does a reconstruct-write,
> > > > so on a 6-device RAID6 it will read the other 4 data blocks, compute P and Q,
> > > > and write them out with the new data.
> > > > If it did read-modify-write it might be able to get away with reading just P,
> > > > Q, and the old data block - 3 reads instead of 4.  However subtracting from
> > > > the Q block is more complicated that subtracting from the P block and has not
> > > > been implemented.
> > >
> > > OK, I did not know that. In my case I have 8 drives RAID6 degraded to
> > > 7 drives, so it would be plus to have it implemented the RAID5 way.
> > > But anyway I was thinking the whole-stripe detection should work in
> > > this case.
> > >
> > > > But that might not be the issue you are hitting - it simply shows that RAID6
> > > > is different from RAID5 in important but non-obvious ways.
> > > >
> > > > Yes, RAID5 and RAID6 do try to detect whole-stripe write and write them out
> > > > without reading.  This is not always possible though.
> > > > Maybe if you told us how many devices were in your arrays (which may be
> > > > import to understand exactly what is happening), what the chunk size is, and
> > > > exactly what command you use to write "lots of data".  That might help
> > > > understand what is happening.
> > >
> > > The RAID5 is 5 drives, the RAID6 arrays are 7 of 8 drives, chunk size
> > > is 64K. I am using command dd if=/dev/zero of=file bs=X count=Y, it
> > > behaves the same for bs between 64K to 1 MB. Actually internal read
> > > speed from every drive is slightly higher that write speed, about cca
> > > 10%. The ratio between write speed to the array and write speed to
> > > individual drive is cca 5.5 - 5.7.
> >
> > I cannot really picture how the read speed can be higher than the write
> > speed.  The spindle doesn't speed up for reads and slow down for writes does
> > it?  But that's not really relevant.
> >
> > A 'dd' with large block size should be a good test.  I just did a simple
> > experiment.  With a 4-drive non-degraded RAID6 I get about a 1:100 ratio for
> > reads to writes for an extended write to the filesystem.
> > If I fail one device it becomes 1:1.  Something certainly seems wrong there.
> >
> > RAID5 behaves more as you would expect - many more writes than reads.
> >
> > I've made a note to look into this when I get a chance.
> >
> > Thanks for the report.
> >
> > NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sequential writing to degraded RAID6 causing a lot of reading
  2014-05-15  7:18         ` NeilBrown
@ 2014-05-15  7:50           ` Patrik Horník
  2014-05-20  5:42             ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Patrik Horník @ 2014-05-15  7:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

OK, it seems that because of that my copy operations will not be
finished yet by next week... :)

BTW this time layout is left-symetric but the problem I guess is in
whole strip' write detection with degraded RAID6.

Patrik

2014-05-15 9:18 GMT+02:00 NeilBrown <neilb@suse.de>:
> On Thu, 15 May 2014 09:04:27 +0200 Patrik Horník <patrik@dsl.sk> wrote:
>
>> Hello Neil,
>>
>> did you make some progress on this issue by any chance?
>
> No I haven't - sorry.
> After 2 year, I guess I really should.
>
> I'll make another note for first thing next week.
>
> NeilBrown
>
>
>>
>> I am hitting the same problem again on degraded RAID 6 missing two
>> drives, kernel Debian 3.13.10-1, mdadm v3.2.5.
>>
>> Thanks.
>>
>> Patrik
>>
>> 2012-05-28 3:31 GMT+02:00 NeilBrown <neilb@suse.de>:
>> >
>> > On Thu, 24 May 2012 14:37:28 +0200 Patrik Horník <patrik@dsl.sk> wrote:
>> >
>> > > On Thu, May 24, 2012 at 6:48 AM, NeilBrown <neilb@suse.de> wrote:
>> >
>> > > > Firstly, degraded RAID6 with a left-symmetric layout is quite different from
>> > > > an optimal RAID5 because there are Q blocks sprinkled around and some D
>> > > > blocks missing.  So there will always be more work to do.
>> > > >
>> > > > Degraded left-symmetric-6 is quite similar to optimal RAID5 as the same data
>> > > > is stored in the same place - so reading should be exactly the same.
>> > > > However writing is generally different and the code doesn't make any attempt
>> > > > to notice and optimise cases that happen to be similar to RAID5.
>> > >
>> > > Actually I have left-symmetric-6 without one of the "regular" drives
>> > > not the one with only Qs on it, so it should be similar to degraded
>> > > RAID6 with a left-symmetric in this regard.
>> >
>> > Yes, it should - I had assumed wrongly ;-)
>> >
>> > >
>> > > > A particular issue is that while RAID5 does read-modify-write when updating a
>> > > > single block in an array with 5 or more devices (i.e. it reads the old data
>> > > > block and the parity block, subtracts the old from parity and adds the new,
>> > > > then writes both back), RAID6 does not. It always does a reconstruct-write,
>> > > > so on a 6-device RAID6 it will read the other 4 data blocks, compute P and Q,
>> > > > and write them out with the new data.
>> > > > If it did read-modify-write it might be able to get away with reading just P,
>> > > > Q, and the old data block - 3 reads instead of 4.  However subtracting from
>> > > > the Q block is more complicated that subtracting from the P block and has not
>> > > > been implemented.
>> > >
>> > > OK, I did not know that. In my case I have 8 drives RAID6 degraded to
>> > > 7 drives, so it would be plus to have it implemented the RAID5 way.
>> > > But anyway I was thinking the whole-stripe detection should work in
>> > > this case.
>> > >
>> > > > But that might not be the issue you are hitting - it simply shows that RAID6
>> > > > is different from RAID5 in important but non-obvious ways.
>> > > >
>> > > > Yes, RAID5 and RAID6 do try to detect whole-stripe write and write them out
>> > > > without reading.  This is not always possible though.
>> > > > Maybe if you told us how many devices were in your arrays (which may be
>> > > > import to understand exactly what is happening), what the chunk size is, and
>> > > > exactly what command you use to write "lots of data".  That might help
>> > > > understand what is happening.
>> > >
>> > > The RAID5 is 5 drives, the RAID6 arrays are 7 of 8 drives, chunk size
>> > > is 64K. I am using command dd if=/dev/zero of=file bs=X count=Y, it
>> > > behaves the same for bs between 64K to 1 MB. Actually internal read
>> > > speed from every drive is slightly higher that write speed, about cca
>> > > 10%. The ratio between write speed to the array and write speed to
>> > > individual drive is cca 5.5 - 5.7.
>> >
>> > I cannot really picture how the read speed can be higher than the write
>> > speed.  The spindle doesn't speed up for reads and slow down for writes does
>> > it?  But that's not really relevant.
>> >
>> > A 'dd' with large block size should be a good test.  I just did a simple
>> > experiment.  With a 4-drive non-degraded RAID6 I get about a 1:100 ratio for
>> > reads to writes for an extended write to the filesystem.
>> > If I fail one device it becomes 1:1.  Something certainly seems wrong there.
>> >
>> > RAID5 behaves more as you would expect - many more writes than reads.
>> >
>> > I've made a note to look into this when I get a chance.
>> >
>> > Thanks for the report.
>> >
>> > NeilBrown
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sequential writing to degraded RAID6 causing a lot of reading
  2014-05-15  7:50           ` Patrik Horník
@ 2014-05-20  5:42             ` NeilBrown
  2014-05-20 10:07               ` Patrik Horník
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2014-05-20  5:42 UTC (permalink / raw)
  To: patrik; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5667 bytes --]

On Thu, 15 May 2014 09:50:49 +0200 Patrik Horník <patrik@dsl.sk> wrote:

> OK, it seems that because of that my copy operations will not be
> finished yet by next week... :)
> 
> BTW this time layout is left-symetric but the problem I guess is in
> whole strip' write detection with degraded RAID6.
> 
> Patrik
> 
> 2014-05-15 9:18 GMT+02:00 NeilBrown <neilb@suse.de>:
> > On Thu, 15 May 2014 09:04:27 +0200 Patrik Horník <patrik@dsl.sk> wrote:
> >
> >> Hello Neil,
> >>
> >> did you make some progress on this issue by any chance?
> >
> > No I haven't - sorry.
> > After 2 year, I guess I really should.
> >
> > I'll make another note for first thing next week.

Can you try the following patch and let me know if it helps?
I definitely reduced the number of reads significantly, but my measurements
(of a very simple test case) didn't show much speed-up.

This is against current mainline.  If you want it against another version and
it doesn't apply easily, just ask.

Thanks,
NeilBrown

From 98c411f93391be0dbda98d43835dd9e042faa78f Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Mon, 19 May 2014 11:16:49 +1000
Subject: [PATCH] md/raid56: Don't perform reads to support writes until stripe
 is ready.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If it is found that we need to pre-read some blocks before a write
can succeed, we normally set STRIPE_DELAYED and don't actually perform
the read until STRIPE_PREREAD_ACTIVE subsequently gets set.

However for a degraded RAID6 we currently perform the reads as soon
as we see that a write is pending.  This significantly hurts
throughput.

So:
 - when handle_stripe_dirtying find a block that it wants on a device
   that is failed, set STRIPE_DELAY, instead of doing nothing, and
 - when fetch_block detects that a read might be required to satisfy a
   write, only perform the read if STRIPE_PREREAD_ACTIVE is set,
   and if we would actually need to read something to complete the write.

This also helps RAID5, though less often as RAID5 supports a
read-modify-write cycle.  For RAID5 the read is performed too early
only if the write is not a full 4K aligned write (i.e. no an
R5_OVERWRITE).

Also clean up a couple of horrible bits of formatting.

Reported-by: Patrik Horník <patrik@dsl.sk>
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 633e20a96b34..d67202bd9118 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -292,9 +292,12 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 	BUG_ON(atomic_read(&conf->active_stripes)==0);
 	if (test_bit(STRIPE_HANDLE, &sh->state)) {
 		if (test_bit(STRIPE_DELAYED, &sh->state) &&
-		    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+		    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
 			list_add_tail(&sh->lru, &conf->delayed_list);
-		else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
+			if (atomic_read(&conf->preread_active_stripes)
+			    < IO_THRESHOLD)
+				md_wakeup_thread(conf->mddev->thread);
+		} else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
 			   sh->bm_seq - conf->seq_write > 0)
 			list_add_tail(&sh->lru, &conf->bitmap_list);
 		else {
@@ -2908,8 +2911,11 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
 	     (s->failed >= 1 && fdev[0]->toread) ||
 	     (s->failed >= 2 && fdev[1]->toread) ||
 	     (sh->raid_conf->level <= 5 && s->failed && fdev[0]->towrite &&
+	      (!test_bit(R5_Insync, &dev->flags) || test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) &&
 	      !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||
-	     (sh->raid_conf->level == 6 && s->failed && s->to_write))) {
+	     (sh->raid_conf->level == 6 && s->failed && s->to_write &&
+	      s->towrite < sh->raid_conf->raid_disks - 2 &&
+	      (!test_bit(R5_Insync, &dev->flags) || test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))))) {
 		/* we would like to get this block, possibly by computing it,
 		 * otherwise read it if the backing disk is insync
 		 */
@@ -3115,7 +3121,8 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 		    !test_bit(R5_LOCKED, &dev->flags) &&
 		    !(test_bit(R5_UPTODATE, &dev->flags) ||
 		    test_bit(R5_Wantcompute, &dev->flags))) {
-			if (test_bit(R5_Insync, &dev->flags)) rcw++;
+			if (test_bit(R5_Insync, &dev->flags))
+				rcw++;
 			else
 				rcw += 2*disks;
 		}
@@ -3136,10 +3143,10 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 			    !(test_bit(R5_UPTODATE, &dev->flags) ||
 			    test_bit(R5_Wantcompute, &dev->flags)) &&
 			    test_bit(R5_Insync, &dev->flags)) {
-				if (
-				  test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-					pr_debug("Read_old block "
-						 "%d for r-m-w\n", i);
+				if (test_bit(STRIPE_PREREAD_ACTIVE,
+					     &sh->state)) {
+					pr_debug("Read_old block %d for r-m-w\n",
+						 i);
 					set_bit(R5_LOCKED, &dev->flags);
 					set_bit(R5_Wantread, &dev->flags);
 					s->locked++;
@@ -3162,10 +3169,9 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 			    !(test_bit(R5_UPTODATE, &dev->flags) ||
 			      test_bit(R5_Wantcompute, &dev->flags))) {
 				rcw++;
-				if (!test_bit(R5_Insync, &dev->flags))
-					continue; /* it's a failed drive */
-				if (
-				  test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
+				if (test_bit(R5_Insync, &dev->flags) &&
+				    test_bit(STRIPE_PREREAD_ACTIVE,
+					     &sh->state)) {
 					pr_debug("Read_old block "
 						"%d for Reconstruct\n", i);
 					set_bit(R5_LOCKED, &dev->flags);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Sequential writing to degraded RAID6 causing a lot of reading
  2014-05-20  5:42             ` NeilBrown
@ 2014-05-20 10:07               ` Patrik Horník
  2014-05-20 11:08                 ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Patrik Horník @ 2014-05-20 10:07 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

2014-05-20 7:42 GMT+02:00 NeilBrown <neilb@suse.de>:
> On Thu, 15 May 2014 09:50:49 +0200 Patrik Horník <patrik@dsl.sk> wrote:
>
>> OK, it seems that because of that my copy operations will not be
>> finished yet by next week... :)
>>
>> BTW this time layout is left-symetric but the problem I guess is in
>> whole strip' write detection with degraded RAID6.
>>
>> Patrik
>>
>> 2014-05-15 9:18 GMT+02:00 NeilBrown <neilb@suse.de>:
>> > On Thu, 15 May 2014 09:04:27 +0200 Patrik Horník <patrik@dsl.sk> wrote:
>> >
>> >> Hello Neil,
>> >>
>> >> did you make some progress on this issue by any chance?
>> >
>> > No I haven't - sorry.
>> > After 2 year, I guess I really should.
>> >
>> > I'll make another note for first thing next week.
>
> Can you try the following patch and let me know if it helps?

I dont want to test it on production system... But I have some
degraded array which does not have production data on it so I will
think about how to test it.

> I definitely reduced the number of reads significantly, but my measurements
> (of a very simple test case) didn't show much speed-up.
>

I did not look at the patch itself but according to your description
is should eliminate the problem, should it not? What was your read /
write ratio after the patch?

Thanks.

Patrik

> This is against current mainline.  If you want it against another version and
> it doesn't apply easily, just ask.
>
> Thanks,
> NeilBrown
>
> From 98c411f93391be0dbda98d43835dd9e042faa78f Mon Sep 17 00:00:00 2001
> From: NeilBrown <neilb@suse.de>
> Date: Mon, 19 May 2014 11:16:49 +1000
> Subject: [PATCH] md/raid56: Don't perform reads to support writes until stripe
>  is ready.
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> If it is found that we need to pre-read some blocks before a write
> can succeed, we normally set STRIPE_DELAYED and don't actually perform
> the read until STRIPE_PREREAD_ACTIVE subsequently gets set.
>
> However for a degraded RAID6 we currently perform the reads as soon
> as we see that a write is pending.  This significantly hurts
> throughput.
>
> So:
>  - when handle_stripe_dirtying find a block that it wants on a device
>    that is failed, set STRIPE_DELAY, instead of doing nothing, and
>  - when fetch_block detects that a read might be required to satisfy a
>    write, only perform the read if STRIPE_PREREAD_ACTIVE is set,
>    and if we would actually need to read something to complete the write.
>
> This also helps RAID5, though less often as RAID5 supports a
> read-modify-write cycle.  For RAID5 the read is performed too early
> only if the write is not a full 4K aligned write (i.e. no an
> R5_OVERWRITE).
>
> Also clean up a couple of horrible bits of formatting.
>
> Reported-by: Patrik Horník <patrik@dsl.sk>
> Signed-off-by: NeilBrown <neilb@suse.de>
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 633e20a96b34..d67202bd9118 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -292,9 +292,12 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
>         BUG_ON(atomic_read(&conf->active_stripes)==0);
>         if (test_bit(STRIPE_HANDLE, &sh->state)) {
>                 if (test_bit(STRIPE_DELAYED, &sh->state) &&
> -                   !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> +                   !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
>                         list_add_tail(&sh->lru, &conf->delayed_list);
> -               else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
> +                       if (atomic_read(&conf->preread_active_stripes)
> +                           < IO_THRESHOLD)
> +                               md_wakeup_thread(conf->mddev->thread);
> +               } else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
>                            sh->bm_seq - conf->seq_write > 0)
>                         list_add_tail(&sh->lru, &conf->bitmap_list);
>                 else {
> @@ -2908,8 +2911,11 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
>              (s->failed >= 1 && fdev[0]->toread) ||
>              (s->failed >= 2 && fdev[1]->toread) ||
>              (sh->raid_conf->level <= 5 && s->failed && fdev[0]->towrite &&
> +             (!test_bit(R5_Insync, &dev->flags) || test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) &&
>               !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||
> -            (sh->raid_conf->level == 6 && s->failed && s->to_write))) {
> +            (sh->raid_conf->level == 6 && s->failed && s->to_write &&
> +             s->towrite < sh->raid_conf->raid_disks - 2 &&
> +             (!test_bit(R5_Insync, &dev->flags) || test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))))) {
>                 /* we would like to get this block, possibly by computing it,
>                  * otherwise read it if the backing disk is insync
>                  */
> @@ -3115,7 +3121,8 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>                     !test_bit(R5_LOCKED, &dev->flags) &&
>                     !(test_bit(R5_UPTODATE, &dev->flags) ||
>                     test_bit(R5_Wantcompute, &dev->flags))) {
> -                       if (test_bit(R5_Insync, &dev->flags)) rcw++;
> +                       if (test_bit(R5_Insync, &dev->flags))
> +                               rcw++;
>                         else
>                                 rcw += 2*disks;
>                 }
> @@ -3136,10 +3143,10 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>                             !(test_bit(R5_UPTODATE, &dev->flags) ||
>                             test_bit(R5_Wantcompute, &dev->flags)) &&
>                             test_bit(R5_Insync, &dev->flags)) {
> -                               if (
> -                                 test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
> -                                       pr_debug("Read_old block "
> -                                                "%d for r-m-w\n", i);
> +                               if (test_bit(STRIPE_PREREAD_ACTIVE,
> +                                            &sh->state)) {
> +                                       pr_debug("Read_old block %d for r-m-w\n",
> +                                                i);
>                                         set_bit(R5_LOCKED, &dev->flags);
>                                         set_bit(R5_Wantread, &dev->flags);
>                                         s->locked++;
> @@ -3162,10 +3169,9 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>                             !(test_bit(R5_UPTODATE, &dev->flags) ||
>                               test_bit(R5_Wantcompute, &dev->flags))) {
>                                 rcw++;
> -                               if (!test_bit(R5_Insync, &dev->flags))
> -                                       continue; /* it's a failed drive */
> -                               if (
> -                                 test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
> +                               if (test_bit(R5_Insync, &dev->flags) &&
> +                                   test_bit(STRIPE_PREREAD_ACTIVE,
> +                                            &sh->state)) {
>                                         pr_debug("Read_old block "
>                                                 "%d for Reconstruct\n", i);
>                                         set_bit(R5_LOCKED, &dev->flags);
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sequential writing to degraded RAID6 causing a lot of reading
  2014-05-20 10:07               ` Patrik Horník
@ 2014-05-20 11:08                 ` NeilBrown
  0 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2014-05-20 11:08 UTC (permalink / raw)
  To: patrik; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 8333 bytes --]

On Tue, 20 May 2014 12:07:11 +0200 Patrik Horník <patrik@dsl.sk> wrote:

> 2014-05-20 7:42 GMT+02:00 NeilBrown <neilb@suse.de>:
> > On Thu, 15 May 2014 09:50:49 +0200 Patrik Horník <patrik@dsl.sk> wrote:
> >
> >> OK, it seems that because of that my copy operations will not be
> >> finished yet by next week... :)
> >>
> >> BTW this time layout is left-symetric but the problem I guess is in
> >> whole strip' write detection with degraded RAID6.
> >>
> >> Patrik
> >>
> >> 2014-05-15 9:18 GMT+02:00 NeilBrown <neilb@suse.de>:
> >> > On Thu, 15 May 2014 09:04:27 +0200 Patrik Horník <patrik@dsl.sk> wrote:
> >> >
> >> >> Hello Neil,
> >> >>
> >> >> did you make some progress on this issue by any chance?
> >> >
> >> > No I haven't - sorry.
> >> > After 2 year, I guess I really should.
> >> >
> >> > I'll make another note for first thing next week.
> >
> > Can you try the following patch and let me know if it helps?
> 
> I dont want to test it on production system... But I have some
> degraded array which does not have production data on it so I will
> think about how to test it.
> 
> > I definitely reduced the number of reads significantly, but my measurements
> > (of a very simple test case) didn't show much speed-up.
> >
> 
> I did not look at the patch itself but according to your description
> is should eliminate the problem, should it not? What was your read /
> write ratio after the patch?

It depends a bit of what particular tests I ran and what other hacks were in
the kernel - I did get zero reads, but that was with some hacks that aren't
general enough to be used.
Providing the stripe_cache_size was reasonably large, I got somewhere between
1:100 and 1:10.  When things were bad, it was often close to 1:1.

NeilBrown


> 
> Thanks.
> 
> Patrik
> 
> > This is against current mainline.  If you want it against another version and
> > it doesn't apply easily, just ask.
> >
> > Thanks,
> > NeilBrown
> >
> > From 98c411f93391be0dbda98d43835dd9e042faa78f Mon Sep 17 00:00:00 2001
> > From: NeilBrown <neilb@suse.de>
> > Date: Mon, 19 May 2014 11:16:49 +1000
> > Subject: [PATCH] md/raid56: Don't perform reads to support writes until stripe
> >  is ready.
> > MIME-Version: 1.0
> > Content-Type: text/plain; charset=UTF-8
> > Content-Transfer-Encoding: 8bit
> >
> > If it is found that we need to pre-read some blocks before a write
> > can succeed, we normally set STRIPE_DELAYED and don't actually perform
> > the read until STRIPE_PREREAD_ACTIVE subsequently gets set.
> >
> > However for a degraded RAID6 we currently perform the reads as soon
> > as we see that a write is pending.  This significantly hurts
> > throughput.
> >
> > So:
> >  - when handle_stripe_dirtying find a block that it wants on a device
> >    that is failed, set STRIPE_DELAY, instead of doing nothing, and
> >  - when fetch_block detects that a read might be required to satisfy a
> >    write, only perform the read if STRIPE_PREREAD_ACTIVE is set,
> >    and if we would actually need to read something to complete the write.
> >
> > This also helps RAID5, though less often as RAID5 supports a
> > read-modify-write cycle.  For RAID5 the read is performed too early
> > only if the write is not a full 4K aligned write (i.e. no an
> > R5_OVERWRITE).
> >
> > Also clean up a couple of horrible bits of formatting.
> >
> > Reported-by: Patrik Horník <patrik@dsl.sk>
> > Signed-off-by: NeilBrown <neilb@suse.de>
> >
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 633e20a96b34..d67202bd9118 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -292,9 +292,12 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
> >         BUG_ON(atomic_read(&conf->active_stripes)==0);
> >         if (test_bit(STRIPE_HANDLE, &sh->state)) {
> >                 if (test_bit(STRIPE_DELAYED, &sh->state) &&
> > -                   !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> > +                   !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
> >                         list_add_tail(&sh->lru, &conf->delayed_list);
> > -               else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
> > +                       if (atomic_read(&conf->preread_active_stripes)
> > +                           < IO_THRESHOLD)
> > +                               md_wakeup_thread(conf->mddev->thread);
> > +               } else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
> >                            sh->bm_seq - conf->seq_write > 0)
> >                         list_add_tail(&sh->lru, &conf->bitmap_list);
> >                 else {
> > @@ -2908,8 +2911,11 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
> >              (s->failed >= 1 && fdev[0]->toread) ||
> >              (s->failed >= 2 && fdev[1]->toread) ||
> >              (sh->raid_conf->level <= 5 && s->failed && fdev[0]->towrite &&
> > +             (!test_bit(R5_Insync, &dev->flags) || test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) &&
> >               !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||
> > -            (sh->raid_conf->level == 6 && s->failed && s->to_write))) {
> > +            (sh->raid_conf->level == 6 && s->failed && s->to_write &&
> > +             s->towrite < sh->raid_conf->raid_disks - 2 &&
> > +             (!test_bit(R5_Insync, &dev->flags) || test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))))) {
> >                 /* we would like to get this block, possibly by computing it,
> >                  * otherwise read it if the backing disk is insync
> >                  */
> > @@ -3115,7 +3121,8 @@ static void handle_stripe_dirtying(struct r5conf *conf,
> >                     !test_bit(R5_LOCKED, &dev->flags) &&
> >                     !(test_bit(R5_UPTODATE, &dev->flags) ||
> >                     test_bit(R5_Wantcompute, &dev->flags))) {
> > -                       if (test_bit(R5_Insync, &dev->flags)) rcw++;
> > +                       if (test_bit(R5_Insync, &dev->flags))
> > +                               rcw++;
> >                         else
> >                                 rcw += 2*disks;
> >                 }
> > @@ -3136,10 +3143,10 @@ static void handle_stripe_dirtying(struct r5conf *conf,
> >                             !(test_bit(R5_UPTODATE, &dev->flags) ||
> >                             test_bit(R5_Wantcompute, &dev->flags)) &&
> >                             test_bit(R5_Insync, &dev->flags)) {
> > -                               if (
> > -                                 test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
> > -                                       pr_debug("Read_old block "
> > -                                                "%d for r-m-w\n", i);
> > +                               if (test_bit(STRIPE_PREREAD_ACTIVE,
> > +                                            &sh->state)) {
> > +                                       pr_debug("Read_old block %d for r-m-w\n",
> > +                                                i);
> >                                         set_bit(R5_LOCKED, &dev->flags);
> >                                         set_bit(R5_Wantread, &dev->flags);
> >                                         s->locked++;
> > @@ -3162,10 +3169,9 @@ static void handle_stripe_dirtying(struct r5conf *conf,
> >                             !(test_bit(R5_UPTODATE, &dev->flags) ||
> >                               test_bit(R5_Wantcompute, &dev->flags))) {
> >                                 rcw++;
> > -                               if (!test_bit(R5_Insync, &dev->flags))
> > -                                       continue; /* it's a failed drive */
> > -                               if (
> > -                                 test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
> > +                               if (test_bit(R5_Insync, &dev->flags) &&
> > +                                   test_bit(STRIPE_PREREAD_ACTIVE,
> > +                                            &sh->state)) {
> >                                         pr_debug("Read_old block "
> >                                                 "%d for Reconstruct\n", i);
> >                                         set_bit(R5_LOCKED, &dev->flags);


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-05-20 11:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-23 19:01 Sequential writing to degraded RAID6 causing a lot of reading Patrik Horník
2012-05-24  4:48 ` NeilBrown
2012-05-24 12:37   ` Patrik Horník
2012-05-25 16:07     ` Patrik Horník
2012-05-28  1:31     ` NeilBrown
2014-05-15  7:04       ` Patrik Horník
2014-05-15  7:18         ` NeilBrown
2014-05-15  7:50           ` Patrik Horník
2014-05-20  5:42             ` NeilBrown
2014-05-20 10:07               ` Patrik Horník
2014-05-20 11:08                 ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).