with raid-6 any writes access all disks

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* with raid-6 any writes access all disks
@ 2011-10-26 21:01 Chris Pearson
  2011-10-26 21:23 ` Peter W. Morreale
  2011-10-26 21:23 ` NeilBrown
  0 siblings, 2 replies; 8+ messages in thread
From: Chris Pearson @ 2011-10-26 21:01 UTC (permalink / raw)
  To: linux-raid

In 2.6.39.1, any writes to a raid-6 array cause all disks to be
accessed.  Though I don't understand the math behind raid-6, I have
tested on LSI cards that it is possible to only access 3 disks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: with raid-6 any writes access all disks
  2011-10-26 21:01 with raid-6 any writes access all disks Chris Pearson
@ 2011-10-26 21:23 ` Peter W. Morreale
  2011-10-26 21:23 ` NeilBrown
  1 sibling, 0 replies; 8+ messages in thread
From: Peter W. Morreale @ 2011-10-26 21:23 UTC (permalink / raw)
  To: Chris Pearson; +Cc: linux-raid

On Wed, 2011-10-26 at 16:01 -0500, Chris Pearson wrote: 
> In 2.6.39.1, any writes to a raid-6 array cause all disks to be
> accessed.  Though I don't understand the math behind raid-6, I have
> tested on LSI cards that it is possible to only access 3 disks.
> --

Hi Chris,

Did you mean to ask a question?

Thx,
-PWM


> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: with raid-6 any writes access all disks
  2011-10-26 21:01 with raid-6 any writes access all disks Chris Pearson
  2011-10-26 21:23 ` Peter W. Morreale
@ 2011-10-26 21:23 ` NeilBrown
  2011-10-26 22:30   ` H. Peter Anvin
  1 sibling, 1 reply; 8+ messages in thread
From: NeilBrown @ 2011-10-26 21:23 UTC (permalink / raw)
  To: Chris Pearson; +Cc: linux-raid, H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 636 bytes --]

On Wed, 26 Oct 2011 16:01:19 -0500 Chris Pearson <kermit4@gmail.com> wrote:

> In 2.6.39.1, any writes to a raid-6 array cause all disks to be
> accessed.  Though I don't understand the math behind raid-6, I have
> tested on LSI cards that it is possible to only access 3 disks.

You are correct.  md/raid6 doesn't do the required maths.

i.e.  it always adds all data together to calculate the parity.
It never subtracts old data from the parity, then add new data.

This was a decision made by the original implementer (hpa) and no-one has
offered code to change it.

(yes, I review and accept patches :-)

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: with raid-6 any writes access all disks
  2011-10-26 21:23 ` NeilBrown
@ 2011-10-26 22:30   ` H. Peter Anvin
  2011-10-27  9:29     ` David Brown
  0 siblings, 1 reply; 8+ messages in thread
From: H. Peter Anvin @ 2011-10-26 22:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: Chris Pearson, linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/26/2011 11:23 PM, NeilBrown wrote:
> On Wed, 26 Oct 2011 16:01:19 -0500 Chris Pearson
> <kermit4@gmail.com> wrote:
> 
>> In 2.6.39.1, any writes to a raid-6 array cause all disks to be 
>> accessed.  Though I don't understand the math behind raid-6, I
>> have tested on LSI cards that it is possible to only access 3
>> disks.
> 
> You are correct.  md/raid6 doesn't do the required maths.
> 
> i.e.  it always adds all data together to calculate the parity. It
> never subtracts old data from the parity, then add new data.
> 
> This was a decision made by the original implementer (hpa) and
> no-one has offered code to change it.
> 
> (yes, I review and accept patches :-)
> 

This was based on benchmarks at the time that indicated that
performance suffered more than it helped.  However, since then CPUs
have gotten much faster whereas disks haven't.  There was also the
issue of getting something working reliably first.

Getting a set of hardware acceleration routines for arbitrary GF
multiplies (as is possible with SSSE3) might change that tradeoff
dramatically.

	-hpa


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJOqIn1AAoJEL2gYIVJO6zkKX0QAKCNUaxXnA22+e/Xas2G1M2z
5EXKvMxJqFeDojGkQ5H0cFW+CoU5h93EY3WaVGWz9IW9fiWnMGaIoPszYcKzw3Hr
OkCyzytvLpeJK/sgPX3+/o2K8ZbNFWIJopdUnZivhWvPqyWFDwvquIv1Pgyj0RQc
74nPs03m4EV1zMrCgv34/JpQatiabBBSvMXzNs1kWLbQfUHr2SLMMGML02utakMb
w74FdBfLaZpWlXC+Lu3G6i97+Xv8LZ9+4Z5Iqj2jVf3JjriBLOT02nChuauJ1c5y
DL6vLmPedxJ3GTKUNb4fcwLOwDW1a1TqQl4QU+kytL25Ico1uXvXgjwgmO/JssfV
z26+NEPOeMfth7+f6tIIOKxnIYTrCvs3p5L66TJ3h7LaHg8M3f6wtq6nmXdavc5P
rwTdYKe5WlO3e8vooWExts+yWcjMufQA3gopkrLJF3gPFVYZaAlEeIIudxU11ofa
b+akxIvUSze7Dqvt6GmggIP+AaAwBGNSMMEYx7XyHJqFjm6rCaUjeo1eUmDUpVbX
o9rSQT3x8lgOs2yh9jNuBZtD1unXEbhXCK+nhB+UKMcKE43rxP+WztJNVG4Dkn6k
pjnO3MnivlrU3WjUlZ6tEzzoxsKbYr2oSzWBM8dKGFKg9eMr+v81sSfF9HETh2ia
lBcAujjR/lgnSpdKk93U
=GHON
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: with raid-6 any writes access all disks
  2011-10-26 22:30   ` H. Peter Anvin
@ 2011-10-27  9:29     ` David Brown
  2011-10-27 12:22       ` H. Peter Anvin
  0 siblings, 1 reply; 8+ messages in thread
From: David Brown @ 2011-10-27  9:29 UTC (permalink / raw)
  Cc: NeilBrown, Chris Pearson, linux-raid

On 27/10/2011 00:30, H. Peter Anvin wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 10/26/2011 11:23 PM, NeilBrown wrote:
>> On Wed, 26 Oct 2011 16:01:19 -0500 Chris Pearson
>> <kermit4@gmail.com>  wrote:
>>
>>> In 2.6.39.1, any writes to a raid-6 array cause all disks to be
>>> accessed.  Though I don't understand the math behind raid-6, I
>>> have tested on LSI cards that it is possible to only access 3
>>> disks.
>>
>> You are correct.  md/raid6 doesn't do the required maths.
>>
>> i.e.  it always adds all data together to calculate the parity. It
>> never subtracts old data from the parity, then add new data.
>>
>> This was a decision made by the original implementer (hpa) and
>> no-one has offered code to change it.
>>
>> (yes, I review and accept patches :-)
>>
>
> This was based on benchmarks at the time that indicated that
> performance suffered more than it helped.  However, since then CPUs
> have gotten much faster whereas disks haven't.  There was also the
> issue of getting something working reliably first.
>
> Getting a set of hardware acceleration routines for arbitrary GF
> multiplies (as is possible with SSSE3) might change that tradeoff
> dramatically.
>
> 	-hpa
>

I can add a little of the theory here (HPA knows it, of course, but 
others might not).  I'm not well versed in the implementation, however.

With RAID5, writes to a single data disk are handled by RMW code, 
writing to the data disk and the parity disk.  The parity is calculated as:

P = D0 + D1 + D2 + .. + Dn

So if Di is to be written, you can use:

P_new = P_old - Di_old + Di_new

Since "-" is the same as "+" (in raid calculations over GF(2^8)), and is 
just "xor", that's easy to calculate.

As far as I know, the RAID5 code implements this as a special case.  If 
more than one data disk in the stripe needs to be changed, the whole 
stripe is re-written.

It would be possible to do RMW writes for more than one data disk 
without writing the whole stripe, but I suspect the overall speed gains 
would be small - I can imagine that small (single disk) writes happen a 
lot, but writes that affect more than one data disk without affecting 
most of the stripe would be rarer.

RAID6 is more complicated.  The parity calculations are:

P = D0 + D1 + D2 + .. + Dn
Q = D0 + 2.D1 + 2^2.D2 + .. + 2^(n-1).Dn

(All adds, multiplies and powers being done over GF(2^8).)

If you want to re-write Di, you have to calculate:

P_new = P_old - Di_old + Di_new
Q_new = Q_old - 2^(i-1).Di_old + 2^(i-1).Di_new

The P_new calculation is the same as for RAID5.

Q_new can be simplified to:

Q_new = Q_old + 2^(i-1) . (Di_old + Di_new)

"Multiplying" by 2 is relatively speaking quite time-consuming in 
GF(2^8).  "Multiplying" by 2^(i-1) can be done by either pre-calculating 
a multiply table, or using a loop to repeatedly multiply by 2.

When RAID6 was originally implemented in md, cpus were slower and disks 
faster (relatively speaking).  And of course simple, correct code is far 
more important than faster, riskier code.  Because of the way the 
standard Q calculation is implemented (using Horner's rule), the 
re-calculation of the whole of Q doesn't take much longer than the 
worst-case Q_new calculation (when it is the last disk changed), once 
you have the other disks read in (which takes disk time and real time, 
but not cpu time).  Thus the choice was to always re-write the whole stripe.

However, since then, we have faster cpus, slower disks (relatively 
speaking), more disks in arrays, more SIMD cpu instructions, and better 
compilers.  This means the balance has changed, and implementing RMW in 
RAID6 would almost certainly speed up small writes, as well as reducing 
the wear on the disks.

I don't know what compiler versions are typically used to compile the 
kernel, but from gcc 4.4 onwards there is a "target" function attribute 
that can be used to change the target cpu for a function.  What this 
means is that the C code can be written once, and multiple versions of 
it can be compiled with features such as "sse", "see4", "altivec", 
"neon", etc.  And newer versions of the compiler are getting better at 
using these cpu features automatically.  It should therefore be 
practical to get high-speed code suited to the particular cpu you are 
running on, without needing hand-written SSE/Altivec assembly code. 
That would save a lot of time and effort on writing, testing and 
maintenance.

That's the theory, anyway - in case anyone has the time and ability to 
implement it!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: with raid-6 any writes access all disks
  2011-10-27  9:29     ` David Brown
@ 2011-10-27 12:22       ` H. Peter Anvin
  2011-10-27 13:05         ` David Brown
  0 siblings, 1 reply; 8+ messages in thread
From: H. Peter Anvin @ 2011-10-27 12:22 UTC (permalink / raw)
  To: David Brown; +Cc: NeilBrown, Chris Pearson, linux-raid

On 10/27/2011 11:29 AM, David Brown wrote:
> 
> Q_new can be simplified to:
> 
> Q_new = Q_old + 2^(i-1) . (Di_old + Di_new)
> 
> "Multiplying" by 2 is relatively speaking quite time-consuming in
> GF(2^8).  "Multiplying" by 2^(i-1) can be done by either pre-calculating
> a multiply table, or using a loop to repeatedly multiply by 2.
> 

Multiplying by 2 is cheap.  Multiplying by an arbitrary number is more
expensive, in the absence of tricks that can be played on specific
hardware implementations (e.g. SSSE3) as mentioned in my paper.

> 
> I don't know what compiler versions are typically used to compile the
> kernel, but from gcc 4.4 onwards there is a "target" function attribute
> that can be used to change the target cpu for a function.  What this
> means is that the C code can be written once, and multiple versions of
> it can be compiled with features such as "sse", "see4", "altivec",
> "neon", etc.  And newer versions of the compiler are getting better at
> using these cpu features automatically.  It should therefore be
> practical to get high-speed code suited to the particular cpu you are
> running on, without needing hand-written SSE/Altivec assembly code. That
> would save a lot of time and effort on writing, testing and maintenance.
> 

Nice in theory; doesn't work in practice in my experience.

	-hpa


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: with raid-6 any writes access all disks
  2011-10-27 12:22       ` H. Peter Anvin
@ 2011-10-27 13:05         ` David Brown
  2011-11-01 22:22           ` H. Peter Anvin
  0 siblings, 1 reply; 8+ messages in thread
From: David Brown @ 2011-10-27 13:05 UTC (permalink / raw)
  Cc: NeilBrown, Chris Pearson, linux-raid

On 27/10/2011 14:22, H. Peter Anvin wrote:
> On 10/27/2011 11:29 AM, David Brown wrote:
>>
>> Q_new can be simplified to:
>>
>> Q_new = Q_old + 2^(i-1) . (Di_old + Di_new)
>>
>> "Multiplying" by 2 is relatively speaking quite time-consuming in
>> GF(2^8).  "Multiplying" by 2^(i-1) can be done by either pre-calculating
>> a multiply table, or using a loop to repeatedly multiply by 2.
>>
>
> Multiplying by 2 is cheap.  Multiplying by an arbitrary number is more
> expensive, in the absence of tricks that can be played on specific
> hardware implementations (e.g. SSSE3) as mentioned in my paper.

Of course, it all depends on the comparisons - multiplying by 2 is 
fairly cheap, but still more work than the simple "add" (xor) used in 
RAID5.  But I agree that the looping for arbitrary powers of 2 is much 
more costly.

Perhaps it makes sense to have functions dedicated to multiplying 
particular powers-of-two (over a full block).  The loop overhead will 
dominate for small powers, so these could be split off into individual 
implementations.  For larger powers, a loop would be used.  And for 
still larger powers, a lookup table would be faster.  I don't know where 
the boundaries go for these.

>
>>
>> I don't know what compiler versions are typically used to compile the
>> kernel, but from gcc 4.4 onwards there is a "target" function attribute
>> that can be used to change the target cpu for a function.  What this
>> means is that the C code can be written once, and multiple versions of
>> it can be compiled with features such as "sse", "see4", "altivec",
>> "neon", etc.  And newer versions of the compiler are getting better at
>> using these cpu features automatically.  It should therefore be
>> practical to get high-speed code suited to the particular cpu you are
>> running on, without needing hand-written SSE/Altivec assembly code. That
>> would save a lot of time and effort on writing, testing and maintenance.
>>
>
> Nice in theory; doesn't work in practice in my experience.
>

Where does it go wrong?  Is it the automatic vectorisation with SSE, 
etc., that is still too limited with gcc?  I have done very little work 
with x86/amd64 assembly (most of my experience is with microcontrollers 
rather than "big" processors), so I haven't tried looking at gcc's SSE 
code and comparing it to hand-optimised code.

mvh.,

David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: with raid-6 any writes access all disks
  2011-10-27 13:05         ` David Brown
@ 2011-11-01 22:22           ` H. Peter Anvin
  0 siblings, 0 replies; 8+ messages in thread
From: H. Peter Anvin @ 2011-11-01 22:22 UTC (permalink / raw)
  To: David Brown; +Cc: NeilBrown, Chris Pearson, linux-raid

On 10/27/2011 06:05 AM, David Brown wrote:
> 
> Where does it go wrong?  Is it the automatic vectorisation with SSE,
> etc., that is still too limited with gcc?  I have done very little work
> with x86/amd64 assembly (most of my experience is with microcontrollers
> rather than "big" processors), so I haven't tried looking at gcc's SSE
> code and comparing it to hand-optimised code.
> 

The autovectorization isn't good enough to understand the tricks that
are necessary to get good performance.  They require leaning pretty hard
on the instruction set.

	-hpa


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-11-01 22:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-26 21:01 with raid-6 any writes access all disks Chris Pearson
2011-10-26 21:23 ` Peter W. Morreale
2011-10-26 21:23 ` NeilBrown
2011-10-26 22:30   ` H. Peter Anvin
2011-10-27  9:29     ` David Brown
2011-10-27 12:22       ` H. Peter Anvin
2011-10-27 13:05         ` David Brown
2011-11-01 22:22           ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).