Accelerating Linux software raid

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Accelerating Linux software raid
@ 2005-09-06 18:24 Dan Williams
  2005-09-06 21:52 ` Molle Bestefich
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Dan Williams @ 2005-09-06 18:24 UTC (permalink / raw)
  To: linux-raid; +Cc: dave.jiang

Hello,

I am writing to the list to gauge interest for a modification of the
md driver that allows it to take advantage of raid acceleration
hardware.  I/O processors like the Intel IOP333
(http://www.intel.com/design/iio/docs/iop333.htm) contain an xor
engine for raid5 and raid6 calculations, but currently the md driver
does not fully utilize these resources.

Dave Jiang wrote a driver that re-routed calls to xor_block() to use
the hardware xor engine.  However, from my understating, he found that
performance did not improve, due to the fact that md deals in
PAGE_SIZE (4K) blocks.  At 4K the overhead of setting up the engine
destroys any performance advantage over a software xor.  The goal of
the modification would be to enable md to understand the capacity of
the platform's xor resources and allow it to issue optimal block
sizes.

The first question is whether a solution along these lines would be
valued by the community?  The effort is non-trivial.

Assuming a positive response I will solicit implementation ideas and
acceptance criteria from the list.

Thank you for your consideration,

Dan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Accelerating Linux software raid
  2005-09-06 18:24 Accelerating Linux software raid Dan Williams
@ 2005-09-06 21:52 ` Molle Bestefich
  2005-09-10  4:51 ` Mark Hahn
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Molle Bestefich @ 2005-09-06 21:52 UTC (permalink / raw)
  To: linux-raid

Dan Williams wrote:
> The first question is whether a solution along these lines would be
> valued by the community?  The effort is non-trivial.

I don't represent the community, but I think the idea is great.

When will it be finished and where can I buy the hardware? :-)

And if you don't mind terribly, could you also add hardware
acceleration support to loop-aes now that you're at it? :-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Accelerating Linux software raid
  2005-09-06 18:24 Accelerating Linux software raid Dan Williams
  2005-09-06 21:52 ` Molle Bestefich
@ 2005-09-10  4:51 ` Mark Hahn
  2005-09-10 12:58   ` Ric Wheeler
  2005-09-10  8:35 ` Colonel Hell
  2005-09-11 23:14 ` Neil Brown
  3 siblings, 1 reply; 12+ messages in thread
From: Mark Hahn @ 2005-09-10  4:51 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

> I am writing to the list to gauge interest for a modification of the
> md driver that allows it to take advantage of raid acceleration

not that much, I think.

> hardware.  I/O processors like the Intel IOP333
> (http://www.intel.com/design/iio/docs/iop333.htm) contain an xor
> engine for raid5 and raid6 calculations, but currently the md driver
> does not fully utilize these resources.

the worst insult in the linux world is "solution in search of a problem".
that applies here: are you sure that there is a problem?  yes, offload 
can be a lovely thing, but it often falls behind the main driver of the 
industry - host cpu performance.  unless you're specifically targetting
a high-IO device with very little CPU power, I think you'll find a lot 
of skepticism about IO coprocessors.

I have a server that can do the raid5 checksum at 8 GB/s, and have no 
reason to ever want more than ~100 MB/s on that machine.  do I care 
about "wasting" 1/80th of the machine?  not really, even though it's 
a supercomputing cluster node.  for fileservers, I mind even less 
wasting CPU using the host for parity, since the cycles aren't going
to be used for anything else...

> Dave Jiang wrote a driver that re-routed calls to xor_block() to use
> the hardware xor engine.  However, from my understating, he found that
> performance did not improve, due to the fact that md deals in
> PAGE_SIZE (4K) blocks.

well, it calls xor_block with STRIPE_SIZE which is indeed PAGE_SIZE.

> destroys any performance advantage over a software xor.  The goal of
> the modification would be to enable md to understand the capacity of
> the platform's xor resources and allow it to issue optimal block
> sizes.

this argument assumes that the HW xor is actually faster than the host,
though.  is that true?  otherwise, HW xor starts out behind due to the 
setup overhead, and falls further behind for larger stripe sizes!

> The first question is whether a solution along these lines would be
> valued by the community?  The effort is non-trivial.

if you're interested in speeding up MD, then looking at how to make 
STRIPE_SIZE bigger might be worthwhile.  (independent of HW xor.)

regards, mark hahn.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Accelerating Linux software raid
  2005-09-10  4:51 ` Mark Hahn
@ 2005-09-10 12:58   ` Ric Wheeler
  2005-09-10 15:35     ` Mark Hahn
  0 siblings, 1 reply; 12+ messages in thread
From: Ric Wheeler @ 2005-09-10 12:58 UTC (permalink / raw)
  To: Mark Hahn; +Cc: Dan Williams, linux-raid

Mark Hahn wrote:

>>hardware.  I/O processors like the Intel IOP333
>>(http://www.intel.com/design/iio/docs/iop333.htm) contain an xor
>>engine for raid5 and raid6 calculations, but currently the md driver
>>does not fully utilize these resources.
>>    
>>
>
>the worst insult in the linux world is "solution in search of a problem".
>that applies here: are you sure that there is a problem?  yes, offload 
>can be a lovely thing, but it often falls behind the main driver of the 
>industry - host cpu performance.  unless you're specifically targetting
>a high-IO device with very little CPU power, I think you'll find a lot 
>of skepticism about IO coprocessors.
>
>I have a server that can do the raid5 checksum at 8 GB/s, and have no 
>reason to ever want more than ~100 MB/s on that machine.  do I care 
>about "wasting" 1/80th of the machine?  not really, even though it's 
>a supercomputing cluster node.  for fileservers, I mind even less 
>wasting CPU using the host for parity, since the cycles aren't going
>to be used for anything else...
>
>  
>
I think that the above holds for server applications, but there are lots 
of places where you will start to see a need for serious IO capabilities 
in low power, multi-core designs.  Think of your Tivo starting to store 
family photos - you don't want to bolt a server class box under your TV 
in order to get some reasonable data protection ;-)

In the Centera group where I work, we have a linux based box that is 
used for archival storage.  Customers understand why the cost of a box 
is related to the number of disks, but the strength of the CPU, memory 
subsystem, etc are all more or less thought of as overhead (not to 
mention that nasty software stuff that I work on ;-)).

In this kind of environment as well, finding an elegant way to take 
advantage of the capabilities of the new system on a chip parts is a win.

Also keep in mind that the Xor done for simple RAID is not the whole 
story - think of compression offload, encryption, etc which might also 
be able to leverage a well thought out solution.

ric


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Accelerating Linux software raid
  2005-09-10 12:58   ` Ric Wheeler
@ 2005-09-10 15:35     ` Mark Hahn
  2005-09-10 19:13       ` Dan Williams
  2005-09-11  2:06       ` Ric Wheeler
  0 siblings, 2 replies; 12+ messages in thread
From: Mark Hahn @ 2005-09-10 15:35 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Dan Williams, linux-raid

> I think that the above holds for server applications, but there are lots 
> of places where you will start to see a need for serious IO capabilities 
> in low power, multi-core designs.  Think of your Tivo starting to store 
> family photos - you don't want to bolt a server class box under your TV 
> in order to get some reasonable data protection ;-)

I understand your point, but are the numbers right?  it seems to me that 
the main factor in appliance design is power dissipation, and I'm guessing
a budget of say 20W for the CPU.  these days, that's a pretty fast processor,
of the mobile-athlon-64 range - probably 3 GB/s xor performance.  I'd 
guess it amounts to perhaps 5-10% cpu overhead if the appliance were,
for some reason, writing at 100 MB/s.  of course, it is NOT writing at 
that rate (remember, reading doesn't require xors, and appliances probably
do more reads than writes...)

> In the Centera group where I work, we have a linux based box that is 
> used for archival storage.  Customers understand why the cost of a box 
> is related to the number of disks, but the strength of the CPU, memory 
> subsystem, etc are all more or less thought of as overhead (not to 
> mention that nasty software stuff that I work on ;-)).

again, no offense meant, but I hear you saying "we under-designed the 
centera host processor, and over-priced it, so that people are trying to 
Stretch their budget by piling on too many disks".  I'm actually a little
surprised, since I figured the Centera design would be a sane, modern,
building-block-based one, where you could cheaply scale the number of 
host processors, not just disks (like an old-fashioned, not-mourned SAN.)
I see a lot of people using a high-performance network like IB as an internal
backplane-like way to tie together a cluster-in-a-box.  (and I expect they'll
sprint from IB to 10G real soon now.)

but then again, you did say this was an archive box.  so what is the
bandwidth of data coming in?  that's the number that sizes your host cpu.
being able to do xor at 12 GB/s is kind of pointless if the server has just
one or two 2 Gb net links... 

> In this kind of environment as well, finding an elegant way to take 
> advantage of the capabilities of the new system on a chip parts is a win.

I understand completely - aesthetic elegance does tend to inspire most 
solutions-in-search-of-problems.  it's one of those urges that we all must 
simply learn to stifle every day.

> Also keep in mind that the Xor done for simple RAID is not the whole 
> story - think of compression offload, encryption, etc which might also 
> be able to leverage a well thought out solution.

this is an excellent point, and one that argues *against* HW coprocessing.
consider the NIC market: TOE never happened because adding tcp/ssl to a 
separate card just moves the complexity and bugs from an easy-to-patch place 
into a harder-to-patch place.  I'd much rather upgrade from a uni server to a
dual and run the tcp/ssl in software than spend the same amount of money
on a $2000 nic that runs its own OS.  my tcp stack bugs get fixed in a 
few hours if I email netdev, but who knows how long bugs would linger in
the firmware stack of a TOE card?

same thing here, except moreso.  making storage appliances smarter is great,
but why put that smarts in some kind of opaque, inaccessible and hard-to-use
coprocessor?  good, thoughtful design leads towards a loosely-coupled cluster
of off-the-shelf components...

regards, mark hahn.
(I run a large supercomputing center, and spend a lot of effort specifying
and using big compute and storage hardware...)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Accelerating Linux software raid
  2005-09-10 15:35     ` Mark Hahn
@ 2005-09-10 19:13       ` Dan Williams
  2005-09-11  2:06       ` Ric Wheeler
  1 sibling, 0 replies; 12+ messages in thread
From: Dan Williams @ 2005-09-10 19:13 UTC (permalink / raw)
  To: Mark Hahn; +Cc: Ric Wheeler, linux-raid

> this is an excellent point, and one that argues *against* HW coprocessing.
> consider the NIC market: TOE never happened because adding tcp/ssl to a
> separate card just moves the complexity and bugs from an easy-to-patch place
> into a harder-to-patch place.  I'd much rather upgrade from a uni server to a
> dual and run the tcp/ssl in software than spend the same amount of money
> on a $2000 nic that runs its own OS.  my tcp stack bugs get fixed in a
> few hours if I email netdev, but who knows how long bugs would linger in
> the firmware stack of a TOE card?
> 
> same thing here, except moreso.  making storage appliances smarter is great,
> but why put that smarts in some kind of opaque, inaccessible and hard-to-use
> coprocessor?  good, thoughtful design leads towards a loosely-coupled cluster
> of off-the-shelf components...
> 

The question here is not can a modern server outperform a coprocessor
at a given task.  Of course it can.  The issue here is how to scale
embedded Linux I/O performance for system-on-a-chip storage silicon
designs.  An embedded design breaks some of the assumptions of the
current driver, first that dedicated raid5/6 offload logic is
available, and that, in general, system resources can be biased
towards the I/O subsystem.  I disagree that it is a solution looking
for a problem.  The problem is the MD driver performs sub optimally on
these platforms.

I'm learning MD by reading the source, and stepping through it with a
debugger.  If anyone knows of other documentation or talks given about
MD please point me to it.

Thanks,

Dan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Accelerating Linux software raid
  2005-09-10 15:35     ` Mark Hahn
  2005-09-10 19:13       ` Dan Williams
@ 2005-09-11  2:06       ` Ric Wheeler
  2005-09-11  2:35         ` Konstantin Olchanski
  1 sibling, 1 reply; 12+ messages in thread
From: Ric Wheeler @ 2005-09-11  2:06 UTC (permalink / raw)
  To: Mark Hahn; +Cc: Dan Williams, linux-raid

Mark Hahn wrote:

>>I think that the above holds for server applications, but there are lots 
>>of places where you will start to see a need for serious IO capabilities 
>>in low power, multi-core designs.  Think of your Tivo starting to store 
>>family photos - you don't want to bolt a server class box under your TV 
>>in order to get some reasonable data protection ;-)
>>    
>>
>
>I understand your point, but are the numbers right?  it seems to me that 
>the main factor in appliance design is power dissipation, and I'm guessing
>a budget of say 20W for the CPU.  these days, that's a pretty fast processor,
>of the mobile-athlon-64 range - probably 3 GB/s xor performance.  I'd 
>guess it amounts to perhaps 5-10% cpu overhead if the appliance were,
>for some reason, writing at 100 MB/s.  of course, it is NOT writing at 
>that rate (remember, reading doesn't require xors, and appliances probably
>do more reads than writes...)
>
>  
>
I think that one thing that your response shows is a small 
misunderstanding in what this class of part is.  It is not a TOE in the 
classic sense, rather a generally useful (non-standard) execution unit 
that can do some restricted set of operations well but is not intended 
to be used as a full second (or third or fourth) CPU.  If you get the 
code and design right, this will be a very simple driver calling 
functions that offload specific computations to these specialized 
execution units. 

If you look at public numbers for power for modern Intel architecture 
CPU's, say Tom's hardware at:

    http://www.tomshardware.com/cpu/20050525/pentium4-02.html

you will see that the 20W budget you allocate for a modern CPU is much 
closer to the power budget for these embedded parts than any modern 
cpu.   Mobile parts draw much less power than server CPUs and come 
somewhat closer to your number.

>>In the Centera group where I work, we have a linux based box that is 
>>used for archival storage.  Customers understand why the cost of a box 
>>is related to the number of disks, but the strength of the CPU, memory 
>>subsystem, etc are all more or less thought of as overhead (not to 
>>mention that nasty software stuff that I work on ;-)).
>>    
>>
>
>again, no offense meant, but I hear you saying "we under-designed the 
>centera host processor, and over-priced it, so that people are trying to 
>Stretch their budget by piling on too many disks".  I'm actually a little
>surprised, since I figured the Centera design would be a sane, modern,
>building-block-based one, where you could cheaply scale the number of 
>host processors, not just disks (like an old-fashioned, not-mourned SAN.)
>I see a lot of people using a high-performance network like IB as an internal
>backplane-like way to tie together a cluster-in-a-box.  (and I expect they'll
>sprint from IB to 10G real soon now.)
>  
>
These operations are not done only during ingest, they can be used to 
check the integrity of the already stored data, regenerate data, etc.  I 
don't want to hawk centera here, but we are definitely a scalable design 
using building blocks ;-)

What I tried to get across is the opposite of your summary, i.e. a 
customer who buys storage devices prefers to pay for storage capacity 
(media) instead of infrastructure used to provide storage and that they 
expect engineers to do the hard work to give them that storage at the 
best possible price.

We definitely use commodity hardware, we just try to get as much out of 
it as possible.

>but then again, you did say this was an archive box.  so what is the
>bandwidth of data coming in?  that's the number that sizes your host cpu.
>being able to do xor at 12 GB/s is kind of pointless if the server has just
>one or two 2 Gb net links... 
>  
>
Storage arrays like Centera are not block device, we do a lot more high 
level functions (real file systems, scrubbing, indexing, etc).  All of 
these functions require CPU, disk, etc, so anything that we can save can 
be used to provide added functionality.

>>Also keep in mind that the Xor done for simple RAID is not the whole 
>>story - think of compression offload, encryption, etc which might also 
>>be able to leverage a well thought out solution.
>>    
>>
>
>this is an excellent point, and one that argues *against* HW coprocessing.
>consider the NIC market: TOE never happened because adding tcp/ssl to a 
>separate card just moves the complexity and bugs from an easy-to-patch place 
>into a harder-to-patch place.  I'd much rather upgrade from a uni server to a
>dual and run the tcp/ssl in software than spend the same amount of money
>on a $2000 nic that runs its own OS.  my tcp stack bugs get fixed in a 
>few hours if I email netdev, but who knows how long bugs would linger in
>the firmware stack of a TOE card?
>  
>
Again, I think you misunderstand the part and the intention of the 
project and the part. Not everyone (much to our sorrow), wants a huge 
storage system - some people might be able to do with very small, quiet 
appliances for their archives.

>same thing here, except moreso.  making storage appliances smarter is great,
>but why put that smarts in some kind of opaque, inaccessible and hard-to-use
>coprocessor?  good, thoughtful design leads towards a loosely-coupled cluster
>of off-the-shelf components...
>
>regards, mark hahn.
>(I run a large supercomputing center, and spend a lot of effort specifying
>and using big compute and storage hardware...)
>
>  
>
I am an ex-Thinking Machines OS developer, who spent time working on the 
paragon OS at OSF and have a fair appreciation for large customers with 
deep wallets.  If everyone wanted to buy large installations built with 
high powered hardware, my life would be much easier ;-)

regards,

ric



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Accelerating Linux software raid
  2005-09-11  2:06       ` Ric Wheeler
@ 2005-09-11  2:35         ` Konstantin Olchanski
  2005-09-11 12:00           ` Ric Wheeler
  0 siblings, 1 reply; 12+ messages in thread
From: Konstantin Olchanski @ 2005-09-11  2:35 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Mark Hahn, Dan Williams, linux-raid

On Sat, Sep 10, 2005 at 10:06:21PM -0400, Ric Wheeler wrote:
> If you look at public numbers for power for modern Intel architecture 
> CPU's, say Tom's hardware at: ...
> you will see that the 20W budget you allocate...

I am now confused. Is somebody trying to save power by adding an i/o
coprocessor? (with it's own power overhead for memory, i/o, etc)

To me it is simple:

1) If you have an infinite power budget (big box), you might as well
   let the main cpus do the raid stuff. If you are short on power (embedded),
   you cannot afford to power an extra processor (+memory and stuff).

2) If you have rich customers (big box), let them pay for a bigger
   main cpu to do the raid, if you want to be cheap (embedded, appliance),
   you cannot afford to plop an extra cpu (+support chips) on your custom pcb.

-- 
Konstantin Olchanski
Data Acquisition Systems: The Bytes Must Flow!
Email: olchansk-at-triumf-dot-ca
Snail mail: 4004 Wesbrook Mall, TRIUMF, Vancouver, B.C., V6T 2A3, Canada

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Accelerating Linux software raid
  2005-09-11  2:35         ` Konstantin Olchanski
@ 2005-09-11 12:00           ` Ric Wheeler
  2005-09-11 20:19             ` Mark Hahn
  0 siblings, 1 reply; 12+ messages in thread
From: Ric Wheeler @ 2005-09-11 12:00 UTC (permalink / raw)
  To: Konstantin Olchanski; +Cc: Mark Hahn, Dan Williams, linux-raid

Konstantin Olchanski wrote:

>I am now confused. Is somebody trying to save power by adding an i/o
>coprocessor? (with it's own power overhead for memory, i/o, etc)
>
>To me it is simple:
>
>1) If you have an infinite power budget (big box), you might as well
>   let the main cpus do the raid stuff. If you are short on power (embedded),
>   you cannot afford to power an extra processor (+memory and stuff).
>
>2) If you have rich customers (big box), let them pay for a bigger
>   main cpu to do the raid, if you want to be cheap (embedded, appliance),
>   you cannot afford to plop an extra cpu (+support chips) on your custom pcb.
>  
>
The actual facts don't support this view since the gap in power 
consumption is huge. Most of these system on a chip designs provide the 
main CPU/northbridge/southbridge and extra execution units for a small 
fraction of one standard CPU.  Say under 20  watts for all of the above 
versus up to (over sometimes) 100 watts for a standard CPU (without its 
system chip sets).



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Accelerating Linux software raid
  2005-09-11 12:00           ` Ric Wheeler
@ 2005-09-11 20:19             ` Mark Hahn
  0 siblings, 0 replies; 12+ messages in thread
From: Mark Hahn @ 2005-09-11 20:19 UTC (permalink / raw)
  To: linux-raid

> >1) If you have an infinite power budget (big box), you might as well
> >   let the main cpus do the raid stuff. If you are short on power (embedded),
> >   you cannot afford to power an extra processor (+memory and stuff).
> >
> >2) If you have rich customers (big box), let them pay for a bigger
> >   main cpu to do the raid, if you want to be cheap (embedded, appliance),
> >   you cannot afford to plop an extra cpu (+support chips) on your custom pcb.
> >
> The actual facts don't support this view since the gap in power 
> consumption is huge. Most of these system on a chip designs provide the 
> main CPU/northbridge/southbridge and extra execution units for a small 
> fraction of one standard CPU.  Say under 20  watts for all of the above 
> versus up to (over sometimes) 100 watts for a standard CPU (without its 
> system chip sets).

we appear to be talking about different things.  the original suggestion
was for appliances like tivo, which clearly have a limited power budget,
but certainly > 20W.  I responded by suggesting that current mainstream
mobile CPUs (like mobile athlon64's) have PLENTY of power to run MD - 
in fact, more than an appliance could possibly have any need for.
and they dissipate 20-30W.

then the topic somehow mutated into SoC designs, such as Intel's,
which are actually intended to *be* the raid card, and have an ARM 
core peaking at 600 MHz, and probably are challenged to sustain even
100 MB/s.  (in spite of bragging about a multi-GB/s internal bus.)

in other words, there's a third category: OEM customers who want to build
a smart raid card that consists of a SoC running linux actually implementing
the raid.  the main technical impediment is that Intel's solution has
XOR and DMA engines to make up for the wimpy CPU, but those engines are 
barely profitable with 4k block sizes.  since MD does XOR's in 4k chunks,
some hacking would be necessary to expand this size.  I expect this would 
have some modest benefit for systems other than the Intel SoC.  (but I 
question the sanity of the Intel approach in the first place, since I believe
the trend in storage is away from this kind of integration.  obviously,
it also doesn't make much sense for linux hosts, but such a card would
probably find more of a market in the windows world.)

regards, mark hahn.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Accelerating Linux software raid
  2005-09-06 18:24 Accelerating Linux software raid Dan Williams
  2005-09-06 21:52 ` Molle Bestefich
  2005-09-10  4:51 ` Mark Hahn
@ 2005-09-10  8:35 ` Colonel Hell
  2005-09-11 23:14 ` Neil Brown
  3 siblings, 0 replies; 12+ messages in thread
From: Colonel Hell @ 2005-09-10  8:35 UTC (permalink / raw)
  To: dan.j.williams; +Cc: linux-raid

Hi,
Dan:
>the md driver that allows it to take advantage of raid acceleration
>hardware.  I/O processors like the Intel IOP333
>(http://www.intel.com/design/iio/docs/iop333.htm) contain an xor
>engine for raid5 and raid6 calculations, but currently the md driver
>does not fully utilize these resources.
does that mean offloading RAID to the IOP board or simply using the
XOR engine in the  IOP ?
if you have some benchmarks which shows that under certain workload md
is not IOlimited(due to excessive XOR computation)  and then concluded
that offloading XOR is a gud idea , then it must be :) ...

I was going thru IOP manual myself and saw that they have a raid6
accelaretor too :)

Let me know.
-Johri.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Accelerating Linux software raid
  2005-09-06 18:24 Accelerating Linux software raid Dan Williams
                   ` (2 preceding siblings ...)
  2005-09-10  8:35 ` Colonel Hell
@ 2005-09-11 23:14 ` Neil Brown
  3 siblings, 0 replies; 12+ messages in thread
From: Neil Brown @ 2005-09-11 23:14 UTC (permalink / raw)
  To: dan.j.williams; +Cc: linux-raid, dave.jiang

On Tuesday September 6, dan.j.williams@gmail.com wrote:
> Hello,
> 
> I am writing to the list to gauge interest for a modification of the
> md driver that allows it to take advantage of raid acceleration
> hardware.  I/O processors like the Intel IOP333
> (http://www.intel.com/design/iio/docs/iop333.htm) contain an xor
> engine for raid5 and raid6 calculations, but currently the md driver
> does not fully utilize these resources.
> 
> Dave Jiang wrote a driver that re-routed calls to xor_block() to use
> the hardware xor engine.  However, from my understating, he found that
> performance did not improve, due to the fact that md deals in
> PAGE_SIZE (4K) blocks.  At 4K the overhead of setting up the engine
> destroys any performance advantage over a software xor.  The goal of
> the modification would be to enable md to understand the capacity of
> the platform's xor resources and allow it to issue optimal block
> sizes.
> 
> The first question is whether a solution along these lines would be
> valued by the community?  The effort is non-trivial.

If the effort is non-trivial, then I suggest you only do it if it has
really value to *you*.
If it does, community involvement is more likely to provide value *to*
you (such as guidance, bug-fixes, long-term maintenance) than to get
value *from* you, though hopefully it would be a win-win situation.

I'm not surprised that simply replacing xor_block with calls into the
hardware engine didn't help much.  xor_block is currently called under
a spinlock, so the main processor will probably be completely idle
while the AA is doing the XOR calculation, so there isn't much room
for improvement.

If I were to try to implement this, here is how I would do it:

1/ get the xor calc out from under the spinlock.  This will require a
  fairly deep understanding of the handle_stripe() function.
  The 'stripe_head' works somewhat like a state machine.
  handle_stripe assesses the current state and advances it one step.

  Currently if it determines that it is time to write some data, it
  will
   - copy data out of file-system buffers into it's own cache
   - perform the xor calculations in the cache, locking all
      blocks the then become dirty.
   - schedule a write on all those locked blocks.
  The stripe won't be ready to be handled again until all the writes
  complete.

  This should be changed so that we don't copy+xor, but instead just
  lock the blocks and flag them as needing xor.  Then after
  sh->lock is dropped you will send the copy+xor request to the AA, or
  do it in-line.  Once the copy+xor is completed, the stripe
  needs to get flagged for handling again.
  Stripe handle will then need to notice that parity has been 
   calculated, so writing can commence.

2/ Then I would try to find the best internal API to provide for the 
   AA (Application Accelerator for those who haven't read the spec
   yet).
   My guess is that it should work much like the crypto API.  I'm not 
   up-to-date with that so I don't know if the async-crypto-API is
   complete and merged yet  (async-crypto-API is for sending data
   to separate processors for crypto manipulation and being alerted 
   asynchronously when they complete).  If it is, definitely look into
   using that.  If it isn't, certainly look into it and maybe even
   help it's development to make sure it can handle multiple-input xor
   operations. 

Step 1 is probably quite useful anyway and is unlikely to slow current
performance - just re-arrange it.  Once that is done, plugging in
async xor should be fairly easy whether you use crypto-api or not.

I don't think it is practical to use larger block sizes for the xor
operations, and I doubt it is needed.  The DMA engine in the AA has a
very nice chaining arrangement where new operations can be added to the
end of the chain at any time, and I doubt the effort of loading a new
chain descriptor would be a substantial fraction of the time it takes
to xor a 4k block.  As long as you keep everything async (i.e. keep
the main processor busy while the copy+xor is happening) you should
notice some speed-up ... or at least a drop is cpu activity.

One last note:  if you do decide to give '1/' a try, remember to keep
patches small and well defined.  handle_stripe currently does xor
in four places: two in compute_parity (one prior to write, one for
resync) and two in compute_block (one for degraded read, one for
recovery). 
Don't try to change these all at once.  One, or at most two, at a time
makes the patches much easier to review.

Good luck,
NeilBrown

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-09-11 23:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-06 18:24 Accelerating Linux software raid Dan Williams
2005-09-06 21:52 ` Molle Bestefich
2005-09-10  4:51 ` Mark Hahn
2005-09-10 12:58   ` Ric Wheeler
2005-09-10 15:35     ` Mark Hahn
2005-09-10 19:13       ` Dan Williams
2005-09-11  2:06       ` Ric Wheeler
2005-09-11  2:35         ` Konstantin Olchanski
2005-09-11 12:00           ` Ric Wheeler
2005-09-11 20:19             ` Mark Hahn
2005-09-10  8:35 ` Colonel Hell
2005-09-11 23:14 ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).