[RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction
@ 2006-02-03  1:46 Dan Williams
  2006-02-03  2:01 ` Jeff Garzik
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Dan Williams @ 2006-02-03  1:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dan Williams, linux-raid, Evgeniy Polyakov, Chris Leech,
	Grover, Andrew, Deepak Saxena

This patch set was originally posted to linux-raid, Neil suggested that
I send to linux-kernel as well:

Per the discussion in this thread (http://marc.theaimsgroup.com/?
t=112603120100004&r=1&w=2) these patches implement the first phase of MD
acceleration, pre-emptible xor.  To date these patches only cover raid5
calls to compute_parity for read/modify and reconstruct writes.  The
plan is to expand this to cover raid5 check parity, raid5 compute block,
as well as the raid6 equivalents.

The ADMA (Asynchronous / Application Specific DMA) interface is proposed
as a cross platform mechanism for supporting system CPU offload engines.
The goal is to provide a unified asynchronous interface to support
memory copies, block xor, block pattern setting, block compare, CRC
calculation, cryptography etc.  The ADMA interface should support a PIO
fallback mode allowing a given ADMA engine implementation to use the
system CPU for operations without a hardware accelerated backend.  In
other words a client coded to the ADMA interface transparently receives
hardware acceleration for its operations depending on the features of
the underlying platform.

Ideally, with some coordination, the I/OAT DMA, ACRYPTO, and ADMA
efforts can unify to present a consistent asynchronous off-load engine
interface.

[RFC][PATCH 001 of 3] MD Acceleration: Base ADMA interface
[RFC][PATCH 002 of 3] MD Acceleration: md_adma driver for raid5 offload
[RFC][PATCH 003 of 3] MD Acceleration: raid5 changes to support asynchronous operation

NOTE: These patches are against Linus' git tree as of commit_id
0271fc2db6260dd46f196191e24281af2fddb879 (they should apply to 2.6.16-
rc1).  They have only passed basic run time sanity checks on ARM, and
compile testing on i386.

The next phase of this development will target the remainder of raid5,
raid6, and the outcome of the asynchronous off-load engine unification
effort.  The known asynchronous interface stakeholders have been CC'd.

Dan Williams
Linux Development Team
Storage Group - Intel Corporation

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction
  2006-02-03  1:46 [RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction Dan Williams
@ 2006-02-03  2:01 ` Jeff Garzik
  2006-02-03 18:21 ` Pierre Ossman
  2006-02-06  3:30 ` Neil Brown
  2 siblings, 0 replies; 9+ messages in thread
From: Jeff Garzik @ 2006-02-03  2:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, linux-raid, Evgeniy Polyakov, Chris Leech,
	Grover, Andrew, Deepak Saxena

Dan Williams wrote:
> This patch set was originally posted to linux-raid, Neil suggested that
> I send to linux-kernel as well:
> 
> Per the discussion in this thread (http://marc.theaimsgroup.com/?
> t=112603120100004&r=1&w=2) these patches implement the first phase of MD
> acceleration, pre-emptible xor.  To date these patches only cover raid5
> calls to compute_parity for read/modify and reconstruct writes.  The
> plan is to expand this to cover raid5 check parity, raid5 compute block,
> as well as the raid6 equivalents.
> 
> The ADMA (Asynchronous / Application Specific DMA) interface is proposed
> as a cross platform mechanism for supporting system CPU offload engines.
> The goal is to provide a unified asynchronous interface to support
> memory copies, block xor, block pattern setting, block compare, CRC
> calculation, cryptography etc.  The ADMA interface should support a PIO
> fallback mode allowing a given ADMA engine implementation to use the
> system CPU for operations without a hardware accelerated backend.  In
> other words a client coded to the ADMA interface transparently receives
> hardware acceleration for its operations depending on the features of
> the underlying platform.

Here are some other things out there worth considering:

* SCSI XOR commands

* Figuring out how to support Promise SX4 (e.g. device offload), which 
is a chip with integrated XOR engine and attached DIMM.  RAID1 and RAID5 
are best implemented on-card, but the Linux driver is responsible for 
implementing all on-card actions, not a firmware.

	Jeff



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction
  2006-02-03  1:46 [RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction Dan Williams
  2006-02-03  2:01 ` Jeff Garzik
@ 2006-02-03 18:21 ` Pierre Ossman
  2006-02-03 18:25   ` Randy.Dunlap
  2006-02-03 18:58   ` Jeff Garzik
  2006-02-06  3:30 ` Neil Brown
  2 siblings, 2 replies; 9+ messages in thread
From: Pierre Ossman @ 2006-02-03 18:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, linux-raid, Evgeniy Polyakov, Chris Leech,
	Grover, Andrew, Deepak Saxena

Dan Williams wrote:
> 
> The ADMA (Asynchronous / Application Specific DMA) interface is proposed
> as a cross platform mechanism for supporting system CPU offload engines.
> The goal is to provide a unified asynchronous interface to support
> memory copies, block xor, block pattern setting, block compare, CRC
> calculation, cryptography etc.  The ADMA interface should support a PIO
> fallback mode allowing a given ADMA engine implementation to use the
> system CPU for operations without a hardware accelerated backend.  In
> other words a client coded to the ADMA interface transparently receives
> hardware acceleration for its operations depending on the features of
> the underlying platform.
> 

I'm wondering, how common is this ADMA acronym? I've been writing a MMC
driver for some hardware where specifications aren't available. I have
found one document which list an "ADMA system address" register, with a
width of 64 bits. What are the odds of this being something that
conforms to said interface?

Rgds
Pierre


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction
  2006-02-03 18:21 ` Pierre Ossman
@ 2006-02-03 18:25   ` Randy.Dunlap
  2006-02-03 18:58   ` Jeff Garzik
  1 sibling, 0 replies; 9+ messages in thread
From: Randy.Dunlap @ 2006-02-03 18:25 UTC (permalink / raw)
  To: Pierre Ossman
  Cc: Dan Williams, linux-kernel, linux-raid, Evgeniy Polyakov,
	Chris Leech, Grover, Andrew, Deepak Saxena

On Fri, 3 Feb 2006, Pierre Ossman wrote:

> Dan Williams wrote:
> >
> > The ADMA (Asynchronous / Application Specific DMA) interface is proposed
> > as a cross platform mechanism for supporting system CPU offload engines.
> > The goal is to provide a unified asynchronous interface to support
> > memory copies, block xor, block pattern setting, block compare, CRC
> > calculation, cryptography etc.  The ADMA interface should support a PIO
> > fallback mode allowing a given ADMA engine implementation to use the
> > system CPU for operations without a hardware accelerated backend.  In
> > other words a client coded to the ADMA interface transparently receives
> > hardware acceleration for its operations depending on the features of
> > the underlying platform.
> >
>
> I'm wondering, how common is this ADMA acronym? I've been writing a MMC
> driver for some hardware where specifications aren't available. I have
> found one document which list an "ADMA system address" register, with a
> width of 64 bits. What are the odds of this being something that
> conforms to said interface?

oh dear, i thought it was either Advanced or Accelerated DMA, fwiw.

-- 
~Randy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction
  2006-02-03 18:21 ` Pierre Ossman
  2006-02-03 18:25   ` Randy.Dunlap
@ 2006-02-03 18:58   ` Jeff Garzik
  1 sibling, 0 replies; 9+ messages in thread
From: Jeff Garzik @ 2006-02-03 18:58 UTC (permalink / raw)
  To: Pierre Ossman
  Cc: Dan Williams, linux-kernel, linux-raid, Evgeniy Polyakov,
	Chris Leech, Grover, Andrew, Deepak Saxena

Pierre Ossman wrote:
> Dan Williams wrote:
> 
>>The ADMA (Asynchronous / Application Specific DMA) interface is proposed
>>as a cross platform mechanism for supporting system CPU offload engines.
>>The goal is to provide a unified asynchronous interface to support
>>memory copies, block xor, block pattern setting, block compare, CRC
>>calculation, cryptography etc.  The ADMA interface should support a PIO
>>fallback mode allowing a given ADMA engine implementation to use the
>>system CPU for operations without a hardware accelerated backend.  In
>>other words a client coded to the ADMA interface transparently receives
>>hardware acceleration for its operations depending on the features of
>>the underlying platform.
>>
> 
> 
> I'm wondering, how common is this ADMA acronym? I've been writing a MMC

In ATA land, ADMA is a hardware ATA controller interface, similar to 
AHCI.  We even have a pdc_adma (Pacific Digital ADMA) driver in the 
tree, and NVIDIA uses a variant of the ADMA interface in their SATA 
controllers.

	Jeff




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction
  2006-02-03  1:46 [RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction Dan Williams
  2006-02-03  2:01 ` Jeff Garzik
  2006-02-03 18:21 ` Pierre Ossman
@ 2006-02-06  3:30 ` Neil Brown
  2006-02-06 19:25   ` Dan Williams
  2 siblings, 1 reply; 9+ messages in thread
From: Neil Brown @ 2006-02-06  3:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, linux-raid, Evgeniy Polyakov, Chris Leech,
	Grover, Andrew, Deepak Saxena

On Thursday February 2, dan.j.williams@intel.com wrote:
> This patch set was originally posted to linux-raid, Neil suggested that
> I send to linux-kernel as well:
> 
> Per the discussion in this thread (http://marc.theaimsgroup.com/?
> t=112603120100004&r=1&w=2) these patches implement the first phase of MD
> acceleration, pre-emptible xor.  To date these patches only cover raid5
> calls to compute_parity for read/modify and reconstruct writes.  The
> plan is to expand this to cover raid5 check parity, raid5 compute block,
> as well as the raid6 equivalents.
> 
> The ADMA (Asynchronous / Application Specific DMA) interface is proposed
> as a cross platform mechanism for supporting system CPU offload engines.
> The goal is to provide a unified asynchronous interface to support
> memory copies, block xor, block pattern setting, block compare, CRC
> calculation, cryptography etc.  The ADMA interface should support a PIO
> fallback mode allowing a given ADMA engine implementation to use the
> system CPU for operations without a hardware accelerated backend.  In
> other words a client coded to the ADMA interface transparently receives
> hardware acceleration for its operations depending on the features of
> the underlying platform.

I've looked through the patches - not exhaustively, but hopefully
enough to get a general idea of what is happening.
There are some things I'm not clear on and some things that I could
suggest alternates too...

 - Each ADMA client (e.g. a raid5 array) gets a dedicated adma thread
   to handle all its requests.  And it handles them all in series.  I
   wonder if this is really optimal.  If there are multiple adma
   engines, then a single client could only make use of one of them
   reliably.
   It would seem to make more sense to have just one thread - or maybe
   one per processor or one per adma engine - and have any ordering
   between requests made explicit in the interface.

   Actually as each processor could be seen as an ADMA engine, maybe
   you want one thread per processor AND one per engine.  If there are
   no engines, the per-processor threads run with high priority, else
   with low.

 - I have thought that the way md/raid5 currently does the
   'copy-to-buffer' and 'xor' in two separate operations may not be
   the best use of the memory bus.  If you could have a 3-address
   operation that read from A, stored into B, and xorred into C, then
   A would have to be read half as often.  Would such an interface
   make sense with ADMA?  I don't have sufficient knowledge of
   assemble to do it myself for the current 'xor' code.

 - Your handling of highmem doesn't seem right.  You shouldn't kmap it
   until you have decided that you have to do the operation 'by hand'
   (i.e. in the cpu, not in the DMA engine).  If the dma engine can be
   used at all, kmap isn't needed at all.

 - The interfacing between raid5 and adma seems clumsy... Maybe this
   is just because you were trying to minimise changes to raid5.c.
   I think it would be better to make substantial but elegant changes
   to raid5.c - handle_stripe in particular - so that what is
   happening becomes very obvious.

   For example, one it has been decided to initiate a write (there is
   enough data to correctly update the parity block).  You need to
   perform a sequence of copies and xor operations, and then submit
   write requests.
   This is currently done by the copy/xor happening inline under the
   sh->lock spinlock, and then R5_WantWrite is set.  Then, out side
   the spinlock, if WantWrite is set generic_make_request is calls as
   appropriate. 

   I would change this so that a sequence of descriptors was assembled
   which described that copies and xors.  Appropriate call-backs would
   be set so that the generic_make_request is called at the right time
   (after the copy, or after that last xor for the parity block).
   Then outside the sh->lock spinlock this sequence is passed to the
   ADMA manager.  If there is no ADMA engine present, everything is
   performed inline - multiple xors are possibly combined into
   multi-way xors automatically.  If there is an ADMA engine, it is
   scheduled to do the work.

   The relevant blocks are all 'locked' as they are added to the
   sequence, and unlocked as the writes complete or, for unchanged
   block in RECONSTRUCT_WRITE, when the copy xor that uses them
   completes. 

   resync operations would construct similar descriptor sequences, and
   have a different call-back on completion.

   Doing this would require making sure that get_desc always
   succeeds.  I notice that you currently allow for the possible
   failure of adma_get_desc and fall back to 'pio' in that case (I
   think).  I think it would be better to use a mempool (or similar)
   to ensure that you never fail.  There is a simple upper bound on
   the number of descriptors you need for one stripe.  Make sure you
   have that many from the pool at the top of handle_stripe.  Then use
   as many as you need inside the spinlock in handle_stripe without
   worrying if you have enough or not.

I hope that if helpful and look forward to seeing future progress in
this area.

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction
  2006-02-06  3:30 ` Neil Brown
@ 2006-02-06 19:25   ` Dan Williams
  2006-02-07  7:23     ` Evgeniy Polyakov
  2006-02-14  3:29     ` Neil Brown
  0 siblings, 2 replies; 9+ messages in thread
From: Dan Williams @ 2006-02-06 19:25 UTC (permalink / raw)
  To: Neil Brown
  Cc: Dan Williams, linux-kernel, linux-raid, Evgeniy Polyakov,
	Chris Leech, Grover, Andrew, Deepak Saxena

On 2/5/06, Neil Brown <neilb@suse.de> wrote:
> I've looked through the patches - not exhaustively, but hopefully
> enough to get a general idea of what is happening.
> There are some things I'm not clear on and some things that I could
> suggest alternates too...

I have a few questions to check that I understand your suggestions.

>  - Each ADMA client (e.g. a raid5 array) gets a dedicated adma thread
>    to handle all its requests.  And it handles them all in series.  I
>    wonder if this is really optimal.  If there are multiple adma
>    engines, then a single client could only make use of one of them
>    reliably.
>    It would seem to make more sense to have just one thread - or maybe
>    one per processor or one per adma engine - and have any ordering
>    between requests made explicit in the interface.
>
>    Actually as each processor could be seen as an ADMA engine, maybe
>    you want one thread per processor AND one per engine.  If there are
>    no engines, the per-processor threads run with high priority, else
>    with low.

...so the engine thread would handle explicit client requested
ordering constraints and then hand the operations off to per processor
worker threads in the "pio" case or queue directly to hardware in the
presence of such an engine.  In md_thread you talk about priority
inversion deadlocks, do those same concerns apply here?

>  - I have thought that the way md/raid5 currently does the
>    'copy-to-buffer' and 'xor' in two separate operations may not be
>    the best use of the memory bus.  If you could have a 3-address
>    operation that read from A, stored into B, and xorred into C, then
>    A would have to be read half as often.  Would such an interface
>    make sense with ADMA?  I don't have sufficient knowledge of
>    assemble to do it myself for the current 'xor' code.

At the very least I can add a copy+xor command to ADMA, that way
developers implementing engines can optimize for this case, if the
hardware supports it, and the hand coded assembly guys can do their
thing.

>  - Your handling of highmem doesn't seem right.  You shouldn't kmap it
>    until you have decided that you have to do the operation 'by hand'
>    (i.e. in the cpu, not in the DMA engine).  If the dma engine can be
>    used at all, kmap isn't needed at all.

I made the assumption that if CONFIG_HIGHMEM is not set then the kmap
call resolves to a simple page_address() call.  I think its "ok", but
it does look fishy so I will revise this code.  I also was looking to
handle the case where the underlying hardware DMA engine does not
support high memory addresses.

>  - The interfacing between raid5 and adma seems clumsy... Maybe this
>    is just because you were trying to minimise changes to raid5.c.
>    I think it would be better to make substantial but elegant changes
>    to raid5.c - handle_stripe in particular - so that what is
>    happening becomes very obvious.

Yes, I went into this with the idea of being minimally intrusive, but
you are right the end result should have MD optimized for ADMA rather
than ADMA shoe-horned into MD.

>    For example, one it has been decided to initiate a write (there is
>    enough data to correctly update the parity block).  You need to
>    perform a sequence of copies and xor operations, and then submit
>    write requests.
>    This is currently done by the copy/xor happening inline under the
>    sh->lock spinlock, and then R5_WantWrite is set.  Then, out side
>    the spinlock, if WantWrite is set generic_make_request is calls as
>    appropriate.
>
>    I would change this so that a sequence of descriptors was assembled
>    which described that copies and xors.  Appropriate call-backs would
>    be set so that the generic_make_request is called at the right time
>    (after the copy, or after that last xor for the parity block).
>    Then outside the sh->lock spinlock this sequence is passed to the
>    ADMA manager.  If there is no ADMA engine present, everything is
>    performed inline - multiple xors are possibly combined into
>    multi-way xors automatically.  If there is an ADMA engine, it is
>    scheduled to do the work.

I like this idea of clearly separated stripe assembly (finding work
while under the lock) and stripe execute (running copy+xor / touching
disks) stages.

Can you elaborate on a scenario where xors are combined into multi-way xors?

>    The relevant blocks are all 'locked' as they are added to the
>    sequence, and unlocked as the writes complete or, for unchanged
>    block in RECONSTRUCT_WRITE, when the copy xor that uses them
>    completes.
>
>    resync operations would construct similar descriptor sequences, and
>    have a different call-back on completion.
>
>
>    Doing this would require making sure that get_desc always
>    succeeds.  I notice that you currently allow for the possible
>    failure of adma_get_desc and fall back to 'pio' in that case (I
>    think).  I think it would be better to use a mempool (or similar)
>    to ensure that you never fail.  There is a simple upper bound on
>    the number of descriptors you need for one stripe.  Make sure you
>    have that many from the pool at the top of handle_stripe.  Then use
>    as many as you need inside the spinlock in handle_stripe without
>    worrying if you have enough or not.

Yes, this is a common feedback to not handle per descriptor allocation
failures.  So, to check your intention, you envision handle_stripe
deciding early and only once whether copy+xor work will be done
synchronously under the lock or allowed to run asynchronously.

> I hope that if helpful and look forward to seeing future progress in
> this area.

Thanks,

Dan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction
  2006-02-06 19:25   ` Dan Williams
@ 2006-02-07  7:23     ` Evgeniy Polyakov
  2006-02-14  3:29     ` Neil Brown
  1 sibling, 0 replies; 9+ messages in thread
From: Evgeniy Polyakov @ 2006-02-07  7:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: Neil Brown, Dan Williams, linux-kernel, linux-raid, Chris Leech,
	Grover, Andrew, Deepak Saxena

On Mon, Feb 06, 2006 at 12:25:22PM -0700, Dan Williams (dan.j.williams@gmail.com) wrote:
> On 2/5/06, Neil Brown <neilb@suse.de> wrote:
> > I've looked through the patches - not exhaustively, but hopefully
> > enough to get a general idea of what is happening.
> > There are some things I'm not clear on and some things that I could
> > suggest alternates too...
> 
> I have a few questions to check that I understand your suggestions.
> 
> >  - Each ADMA client (e.g. a raid5 array) gets a dedicated adma thread
> >    to handle all its requests.  And it handles them all in series.  I
> >    wonder if this is really optimal.  If there are multiple adma
> >    engines, then a single client could only make use of one of them
> >    reliably.
> >    It would seem to make more sense to have just one thread - or maybe
> >    one per processor or one per adma engine - and have any ordering
> >    between requests made explicit in the interface.
> >
> >    Actually as each processor could be seen as an ADMA engine, maybe
> >    you want one thread per processor AND one per engine.  If there are
> >    no engines, the per-processor threads run with high priority, else
> >    with low.
> 
> ...so the engine thread would handle explicit client requested
> ordering constraints and then hand the operations off to per processor
> worker threads in the "pio" case or queue directly to hardware in the
> presence of such an engine.  In md_thread you talk about priority
> inversion deadlocks, do those same concerns apply here?

Just for reference: the more threads you have, the less stable your
system is. Ping-ponging work between several completely independent
entities is always a bad idea. Even completion of the request postponed
to workqueue from current execution unit introduces noticeble latencies.
System should be able to process as much as possible of it's work in one
flow.

> Dan

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction
  2006-02-06 19:25   ` Dan Williams
  2006-02-07  7:23     ` Evgeniy Polyakov
@ 2006-02-14  3:29     ` Neil Brown
  1 sibling, 0 replies; 9+ messages in thread
From: Neil Brown @ 2006-02-14  3:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, linux-raid, Evgeniy Polyakov, Chris Leech,
	Grover, Andrew, Deepak Saxena

On Monday February 6, dan.j.williams@gmail.com wrote:
> On 2/5/06, Neil Brown <neilb@suse.de> wrote:
> > I've looked through the patches - not exhaustively, but hopefully
> > enough to get a general idea of what is happening.
> > There are some things I'm not clear on and some things that I could
> > suggest alternates too...
> 
> I have a few questions to check that I understand your suggestions.

(sorry for the delay).

> 
> >  - Each ADMA client (e.g. a raid5 array) gets a dedicated adma thread
> >    to handle all its requests.  And it handles them all in series.  I
> >    wonder if this is really optimal.  If there are multiple adma
> >    engines, then a single client could only make use of one of them
> >    reliably.
> >    It would seem to make more sense to have just one thread - or maybe
> >    one per processor or one per adma engine - and have any ordering
> >    between requests made explicit in the interface.
> >
> >    Actually as each processor could be seen as an ADMA engine, maybe
> >    you want one thread per processor AND one per engine.  If there are
> >    no engines, the per-processor threads run with high priority, else
> >    with low.
> 
> ...so the engine thread would handle explicit client requested
> ordering constraints and then hand the operations off to per processor
> worker threads in the "pio" case or queue directly to hardware in the
> presence of such an engine.  In md_thread you talk about priority
> inversion deadlocks, do those same concerns apply here?

That comment in md.c about priority inversion deadlocks predates my
involvement - making it "soooo last millennium"...
I don't think it is relevant any more, and possibly never was.

I don't see any room for priority inversion here.

I probably wouldn't even have an 'engine thread'.  If I were to write
'md' today, it probably wouldn't have a dedicate thread but would use
'schedule_work' to arrange for code to be run in process-context.
The ADMA engine could do the same.
Note: I'm not saying this is the "right" way to go.  But I do think it
is worth exploring.

I'm not sure about threads for the 'pio' case.  It would probably be
easiest that way, but I would explore the 'schedule_work' family of
services first.

But yes, the ADMA engine would handle explicit client requested
ordering and arrange for work to be done somehow.

> 
> >  - I have thought that the way md/raid5 currently does the
> >    'copy-to-buffer' and 'xor' in two separate operations may not be
> >    the best use of the memory bus.  If you could have a 3-address
> >    operation that read from A, stored into B, and xorred into C, then
> >    A would have to be read half as often.  Would such an interface
> >    make sense with ADMA?  I don't have sufficient knowledge of
> >    assemble to do it myself for the current 'xor' code.
> 
> At the very least I can add a copy+xor command to ADMA, that way
> developers implementing engines can optimize for this case, if the
> hardware supports it, and the hand coded assembly guys can do their
> thing.
> 
> >  - Your handling of highmem doesn't seem right.  You shouldn't kmap it
> >    until you have decided that you have to do the operation 'by hand'
> >    (i.e. in the cpu, not in the DMA engine).  If the dma engine can be
> >    used at all, kmap isn't needed at all.
> 
> I made the assumption that if CONFIG_HIGHMEM is not set then the kmap
> call resolves to a simple page_address() call.  I think its "ok", but
> it does look fishy so I will revise this code.  I also was looking to
> handle the case where the underlying hardware DMA engine does not
> support high memory addresses.

I think the only way to handle the ADMA engine not supporting high
memory is to do the operation 'polled' - i.e. in the CPU.
The alternative is to copy it to somewhere that the DMA engine can
reach, and if you are going to do that, you have done most of the work
already.
Possibly you could still gain by using the engine for RAID6
calculations, but not for copy, compare, or xor operations.

And if you are using the DMA engine, then you don't want the
page_address. You want to use pci_map_page (or similar?) to get a
dma_handle. 

> >    For example, one it has been decided to initiate a write (there is
> >    enough data to correctly update the parity block).  You need to
> >    perform a sequence of copies and xor operations, and then submit
> >    write requests.
> >    This is currently done by the copy/xor happening inline under the
> >    sh->lock spinlock, and then R5_WantWrite is set.  Then, out side
> >    the spinlock, if WantWrite is set generic_make_request is calls as
> >    appropriate.
> >
> >    I would change this so that a sequence of descriptors was assembled
> >    which described that copies and xors.  Appropriate call-backs would
> >    be set so that the generic_make_request is called at the right time
> >    (after the copy, or after that last xor for the parity block).
> >    Then outside the sh->lock spinlock this sequence is passed to the
> >    ADMA manager.  If there is no ADMA engine present, everything is
> >    performed inline - multiple xors are possibly combined into
> >    multi-way xors automatically.  If there is an ADMA engine, it is
> >    scheduled to do the work.
> 
> I like this idea of clearly separated stripe assembly (finding work
> while under the lock) and stripe execute (running copy+xor / touching
> disks) stages.
> 
> Can you elaborate on a scenario where xors are combined into multi-way xors?

Any xor operation in raid5 is (potentially) over multiple block.
You xor n-1 data blocks together to get the parity block.

The xor_block function takes an array of blocks and xors them all
together into the first.
I don't know if a separate DMA engine would be able to get any benefit
from xoring multiple things at one, but the CPU does so the ADMA
engine would need to know about it atleast when using the CPU to
perform xor operations.

> 
> >    The relevant blocks are all 'locked' as they are added to the
> >    sequence, and unlocked as the writes complete or, for unchanged
> >    block in RECONSTRUCT_WRITE, when the copy xor that uses them
> >    completes.
> >
> >    resync operations would construct similar descriptor sequences, and
> >    have a different call-back on completion.
> >
> >
> >    Doing this would require making sure that get_desc always
> >    succeeds.  I notice that you currently allow for the possible
> >    failure of adma_get_desc and fall back to 'pio' in that case (I
> >    think).  I think it would be better to use a mempool (or similar)
> >    to ensure that you never fail.  There is a simple upper bound on
> >    the number of descriptors you need for one stripe.  Make sure you
> >    have that many from the pool at the top of handle_stripe.  Then use
> >    as many as you need inside the spinlock in handle_stripe without
> >    worrying if you have enough or not.
> 
> Yes, this is a common feedback to not handle per descriptor allocation
> failures.  So, to check your intention, you envision handle_stripe
> deciding early and only once whether copy+xor work will be done
> synchronously under the lock or allowed to run asynchronously.

Not exactly.
I would always have the copy+xor done separately, out side the lock
under which we examine the stripe and decide what to do next.

To achieve this, we would pre-allocate enough descriptors so that a
suitable number of operations could all be happening at once.  Then if
handle_stripe cannot get enough descriptors at the start (i.e. if the
pool is empty), it delays handling of the stripe until descriptors are
available.

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-02-14  3:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-03  1:46 [RFC][PATCH 000 of 3] MD Acceleration and the ADMA interface: Introduction Dan Williams
2006-02-03  2:01 ` Jeff Garzik
2006-02-03 18:21 ` Pierre Ossman
2006-02-03 18:25   ` Randy.Dunlap
2006-02-03 18:58   ` Jeff Garzik
2006-02-06  3:30 ` Neil Brown
2006-02-06 19:25   ` Dan Williams
2006-02-07  7:23     ` Evgeniy Polyakov
2006-02-14  3:29     ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox