[PATCH 0/10] [IOAT] I/OAT patches repost

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/10] [IOAT] I/OAT patches repost
@ 2006-04-20 20:49 Andrew Grover
  2006-04-20 21:33 ` Olof Johansson
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Grover @ 2006-04-20 20:49 UTC (permalink / raw)
  To: netdev

Hi I'm reposting these, originally posted by Chris Leech a few weeks ago.
However, there is an extra part since I broke up one patch that was too 
big for netdev last time into two (patches 2 and 3).

Of course we're always looking for more style improvement comments, but 
more importantly we're posting these to talk about the larger issues 
around I/OAT and this code making it in upstream at some point.

These are also available on the wiki,  
http://linux-net.osdl.org/index.php/I/OAT .

Thanks -- Andy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-20 20:49 [PATCH 0/10] [IOAT] I/OAT patches repost Andrew Grover
@ 2006-04-20 21:33 ` Olof Johansson
  2006-04-20 22:14   ` Andrew Grover
  2006-04-21  0:27   ` David S. Miller
  0 siblings, 2 replies; 20+ messages in thread
From: Olof Johansson @ 2006-04-20 21:33 UTC (permalink / raw)
  To: Andrew Grover; +Cc: netdev

On Thu, Apr 20, 2006 at 01:49:16PM -0700, Andrew Grover wrote:
> Hi I'm reposting these, originally posted by Chris Leech a few weeks ago.
> However, there is an extra part since I broke up one patch that was too 
> big for netdev last time into two (patches 2 and 3).
> 
> Of course we're always looking for more style improvement comments, but 
> more importantly we're posting these to talk about the larger issues 
> around I/OAT and this code making it in upstream at some point.
> 
> These are also available on the wiki,  
> http://linux-net.osdl.org/index.php/I/OAT .

Hi,

Since you didn't provide the current issues in this email, I will copy
and paste them from the wiki page.

I guess the overall question is, how much of this needs to be addressed
in the implementation before merge, and how much should be done when
more drivers (with more features) are merged down the road. It might not
make sense to implement all of it now if the only available public
driver lacks the abilities.   But I'm bringing up the points anyway.

Maybe it could make sense to add a software-based driver for reference,
and for others to play around with.

I would also prefer to see the series clearly split between the DMA
framework and first clients (networking) and the I/OAT driver. Right now
"I/OAT" and "DMA" is used interchangeably, especially when describing
the later patches. It might help you in the perception that this is
something unique to the Intel chipsets as well.  :-)

(I have also proposed DMA offload discussions as a topic for the Kernel
Summit. I have kept Chris Leech Cc:d on most of the emails in question. It
should be a good place to get input from other subsystems regarding what
functionality they would like to see provided, etc.)

>From the wiki:

> Current issues of concern:
>
>    1. Performance improvement may be on too narrow a set of workloads

Maybe from I/OAT and the current client, but the introduction of the
DMA infrastructure opens up for other uses that are not yet possible in
the API. For example, DMA with functions is a very natural extension,
and something that's very common on various platforms (XOR for RAID use,
checksums, encryption).

The API needs to be expanded to cover this by adding function types and
adding them to the channel allocation interface and logic.

>    2. Limited availability of hardware supporting I/OAT

DMA engines are fairly common, even though I/OAT might not be yet. They
just haven't had a common infrastructure until now.

For people who might want to play with it, a reference software-based
implementation might be useful.

>    3. Data copied by I/OAT is not cached

This is a I/OAT device limitation and not a global statement of the
DMA infrastructure. Other platforms might be able to prime caches
with the DMA traffic. Hint flags should be added on either the channel
allocation calls, or per-operation calls, depending on where it makes
sense driver/client wise.

>    4. Intrusiveness of net stack modifications
>    5. Compatibility with upcoming VJ net channel architecture 

Both of these are outside my scope, so I won't comment on them at this
time.

I would like to add, for longer term:

   * Userspace interfaces:
Are there any plans yet on how to export some of this to userspace? It
might not make full sense for just memcpy due to overheads, but it makes
sense for more advanced dma/offload engines.

-Olof

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-20 21:33 ` Olof Johansson
@ 2006-04-20 22:14   ` Andrew Grover
  2006-04-20 23:33     ` Olof Johansson
  2006-04-21  0:38     ` David S. Miller
  2006-04-21  0:27   ` David S. Miller
  1 sibling, 2 replies; 20+ messages in thread
From: Andrew Grover @ 2006-04-20 22:14 UTC (permalink / raw)
  To: Olof Johansson; +Cc: Andrew Grover, netdev

Hah, I was just writing an email covering those. I'll incorporate that
into this reponse.

On 4/20/06, Olof Johansson <olof@lixom.net> wrote:
> I guess the overall question is, how much of this needs to be addressed
> in the implementation before merge, and how much should be done when
> more drivers (with more features) are merged down the road. It might not
> make sense to implement all of it now if the only available public
> driver lacks the abilities.   But I'm bringing up the points anyway.

Yeah. But I would think maybe this is a reason to merge at least the
DMA subsystem code, so people with other HW (ARM? I'm still not
exactly sure) can start trying to write a DMA driver and see where the
architecture needs to be generalized further.

> Maybe it could make sense to add a software-based driver for reference,
> and for others to play around with.

We wrote one, but just for testing. I think we've been focused on the
performance story, so it didn't seem a priority.

> I would also prefer to see the series clearly split between the DMA
> framework and first clients (networking) and the I/OAT driver. Right now
> "I/OAT" and "DMA" is used interchangeably, especially when describing
> the later patches. It might help you in the perception that this is
> something unique to the Intel chipsets as well.  :-)

I think we have this reasonably well split-out in the patches, but yes
you're right about how we've been using the terms.

> (I have also proposed DMA offload discussions as a topic for the Kernel
> Summit. I have kept Chris Leech Cc:d on most of the emails in question. It
> should be a good place to get input from other subsystems regarding what
> functionality they would like to see provided, etc.)

I think that would be a good topic for the KS - like you say not
necessarily I/OAT but general DMA offload.

> >    1. Performance improvement may be on too narrow a set of workloads
> Maybe from I/OAT and the current client, but the introduction of the
> DMA infrastructure opens up for other uses that are not yet possible in
> the API. For example, DMA with functions is a very natural extension,
> and something that's very common on various platforms (XOR for RAID use,
> checksums, encryption).

Yes. Does this hardware exist in shipping platforms, so we could use
actual hw to start evaluating the DMA interfaces?

While you may not care (:-) I'd like to address the network
performance aspect above, for other netdev readers:

First obviously it's a technology for RX CPU improvement so there's no
benefit on TX workloads. Second it depends on there being buffers to
copy the data into *before* the data arrives. This happens to be the
case for benchmarks like netperf and Chariot, but real apps using
poll/select wouldn't see a benefit,  Just laying the cards out here.
BUT we are seeing very good CPU savings on some workloads, so for
those apps (and if select/poll apps could make use of a
yet-to-be-implemented async net interface) it would be a win.

I don't know what the breakdown is of apps doing blocking reads vs.
waiting, does anyone know?

> >    2. Limited availability of hardware supporting I/OAT
>
> DMA engines are fairly common, even though I/OAT might not be yet. They
> just haven't had a common infrastructure until now.

We've engaged early that's a good thing :) I think we'd like to see
some netdev people do some independent performance analysis of it. If
anyone is willing to do so and has time to do so, email us and let's
see what we can work out.

> For people who might want to play with it, a reference software-based
> implementation might be useful.

Yeah I'll ask if I can post the one we have. Or it would be trivial to write.

> >    3. Data copied by I/OAT is not cached
>
> This is a I/OAT device limitation and not a global statement of the
> DMA infrastructure. Other platforms might be able to prime caches
> with the DMA traffic. Hint flags should be added on either the channel
> allocation calls, or per-operation calls, depending on where it makes
> sense driver/client wise.

Furthermore in our implementation's defense I would say I think the
smart prefetching that modern CPUs do is helping here. In any case, we
are seeing performance gains (see benchmarks), which seems to indicate
this is not an immediate deal-breaker for the technology.. In
addition, there may be workloads (file serving? backup?) where we
could do a skb->page-in-page-cache copy and avoid cache pollution?

> >    4. Intrusiveness of net stack modifications
> >    5. Compatibility with upcoming VJ net channel architecture
> Both of these are outside my scope, so I won't comment on them at this
> time.

Yeah I don't have much to say about these except we made the patch as
unintrusive as we could, and we think there may be ways to use async
DMA to
help VJ channels, whenever they arrive.

> I would like to add, for longer term:
>    * Userspace interfaces:
> Are there any plans yet on how to export some of this to userspace? It
> might not make full sense for just memcpy due to overheads, but it makes
> sense for more advanced dma/offload engines.

I agree.

Regards -- Andy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-20 22:14   ` Andrew Grover
@ 2006-04-20 23:33     ` Olof Johansson
  2006-04-21  0:44       ` David S. Miller
  2006-04-21  0:38     ` David S. Miller
  1 sibling, 1 reply; 20+ messages in thread
From: Olof Johansson @ 2006-04-20 23:33 UTC (permalink / raw)
  To: Andrew Grover; +Cc: netdev

On Thu, Apr 20, 2006 at 03:14:15PM -0700, Andrew Grover wrote:
> Hah, I was just writing an email covering those. I'll incorporate that
> into this reponse.
> 
> On 4/20/06, Olof Johansson <olof@lixom.net> wrote:
> > I guess the overall question is, how much of this needs to be addressed
> > in the implementation before merge, and how much should be done when
> > more drivers (with more features) are merged down the road. It might not
> > make sense to implement all of it now if the only available public
> > driver lacks the abilities.   But I'm bringing up the points anyway.
> 
> Yeah. But I would think maybe this is a reason to merge at least the
> DMA subsystem code, so people with other HW (ARM? I'm still not
> exactly sure) can start trying to write a DMA driver and see where the
> architecture needs to be generalized further.

The interfaces need to evolve as people implement drivers, yes. If it
should be before or after merging can be discussed, but as long as
everyone is on the same page w.r.t. the interfaces being volatile for a
while, merge should be OK.

Having a roadmap of known-todo improvements could be beneficial for
everyone involved, especially if several people start looking at drivers
in parallel. However, so far, (public) activity seems to have been
fairly low.

> > I would also prefer to see the series clearly split between the DMA
> > framework and first clients (networking) and the I/OAT driver. Right now
> > "I/OAT" and "DMA" is used interchangeably, especially when describing
> > the later patches. It might help you in the perception that this is
> > something unique to the Intel chipsets as well.  :-)
> 
> I think we have this reasonably well split-out in the patches, but yes
> you're right about how we've been using the terms.

The patches are well split up already, it was mostly that the network
stack changes were marked as I/OAT changes instead of DMA dito.

> > >    1. Performance improvement may be on too narrow a set of workloads
> > Maybe from I/OAT and the current client, but the introduction of the
> > DMA infrastructure opens up for other uses that are not yet possible in
> > the API. For example, DMA with functions is a very natural extension,
> > and something that's very common on various platforms (XOR for RAID use,
> > checksums, encryption).
> 
> Yes. Does this hardware exist in shipping platforms, so we could use
> actual hw to start evaluating the DMA interfaces?

Freescale has it on several processors that are shipping, as far as I
know. Other embedded families likely has them as well (MIPS, ARM), but
I don't know details. The platform I am working on is not yet shipping;
I've just started looking at drivers.

> > For people who might want to play with it, a reference software-based
> > implementation might be useful.
> 
> Yeah I'll ask if I can post the one we have. Or it would be trivial to write.

I was going to look at it myself, but if you have one to post that's
even more trivial. :-)

> > >    3. Data copied by I/OAT is not cached
> >
> > This is a I/OAT device limitation and not a global statement of the
> > DMA infrastructure. Other platforms might be able to prime caches
> > with the DMA traffic. Hint flags should be added on either the channel
> > allocation calls, or per-operation calls, depending on where it makes
> > sense driver/client wise.
> 
> Furthermore in our implementation's defense I would say I think the
> smart prefetching that modern CPUs do is helping here.

Yes. It's also not obvious that warming the cache at copy time is always
a gain, it will depends on the receiver and what it does with the data.

> In any case, we
> are seeing performance gains (see benchmarks), which seems to indicate
> this is not an immediate deal-breaker for the technology..

There's always the good old benefit-vs-added-complexity tradeoff, which
I guess is the sore spot right now.

> In
> addition, there may be workloads (file serving? backup?) where we
> could do a skb->page-in-page-cache copy and avoid cache pollution?

Yes, NFS is probably a prime example of where most of the data isn't
looked at; just written to disk. I'm not sure how well-optimized the
receive path is there already w.r.t. avoiding copying though. I don't
remember seeing memcpy and friends being high on the profile when I
looked at SPECsfs last.

> > >    4. Intrusiveness of net stack modifications
> > >    5. Compatibility with upcoming VJ net channel architecture
> > Both of these are outside my scope, so I won't comment on them at this
> > time.
> 
> Yeah I don't have much to say about these except we made the patch as
> unintrusive as we could, and we think there may be ways to use async
> DMA to
> help VJ channels, whenever they arrive.

Not that I know all the tricks they are using, but it seems to me that it
would be hard to both be efficient w.r.t memory use (i.e. more than one
IP packet per page) AND avoid copying once. At least without device-level
flow classification and per-flow (process) buffer rings.

-Olof

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-20 21:33 ` Olof Johansson
  2006-04-20 22:14   ` Andrew Grover
@ 2006-04-21  0:27   ` David S. Miller
  2006-04-21  1:00     ` Rick Jones
                       ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: David S. Miller @ 2006-04-21  0:27 UTC (permalink / raw)
  To: olof; +Cc: andrew.grover, netdev

From: Olof Johansson <olof@lixom.net>
Date: Thu, 20 Apr 2006 16:33:05 -0500

> From the wiki:
> 
> >    3. Data copied by I/OAT is not cached
> 
> This is a I/OAT device limitation and not a global statement of the
> DMA infrastructure. Other platforms might be able to prime caches
> with the DMA traffic. Hint flags should be added on either the channel
> allocation calls, or per-operation calls, depending on where it makes
> sense driver/client wise.

This sidesteps the whole question of _which_ cache to warm.  And if
you choose wrongly, then what?

Besides the control overhead of the DMA engines, the biggest thing
lost in my opinion is the perfect cache warming that a cpu based copy
does from the kernel socket buffer into userspace.

The first thing an application is going to do is touch that data.  So
I think it's very important to prewarm the caches and the only
straightforward way I know of to always warm up the correct cpu's
caches is copy_to_user().

Unfortunately, many benchmarks just do raw bandwidth tests sending to
a receiver that just doesn't even look at the data.  They just return
from recvmsg() and loop back into it.  This is not what applications
using networking actually do, so it's important to make sure we look
intelligently at any benchmarks done and do not fall into the trap of
saying "even without cache warming it made things faster" when in fact
the tested receiver did not touch the data at all so was a false test.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-20 22:14   ` Andrew Grover
  2006-04-20 23:33     ` Olof Johansson
@ 2006-04-21  0:38     ` David S. Miller
  2006-04-21  1:02       ` Rick Jones
  2006-04-21  2:23       ` Herbert Xu
  1 sibling, 2 replies; 20+ messages in thread
From: David S. Miller @ 2006-04-21  0:38 UTC (permalink / raw)
  To: andy.grover; +Cc: olof, andrew.grover, netdev

From: "Andrew Grover" <andy.grover@gmail.com>
Date: Thu, 20 Apr 2006 15:14:15 -0700

> First obviously it's a technology for RX CPU improvement so there's no
> benefit on TX workloads. Second it depends on there being buffers to
> copy the data into *before* the data arrives. This happens to be the
> case for benchmarks like netperf and Chariot, but real apps using
> poll/select wouldn't see a benefit,  Just laying the cards out here.
> BUT we are seeing very good CPU savings on some workloads, so for
> those apps (and if select/poll apps could make use of a
> yet-to-be-implemented async net interface) it would be a win.
> 
> I don't know what the breakdown is of apps doing blocking reads vs.
> waiting, does anyone know?

All the bandwidth benchmarks tend to block, real world servers (and
most clients to some extent) tend to use non-blocking reads and
poll/select except in some very limited cases and designs doing
something like 1 thread per connection.

This is an issue for the TCP prequeue and as a consequence VJ's net
channel ideas.  We need something to wakeup some context in order to
push channel data.

All the net channel stuff really wants is an execution context to
run the TCP stack outside of software interrupts.  I/O AT wants
something similar.

For net channels the probably best thing to do is to just queue
to the socket's netchannel, and mark poll state appropriately
and just wait for the thread to get back into recvmsg() to run
the queue.  So I think net channels can be handled in all cases
and application I/O models.

For I/O AT you'd really want to get the DMA engine going as soon
as you had those packets, but I do not see a clean and reliable way
to determine the target pages before the app gets back to recvmsg().

I/O AT really expects a lot of things to be in place in order for it
to function at all.  And sadly, that set of requirements isn't
actually very common outside of benchmarking tools and a few
uncommonly designed servers.  Even a web browser does non-blocking
reads and poll().

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-20 23:33     ` Olof Johansson
@ 2006-04-21  0:44       ` David S. Miller
  2006-04-21  3:09         ` Olof Johansson
  0 siblings, 1 reply; 20+ messages in thread
From: David S. Miller @ 2006-04-21  0:44 UTC (permalink / raw)
  To: olof; +Cc: andy.grover, netdev

From: Olof Johansson <olof@lixom.net>
Date: Thu, 20 Apr 2006 18:33:43 -0500

> On Thu, Apr 20, 2006 at 03:14:15PM -0700, Andrew Grover wrote:
> > In
> > addition, there may be workloads (file serving? backup?) where we
> > could do a skb->page-in-page-cache copy and avoid cache pollution?
> 
> Yes, NFS is probably a prime example of where most of the data isn't
> looked at; just written to disk. I'm not sure how well-optimized the
> receive path is there already w.r.t. avoiding copying though. I don't
> remember seeing memcpy and friends being high on the profile when I
> looked at SPECsfs last.

If that makes sense then the cpu copy can be made to use non-temporal
stores.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-21  0:27   ` David S. Miller
@ 2006-04-21  1:00     ` Rick Jones
  2006-04-21  1:13       ` David S. Miller
  2006-04-21  3:04     ` Olof Johansson
  2006-04-21 17:13     ` Ingo Oeser
  2 siblings, 1 reply; 20+ messages in thread
From: Rick Jones @ 2006-04-21  1:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: olof, andrew.grover, netdev

> Unfortunately, many benchmarks just do raw bandwidth tests sending to
> a receiver that just doesn't even look at the data.  They just return
> from recvmsg() and loop back into it.  This is not what applications
> using networking actually do, so it's important to make sure we look
> intelligently at any benchmarks done and do not fall into the trap of
> saying "even without cache warming it made things faster" when in fact
> the tested receiver did not touch the data at all so was a false test.

FWIW, netperf can be configured to access the buffers it gives to send() 
or gets from recv().  A ./configure --enable-dirty in TOT:

http://www.netperf.org/svn/netperf2/trunk

will enable two global options:

  -k dirty,clean # bytes to dirty, bytes to read clean on netperf side

  -K dirty,clean # as above, on netserver side.

And in such a netperf the test banner will include the string "dirty 
data" (alas the default output will not say how much :)

In say a TCP_STREAM test -k will affect what is done with a buffer 
before send() is called, and -K will affect what is done with a buffer 
_before_ recv() is called with that buffer.

-k N will cause the first N bytes of the buffer to be dirtied, and the 
next N bytes to be read clean

-k N, will cause the first N bytes of the buffer to be dirtied

-k ,N will cause the first N bytes of the buffer to be read clean

-k M,N will cause the first M bytes to be dirtied, the next N bytes to 
be read clean

Actually, that brings-up a question - presently, and for reasons that 
are lost to me in the mists of time - netperf will "access" the buffer 
before it calls recv().  I'm wondering if that should be changed to an 
access of the buffer after it calls recv()?

And I suspect related to all this is whether or not one should alter the 
size of the buffer ring being used by netperf, which by default is the 
SO_*BUF size divided by the send_size (or recv_size) plus one buffers - 
the -W option can control that.

rick jones

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-21  0:38     ` David S. Miller
@ 2006-04-21  1:02       ` Rick Jones
  2006-04-21  2:23       ` Herbert Xu
  1 sibling, 0 replies; 20+ messages in thread
From: Rick Jones @ 2006-04-21  1:02 UTC (permalink / raw)
  To: David S. Miller; +Cc: andy.grover, olof, andrew.grover, netdev

David S. Miller wrote:
> From: "Andrew Grover" <andy.grover@gmail.com>
> Date: Thu, 20 Apr 2006 15:14:15 -0700
> 
> 
>>First obviously it's a technology for RX CPU improvement so there's no
>>benefit on TX workloads. Second it depends on there being buffers to
>>copy the data into *before* the data arrives. This happens to be the
>>case for benchmarks like netperf and Chariot, but real apps using
>>poll/select wouldn't see a benefit,  Just laying the cards out here.
>>BUT we are seeing very good CPU savings on some workloads, so for
>>those apps (and if select/poll apps could make use of a
>>yet-to-be-implemented async net interface) it would be a win.
>>
>>I don't know what the breakdown is of apps doing blocking reads vs.
>>waiting, does anyone know?
> 
> 
> All the bandwidth benchmarks tend to block, real world servers (and
> most clients to some extent) tend to use non-blocking reads and
> poll/select except in some very limited cases and designs doing
> something like 1 thread per connection.

Another netperf2 option :) (not exported via configure though) if a 
certain define is set - look at recv_tcp_stream() in nettest_bsd.c - 
then netperf will call select() before it calls recv().

rick jones



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-21  1:00     ` Rick Jones
@ 2006-04-21  1:13       ` David S. Miller
  2006-04-21 17:12         ` Rick Jones
  0 siblings, 1 reply; 20+ messages in thread
From: David S. Miller @ 2006-04-21  1:13 UTC (permalink / raw)
  To: rick.jones2; +Cc: olof, andrew.grover, netdev

From: Rick Jones <rick.jones2@hp.com>
Date: Thu, 20 Apr 2006 18:00:37 -0700

> Actually, that brings-up a question - presently, and for reasons that 
> are lost to me in the mists of time - netperf will "access" the buffer 
> before it calls recv().  I'm wondering if that should be changed to an 
> access of the buffer after it calls recv()?

Yes, that's what it should do, as this is whan a real
application would do.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-21  0:38     ` David S. Miller
  2006-04-21  1:02       ` Rick Jones
@ 2006-04-21  2:23       ` Herbert Xu
  1 sibling, 0 replies; 20+ messages in thread
From: Herbert Xu @ 2006-04-21  2:23 UTC (permalink / raw)
  To: David S. Miller; +Cc: andy.grover, olof, andrew.grover, netdev

David S. Miller <davem@davemloft.net> wrote:
> 
> For I/O AT you'd really want to get the DMA engine going as soon
> as you had those packets, but I do not see a clean and reliable way
> to determine the target pages before the app gets back to recvmsg().

The vmsplice() system call proposed by Linus might be a good fit.

http://www.ussg.iu.edu/hypermail/linux/kernel/0604.2/0854.html
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-21  0:27   ` David S. Miller
  2006-04-21  1:00     ` Rick Jones
@ 2006-04-21  3:04     ` Olof Johansson
  2006-04-21  3:42       ` David S. Miller
  2006-04-21 17:13     ` Ingo Oeser
  2 siblings, 1 reply; 20+ messages in thread
From: Olof Johansson @ 2006-04-21  3:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: olof, andrew.grover, netdev

On Thu, Apr 20, 2006 at 05:27:42PM -0700, David S. Miller wrote:
> From: Olof Johansson <olof@lixom.net>
> Date: Thu, 20 Apr 2006 16:33:05 -0500
> 
> > From the wiki:
> > 
> > >    3. Data copied by I/OAT is not cached
> > 
> > This is a I/OAT device limitation and not a global statement of the
> > DMA infrastructure. Other platforms might be able to prime caches
> > with the DMA traffic. Hint flags should be added on either the channel
> > allocation calls, or per-operation calls, depending on where it makes
> > sense driver/client wise.
> 
> This sidesteps the whole question of _which_ cache to warm.  And if
> you choose wrongly, then what?
>
> Besides the control overhead of the DMA engines, the biggest thing
> lost in my opinion is the perfect cache warming that a cpu based copy
> does from the kernel socket buffer into userspace.

It's definitely the easiest way to always make sure the right caches
are warm for the app, that I agree with.

But, when warming those caches by copying, the data is pulled in through
a potentially cold cache in the first place. So the cache misses are
just moved from the copy loop to userspace with dma offload. Or am I
missing something?

> The first thing an application is going to do is touch that data.  So
> I think it's very important to prewarm the caches and the only
> straightforward way I know of to always warm up the correct cpu's
> caches is copy_to_user().

The other way (assuming the hardware supports cache warming) would be
to pass down affinities (or look them up during receive processing,
I'm not sure that's practical the way things work now), and dispatch
on a DMA channel with the right cache affinity. I've got a feeling that
"straightforward" is not a term to use for describing that solution
though.

> Unfortunately, many benchmarks just do raw bandwidth tests sending to
> a receiver that just doesn't even look at the data.  They just return
> from recvmsg() and loop back into it.  This is not what applications
> using networking actually do, so it's important to make sure we look
> intelligently at any benchmarks done and do not fall into the trap of
> saying "even without cache warming it made things faster" when in fact
> the tested receiver did not touch the data at all so was a false test.

Yes, some real-life-like benchmarking is definitiely needed. Unfortunately
I'm not at a position where I can do much (and share numbers) at the
moment myself.


-Olof

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-21  0:44       ` David S. Miller
@ 2006-04-21  3:09         ` Olof Johansson
  0 siblings, 0 replies; 20+ messages in thread
From: Olof Johansson @ 2006-04-21  3:09 UTC (permalink / raw)
  To: David S. Miller; +Cc: andy.grover, netdev

On Thu, Apr 20, 2006 at 05:44:38PM -0700, David S. Miller wrote:
> From: Olof Johansson <olof@lixom.net>
> Date: Thu, 20 Apr 2006 18:33:43 -0500
> 
> > On Thu, Apr 20, 2006 at 03:14:15PM -0700, Andrew Grover wrote:
> > > In
> > > addition, there may be workloads (file serving? backup?) where we
> > > could do a skb->page-in-page-cache copy and avoid cache pollution?
> > 
> > Yes, NFS is probably a prime example of where most of the data isn't
> > looked at; just written to disk. I'm not sure how well-optimized the
> > receive path is there already w.r.t. avoiding copying though. I don't
> > remember seeing memcpy and friends being high on the profile when I
> > looked at SPECsfs last.
> 
> If that makes sense then the cpu copy can be made to use non-temporal
> stores.

I'm not sure that would buy anything. I didn't mean caching was
necessarily bad, just that lack of it might not hurt as much under that
specific type of workload.

NFS has to look at RPC/NFS headers anyway, so it will benefit from the
cache being warm.


-Olof

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-21  3:04     ` Olof Johansson
@ 2006-04-21  3:42       ` David S. Miller
  2006-04-21  4:42         ` Olof Johansson
  2006-04-27 23:45         ` Chris Leech
  0 siblings, 2 replies; 20+ messages in thread
From: David S. Miller @ 2006-04-21  3:42 UTC (permalink / raw)
  To: olof; +Cc: andrew.grover, netdev

From: Olof Johansson <olof@lixom.net>
Date: Thu, 20 Apr 2006 22:04:26 -0500

> On Thu, Apr 20, 2006 at 05:27:42PM -0700, David S. Miller wrote:
> > Besides the control overhead of the DMA engines, the biggest thing
> > lost in my opinion is the perfect cache warming that a cpu based copy
> > does from the kernel socket buffer into userspace.
> 
> It's definitely the easiest way to always make sure the right caches
> are warm for the app, that I agree with.
> 
> But, when warming those caches by copying, the data is pulled in through
> a potentially cold cache in the first place. So the cache misses are
> just moved from the copy loop to userspace with dma offload. Or am I
> missing something?

Yes, and it means that the memory bandwidth costs are equivalent
between I/O AT and cpu copy.

In the cpu copy case you eat the read cache miss, but on the write
side you'll prewarm the cache properly.  In the I/O AT case you
eat the same read cost, but the cache will not be prewarmed, so you'll
eat the read cache miss in the application.  It's moving the same
exact cost from one place to another.

The time it takes to get the app to make forward progress (meaning
returned from the recvmsg() system call and back in userspace) must by
definition take at least as long with I/O AT as it does with cpu
copies.  Yet in the I/O AT case, the application must wait that long
and also then take in the delays of the cache misses when it tries to
read the data that the I/O AT engine copied.  Instead of eating the
cache miss cost in the kernel, we eat it in the app because in the I/O
AT case the cpu won't have the user data fresh and loaded into the cpu
cache.

And I say I/O AT must take "at least as long" as cpu copies because
the same memory copy cost is there, and on top of that I/O AT has to
program the DMA controller and touch a _lot_ of other state to get
things going and then wake the task up.  We're talking non-trivial
overheads like grabbing the page mappings out of the page tables using
get_user_pages().  Evgivny has posted some very nice performance graphs
showing how poorly that function scales.

This is basically why none of the performance gains add up to me.  I
am thus very concerned that the current non-cache-warming
implmentation may fall flat performance wise.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-21  3:42       ` David S. Miller
@ 2006-04-21  4:42         ` Olof Johansson
  2006-04-27 23:45         ` Chris Leech
  1 sibling, 0 replies; 20+ messages in thread
From: Olof Johansson @ 2006-04-21  4:42 UTC (permalink / raw)
  To: David S. Miller; +Cc: olof, andrew.grover, netdev

On Thu, Apr 20, 2006 at 08:42:00PM -0700, David S. Miller wrote:

> This is basically why none of the performance gains add up to me.  I
> am thus very concerned that the current non-cache-warming
> implmentation may fall flat performance wise.

Ok, I buy your arguments. It does seems unlikely that a DMA offload
without cache warmth will be a net gain. More performance data is
definitely be required.

After digging after PDFs, it seems as the Freescale 85xx (at least,
probably earlier models as well) can warm L2 for the DMA destination
data. However, I don't have any hardware with it to play around
with for benchmarking to see what cache warming might bring (back),
performance-wise.

I think there is still use for a common multi-function DMA framework
across platforms and client components, even if net receive doesn't end
up being {a,the first} user.

-Olof

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-21  1:13       ` David S. Miller
@ 2006-04-21 17:12         ` Rick Jones
  2006-04-27 23:49           ` Chris Leech
  0 siblings, 1 reply; 20+ messages in thread
From: Rick Jones @ 2006-04-21 17:12 UTC (permalink / raw)
  To: David S. Miller; +Cc: olof, andrew.grover, netdev

David S. Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Thu, 20 Apr 2006 18:00:37 -0700
> 
> 
>>Actually, that brings-up a question - presently, and for reasons that 
>>are lost to me in the mists of time - netperf will "access" the buffer 
>>before it calls recv().  I'm wondering if that should be changed to an 
>>access of the buffer after it calls recv()?
> 
> 
> Yes, that's what it should do, as this is whan a real
> application would do.

Netperf2 TOT now accesses the buffer that was just recv()'d rather than 
the one that is about to be recv()'d.

rick jones

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-21  0:27   ` David S. Miller
  2006-04-21  1:00     ` Rick Jones
  2006-04-21  3:04     ` Olof Johansson
@ 2006-04-21 17:13     ` Ingo Oeser
  2 siblings, 0 replies; 20+ messages in thread
From: Ingo Oeser @ 2006-04-21 17:13 UTC (permalink / raw)
  To: David S. Miller; +Cc: olof, andrew.grover, netdev, Ingo Oeser

David S. Miller wrote:
> The first thing an application is going to do is touch that data.  So
> I think it's very important to prewarm the caches and the only
> straightforward way I know of to always warm up the correct cpu's
> caches is copy_to_user().

Hmm, what if the application is sth. like a MPEG demultiplexer?

There you don't like to look at everything and excplicitly 
ignore received data[1].

Yes, I know this is usually done with hardware demuxers and filters,
but the principle might apply to other applications as well, for which
no hardware solutions exist.

Regards

Ingo Oeser

[1] which you cannot ignore properly with Linux yet.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-21  3:42       ` David S. Miller
  2006-04-21  4:42         ` Olof Johansson
@ 2006-04-27 23:45         ` Chris Leech
  1 sibling, 0 replies; 20+ messages in thread
From: Chris Leech @ 2006-04-27 23:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: olof, andrew.grover, netdev

On 4/20/06, David S. Miller <davem@davemloft.net> wrote:
> Yes, and it means that the memory bandwidth costs are equivalent
> between I/O AT and cpu copy.

The following is a response from the I/OAT architects.  I only point
out that this is not coming directly from me because I have not seen
the data to verify the claims regarding the speed of a copy vs a load
and the cost of the rep mov instruction.  I'll encourage more direct
participation in this discussion from the architects moving forward.

    - Chris

Let's talk about the caching benefits that is seemingly lost when
using the DMA engine. The intent of the DMA engine is to save CPU
cycles spent in copying data (rep mov). In cases where the destination
is already warm in cache (due to destination buffer re-use) and the
source is in memory, the cycles spent in a host copy is not just due
to the cache misses it encounters in the process of bringing in the
source but also due to the execution of rep move itself within the
host core. If you contrast this to simply touching (loading) the data
residing in memory, the cost of this load is primarily the cost of the
cache misses and not so much CPU execution time. Given this, some of
the following points are noteworthy:

1. While the DMA engine forces the destination to be in memory and
touching it may cause the same number of observable cache misses as a
host copy assuming a cache warmed destination, the cost of the host
copy (in terms of CPU cycles) is much more than the cost of the touch.

2. CPU hardware prefetchers do a pretty good job of staying ahead of
the fetch stream to minimize cache misses. So for loads of medium to
large buffers, cache misses form a much smaller component of the data
fetch time…most of it is dominated by front side bus (FSB) or Memory
bandwidth. For small buffers, we do not use the DMA engine but if we
had to, we would insert SW prefetches that do reasonably well.

3. If the destination wasn't already warm in cache i.e., it was in
memory or some CPU other cache, host copy will have to snoop and bring
the destination in and will encounter additional misses on the
destination buffer as well. These misses are the same as those
encountered in #1 above when using the DMA engine and touching the
data afterwards. So in effect it becomes a wash when compared to the
DMA engine's behavior. The case where the destination is already warm
in cache is common in benchmarks such as iperf, ttcp etc. where the
same buffer is reused over and over again. Real applications typically
will not exhibit this aggressive buffer re-use behavior.

4. It may take a large number of packets (and several interrupts) to
satisfy a large posted buffer (say 64KB). Even if you use host copy to
warm the cache with the destination, there is no guarantee that some
or all of the destination will stay in the cache before the
application has a chance to read the data.

5. The source data payload (skb ->data) is typically needed only once
for the copy and has no use later. The host copy brings it into the
cache and may end up polluting the cache, and consuming FSB bandwidth
whereas the DMA engine avoids this altogether.

The IxChariot data posted earlier that touches the data and yet shows
I/OAT benefit is due to some of the reasons above. Bottom line is that
I agree with the cache benefit argument of host copy for small buffers
(64B to 512B) but for larger buffers and certain application scenarios
(destination in memory), the DMA engine will show better performance
regardless of where the destination buffer resided to begin with and
where it is accessed from.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-21 17:12         ` Rick Jones
@ 2006-04-27 23:49           ` Chris Leech
  2006-04-27 23:53             ` Rick Jones
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Leech @ 2006-04-27 23:49 UTC (permalink / raw)
  To: Rick Jones; +Cc: David S. Miller, olof, andrew.grover, netdev

> Netperf2 TOT now accesses the buffer that was just recv()'d rather than
> the one that is about to be recv()'d.

We've posted netperf2 results with I/OAT enabled/disabled and the data
access option on/off at
http://kernel.org/pub/linux/kernel/people/grover/ioat/netperf-icb-1.5-postscaling-both.pdf

This link has also been added to the I/OAT page on the LinuxNet wiki.

- Chris

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/10] [IOAT] I/OAT patches repost
  2006-04-27 23:49           ` Chris Leech
@ 2006-04-27 23:53             ` Rick Jones
  0 siblings, 0 replies; 20+ messages in thread
From: Rick Jones @ 2006-04-27 23:53 UTC (permalink / raw)
  To: chris.leech; +Cc: David S. Miller, olof, andrew.grover, netdev

Chris Leech wrote:
>>Netperf2 TOT now accesses the buffer that was just recv()'d rather than
>>the one that is about to be recv()'d.
> 
> 
> We've posted netperf2 results with I/OAT enabled/disabled and the data
> access option on/off at
> http://kernel.org/pub/linux/kernel/people/grover/ioat/netperf-icb-1.5-postscaling-both.pdf
> 

calling it netperf data verification is a quite overstated - all netperf 
does is read from or write to the buffer.  there is no check of data 
integrity or anything

rick jones

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2006-04-27 23:53 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-20 20:49 [PATCH 0/10] [IOAT] I/OAT patches repost Andrew Grover
2006-04-20 21:33 ` Olof Johansson
2006-04-20 22:14   ` Andrew Grover
2006-04-20 23:33     ` Olof Johansson
2006-04-21  0:44       ` David S. Miller
2006-04-21  3:09         ` Olof Johansson
2006-04-21  0:38     ` David S. Miller
2006-04-21  1:02       ` Rick Jones
2006-04-21  2:23       ` Herbert Xu
2006-04-21  0:27   ` David S. Miller
2006-04-21  1:00     ` Rick Jones
2006-04-21  1:13       ` David S. Miller
2006-04-21 17:12         ` Rick Jones
2006-04-27 23:49           ` Chris Leech
2006-04-27 23:53             ` Rick Jones
2006-04-21  3:04     ` Olof Johansson
2006-04-21  3:42       ` David S. Miller
2006-04-21  4:42         ` Olof Johansson
2006-04-27 23:45         ` Chris Leech
2006-04-21 17:13     ` Ingo Oeser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).