* [PATCH 0/10] [IOAT] I/OAT patches repost @ 2006-04-20 20:49 Andrew Grover 2006-04-20 21:33 ` Olof Johansson 0 siblings, 1 reply; 20+ messages in thread From: Andrew Grover @ 2006-04-20 20:49 UTC (permalink / raw) To: netdev Hi I'm reposting these, originally posted by Chris Leech a few weeks ago. However, there is an extra part since I broke up one patch that was too big for netdev last time into two (patches 2 and 3). Of course we're always looking for more style improvement comments, but more importantly we're posting these to talk about the larger issues around I/OAT and this code making it in upstream at some point. These are also available on the wiki, http://linux-net.osdl.org/index.php/I/OAT . Thanks -- Andy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-20 20:49 [PATCH 0/10] [IOAT] I/OAT patches repost Andrew Grover @ 2006-04-20 21:33 ` Olof Johansson 2006-04-20 22:14 ` Andrew Grover 2006-04-21 0:27 ` David S. Miller 0 siblings, 2 replies; 20+ messages in thread From: Olof Johansson @ 2006-04-20 21:33 UTC (permalink / raw) To: Andrew Grover; +Cc: netdev On Thu, Apr 20, 2006 at 01:49:16PM -0700, Andrew Grover wrote: > Hi I'm reposting these, originally posted by Chris Leech a few weeks ago. > However, there is an extra part since I broke up one patch that was too > big for netdev last time into two (patches 2 and 3). > > Of course we're always looking for more style improvement comments, but > more importantly we're posting these to talk about the larger issues > around I/OAT and this code making it in upstream at some point. > > These are also available on the wiki, > http://linux-net.osdl.org/index.php/I/OAT . Hi, Since you didn't provide the current issues in this email, I will copy and paste them from the wiki page. I guess the overall question is, how much of this needs to be addressed in the implementation before merge, and how much should be done when more drivers (with more features) are merged down the road. It might not make sense to implement all of it now if the only available public driver lacks the abilities. But I'm bringing up the points anyway. Maybe it could make sense to add a software-based driver for reference, and for others to play around with. I would also prefer to see the series clearly split between the DMA framework and first clients (networking) and the I/OAT driver. Right now "I/OAT" and "DMA" is used interchangeably, especially when describing the later patches. It might help you in the perception that this is something unique to the Intel chipsets as well. :-) (I have also proposed DMA offload discussions as a topic for the Kernel Summit. I have kept Chris Leech Cc:d on most of the emails in question. It should be a good place to get input from other subsystems regarding what functionality they would like to see provided, etc.) >From the wiki: > Current issues of concern: > > 1. Performance improvement may be on too narrow a set of workloads Maybe from I/OAT and the current client, but the introduction of the DMA infrastructure opens up for other uses that are not yet possible in the API. For example, DMA with functions is a very natural extension, and something that's very common on various platforms (XOR for RAID use, checksums, encryption). The API needs to be expanded to cover this by adding function types and adding them to the channel allocation interface and logic. > 2. Limited availability of hardware supporting I/OAT DMA engines are fairly common, even though I/OAT might not be yet. They just haven't had a common infrastructure until now. For people who might want to play with it, a reference software-based implementation might be useful. > 3. Data copied by I/OAT is not cached This is a I/OAT device limitation and not a global statement of the DMA infrastructure. Other platforms might be able to prime caches with the DMA traffic. Hint flags should be added on either the channel allocation calls, or per-operation calls, depending on where it makes sense driver/client wise. > 4. Intrusiveness of net stack modifications > 5. Compatibility with upcoming VJ net channel architecture Both of these are outside my scope, so I won't comment on them at this time. I would like to add, for longer term: * Userspace interfaces: Are there any plans yet on how to export some of this to userspace? It might not make full sense for just memcpy due to overheads, but it makes sense for more advanced dma/offload engines. -Olof ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-20 21:33 ` Olof Johansson @ 2006-04-20 22:14 ` Andrew Grover 2006-04-20 23:33 ` Olof Johansson 2006-04-21 0:38 ` David S. Miller 2006-04-21 0:27 ` David S. Miller 1 sibling, 2 replies; 20+ messages in thread From: Andrew Grover @ 2006-04-20 22:14 UTC (permalink / raw) To: Olof Johansson; +Cc: Andrew Grover, netdev Hah, I was just writing an email covering those. I'll incorporate that into this reponse. On 4/20/06, Olof Johansson <olof@lixom.net> wrote: > I guess the overall question is, how much of this needs to be addressed > in the implementation before merge, and how much should be done when > more drivers (with more features) are merged down the road. It might not > make sense to implement all of it now if the only available public > driver lacks the abilities. But I'm bringing up the points anyway. Yeah. But I would think maybe this is a reason to merge at least the DMA subsystem code, so people with other HW (ARM? I'm still not exactly sure) can start trying to write a DMA driver and see where the architecture needs to be generalized further. > Maybe it could make sense to add a software-based driver for reference, > and for others to play around with. We wrote one, but just for testing. I think we've been focused on the performance story, so it didn't seem a priority. > I would also prefer to see the series clearly split between the DMA > framework and first clients (networking) and the I/OAT driver. Right now > "I/OAT" and "DMA" is used interchangeably, especially when describing > the later patches. It might help you in the perception that this is > something unique to the Intel chipsets as well. :-) I think we have this reasonably well split-out in the patches, but yes you're right about how we've been using the terms. > (I have also proposed DMA offload discussions as a topic for the Kernel > Summit. I have kept Chris Leech Cc:d on most of the emails in question. It > should be a good place to get input from other subsystems regarding what > functionality they would like to see provided, etc.) I think that would be a good topic for the KS - like you say not necessarily I/OAT but general DMA offload. > > 1. Performance improvement may be on too narrow a set of workloads > Maybe from I/OAT and the current client, but the introduction of the > DMA infrastructure opens up for other uses that are not yet possible in > the API. For example, DMA with functions is a very natural extension, > and something that's very common on various platforms (XOR for RAID use, > checksums, encryption). Yes. Does this hardware exist in shipping platforms, so we could use actual hw to start evaluating the DMA interfaces? While you may not care (:-) I'd like to address the network performance aspect above, for other netdev readers: First obviously it's a technology for RX CPU improvement so there's no benefit on TX workloads. Second it depends on there being buffers to copy the data into *before* the data arrives. This happens to be the case for benchmarks like netperf and Chariot, but real apps using poll/select wouldn't see a benefit, Just laying the cards out here. BUT we are seeing very good CPU savings on some workloads, so for those apps (and if select/poll apps could make use of a yet-to-be-implemented async net interface) it would be a win. I don't know what the breakdown is of apps doing blocking reads vs. waiting, does anyone know? > > 2. Limited availability of hardware supporting I/OAT > > DMA engines are fairly common, even though I/OAT might not be yet. They > just haven't had a common infrastructure until now. We've engaged early that's a good thing :) I think we'd like to see some netdev people do some independent performance analysis of it. If anyone is willing to do so and has time to do so, email us and let's see what we can work out. > For people who might want to play with it, a reference software-based > implementation might be useful. Yeah I'll ask if I can post the one we have. Or it would be trivial to write. > > 3. Data copied by I/OAT is not cached > > This is a I/OAT device limitation and not a global statement of the > DMA infrastructure. Other platforms might be able to prime caches > with the DMA traffic. Hint flags should be added on either the channel > allocation calls, or per-operation calls, depending on where it makes > sense driver/client wise. Furthermore in our implementation's defense I would say I think the smart prefetching that modern CPUs do is helping here. In any case, we are seeing performance gains (see benchmarks), which seems to indicate this is not an immediate deal-breaker for the technology.. In addition, there may be workloads (file serving? backup?) where we could do a skb->page-in-page-cache copy and avoid cache pollution? > > 4. Intrusiveness of net stack modifications > > 5. Compatibility with upcoming VJ net channel architecture > Both of these are outside my scope, so I won't comment on them at this > time. Yeah I don't have much to say about these except we made the patch as unintrusive as we could, and we think there may be ways to use async DMA to help VJ channels, whenever they arrive. > I would like to add, for longer term: > * Userspace interfaces: > Are there any plans yet on how to export some of this to userspace? It > might not make full sense for just memcpy due to overheads, but it makes > sense for more advanced dma/offload engines. I agree. Regards -- Andy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-20 22:14 ` Andrew Grover @ 2006-04-20 23:33 ` Olof Johansson 2006-04-21 0:44 ` David S. Miller 2006-04-21 0:38 ` David S. Miller 1 sibling, 1 reply; 20+ messages in thread From: Olof Johansson @ 2006-04-20 23:33 UTC (permalink / raw) To: Andrew Grover; +Cc: netdev On Thu, Apr 20, 2006 at 03:14:15PM -0700, Andrew Grover wrote: > Hah, I was just writing an email covering those. I'll incorporate that > into this reponse. > > On 4/20/06, Olof Johansson <olof@lixom.net> wrote: > > I guess the overall question is, how much of this needs to be addressed > > in the implementation before merge, and how much should be done when > > more drivers (with more features) are merged down the road. It might not > > make sense to implement all of it now if the only available public > > driver lacks the abilities. But I'm bringing up the points anyway. > > Yeah. But I would think maybe this is a reason to merge at least the > DMA subsystem code, so people with other HW (ARM? I'm still not > exactly sure) can start trying to write a DMA driver and see where the > architecture needs to be generalized further. The interfaces need to evolve as people implement drivers, yes. If it should be before or after merging can be discussed, but as long as everyone is on the same page w.r.t. the interfaces being volatile for a while, merge should be OK. Having a roadmap of known-todo improvements could be beneficial for everyone involved, especially if several people start looking at drivers in parallel. However, so far, (public) activity seems to have been fairly low. > > I would also prefer to see the series clearly split between the DMA > > framework and first clients (networking) and the I/OAT driver. Right now > > "I/OAT" and "DMA" is used interchangeably, especially when describing > > the later patches. It might help you in the perception that this is > > something unique to the Intel chipsets as well. :-) > > I think we have this reasonably well split-out in the patches, but yes > you're right about how we've been using the terms. The patches are well split up already, it was mostly that the network stack changes were marked as I/OAT changes instead of DMA dito. > > > 1. Performance improvement may be on too narrow a set of workloads > > Maybe from I/OAT and the current client, but the introduction of the > > DMA infrastructure opens up for other uses that are not yet possible in > > the API. For example, DMA with functions is a very natural extension, > > and something that's very common on various platforms (XOR for RAID use, > > checksums, encryption). > > Yes. Does this hardware exist in shipping platforms, so we could use > actual hw to start evaluating the DMA interfaces? Freescale has it on several processors that are shipping, as far as I know. Other embedded families likely has them as well (MIPS, ARM), but I don't know details. The platform I am working on is not yet shipping; I've just started looking at drivers. > > For people who might want to play with it, a reference software-based > > implementation might be useful. > > Yeah I'll ask if I can post the one we have. Or it would be trivial to write. I was going to look at it myself, but if you have one to post that's even more trivial. :-) > > > 3. Data copied by I/OAT is not cached > > > > This is a I/OAT device limitation and not a global statement of the > > DMA infrastructure. Other platforms might be able to prime caches > > with the DMA traffic. Hint flags should be added on either the channel > > allocation calls, or per-operation calls, depending on where it makes > > sense driver/client wise. > > Furthermore in our implementation's defense I would say I think the > smart prefetching that modern CPUs do is helping here. Yes. It's also not obvious that warming the cache at copy time is always a gain, it will depends on the receiver and what it does with the data. > In any case, we > are seeing performance gains (see benchmarks), which seems to indicate > this is not an immediate deal-breaker for the technology.. There's always the good old benefit-vs-added-complexity tradeoff, which I guess is the sore spot right now. > In > addition, there may be workloads (file serving? backup?) where we > could do a skb->page-in-page-cache copy and avoid cache pollution? Yes, NFS is probably a prime example of where most of the data isn't looked at; just written to disk. I'm not sure how well-optimized the receive path is there already w.r.t. avoiding copying though. I don't remember seeing memcpy and friends being high on the profile when I looked at SPECsfs last. > > > 4. Intrusiveness of net stack modifications > > > 5. Compatibility with upcoming VJ net channel architecture > > Both of these are outside my scope, so I won't comment on them at this > > time. > > Yeah I don't have much to say about these except we made the patch as > unintrusive as we could, and we think there may be ways to use async > DMA to > help VJ channels, whenever they arrive. Not that I know all the tricks they are using, but it seems to me that it would be hard to both be efficient w.r.t memory use (i.e. more than one IP packet per page) AND avoid copying once. At least without device-level flow classification and per-flow (process) buffer rings. -Olof ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-20 23:33 ` Olof Johansson @ 2006-04-21 0:44 ` David S. Miller 2006-04-21 3:09 ` Olof Johansson 0 siblings, 1 reply; 20+ messages in thread From: David S. Miller @ 2006-04-21 0:44 UTC (permalink / raw) To: olof; +Cc: andy.grover, netdev From: Olof Johansson <olof@lixom.net> Date: Thu, 20 Apr 2006 18:33:43 -0500 > On Thu, Apr 20, 2006 at 03:14:15PM -0700, Andrew Grover wrote: > > In > > addition, there may be workloads (file serving? backup?) where we > > could do a skb->page-in-page-cache copy and avoid cache pollution? > > Yes, NFS is probably a prime example of where most of the data isn't > looked at; just written to disk. I'm not sure how well-optimized the > receive path is there already w.r.t. avoiding copying though. I don't > remember seeing memcpy and friends being high on the profile when I > looked at SPECsfs last. If that makes sense then the cpu copy can be made to use non-temporal stores. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-21 0:44 ` David S. Miller @ 2006-04-21 3:09 ` Olof Johansson 0 siblings, 0 replies; 20+ messages in thread From: Olof Johansson @ 2006-04-21 3:09 UTC (permalink / raw) To: David S. Miller; +Cc: andy.grover, netdev On Thu, Apr 20, 2006 at 05:44:38PM -0700, David S. Miller wrote: > From: Olof Johansson <olof@lixom.net> > Date: Thu, 20 Apr 2006 18:33:43 -0500 > > > On Thu, Apr 20, 2006 at 03:14:15PM -0700, Andrew Grover wrote: > > > In > > > addition, there may be workloads (file serving? backup?) where we > > > could do a skb->page-in-page-cache copy and avoid cache pollution? > > > > Yes, NFS is probably a prime example of where most of the data isn't > > looked at; just written to disk. I'm not sure how well-optimized the > > receive path is there already w.r.t. avoiding copying though. I don't > > remember seeing memcpy and friends being high on the profile when I > > looked at SPECsfs last. > > If that makes sense then the cpu copy can be made to use non-temporal > stores. I'm not sure that would buy anything. I didn't mean caching was necessarily bad, just that lack of it might not hurt as much under that specific type of workload. NFS has to look at RPC/NFS headers anyway, so it will benefit from the cache being warm. -Olof ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-20 22:14 ` Andrew Grover 2006-04-20 23:33 ` Olof Johansson @ 2006-04-21 0:38 ` David S. Miller 2006-04-21 1:02 ` Rick Jones 2006-04-21 2:23 ` Herbert Xu 1 sibling, 2 replies; 20+ messages in thread From: David S. Miller @ 2006-04-21 0:38 UTC (permalink / raw) To: andy.grover; +Cc: olof, andrew.grover, netdev From: "Andrew Grover" <andy.grover@gmail.com> Date: Thu, 20 Apr 2006 15:14:15 -0700 > First obviously it's a technology for RX CPU improvement so there's no > benefit on TX workloads. Second it depends on there being buffers to > copy the data into *before* the data arrives. This happens to be the > case for benchmarks like netperf and Chariot, but real apps using > poll/select wouldn't see a benefit, Just laying the cards out here. > BUT we are seeing very good CPU savings on some workloads, so for > those apps (and if select/poll apps could make use of a > yet-to-be-implemented async net interface) it would be a win. > > I don't know what the breakdown is of apps doing blocking reads vs. > waiting, does anyone know? All the bandwidth benchmarks tend to block, real world servers (and most clients to some extent) tend to use non-blocking reads and poll/select except in some very limited cases and designs doing something like 1 thread per connection. This is an issue for the TCP prequeue and as a consequence VJ's net channel ideas. We need something to wakeup some context in order to push channel data. All the net channel stuff really wants is an execution context to run the TCP stack outside of software interrupts. I/O AT wants something similar. For net channels the probably best thing to do is to just queue to the socket's netchannel, and mark poll state appropriately and just wait for the thread to get back into recvmsg() to run the queue. So I think net channels can be handled in all cases and application I/O models. For I/O AT you'd really want to get the DMA engine going as soon as you had those packets, but I do not see a clean and reliable way to determine the target pages before the app gets back to recvmsg(). I/O AT really expects a lot of things to be in place in order for it to function at all. And sadly, that set of requirements isn't actually very common outside of benchmarking tools and a few uncommonly designed servers. Even a web browser does non-blocking reads and poll(). ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-21 0:38 ` David S. Miller @ 2006-04-21 1:02 ` Rick Jones 2006-04-21 2:23 ` Herbert Xu 1 sibling, 0 replies; 20+ messages in thread From: Rick Jones @ 2006-04-21 1:02 UTC (permalink / raw) To: David S. Miller; +Cc: andy.grover, olof, andrew.grover, netdev David S. Miller wrote: > From: "Andrew Grover" <andy.grover@gmail.com> > Date: Thu, 20 Apr 2006 15:14:15 -0700 > > >>First obviously it's a technology for RX CPU improvement so there's no >>benefit on TX workloads. Second it depends on there being buffers to >>copy the data into *before* the data arrives. This happens to be the >>case for benchmarks like netperf and Chariot, but real apps using >>poll/select wouldn't see a benefit, Just laying the cards out here. >>BUT we are seeing very good CPU savings on some workloads, so for >>those apps (and if select/poll apps could make use of a >>yet-to-be-implemented async net interface) it would be a win. >> >>I don't know what the breakdown is of apps doing blocking reads vs. >>waiting, does anyone know? > > > All the bandwidth benchmarks tend to block, real world servers (and > most clients to some extent) tend to use non-blocking reads and > poll/select except in some very limited cases and designs doing > something like 1 thread per connection. Another netperf2 option :) (not exported via configure though) if a certain define is set - look at recv_tcp_stream() in nettest_bsd.c - then netperf will call select() before it calls recv(). rick jones ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-21 0:38 ` David S. Miller 2006-04-21 1:02 ` Rick Jones @ 2006-04-21 2:23 ` Herbert Xu 1 sibling, 0 replies; 20+ messages in thread From: Herbert Xu @ 2006-04-21 2:23 UTC (permalink / raw) To: David S. Miller; +Cc: andy.grover, olof, andrew.grover, netdev David S. Miller <davem@davemloft.net> wrote: > > For I/O AT you'd really want to get the DMA engine going as soon > as you had those packets, but I do not see a clean and reliable way > to determine the target pages before the app gets back to recvmsg(). The vmsplice() system call proposed by Linus might be a good fit. http://www.ussg.iu.edu/hypermail/linux/kernel/0604.2/0854.html -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-20 21:33 ` Olof Johansson 2006-04-20 22:14 ` Andrew Grover @ 2006-04-21 0:27 ` David S. Miller 2006-04-21 1:00 ` Rick Jones ` (2 more replies) 1 sibling, 3 replies; 20+ messages in thread From: David S. Miller @ 2006-04-21 0:27 UTC (permalink / raw) To: olof; +Cc: andrew.grover, netdev From: Olof Johansson <olof@lixom.net> Date: Thu, 20 Apr 2006 16:33:05 -0500 > From the wiki: > > > 3. Data copied by I/OAT is not cached > > This is a I/OAT device limitation and not a global statement of the > DMA infrastructure. Other platforms might be able to prime caches > with the DMA traffic. Hint flags should be added on either the channel > allocation calls, or per-operation calls, depending on where it makes > sense driver/client wise. This sidesteps the whole question of _which_ cache to warm. And if you choose wrongly, then what? Besides the control overhead of the DMA engines, the biggest thing lost in my opinion is the perfect cache warming that a cpu based copy does from the kernel socket buffer into userspace. The first thing an application is going to do is touch that data. So I think it's very important to prewarm the caches and the only straightforward way I know of to always warm up the correct cpu's caches is copy_to_user(). Unfortunately, many benchmarks just do raw bandwidth tests sending to a receiver that just doesn't even look at the data. They just return from recvmsg() and loop back into it. This is not what applications using networking actually do, so it's important to make sure we look intelligently at any benchmarks done and do not fall into the trap of saying "even without cache warming it made things faster" when in fact the tested receiver did not touch the data at all so was a false test. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-21 0:27 ` David S. Miller @ 2006-04-21 1:00 ` Rick Jones 2006-04-21 1:13 ` David S. Miller 2006-04-21 3:04 ` Olof Johansson 2006-04-21 17:13 ` Ingo Oeser 2 siblings, 1 reply; 20+ messages in thread From: Rick Jones @ 2006-04-21 1:00 UTC (permalink / raw) To: David S. Miller; +Cc: olof, andrew.grover, netdev > Unfortunately, many benchmarks just do raw bandwidth tests sending to > a receiver that just doesn't even look at the data. They just return > from recvmsg() and loop back into it. This is not what applications > using networking actually do, so it's important to make sure we look > intelligently at any benchmarks done and do not fall into the trap of > saying "even without cache warming it made things faster" when in fact > the tested receiver did not touch the data at all so was a false test. FWIW, netperf can be configured to access the buffers it gives to send() or gets from recv(). A ./configure --enable-dirty in TOT: http://www.netperf.org/svn/netperf2/trunk will enable two global options: -k dirty,clean # bytes to dirty, bytes to read clean on netperf side -K dirty,clean # as above, on netserver side. And in such a netperf the test banner will include the string "dirty data" (alas the default output will not say how much :) In say a TCP_STREAM test -k will affect what is done with a buffer before send() is called, and -K will affect what is done with a buffer _before_ recv() is called with that buffer. -k N will cause the first N bytes of the buffer to be dirtied, and the next N bytes to be read clean -k N, will cause the first N bytes of the buffer to be dirtied -k ,N will cause the first N bytes of the buffer to be read clean -k M,N will cause the first M bytes to be dirtied, the next N bytes to be read clean Actually, that brings-up a question - presently, and for reasons that are lost to me in the mists of time - netperf will "access" the buffer before it calls recv(). I'm wondering if that should be changed to an access of the buffer after it calls recv()? And I suspect related to all this is whether or not one should alter the size of the buffer ring being used by netperf, which by default is the SO_*BUF size divided by the send_size (or recv_size) plus one buffers - the -W option can control that. rick jones ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-21 1:00 ` Rick Jones @ 2006-04-21 1:13 ` David S. Miller 2006-04-21 17:12 ` Rick Jones 0 siblings, 1 reply; 20+ messages in thread From: David S. Miller @ 2006-04-21 1:13 UTC (permalink / raw) To: rick.jones2; +Cc: olof, andrew.grover, netdev From: Rick Jones <rick.jones2@hp.com> Date: Thu, 20 Apr 2006 18:00:37 -0700 > Actually, that brings-up a question - presently, and for reasons that > are lost to me in the mists of time - netperf will "access" the buffer > before it calls recv(). I'm wondering if that should be changed to an > access of the buffer after it calls recv()? Yes, that's what it should do, as this is whan a real application would do. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-21 1:13 ` David S. Miller @ 2006-04-21 17:12 ` Rick Jones 2006-04-27 23:49 ` Chris Leech 0 siblings, 1 reply; 20+ messages in thread From: Rick Jones @ 2006-04-21 17:12 UTC (permalink / raw) To: David S. Miller; +Cc: olof, andrew.grover, netdev David S. Miller wrote: > From: Rick Jones <rick.jones2@hp.com> > Date: Thu, 20 Apr 2006 18:00:37 -0700 > > >>Actually, that brings-up a question - presently, and for reasons that >>are lost to me in the mists of time - netperf will "access" the buffer >>before it calls recv(). I'm wondering if that should be changed to an >>access of the buffer after it calls recv()? > > > Yes, that's what it should do, as this is whan a real > application would do. Netperf2 TOT now accesses the buffer that was just recv()'d rather than the one that is about to be recv()'d. rick jones ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-21 17:12 ` Rick Jones @ 2006-04-27 23:49 ` Chris Leech 2006-04-27 23:53 ` Rick Jones 0 siblings, 1 reply; 20+ messages in thread From: Chris Leech @ 2006-04-27 23:49 UTC (permalink / raw) To: Rick Jones; +Cc: David S. Miller, olof, andrew.grover, netdev > Netperf2 TOT now accesses the buffer that was just recv()'d rather than > the one that is about to be recv()'d. We've posted netperf2 results with I/OAT enabled/disabled and the data access option on/off at http://kernel.org/pub/linux/kernel/people/grover/ioat/netperf-icb-1.5-postscaling-both.pdf This link has also been added to the I/OAT page on the LinuxNet wiki. - Chris ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-27 23:49 ` Chris Leech @ 2006-04-27 23:53 ` Rick Jones 0 siblings, 0 replies; 20+ messages in thread From: Rick Jones @ 2006-04-27 23:53 UTC (permalink / raw) To: chris.leech; +Cc: David S. Miller, olof, andrew.grover, netdev Chris Leech wrote: >>Netperf2 TOT now accesses the buffer that was just recv()'d rather than >>the one that is about to be recv()'d. > > > We've posted netperf2 results with I/OAT enabled/disabled and the data > access option on/off at > http://kernel.org/pub/linux/kernel/people/grover/ioat/netperf-icb-1.5-postscaling-both.pdf > calling it netperf data verification is a quite overstated - all netperf does is read from or write to the buffer. there is no check of data integrity or anything rick jones ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-21 0:27 ` David S. Miller 2006-04-21 1:00 ` Rick Jones @ 2006-04-21 3:04 ` Olof Johansson 2006-04-21 3:42 ` David S. Miller 2006-04-21 17:13 ` Ingo Oeser 2 siblings, 1 reply; 20+ messages in thread From: Olof Johansson @ 2006-04-21 3:04 UTC (permalink / raw) To: David S. Miller; +Cc: olof, andrew.grover, netdev On Thu, Apr 20, 2006 at 05:27:42PM -0700, David S. Miller wrote: > From: Olof Johansson <olof@lixom.net> > Date: Thu, 20 Apr 2006 16:33:05 -0500 > > > From the wiki: > > > > > 3. Data copied by I/OAT is not cached > > > > This is a I/OAT device limitation and not a global statement of the > > DMA infrastructure. Other platforms might be able to prime caches > > with the DMA traffic. Hint flags should be added on either the channel > > allocation calls, or per-operation calls, depending on where it makes > > sense driver/client wise. > > This sidesteps the whole question of _which_ cache to warm. And if > you choose wrongly, then what? > > Besides the control overhead of the DMA engines, the biggest thing > lost in my opinion is the perfect cache warming that a cpu based copy > does from the kernel socket buffer into userspace. It's definitely the easiest way to always make sure the right caches are warm for the app, that I agree with. But, when warming those caches by copying, the data is pulled in through a potentially cold cache in the first place. So the cache misses are just moved from the copy loop to userspace with dma offload. Or am I missing something? > The first thing an application is going to do is touch that data. So > I think it's very important to prewarm the caches and the only > straightforward way I know of to always warm up the correct cpu's > caches is copy_to_user(). The other way (assuming the hardware supports cache warming) would be to pass down affinities (or look them up during receive processing, I'm not sure that's practical the way things work now), and dispatch on a DMA channel with the right cache affinity. I've got a feeling that "straightforward" is not a term to use for describing that solution though. > Unfortunately, many benchmarks just do raw bandwidth tests sending to > a receiver that just doesn't even look at the data. They just return > from recvmsg() and loop back into it. This is not what applications > using networking actually do, so it's important to make sure we look > intelligently at any benchmarks done and do not fall into the trap of > saying "even without cache warming it made things faster" when in fact > the tested receiver did not touch the data at all so was a false test. Yes, some real-life-like benchmarking is definitiely needed. Unfortunately I'm not at a position where I can do much (and share numbers) at the moment myself. -Olof ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-21 3:04 ` Olof Johansson @ 2006-04-21 3:42 ` David S. Miller 2006-04-21 4:42 ` Olof Johansson 2006-04-27 23:45 ` Chris Leech 0 siblings, 2 replies; 20+ messages in thread From: David S. Miller @ 2006-04-21 3:42 UTC (permalink / raw) To: olof; +Cc: andrew.grover, netdev From: Olof Johansson <olof@lixom.net> Date: Thu, 20 Apr 2006 22:04:26 -0500 > On Thu, Apr 20, 2006 at 05:27:42PM -0700, David S. Miller wrote: > > Besides the control overhead of the DMA engines, the biggest thing > > lost in my opinion is the perfect cache warming that a cpu based copy > > does from the kernel socket buffer into userspace. > > It's definitely the easiest way to always make sure the right caches > are warm for the app, that I agree with. > > But, when warming those caches by copying, the data is pulled in through > a potentially cold cache in the first place. So the cache misses are > just moved from the copy loop to userspace with dma offload. Or am I > missing something? Yes, and it means that the memory bandwidth costs are equivalent between I/O AT and cpu copy. In the cpu copy case you eat the read cache miss, but on the write side you'll prewarm the cache properly. In the I/O AT case you eat the same read cost, but the cache will not be prewarmed, so you'll eat the read cache miss in the application. It's moving the same exact cost from one place to another. The time it takes to get the app to make forward progress (meaning returned from the recvmsg() system call and back in userspace) must by definition take at least as long with I/O AT as it does with cpu copies. Yet in the I/O AT case, the application must wait that long and also then take in the delays of the cache misses when it tries to read the data that the I/O AT engine copied. Instead of eating the cache miss cost in the kernel, we eat it in the app because in the I/O AT case the cpu won't have the user data fresh and loaded into the cpu cache. And I say I/O AT must take "at least as long" as cpu copies because the same memory copy cost is there, and on top of that I/O AT has to program the DMA controller and touch a _lot_ of other state to get things going and then wake the task up. We're talking non-trivial overheads like grabbing the page mappings out of the page tables using get_user_pages(). Evgivny has posted some very nice performance graphs showing how poorly that function scales. This is basically why none of the performance gains add up to me. I am thus very concerned that the current non-cache-warming implmentation may fall flat performance wise. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-21 3:42 ` David S. Miller @ 2006-04-21 4:42 ` Olof Johansson 2006-04-27 23:45 ` Chris Leech 1 sibling, 0 replies; 20+ messages in thread From: Olof Johansson @ 2006-04-21 4:42 UTC (permalink / raw) To: David S. Miller; +Cc: olof, andrew.grover, netdev On Thu, Apr 20, 2006 at 08:42:00PM -0700, David S. Miller wrote: > This is basically why none of the performance gains add up to me. I > am thus very concerned that the current non-cache-warming > implmentation may fall flat performance wise. Ok, I buy your arguments. It does seems unlikely that a DMA offload without cache warmth will be a net gain. More performance data is definitely be required. After digging after PDFs, it seems as the Freescale 85xx (at least, probably earlier models as well) can warm L2 for the DMA destination data. However, I don't have any hardware with it to play around with for benchmarking to see what cache warming might bring (back), performance-wise. I think there is still use for a common multi-function DMA framework across platforms and client components, even if net receive doesn't end up being {a,the first} user. -Olof ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-21 3:42 ` David S. Miller 2006-04-21 4:42 ` Olof Johansson @ 2006-04-27 23:45 ` Chris Leech 1 sibling, 0 replies; 20+ messages in thread From: Chris Leech @ 2006-04-27 23:45 UTC (permalink / raw) To: David S. Miller; +Cc: olof, andrew.grover, netdev On 4/20/06, David S. Miller <davem@davemloft.net> wrote: > Yes, and it means that the memory bandwidth costs are equivalent > between I/O AT and cpu copy. The following is a response from the I/OAT architects. I only point out that this is not coming directly from me because I have not seen the data to verify the claims regarding the speed of a copy vs a load and the cost of the rep mov instruction. I'll encourage more direct participation in this discussion from the architects moving forward. - Chris Let's talk about the caching benefits that is seemingly lost when using the DMA engine. The intent of the DMA engine is to save CPU cycles spent in copying data (rep mov). In cases where the destination is already warm in cache (due to destination buffer re-use) and the source is in memory, the cycles spent in a host copy is not just due to the cache misses it encounters in the process of bringing in the source but also due to the execution of rep move itself within the host core. If you contrast this to simply touching (loading) the data residing in memory, the cost of this load is primarily the cost of the cache misses and not so much CPU execution time. Given this, some of the following points are noteworthy: 1. While the DMA engine forces the destination to be in memory and touching it may cause the same number of observable cache misses as a host copy assuming a cache warmed destination, the cost of the host copy (in terms of CPU cycles) is much more than the cost of the touch. 2. CPU hardware prefetchers do a pretty good job of staying ahead of the fetch stream to minimize cache misses. So for loads of medium to large buffers, cache misses form a much smaller component of the data fetch time…most of it is dominated by front side bus (FSB) or Memory bandwidth. For small buffers, we do not use the DMA engine but if we had to, we would insert SW prefetches that do reasonably well. 3. If the destination wasn't already warm in cache i.e., it was in memory or some CPU other cache, host copy will have to snoop and bring the destination in and will encounter additional misses on the destination buffer as well. These misses are the same as those encountered in #1 above when using the DMA engine and touching the data afterwards. So in effect it becomes a wash when compared to the DMA engine's behavior. The case where the destination is already warm in cache is common in benchmarks such as iperf, ttcp etc. where the same buffer is reused over and over again. Real applications typically will not exhibit this aggressive buffer re-use behavior. 4. It may take a large number of packets (and several interrupts) to satisfy a large posted buffer (say 64KB). Even if you use host copy to warm the cache with the destination, there is no guarantee that some or all of the destination will stay in the cache before the application has a chance to read the data. 5. The source data payload (skb ->data) is typically needed only once for the copy and has no use later. The host copy brings it into the cache and may end up polluting the cache, and consuming FSB bandwidth whereas the DMA engine avoids this altogether. The IxChariot data posted earlier that touches the data and yet shows I/OAT benefit is due to some of the reasons above. Bottom line is that I agree with the cache benefit argument of host copy for small buffers (64B to 512B) but for larger buffers and certain application scenarios (destination in memory), the DMA engine will show better performance regardless of where the destination buffer resided to begin with and where it is accessed from. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/10] [IOAT] I/OAT patches repost 2006-04-21 0:27 ` David S. Miller 2006-04-21 1:00 ` Rick Jones 2006-04-21 3:04 ` Olof Johansson @ 2006-04-21 17:13 ` Ingo Oeser 2 siblings, 0 replies; 20+ messages in thread From: Ingo Oeser @ 2006-04-21 17:13 UTC (permalink / raw) To: David S. Miller; +Cc: olof, andrew.grover, netdev, Ingo Oeser David S. Miller wrote: > The first thing an application is going to do is touch that data. So > I think it's very important to prewarm the caches and the only > straightforward way I know of to always warm up the correct cpu's > caches is copy_to_user(). Hmm, what if the application is sth. like a MPEG demultiplexer? There you don't like to look at everything and excplicitly ignore received data[1]. Yes, I know this is usually done with hardware demuxers and filters, but the principle might apply to other applications as well, for which no hardware solutions exist. Regards Ingo Oeser [1] which you cannot ignore properly with Linux yet. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2006-04-27 23:53 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-04-20 20:49 [PATCH 0/10] [IOAT] I/OAT patches repost Andrew Grover 2006-04-20 21:33 ` Olof Johansson 2006-04-20 22:14 ` Andrew Grover 2006-04-20 23:33 ` Olof Johansson 2006-04-21 0:44 ` David S. Miller 2006-04-21 3:09 ` Olof Johansson 2006-04-21 0:38 ` David S. Miller 2006-04-21 1:02 ` Rick Jones 2006-04-21 2:23 ` Herbert Xu 2006-04-21 0:27 ` David S. Miller 2006-04-21 1:00 ` Rick Jones 2006-04-21 1:13 ` David S. Miller 2006-04-21 17:12 ` Rick Jones 2006-04-27 23:49 ` Chris Leech 2006-04-27 23:53 ` Rick Jones 2006-04-21 3:04 ` Olof Johansson 2006-04-21 3:42 ` David S. Miller 2006-04-21 4:42 ` Olof Johansson 2006-04-27 23:45 ` Chris Leech 2006-04-21 17:13 ` Ingo Oeser
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).