* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
@ 2007-08-17 6:06 Krishna Kumar2
2007-08-21 7:18 ` David Miller
0 siblings, 1 reply; 37+ messages in thread
From: Krishna Kumar2 @ 2007-08-17 6:06 UTC (permalink / raw)
To: David Miller
Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma
Hi Dave,
> I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will
> run a longer one tonight). The results are (results in KB/s, and %):
I ran a 8.5 hours run with no batching + another 8.5 hours run with
batching (Buffer sizes: "32 128 512 4096 16384", Threads: "1 8 32",
Each test run time: 3 minutes, Iterations to average: 5). TCP seems
to get a small improvement.
Thanks,
- KK
-------------------------------------------------------------------------------
TCP
-----------
Size:32 Procs:1 3415 3321 -2.75
Size:128 Procs:1 13094 13388 2.24
Size:512 Procs:1 49037 50683 3.35
Size:4096 Procs:1 114646 114619 -.02
Size:16384 Procs:1 114626 114644 .01
Size:32 Procs:8 22675 22633 -.18
Size:128 Procs:8 77994 77297 -.89
Size:512 Procs:8 114716 114711 0
Size:4096 Procs:8 114637 114636 0
Size:16384 Procs:8 95814 114638 19.64
Size:32 Procs:32 23240 23349 .46
Size:128 Procs:32 82284 82247 -.04
Size:512 Procs:32 114885 114769 -.10
Size:4096 Procs:32 95735 114634 19.74
Size:16384 Procs:32 114736 114641 -.08
Average: 1151534 1190210 3.36%
-------------------------------------------------------------------------------
No Delay:
---------
Size:32 Procs:1 3002 2873 -4.29
Size:128 Procs:1 11853 11801 -.43
Size:512 Procs:1 45565 45837 .59
Size:4096 Procs:1 114511 114485 -.02
Size:16384 Procs:1 114521 114555 .02
Size:32 Procs:8 8026 8029 .03
Size:128 Procs:8 31589 31573 -.05
Size:512 Procs:8 111506 105766 -5.14
Size:4096 Procs:8 114455 114454 0
Size:16384 Procs:8 95833 114491 19.46
Size:32 Procs:32 8005 8027 .27
Size:128 Procs:32 31475 31505 .09
Size:512 Procs:32 114558 113687 -.76
Size:4096 Procs:32 114784 114447 -.29
Size:16384 Procs:32 114719 114496 -.19
Average: 1046026 1034402 -1.11%
-------------------------------------------------------------------------------
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-17 6:06 [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar2 @ 2007-08-21 7:18 ` David Miller 2007-08-21 12:30 ` [ofa-general] " jamal 0 siblings, 1 reply; 37+ messages in thread From: David Miller @ 2007-08-21 7:18 UTC (permalink / raw) To: krkumar2 Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma From: Krishna Kumar2 <krkumar2@in.ibm.com> Date: Fri, 17 Aug 2007 11:36:03 +0530 > > I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will > > run a longer one tonight). The results are (results in KB/s, and %): > > I ran a 8.5 hours run with no batching + another 8.5 hours run with > batching (Buffer sizes: "32 128 512 4096 16384", Threads: "1 8 32", > Each test run time: 3 minutes, Iterations to average: 5). TCP seems > to get a small improvement. Using 16K buffer size really isn't going to keep the pipe full enough for TSO. And realistically applications queue much more data at a time. Also, with smaller buffer sizes can have negative effects for the dynamic receive and send buffer growth algorithm the kernel uses, it might consider the connection application limited for too long. I would really prefer to see numbers that use buffer sizes more in line with the amount of data that is typically inflight on a 1G connection on a local network. Do a tcpdump during the height of the transfer to see about what this value is. When an ACK comes in, compare the sequence number it's ACK'ing with the sequence number of the most recently sent frame. The difference is approximately the pipe size at maximum congestion window assuming a loss free local network. ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-21 7:18 ` David Miller @ 2007-08-21 12:30 ` jamal 2007-08-21 18:51 ` David Miller 0 siblings, 1 reply; 37+ messages in thread From: jamal @ 2007-08-21 12:30 UTC (permalink / raw) To: David Miller Cc: jagana, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier, peter.p.waskiewicz.jr, mcarlson, netdev, sri, general, mchan, tgraf, johnpol, shemminger, kaber, herbert On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote: > Using 16K buffer size really isn't going to keep the pipe full enough > for TSO. Why the comparison with TSO (or GSO for that matter)? Seems to me that is only valid/fair if you have a single flow. Batching is multi-flow focused (or i should say flow-unaware). cheers, jamal ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-21 12:30 ` [ofa-general] " jamal @ 2007-08-21 18:51 ` David Miller 2007-08-21 21:09 ` jamal 2007-08-22 4:11 ` [ofa-general] " Krishna Kumar2 0 siblings, 2 replies; 37+ messages in thread From: David Miller @ 2007-08-21 18:51 UTC (permalink / raw) To: hadi Cc: jagana, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier, peter.p.waskiewicz.jr, mcarlson, netdev, sri, general, mchan, tgraf, johnpol, shemminger, kaber, herbert From: jamal <hadi@cyberus.ca> Date: Tue, 21 Aug 2007 08:30:22 -0400 > On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote: > > > Using 16K buffer size really isn't going to keep the pipe full enough > > for TSO. > > Why the comparison with TSO (or GSO for that matter)? Because TSO does batching already, so it's a very good "tit for tat" comparison of the new batching scheme vs. an existing one. ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-21 18:51 ` David Miller @ 2007-08-21 21:09 ` jamal 2007-08-21 22:50 ` David Miller 2007-08-22 4:11 ` [ofa-general] " Krishna Kumar2 1 sibling, 1 reply; 37+ messages in thread From: jamal @ 2007-08-21 21:09 UTC (permalink / raw) To: David Miller Cc: jagana, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier, peter.p.waskiewicz.jr, mcarlson, netdev, sri, general, mchan, tgraf, johnpol, shemminger, kaber, herbert On Tue, 2007-21-08 at 11:51 -0700, David Miller wrote: > Because TSO does batching already, so it's a very good > "tit for tat" comparison of the new batching scheme > vs. an existing one. Fair enough - I may have read too much into your email then;-> For bulk type of apps (where TSO will make a difference) this a fair test. Hence i agree the 16KB buffer size is not sensible if the goal is to simulate such an app. However (and this is where i read too much into what you were saying) that the test by itself is insufficient comparison. You gotta look at the other side of the coin i.e. at apps where TSO wont buy much. Examples, a busy ssh or irc server and you could go as far as looking at the most pre-dominant app on the wild west, http (average page size from a few years back was in the range of 10-20K and can be simulated with good ole netperf/iperf). cheers, jamal ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-21 21:09 ` jamal @ 2007-08-21 22:50 ` David Miller 0 siblings, 0 replies; 37+ messages in thread From: David Miller @ 2007-08-21 22:50 UTC (permalink / raw) To: hadi Cc: krkumar2, gaagaan, general, herbert, jagana, jeff, johnpol, kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma From: jamal <hadi@cyberus.ca> Date: Tue, 21 Aug 2007 17:09:12 -0400 > Examples, a busy ssh or irc server and you could go as far as > looking at the most pre-dominant app on the wild west, http (average > page size from a few years back was in the range of 10-20K and can > be simulated with good ole netperf/iperf). Pages have chunked up considerably in recent years. Just bringing up a myspace page can give you megabytes of video, images, flash, and other stuff. ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-21 18:51 ` David Miller 2007-08-21 21:09 ` jamal @ 2007-08-22 4:11 ` Krishna Kumar2 2007-08-22 4:22 ` David Miller 1 sibling, 1 reply; 37+ messages in thread From: Krishna Kumar2 @ 2007-08-22 4:11 UTC (permalink / raw) To: David Miller Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan, tgraf, sri, shemminger, kaber, herbert David Miller <davem@davemloft.net> wrote on 08/22/2007 12:21:43 AM: > From: jamal <hadi@cyberus.ca> > Date: Tue, 21 Aug 2007 08:30:22 -0400 > > > On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote: > > > > > Using 16K buffer size really isn't going to keep the pipe full enough > > > for TSO. > > > > Why the comparison with TSO (or GSO for that matter)? > > Because TSO does batching already, so it's a very good > "tit for tat" comparison of the new batching scheme > vs. an existing one. I am planning to do more testing on your suggestion over the weekend, but I had a comment. Are you saying that TSO and batching should be mutually exclusive so hardware that doesn't support TSO (like IB) only would benefit? But even if they can co-exist, aren't cases like sending multiple small skbs better handled with batching? Thanks, - KK ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-22 4:11 ` [ofa-general] " Krishna Kumar2 @ 2007-08-22 4:22 ` David Miller 2007-08-22 7:03 ` Krishna Kumar2 2007-08-22 17:09 ` [ofa-general] " Rick Jones 0 siblings, 2 replies; 37+ messages in thread From: David Miller @ 2007-08-22 4:22 UTC (permalink / raw) To: krkumar2 Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan, tgraf, sri, shemminger, kaber, herbert From: Krishna Kumar2 <krkumar2@in.ibm.com> Date: Wed, 22 Aug 2007 09:41:52 +0530 > David Miller <davem@davemloft.net> wrote on 08/22/2007 12:21:43 AM: > > > From: jamal <hadi@cyberus.ca> > > Date: Tue, 21 Aug 2007 08:30:22 -0400 > > > > > On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote: > > > > > > > Using 16K buffer size really isn't going to keep the pipe full enough > > > > for TSO. > > > > > > Why the comparison with TSO (or GSO for that matter)? > > > > Because TSO does batching already, so it's a very good > > "tit for tat" comparison of the new batching scheme > > vs. an existing one. > > I am planning to do more testing on your suggestion over the > weekend, but I had a comment. Are you saying that TSO and > batching should be mutually exclusive so hardware that doesn't > support TSO (like IB) only would benefit? > > But even if they can co-exist, aren't cases like sending > multiple small skbs better handled with batching? I'm not making any suggestions, so don't read that into anything I've said :-) I think the jury is still out, but seeing TSO perform even slightly worse with the batching changes in place would be very worrysome. This applies to both throughput and cpu utilization. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-22 4:22 ` David Miller @ 2007-08-22 7:03 ` Krishna Kumar2 2007-08-22 9:14 ` David Miller 2007-08-22 17:09 ` [ofa-general] " Rick Jones 1 sibling, 1 reply; 37+ messages in thread From: Krishna Kumar2 @ 2007-08-22 7:03 UTC (permalink / raw) To: David Miller Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma Hi Dave, David Miller <davem@davemloft.net> wrote on 08/22/2007 09:52:29 AM: > From: Krishna Kumar2 <krkumar2@in.ibm.com> > Date: Wed, 22 Aug 2007 09:41:52 +0530 > <snip> > > > Because TSO does batching already, so it's a very good > > > "tit for tat" comparison of the new batching scheme > > > vs. an existing one. > > > > I am planning to do more testing on your suggestion over the > > weekend, but I had a comment. Are you saying that TSO and > > batching should be mutually exclusive so hardware that doesn't > > support TSO (like IB) only would benefit? > > > > But even if they can co-exist, aren't cases like sending > > multiple small skbs better handled with batching? > > I'm not making any suggestions, so don't read that into anything I've > said :-) > > I think the jury is still out, but seeing TSO perform even slightly > worse with the batching changes in place would be very worrysome. > This applies to both throughput and cpu utilization. Does turning off batching solve that problem? What I mean by that is: batching can be disabled if a TSO device is worse for some cases. Infact something that I had changed my latest code is to not enable batching in register_netdevice (in Rev4 which I am sending in a few mins), rather the user has to explicitly turn 'on' batching. Wondering if that is what you are concerned about. In any case, I will test your case on Monday (I am on vacation for next couple of days). Thanks, - KK ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-22 7:03 ` Krishna Kumar2 @ 2007-08-22 9:14 ` David Miller 2007-08-23 2:43 ` Krishna Kumar2 0 siblings, 1 reply; 37+ messages in thread From: David Miller @ 2007-08-22 9:14 UTC (permalink / raw) To: krkumar2 Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma From: Krishna Kumar2 <krkumar2@in.ibm.com> Date: Wed, 22 Aug 2007 12:33:04 +0530 > Does turning off batching solve that problem? What I mean by that is: > batching can be disabled if a TSO device is worse for some cases. This new batching stuff isn't going to be enabled or disabled on a per-device basis just to get "parity" with how things are now. It should be enabled by default, and give at least as good performance as what can be obtained right now. Otherwise it's a clear regression. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-22 9:14 ` David Miller @ 2007-08-23 2:43 ` Krishna Kumar2 0 siblings, 0 replies; 37+ messages in thread From: Krishna Kumar2 @ 2007-08-23 2:43 UTC (permalink / raw) To: David Miller Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma David Miller <davem@davemloft.net> wrote on 08/22/2007 02:44:40 PM: > From: Krishna Kumar2 <krkumar2@in.ibm.com> > Date: Wed, 22 Aug 2007 12:33:04 +0530 > > > Does turning off batching solve that problem? What I mean by that is: > > batching can be disabled if a TSO device is worse for some cases. > > This new batching stuff isn't going to be enabled or disabled > on a per-device basis just to get "parity" with how things are > now. > > It should be enabled by default, and give at least as good > performance as what can be obtained right now. That was how it was in earlier revisions. In revision4 I coded it so that it is enabled only if explicitly set by the user. I can revert that change. > Otherwise it's a clear regression. Definitely. For drivers that support it, it should not reduce performance. Thanks, - KK ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-22 4:22 ` David Miller 2007-08-22 7:03 ` Krishna Kumar2 @ 2007-08-22 17:09 ` Rick Jones 2007-08-22 20:21 ` David Miller 1 sibling, 1 reply; 37+ messages in thread From: Rick Jones @ 2007-08-22 17:09 UTC (permalink / raw) To: David Miller Cc: jagana, herbert, gaagaan, Robert.Olsson, kumarkr, rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general, mchan, tgraf, netdev, johnpol, shemminger, kaber, sri David Miller wrote: > I think the jury is still out, but seeing TSO perform even slightly > worse with the batching changes in place would be very worrysome. > This applies to both throughput and cpu utilization. Should it be any more or less worrysome than small packet performance (eg the TCP_RR stuff I posted recently) being rather worse with TSO enabled than with it disabled? rick jones ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-22 17:09 ` [ofa-general] " Rick Jones @ 2007-08-22 20:21 ` David Miller 2007-08-23 22:04 ` [ofa-general] " jamal 0 siblings, 1 reply; 37+ messages in thread From: David Miller @ 2007-08-22 20:21 UTC (permalink / raw) To: rick.jones2 Cc: krkumar2, gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri, tgraf, xma From: Rick Jones <rick.jones2@hp.com> Date: Wed, 22 Aug 2007 10:09:37 -0700 > Should it be any more or less worrysome than small packet > performance (eg the TCP_RR stuff I posted recently) being rather > worse with TSO enabled than with it disabled? That, like any such thing shown by the batching changes, is a bug to fix. ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-22 20:21 ` David Miller @ 2007-08-23 22:04 ` jamal 2007-08-23 22:25 ` jamal 2007-08-23 22:30 ` David Miller 0 siblings, 2 replies; 37+ messages in thread From: jamal @ 2007-08-23 22:04 UTC (permalink / raw) To: David Miller Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson, kumarkr, rdreier, mcarlson, jeff, general, mchan, tgraf, netdev, johnpol, shemminger, kaber, sri On Wed, 2007-22-08 at 13:21 -0700, David Miller wrote: > From: Rick Jones <rick.jones2@hp.com> > Date: Wed, 22 Aug 2007 10:09:37 -0700 > > > Should it be any more or less worrysome than small packet > > performance (eg the TCP_RR stuff I posted recently) being rather > > worse with TSO enabled than with it disabled? > > That, like any such thing shown by the batching changes, is a bug > to fix. Possibly a bug - but you really should turn off TSO if you are doing huge interactive transactions (which is fair because there is a clear demarcation). The litmus test is the same as any change that is supposed to improve net performance - it has to demonstrate it is not intrusive and that it improves (consistently) performance. The standard metrics are {throughput, cpu-utilization, latency} i.e as long as one improves and others remain zero, it would make sense. Yes, i am religious for batching after all the invested sweat (and i continue to work on it hoping to demystify) - the theory makes a lot of sense. cheers, jamal ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-23 22:04 ` [ofa-general] " jamal @ 2007-08-23 22:25 ` jamal 2007-08-23 22:35 ` [ofa-general] " Rick Jones 2007-08-23 22:30 ` David Miller 1 sibling, 1 reply; 37+ messages in thread From: jamal @ 2007-08-23 22:25 UTC (permalink / raw) To: David Miller Cc: rick.jones2, krkumar2, gaagaan, general, herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri, tgraf, xma On Thu, 2007-23-08 at 18:04 -0400, jamal wrote: > The litmus test is the same as any change that is supposed to improve > net performance - it has to demonstrate it is not intrusive and that it > improves (consistently) performance. The standard metrics are > {throughput, cpu-utilization, latency} i.e as long as one improves and > others remain zero, it would make sense. Yes, i am religious for > batching after all the invested sweat (and i continue to work on it > hoping to demystify) - the theory makes a lot of sense. Before someone jumps and strangles me ;-> By "litmus test" i meant as applied to batching. [TSO already passed - iirc, it has been demostranted to really not add much to throughput (cant improve much over closeness to wire speed) but improve CPU utilization]. cheers, jamal ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-23 22:25 ` jamal @ 2007-08-23 22:35 ` Rick Jones 2007-08-23 22:41 ` jamal 2007-08-24 3:18 ` Bill Fink 0 siblings, 2 replies; 37+ messages in thread From: Rick Jones @ 2007-08-23 22:35 UTC (permalink / raw) To: hadi Cc: jagana, herbert, gaagaan, Robert.Olsson, netdev, rdreier, peter.p.waskiewicz.jr, mcarlson, kaber, jeff, general, mchan, tgraf, johnpol, shemminger, David Miller, sri jamal wrote: > [TSO already passed - iirc, it has been > demostranted to really not add much to throughput (cant improve much > over closeness to wire speed) but improve CPU utilization]. In the one gig space sure, but in the 10 Gig space, TSO on/off does make a difference for throughput. rick jones ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-23 22:35 ` [ofa-general] " Rick Jones @ 2007-08-23 22:41 ` jamal 2007-08-24 3:18 ` Bill Fink 1 sibling, 0 replies; 37+ messages in thread From: jamal @ 2007-08-23 22:41 UTC (permalink / raw) To: Rick Jones Cc: jagana, herbert, gaagaan, Robert.Olsson, netdev, rdreier, peter.p.waskiewicz.jr, mcarlson, kaber, jeff, general, mchan, tgraf, johnpol, shemminger, David Miller, sri On Thu, 2007-23-08 at 15:35 -0700, Rick Jones wrote: > jamal wrote: > > [TSO already passed - iirc, it has been > > demostranted to really not add much to throughput (cant improve much > > over closeness to wire speed) but improve CPU utilization]. > > In the one gig space sure, but in the 10 Gig space, TSO on/off does make a > difference for throughput. I am still so 1Gige;-> I stand corrected again ;-> cheers, jamal ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-23 22:35 ` [ofa-general] " Rick Jones 2007-08-23 22:41 ` jamal @ 2007-08-24 3:18 ` Bill Fink 2007-08-24 12:14 ` jamal ` (2 more replies) 1 sibling, 3 replies; 37+ messages in thread From: Bill Fink @ 2007-08-24 3:18 UTC (permalink / raw) To: Rick Jones Cc: jagana, herbert, gaagaan, Robert.Olsson, mcarlson, rdreier, peter.p.waskiewicz.jr, hadi, kaber, jeff, general, mchan, tgraf, netdev, johnpol, shemminger, David Miller, sri On Thu, 23 Aug 2007, Rick Jones wrote: > jamal wrote: > > [TSO already passed - iirc, it has been > > demostranted to really not add much to throughput (cant improve much > > over closeness to wire speed) but improve CPU utilization]. > > In the one gig space sure, but in the 10 Gig space, TSO on/off does make a > difference for throughput. Not too much. TSO enabled: [root@lang2 ~]# ethtool -k eth2 Offload parameters for eth2: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11813.4375 MB / 10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX TSO disabled: [root@lang2 ~]# ethtool -K eth2 tso off [root@lang2 ~]# ethtool -k eth2 Offload parameters for eth2: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11818.2500 MB / 10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX Pretty negligible difference it seems. This is with a 2.6.20.7 kernel, Myricom 10-GigE NICs, and 9000 byte jumbo frames, in a LAN environment. For grins, I also did a couple of tests with an MSS of 1460 to emulate a standard 1500 byte Ethernet MTU. TSO enabled: [root@lang2 ~]# ethtool -k eth2 Offload parameters for eth2: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5102.8503 MB / 10.06 sec = 4253.9124 Mbps 39 %TX 99 %RX TSO disabled: [root@lang2 ~]# ethtool -K eth2 tso off [root@lang2 ~]# ethtool -k eth2 Offload parameters for eth2: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5399.5625 MB / 10.00 sec = 4527.9070 Mbps 99 %TX 76 %RX Here you can see there is a major difference in the TX CPU utilization (99 % with TSO disabled versus only 39 % with TSO enabled), although the TSO disabled case was able to squeeze out a little extra performance from its extra CPU utilization. Interestingly, with TSO enabled, the receiver actually consumed more CPU than with TSO disabled, so I guess the receiver CPU saturation in that case (99 %) was what restricted its performance somewhat (this was consistent across a few test runs). -Bill ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-24 3:18 ` Bill Fink @ 2007-08-24 12:14 ` jamal 2007-08-24 18:08 ` Bill Fink 2007-08-24 21:25 ` David Miller 2007-08-24 18:46 ` [ofa-general] " Rick Jones 2007-08-25 0:42 ` John Heffner 2 siblings, 2 replies; 37+ messages in thread From: jamal @ 2007-08-24 12:14 UTC (permalink / raw) To: Bill Fink Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson, netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf, johnpol, shemminger, David Miller, sri On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote: [..] > Here you can see there is a major difference in the TX CPU utilization > (99 % with TSO disabled versus only 39 % with TSO enabled), although > the TSO disabled case was able to squeeze out a little extra performance > from its extra CPU utilization. Good stuff. What kind of machine? SMP? Seems the receive side of the sender is also consuming a lot more cpu i suspect because receiver is generating a lot more ACKs with TSO. Does the choice of the tcp congestion control algorithm affect results? it would be interesting to see both MTUs with either TCP BIC vs good old reno on sender (probably without changing what the receiver does). BIC seems to be the default lately. > Interestingly, with TSO enabled, the > receiver actually consumed more CPU than with TSO disabled, I would suspect the fact that a lot more packets making it into the receiver for TSO contributes. > so I guess > the receiver CPU saturation in that case (99 %) was what restricted > its performance somewhat (this was consistent across a few test runs). Unfortunately the receiver plays a big role in such tests - if it is bottlenecked then you are not really testing the limits of the transmitter. cheers, jamal ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-24 12:14 ` jamal @ 2007-08-24 18:08 ` Bill Fink 2007-08-24 21:25 ` David Miller 1 sibling, 0 replies; 37+ messages in thread From: Bill Fink @ 2007-08-24 18:08 UTC (permalink / raw) To: hadi Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson, netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf, johnpol, shemminger, David Miller, sri On Fri, 24 Aug 2007, jamal wrote: > On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote: > > [..] > > Here you can see there is a major difference in the TX CPU utilization > > (99 % with TSO disabled versus only 39 % with TSO enabled), although > > the TSO disabled case was able to squeeze out a little extra performance > > from its extra CPU utilization. > > Good stuff. What kind of machine? SMP? Tyan Thunder K8WE S2895ANRF motherboard with Nvidia nForce Professional 2200+2050 chipset, 2 AMD Opteron 254 2.8 GHz CPUs, 4 GB PC3200 ECC REG-DDR 400 memory, and 2 PCI-Express x16 slots (2 buses). It is SMP but both the NIC interrupts and nuttcp are bound to CPU 0. And all other non-kernel system processes are bound to CPU 1. > Seems the receive side of the sender is also consuming a lot more cpu > i suspect because receiver is generating a lot more ACKs with TSO. Odd. I just reran the TCP CUBIC "-M1460" tests, and with TSO enabled on the transmitter, there were about 153709 eth2 interrupts on the receiver, while with TSO disabled there was actually a somewhat higher number (164988) of receiver side eth2 interrupts, although the receive side CPU utilization was actually lower in that case. On the transmit side (different test run), the TSO enabled case had about 161773 eth2 interrupts whereas the TSO disabled case had about 165179 eth2 interrupts. > Does the choice of the tcp congestion control algorithm affect results? > it would be interesting to see both MTUs with either TCP BIC vs good old > reno on sender (probably without changing what the receiver does). BIC > seems to be the default lately. These tests were with the default TCP CUBIC (with initial_ssthresh set to 0). With TCP BIC (and initial_ssthresh set to 0): TSO enabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11751.3750 MB / 10.00 sec = 9853.9839 Mbps 100 %TX 83 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4999.3321 MB / 10.06 sec = 4167.7872 Mbps 38 %TX 100 %RX TSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11818.1875 MB / 10.00 sec = 9910.0682 Mbps 99 %TX 81 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5502.6250 MB / 10.00 sec = 4614.3297 Mbps 100 %TX 84 %RX And with TCP Reno: TSO enabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11782.6250 MB / 10.00 sec = 9880.2613 Mbps 100 %TX 77 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5024.6649 MB / 10.06 sec = 4191.6574 Mbps 38 %TX 99 %RX TSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11818.2500 MB / 10.00 sec = 9910.0860 Mbps 99 %TX 77 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5284.0000 MB / 10.00 sec = 4430.9604 Mbps 99 %TX 79 %RX Very similar results to the original TCP CUBIC tests. > > Interestingly, with TSO enabled, the > > receiver actually consumed more CPU than with TSO disabled, > > I would suspect the fact that a lot more packets making it into the > receiver for TSO contributes. > > > so I guess > > the receiver CPU saturation in that case (99 %) was what restricted > > its performance somewhat (this was consistent across a few test runs). > > Unfortunately the receiver plays a big role in such tests - if it is > bottlenecked then you are not really testing the limits of the > transmitter. It might be interesting to see what affect the LRO changes would have on this. Once they are in a stable released kernel, I might try that out, or maybe even before if I get some spare time (but that's in very short supply right now). -Thanks -Bill ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-24 12:14 ` jamal 2007-08-24 18:08 ` Bill Fink @ 2007-08-24 21:25 ` David Miller 2007-08-24 23:11 ` Herbert Xu 1 sibling, 1 reply; 37+ messages in thread From: David Miller @ 2007-08-24 21:25 UTC (permalink / raw) To: hadi Cc: jagana, billfink, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson, netdev, rdreier, mcarlson, jeff, general, mchan, tgraf, johnpol, shemminger, kaber, sri From: jamal <hadi@cyberus.ca> Date: Fri, 24 Aug 2007 08:14:16 -0400 > Seems the receive side of the sender is also consuming a lot more cpu > i suspect because receiver is generating a lot more ACKs with TSO. I've seen this behavior before on a low cpu powered receiver and the issue is that batching too much actually hurts a receiver. If the data packets were better spaced out, the receive would handle the load better. This is the thing the TOE guys keep talking about overcoming with their packet pacing algorithms in their on-card TOE stack. My hunch is that even if in the non-TSO case the TX packets were all back to back in the cards TX ring, TSO still spits them out faster on the wire. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-24 21:25 ` David Miller @ 2007-08-24 23:11 ` Herbert Xu 2007-08-25 23:45 ` Bill Fink 0 siblings, 1 reply; 37+ messages in thread From: Herbert Xu @ 2007-08-24 23:11 UTC (permalink / raw) To: David Miller Cc: hadi, billfink, rick.jones2, krkumar2, gaagaan, general, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri, tgraf, xma On Fri, Aug 24, 2007 at 02:25:03PM -0700, David Miller wrote: > > My hunch is that even if in the non-TSO case the TX packets were all > back to back in the cards TX ring, TSO still spits them out faster on > the wire. If this is the case then we should see an improvement by disabling TSO and enabling GSO. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-24 23:11 ` Herbert Xu @ 2007-08-25 23:45 ` Bill Fink 0 siblings, 0 replies; 37+ messages in thread From: Bill Fink @ 2007-08-25 23:45 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, hadi, rick.jones2, krkumar2, gaagaan, general, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri, tgraf, xma On Sat, 25 Aug 2007, Herbert Xu wrote: > On Fri, Aug 24, 2007 at 02:25:03PM -0700, David Miller wrote: > > > > My hunch is that even if in the non-TSO case the TX packets were all > > back to back in the cards TX ring, TSO still spits them out faster on > > the wire. > > If this is the case then we should see an improvement by > disabling TSO and enabling GSO. TSO disabled and GSO enabled: [root@lang2 redhat]# nuttcp -w10m 192.168.88.16 11806.7500 MB / 10.00 sec = 9900.6278 Mbps 100 %TX 84 %RX [root@lang2 redhat]# nuttcp -M1460 -w10m 192.168.88.16 4872.0625 MB / 10.00 sec = 4085.5690 Mbps 100 %TX 64 %RX In the "-M1460" case, there was generally less receiver CPU utilization, but the transmitter utilization was generally pegged at 100 %, even though there wasn't any improvement in throughput compared to the TSO enabled case (in fact the throughput generally seemed to be somewhat less than the TSO enabled case). Note there was a fair degree of variability across runs for the receiver CPU utilization (the one shown I considered to be representative of the average behavior). Repeat of previous test results: TSO enabled and GSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11813.4375 MB / 10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5102.8503 MB / 10.06 sec = 4253.9124 Mbps 39 %TX 99 %RX TSO disabled and GSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11818.2500 MB / 10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5399.5625 MB / 10.00 sec = 4527.9070 Mbps 99 %TX 76 %RX -Bill ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-24 3:18 ` Bill Fink 2007-08-24 12:14 ` jamal @ 2007-08-24 18:46 ` Rick Jones 2007-08-25 0:42 ` John Heffner 2 siblings, 0 replies; 37+ messages in thread From: Rick Jones @ 2007-08-24 18:46 UTC (permalink / raw) To: Bill Fink Cc: jagana, herbert, gaagaan, Robert.Olsson, mcarlson, rdreier, peter.p.waskiewicz.jr, hadi, kaber, jeff, general, mchan, tgraf, netdev, johnpol, shemminger, David Miller, sri Bill Fink wrote: > On Thu, 23 Aug 2007, Rick Jones wrote: > > >>jamal wrote: >> >>>[TSO already passed - iirc, it has been >>>demostranted to really not add much to throughput (cant improve much >>>over closeness to wire speed) but improve CPU utilization]. >> >>In the one gig space sure, but in the 10 Gig space, TSO on/off does make a >>difference for throughput. > > > Not too much. > > TSO enabled: > > [root@lang2 ~]# ethtool -k eth2 > Offload parameters for eth2: > rx-checksumming: on > tx-checksumming: on > scatter-gather: on > tcp segmentation offload: on > > [root@lang2 ~]# nuttcp -w10m 192.168.88.16 > 11813.4375 MB / 10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX > > TSO disabled: > > [root@lang2 ~]# ethtool -K eth2 tso off > [root@lang2 ~]# ethtool -k eth2 > Offload parameters for eth2: > rx-checksumming: on > tx-checksumming: on > scatter-gather: on > tcp segmentation offload: off > > [root@lang2 ~]# nuttcp -w10m 192.168.88.16 > 11818.2500 MB / 10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX > > Pretty negligible difference it seems. Leaves one wondering how often more than one segment was sent to the card in the 9000 byte case :) rick jones ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-24 3:18 ` Bill Fink 2007-08-24 12:14 ` jamal 2007-08-24 18:46 ` [ofa-general] " Rick Jones @ 2007-08-25 0:42 ` John Heffner 2007-08-26 8:41 ` [ofa-general] " Bill Fink 2 siblings, 1 reply; 37+ messages in thread From: John Heffner @ 2007-08-25 0:42 UTC (permalink / raw) To: Bill Fink Cc: Rick Jones, hadi, David Miller, krkumar2, gaagaan, general, herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri, tgraf, xma Bill Fink wrote: > Here you can see there is a major difference in the TX CPU utilization > (99 % with TSO disabled versus only 39 % with TSO enabled), although > the TSO disabled case was able to squeeze out a little extra performance > from its extra CPU utilization. Interestingly, with TSO enabled, the > receiver actually consumed more CPU than with TSO disabled, so I guess > the receiver CPU saturation in that case (99 %) was what restricted > its performance somewhat (this was consistent across a few test runs). One possibility is that I think the receive-side processing tends to do better when receiving into an empty queue. When the (non-TSO) sender is the flow's bottleneck, this is going to be the case. But when you switch to TSO, the receiver becomes the bottleneck and you're always going to have to put the packets at the back of the receive queue. This might help account for the reason why you have both lower throughput and higher CPU utilization -- there's a point of instability right where the receiver becomes the bottleneck and you end up pushing it over to the bad side. :) Just a theory. I'm honestly surprised this effect would be so significant. What do the numbers from netstat -s look like in the two cases? -John ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-25 0:42 ` John Heffner @ 2007-08-26 8:41 ` Bill Fink 2007-08-27 1:32 ` John Heffner 0 siblings, 1 reply; 37+ messages in thread From: Bill Fink @ 2007-08-26 8:41 UTC (permalink / raw) To: John Heffner Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson, mcarlson, rdreier, hadi, kaber, jeff, general, mchan, tgraf, netdev, johnpol, shemminger, David Miller, sri On Fri, 24 Aug 2007, John Heffner wrote: > Bill Fink wrote: > > Here you can see there is a major difference in the TX CPU utilization > > (99 % with TSO disabled versus only 39 % with TSO enabled), although > > the TSO disabled case was able to squeeze out a little extra performance > > from its extra CPU utilization. Interestingly, with TSO enabled, the > > receiver actually consumed more CPU than with TSO disabled, so I guess > > the receiver CPU saturation in that case (99 %) was what restricted > > its performance somewhat (this was consistent across a few test runs). > > One possibility is that I think the receive-side processing tends to do > better when receiving into an empty queue. When the (non-TSO) sender is > the flow's bottleneck, this is going to be the case. But when you > switch to TSO, the receiver becomes the bottleneck and you're always > going to have to put the packets at the back of the receive queue. This > might help account for the reason why you have both lower throughput and > higher CPU utilization -- there's a point of instability right where the > receiver becomes the bottleneck and you end up pushing it over to the > bad side. :) > > Just a theory. I'm honestly surprised this effect would be so > significant. What do the numbers from netstat -s look like in the two > cases? Well, I was going to check this out, but I happened to reboot the system and now I get somewhat different results. Here are the new results, which should hopefully be more accurate since they are on a freshly booted system. TSO enabled and GSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11610.6875 MB / 10.00 sec = 9735.9526 Mbps 100 %TX 75 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5029.6875 MB / 10.06 sec = 4194.6931 Mbps 36 %TX 100 %RX TSO disabled and GSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11817.9375 MB / 10.00 sec = 9909.7773 Mbps 99 %TX 77 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5823.3125 MB / 10.00 sec = 4883.2429 Mbps 100 %TX 82 %RX The TSO disabled case got a little better performance even for 9000 byte jumbo frames. For the "-M1460" case eumalating a standard 1500 byte Ethernet MTU, the performance was significantly better and used less CPU on the receiver (82 % versus 100 %) although it did use significantly more CPU on the transmitter (100 % versus 36 %). TSO disabled and GSO enabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11609.5625 MB / 10.00 sec = 9734.9859 Mbps 99 %TX 75 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5001.4375 MB / 10.06 sec = 4170.6739 Mbps 52 %TX 100 %RX The GSO enabled case is very similar to the TSO enabled case, except that for the "-M1460" test the transmitter used more CPU (52 % versus 36 %), which is to be expected since TSO has hardware assist. Here's the beforeafter delta of the receiver's "netstat -s" statistics for the TSO enabled case: Ip: 3659898 total packets received 3659898 incoming packets delivered 80050 requests sent out Tcp: 2 passive connection openings 3659897 segments received 80050 segments send out TcpExt: 33 packets directly queued to recvmsg prequeue. 104956 packets directly received from backlog 705528 packets directly received from prequeue 3654842 packets header predicted 193 packets header predicted and directly queued to user 4 acknowledgments not containing data received 6 predicted acknowledgments And here it is for the TSO disabled case (GSO also disabled): Ip: 4107083 total packets received 4107083 incoming packets delivered 1401376 requests sent out Tcp: 2 passive connection openings 4107083 segments received 1401376 segments send out TcpExt: 2 TCP sockets finished time wait in fast timer 48486 packets directly queued to recvmsg prequeue. 1056111048 packets directly received from backlog 2273357712 packets directly received from prequeue 1819317 packets header predicted 2287497 packets header predicted and directly queued to user 4 acknowledgments not containing data received 10 predicted acknowledgments For the TSO disabled case, there are a huge amount more TCP segments sent out (1401376 versus 80050), which I assume are ACKs, and which could possibly contribute to the higher throughput for the TSO disabled case due to faster feedback, but not explain the lower CPU utilization. There are many more packets directly queued to recvmsg prequeue (48486 versus 33). The numbers for packets directly received from backlog and prequeue in the TCP disabled case seem bogus to me so I don't know how to interpret that. There are only about half as many packets header predicted (1819317 versus 3654842), but there are many more packets header predicted and directly queued to user (2287497 versus 193). I'll leave the analysis of all this to those who might actually know what it all means. I also ran another set of tests that may be of interest. I changed the rx-usecs/tx-usecs interrupt coalescing parameter from the recommended optimum value of 75 usecs to 0 (no coalescing), but only on the transmitter. The comparison discussions below are relative to the previous tests where rx-usecs/tx-usecs were set to 75 usecs. TSO enabled and GSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11812.8125 MB / 10.00 sec = 9905.6640 Mbps 100 %TX 75 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 7701.8750 MB / 10.00 sec = 6458.5541 Mbps 100 %TX 56 %RX For 9000 byte jumbo frames it now gets a little better performance and almost matches the 10-GigE line rate performance of the TSO disabled case. For the "-M1460" test, it gets substantially better performance (6458.5541 Mbps versus 4194.6931 Mbps) at the expense of much higher transmitter CPU utilization (100 % versus 36 %), although the receiver CPU utilization is much less (56 % versus 100 %). TSO disabled and GSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11817.3125 MB / 10.00 sec = 9909.4058 Mbps 100 %TX 76 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4081.2500 MB / 10.00 sec = 3422.3994 Mbps 99 %TX 41 %RX For 9000 byte jumbo frames the results are essentially the same. For the "-M1460" test, the performance is significantly worse (3422.3994 Mbps versus 4883.2429 Mbps) even though the transmitter CPU utilization is saturated in both cases, but the receiver CPU utilization is about half (41 % versus 82 %). TSO disabled and GSO enabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11813.3750 MB / 10.00 sec = 9906.1090 Mbps 99 %TX 77 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 3939.1875 MB / 10.00 sec = 3303.2814 Mbps 100 %TX 41 %RX For 9000 byte jumbo frames the performance is a little better, again approaching 10-GigE line rate. But for the "-M1460" test, the performance is significantly worse (3303.2814 Mbps versus 4170.6739 Mbps) even though the transmitter consumes much more CPU (100 % versus 52 %). In this case though the receiver has a much lower CPU utilization (41 % versus 100 %). -Bill ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-26 8:41 ` [ofa-general] " Bill Fink @ 2007-08-27 1:32 ` John Heffner 2007-08-27 2:04 ` David Miller 0 siblings, 1 reply; 37+ messages in thread From: John Heffner @ 2007-08-27 1:32 UTC (permalink / raw) To: Bill Fink Cc: Rick Jones, hadi, David Miller, krkumar2, gaagaan, general, herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri, tgraf, xma Bill Fink wrote: > Here's the beforeafter delta of the receiver's "netstat -s" > statistics for the TSO enabled case: > > Ip: > 3659898 total packets received > 3659898 incoming packets delivered > 80050 requests sent out > Tcp: > 2 passive connection openings > 3659897 segments received > 80050 segments send out > TcpExt: > 33 packets directly queued to recvmsg prequeue. > 104956 packets directly received from backlog > 705528 packets directly received from prequeue > 3654842 packets header predicted > 193 packets header predicted and directly queued to user > 4 acknowledgments not containing data received > 6 predicted acknowledgments > > And here it is for the TSO disabled case (GSO also disabled): > > Ip: > 4107083 total packets received > 4107083 incoming packets delivered > 1401376 requests sent out > Tcp: > 2 passive connection openings > 4107083 segments received > 1401376 segments send out > TcpExt: > 2 TCP sockets finished time wait in fast timer > 48486 packets directly queued to recvmsg prequeue. > 1056111048 packets directly received from backlog > 2273357712 packets directly received from prequeue > 1819317 packets header predicted > 2287497 packets header predicted and directly queued to user > 4 acknowledgments not containing data received > 10 predicted acknowledgments > > For the TSO disabled case, there are a huge amount more TCP segments > sent out (1401376 versus 80050), which I assume are ACKs, and which > could possibly contribute to the higher throughput for the TSO disabled > case due to faster feedback, but not explain the lower CPU utilization. > There are many more packets directly queued to recvmsg prequeue > (48486 versus 33). The numbers for packets directly received from > backlog and prequeue in the TCP disabled case seem bogus to me so > I don't know how to interpret that. There are only about half as > many packets header predicted (1819317 versus 3654842), but there > are many more packets header predicted and directly queued to user > (2287497 versus 193). I'll leave the analysis of all this to those > who might actually know what it all means. There are a few interesting things here. For one, the bursts caused by TSO seem to be causing the receiver to do stretch acks. This may have a negative impact on flow performance, but it's hard to say for sure how much. Interestingly, it will even further reduce the CPU load on the sender, since it has to process fewer acks. As I suspected, in the non-TSO case the receiver gets lots of packets directly queued to user. This should result in somewhat lower CPU utilization on the receiver. I don't know if it can account for all the difference you see. The backlog and prequeue values are probably correct, but netstat's description is wrong. A quick look at the code reveals these values are in units of bytes, not packets. -John ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-27 1:32 ` John Heffner @ 2007-08-27 2:04 ` David Miller 2007-08-27 23:23 ` jamal 0 siblings, 1 reply; 37+ messages in thread From: David Miller @ 2007-08-27 2:04 UTC (permalink / raw) To: jheffner Cc: billfink, rick.jones2, hadi, krkumar2, gaagaan, general, herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri, tgraf, xma From: John Heffner <jheffner@psc.edu> Date: Sun, 26 Aug 2007 21:32:26 -0400 > There are a few interesting things here. For one, the bursts caused by > TSO seem to be causing the receiver to do stretch acks. This may have a > negative impact on flow performance, but it's hard to say for sure how > much. Interestingly, it will even further reduce the CPU load on the > sender, since it has to process fewer acks. > > As I suspected, in the non-TSO case the receiver gets lots of packets > directly queued to user. This should result in somewhat lower CPU > utilization on the receiver. I don't know if it can account for all the > difference you see. I had completely forgotten these stretch ACK and ucopy issues. When the receiver gets inundated with a backlog of receive queue packets, it just spins there copying into userspace _every_ _single_ packet in that queue, then spits out one ACK. Meanwhile the sender has to pause long enough for the pipe to empty slightly. The transfer is much better behaved if we ACK every two full sized frames we copy into the receiver, and therefore don't stretch ACK, but at the cost of cpu utilization. These effects are particularly pronounced on systems where the bus bandwidth is also one of the limiting factors. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-27 2:04 ` David Miller @ 2007-08-27 23:23 ` jamal 2007-09-14 7:20 ` [ofa-general] " Bill Fink 0 siblings, 1 reply; 37+ messages in thread From: jamal @ 2007-08-27 23:23 UTC (permalink / raw) To: David Miller Cc: jheffner, billfink, rick.jones2, krkumar2, gaagaan, general, herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri, tgraf, xma On Sun, 2007-26-08 at 19:04 -0700, David Miller wrote: > The transfer is much better behaved if we ACK every two full sized > frames we copy into the receiver, and therefore don't stretch ACK, but > at the cost of cpu utilization. The rx coalescing in theory should help by accumulating more ACKs on the rx side of the sender. But it doesnt seem to do that i.e For the 9K MTU, you are better off to turn off the coalescing if you want higher numbers. Also some of the TOE vendors (chelsio?) claim to have fixed this by reducing bursts on outgoing packets. Bill: who suggested (as per your email) the 75usec value and what was it based on measurement-wise? BTW, thanks for the finding the energy to run those tests and a very refreshing perspective. I dont mean to add more work, but i had some queries; On your earlier tests, i think that Reno showed some significant differences on the lower MTU case over BIC. I wonder if this is consistent? A side note: Although the experimentation reduces the variables (eg tying all to CPU0), it would be more exciting to see multi-cpu and multi-flow sender effect (which IMO is more real world). Last note: you need a newer netstat. > These effects are particularly pronounced on systems where the > bus bandwidth is also one of the limiting factors. Can you elucidate this a little more Dave? Did you mean memory bandwidth? cheers, jamal ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-27 23:23 ` jamal @ 2007-09-14 7:20 ` Bill Fink 2007-09-14 13:44 ` TSO, TCP Cong control etc jamal 2007-09-14 17:24 ` [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller 0 siblings, 2 replies; 37+ messages in thread From: Bill Fink @ 2007-09-14 7:20 UTC (permalink / raw) To: hadi Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson, netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf, johnpol, shemminger, David Miller, jheffner, sri On Mon, 27 Aug 2007, jamal wrote: > On Sun, 2007-26-08 at 19:04 -0700, David Miller wrote: > > > The transfer is much better behaved if we ACK every two full sized > > frames we copy into the receiver, and therefore don't stretch ACK, but > > at the cost of cpu utilization. > > The rx coalescing in theory should help by accumulating more ACKs on the > rx side of the sender. But it doesnt seem to do that i.e For the 9K MTU, > you are better off to turn off the coalescing if you want higher > numbers. Also some of the TOE vendors (chelsio?) claim to have fixed > this by reducing bursts on outgoing packets. > > Bill: > who suggested (as per your email) the 75usec value and what was it based > on measurement-wise? Belatedly getting back to this thread. There was a recent myri10ge patch that changed the default value for tx/rx interrupt coalescing to 75 usec claiming it was an optimum value for maximum throughput (and is also mentioned in their external README documentation). I also did some empirical testing to determine the effect of different values of TX/RX interrupt coalescing on 10-GigE network performance, both with TSO enabled and with TSO disabled. The actual test runs are attached at the end of this message, but the results are summarized in the following table (network performance in Mbps). TX/RX interrupt coalescing in usec (both sides) 0 15 30 45 60 75 90 105 TSO enabled 8909 9682 9716 9725 9739 9745 9688 9648 TSO disabled 9113 9910 9910 9910 9910 9910 9910 9910 TSO disabled performance is always better than equivalent TSO enabled performance. With TSO enabled, the optimum performance is indeed at a TX/RX interrupt coalescing value of 75 usec. With TSO disabled, performance is the full 10-GigE line rate of 9910 Mbps for any value of TX/RX interrupt coalescing from 15 usec to 105 usec. > BTW, thanks for the finding the energy to run those tests and a very > refreshing perspective. I dont mean to add more work, but i had some > queries; > On your earlier tests, i think that Reno showed some significant > differences on the lower MTU case over BIC. I wonder if this is > consistent? Here's a retest (5 tests each): TSO enabled: TCP Cubic (initial_ssthresh set to 0): [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5007.6295 MB / 10.06 sec = 4176.1807 Mbps 36 %TX 100 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4950.9279 MB / 10.06 sec = 4130.2528 Mbps 36 %TX 99 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4917.1742 MB / 10.05 sec = 4102.5772 Mbps 35 %TX 99 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4948.7920 MB / 10.05 sec = 4128.7990 Mbps 36 %TX 100 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4937.5765 MB / 10.05 sec = 4120.6460 Mbps 35 %TX 99 %RX TCP Bic (initial_ssthresh set to 0): [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5005.5335 MB / 10.06 sec = 4172.9571 Mbps 36 %TX 99 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5001.0625 MB / 10.06 sec = 4169.2960 Mbps 36 %TX 99 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4957.7500 MB / 10.06 sec = 4135.7355 Mbps 36 %TX 99 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4957.3777 MB / 10.06 sec = 4135.6252 Mbps 36 %TX 99 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5059.1815 MB / 10.05 sec = 4221.3546 Mbps 37 %TX 99 %RX TCP Reno: [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4973.3532 MB / 10.06 sec = 4147.3589 Mbps 36 %TX 100 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4984.4375 MB / 10.06 sec = 4155.2131 Mbps 36 %TX 99 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4995.6841 MB / 10.06 sec = 4166.2734 Mbps 36 %TX 100 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4982.2500 MB / 10.05 sec = 4156.7586 Mbps 36 %TX 99 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4989.9796 MB / 10.05 sec = 4163.0949 Mbps 36 %TX 99 %RX TSO disabled: TCP Cubic (initial_ssthresh set to 0): [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5075.8125 MB / 10.02 sec = 4247.3408 Mbps 99 %TX 100 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5056.0000 MB / 10.03 sec = 4229.9621 Mbps 100 %TX 100 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5047.4375 MB / 10.03 sec = 4223.1203 Mbps 99 %TX 100 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5066.1875 MB / 10.03 sec = 4239.1659 Mbps 100 %TX 100 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4986.3750 MB / 10.03 sec = 4171.9906 Mbps 99 %TX 100 %RX TCP Bic (initial_ssthresh set to 0): [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5040.5625 MB / 10.03 sec = 4217.3521 Mbps 100 %TX 100 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5049.7500 MB / 10.03 sec = 4225.4585 Mbps 99 %TX 100 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5076.5000 MB / 10.03 sec = 4247.6632 Mbps 100 %TX 100 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5017.2500 MB / 10.03 sec = 4197.4990 Mbps 100 %TX 99 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5013.3125 MB / 10.03 sec = 4194.8851 Mbps 100 %TX 100 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5036.0625 MB / 10.03 sec = 4213.9195 Mbps 100 %TX 100 %RX TCP Reno: [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5006.8750 MB / 10.02 sec = 4189.6051 Mbps 99 %TX 99 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5028.1250 MB / 10.02 sec = 4207.4553 Mbps 100 %TX 99 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5021.9375 MB / 10.02 sec = 4202.2668 Mbps 99 %TX 100 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5000.5625 MB / 10.03 sec = 4184.3109 Mbps 99 %TX 100 %RX [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5025.1250 MB / 10.03 sec = 4204.7378 Mbps 99 %TX 100 %RX Not too much variation here, and not quite as high results as previously. Some further testing reveals that while this time I mainly get results like (here for TCP Bic with TSO disabled): [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4958.0625 MB / 10.02 sec = 4148.9361 Mbps 100 %TX 99 %RX I also sometimes get results like: [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5882.1875 MB / 10.00 sec = 4932.5549 Mbps 100 %TX 90 %RX The higher performing results seem to correspond to when there's a somewhat lower receiver CPU utilization. I'm not sure but there could also have been an effect from running the "-M1460" test after the 9000 byte jumbo frame test (no jumbo tests were done at all prior to running the above sets of 5 tests, although I did always discard an initial "warmup" test, and now that I think about it some of those initial discarded "warmup" tests did have somewhat anomalously high results). > A side note: Although the experimentation reduces the variables (eg > tying all to CPU0), it would be more exciting to see multi-cpu and > multi-flow sender effect (which IMO is more real world). These systems are intended as test systems for 10-GigE networks, and as such it's important to get as consistently close to full 10-GigE line rate as possible, and that's why the interrupts and nuttcp application are tied to CPU0, with almost all other system applications tied to CPU1. Now on another system that's intended as a 10-GigE firewall system, it has 2 Myricom 10-GigE NICs with the interrupts for eth2 tied to CPU0 and the interrupts for CPU1 tied to CPU1. In IP forwarding tests of this system, I have basically achieved full bidirectional 10-GigE line rate IP forwarding with 9000 byte jumbo frames. chance4 -> chance6 -> chance9 4.85 Gbps rate limited TCP stream chance5 -> chance6 -> chance9 4.85 Gbps rate limited TCP stream chance7 <- chance6 <- chance8 10.0 Gbps non-rate limited TCP stream [root@chance7 ~]# nuttcp -Ic4tc9 -Ri4.85g -w10m 192.168.88.8 192.168.89.16 & \ nuttcp -Ic5tc9 -Ri4.85g -w10m -P5100 -p5101 192.168.88.9 192.168.89.16 & \ nuttcp -Ic7rc8 -r -w10m 192.168.89.15 c4tc9: 5778.6875 MB / 10.01 sec = 4842.7158 Mbps 100 %TX 42 %RX c5tc9: 5778.9375 MB / 10.01 sec = 4843.1595 Mbps 100 %TX 40 %RX c7rc8: 11509.1875 MB / 10.00 sec = 9650.8009 Mbps 99 %TX 74 %RX If there's some other specific test you'd like to see, and it's not too difficult to set up and I have some spare time, I'll see what I can do. -Bill Testing of effect of RX/TX interrupt coalescing on 10-GigE network performance (both with TSO enabled and with TSO disabled): -------------------------------------------------------------------------------- No RX/TX interrupt coalescing (either side): TSO enabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 10649.8750 MB / 10.03 sec = 8908.9806 Mbps 97 %TX 100 %RX TSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 10879.5000 MB / 10.02 sec = 9112.5141 Mbps 99 %TX 99 %RX RX/TX interrupt coalescing set to 15 usec (both sides): TSO enabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11546.7500 MB / 10.00 sec = 9682.0785 Mbps 99 %TX 90 %RX TSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11818.9375 MB / 10.00 sec = 9910.3702 Mbps 100 %TX 92 %RX RX/TX interrupt coalescing set to 30 usec (both sides): TSO enabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11587.1250 MB / 10.00 sec = 9715.9489 Mbps 99 %TX 81 %RX TSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11818.8125 MB / 10.00 sec = 9910.3040 Mbps 100 %TX 81 %RX RX/TX interrupt coalescing set to 45 usec (both sides): TSO enabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11597.8750 MB / 10.00 sec = 9724.9902 Mbps 99 %TX 76 %RX TSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11818.6250 MB / 10.00 sec = 9910.0933 Mbps 100 %TX 77 %RX RX/TX interrupt coalescing set to 60 usec (both sides): TSO enabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11614.7500 MB / 10.00 sec = 9739.1323 Mbps 100 %TX 74 %RX TSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11818.4375 MB / 10.00 sec = 9909.9995 Mbps 100 %TX 76 %RX RX/TX interrupt coalescing set to 75 usec (both sides): TSO enabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11621.7500 MB / 10.00 sec = 9745.0993 Mbps 100 %TX 72 %RX TSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11818.0625 MB / 10.00 sec = 9909.7881 Mbps 100 %TX 75 %RX RX/TX interrupt coalescing set to 90 usec (both sides): TSO enabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11553.1250 MB / 10.00 sec = 9687.6458 Mbps 100 %TX 71 %RX TSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11818.4375 MB / 10.00 sec = 9910.0837 Mbps 100 %TX 73 %RX RX/TX interrupt coalescing set to 105 usec (both sides): TSO enabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11505.7500 MB / 10.00 sec = 9647.8558 Mbps 99 %TX 69 %RX TSO disabled: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11818.4375 MB / 10.00 sec = 9910.0530 Mbps 100 %TX 74 %RX ^ permalink raw reply [flat|nested] 37+ messages in thread
* TSO, TCP Cong control etc 2007-09-14 7:20 ` [ofa-general] " Bill Fink @ 2007-09-14 13:44 ` jamal 2007-09-14 17:24 ` [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller 1 sibling, 0 replies; 37+ messages in thread From: jamal @ 2007-09-14 13:44 UTC (permalink / raw) To: Bill Fink Cc: David Miller, jheffner, rick.jones2, krkumar2, gaagaan, general, herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri, tgraf, xma Ive changed the subject to match content.. On Fri, 2007-14-09 at 03:20 -0400, Bill Fink wrote: > On Mon, 27 Aug 2007, jamal wrote: > > > Bill: > > who suggested (as per your email) the 75usec value and what was it based > > on measurement-wise? > > Belatedly getting back to this thread. There was a recent myri10ge > patch that changed the default value for tx/rx interrupt coalescing > to 75 usec claiming it was an optimum value for maximum throughput > (and is also mentioned in their external README documentation). I would think such a value would be very specific to the ring size and maybe even the machine in use. > I also did some empirical testing to determine the effect of different > values of TX/RX interrupt coalescing on 10-GigE network performance, > both with TSO enabled and with TSO disabled. The actual test runs > are attached at the end of this message, but the results are summarized > in the following table (network performance in Mbps). > > TX/RX interrupt coalescing in usec (both sides) > 0 15 30 45 60 75 90 105 > > TSO enabled 8909 9682 9716 9725 9739 9745 9688 9648 > TSO disabled 9113 9910 9910 9910 9910 9910 9910 9910 > > TSO disabled performance is always better than equivalent TSO enabled > performance. With TSO enabled, the optimum performance is indeed at > a TX/RX interrupt coalescing value of 75 usec. With TSO disabled, > performance is the full 10-GigE line rate of 9910 Mbps for any value > of TX/RX interrupt coalescing from 15 usec to 105 usec. Interesting results. I think J Heffner made a very compelling description the other day based on your netstat results at the receiver as to what is going on (refer to the comments on stretch ACKs). If the receiver is fixed, then youd see better numbers from TSO. The 75 microsecs is very benchmarky in my opinion. If i was to pick a different app or different NIC or run on many cpus with many apps doing TSO, i highly doubt that will be the right number. > Here's a retest (5 tests each): > > TSO enabled: > > TCP Cubic (initial_ssthresh set to 0): [..] > TCP Bic (initial_ssthresh set to 0): [..] > > TCP Reno: > [..] > TSO disabled: > > TCP Cubic (initial_ssthresh set to 0): > [..] > TCP Bic (initial_ssthresh set to 0): > [..] > TCP Reno: > [..] > Not too much variation here, and not quite as high results > as previously. BIC seems to be on average better followed by CUBIC followed by Reno. The difference this time maybe because you set the ssthresh to 0 (hopefully every run) and so Reno is definetely going to perform less better since it is a lot less agressive in comparison to other two. > Some further testing reveals that while this > time I mainly get results like (here for TCP Bic with TSO > disabled): > > [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 > 4958.0625 MB / 10.02 sec = 4148.9361 Mbps 100 %TX 99 %RX > > I also sometimes get results like: > > [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 > 5882.1875 MB / 10.00 sec = 4932.5549 Mbps 100 %TX 90 %RX > not good. > The higher performing results seem to correspond to when there's a > somewhat lower receiver CPU utilization. I'm not sure but there > could also have been an effect from running the "-M1460" test after > the 9000 byte jumbo frame test (no jumbo tests were done at all prior > to running the above sets of 5 tests, although I did always discard > an initial "warmup" test, and now that I think about it some of > those initial discarded "warmup" tests did have somewhat anomalously > high results). If you didnt reset the ssthresh on every run, could it have been cached and used on subsequent runs? > > A side note: Although the experimentation reduces the variables (eg > > tying all to CPU0), it would be more exciting to see multi-cpu and > > multi-flow sender effect (which IMO is more real world). > > These systems are intended as test systems for 10-GigE networks, > and as such it's important to get as consistently close to full > 10-GigE line rate as possible, and that's why the interrupts and > nuttcp application are tied to CPU0, with almost all other system > applications tied to CPU1. Sure, good benchmark. You get to know how well you can do. > Now on another system that's intended as a 10-GigE firewall system, > it has 2 Myricom 10-GigE NICs with the interrupts for eth2 tied to > CPU0 and the interrupts for CPU1 tied to CPU1. In IP forwarding > tests of this system, I have basically achieved full bidirectional > 10-GigE line rate IP forwarding with 9000 byte jumbo frames. In forwarding a more meaningful metric would be pps. The cost per packet tends to dominate the results over the cost/byte. 9K jumbo frames at 10G is less than 500Kpps - so i dont see that machine you are using sweating at all. To give you a comparison on a lower end opteron a single CPU i can generate with batching pktgen 1Mpps; Robert says he can do that even without batching on an opteron closer to what you are using. So if you want to run that test, youd need to use incrementally smaller packets. > If there's some other specific test you'd like to see, and it's not > too difficult to set up and I have some spare time, I'll see what I > can do. Well, the more interesting tests would be to go full throttle on all CPUs you have and target one (or more) receivers. i.e you simulate a real server. Can the utility you have be bound to a cpu? If yes, you should be able to achieve this without much effort. Thanks a lot Bill for the effort. cheers, jamal ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-09-14 7:20 ` [ofa-general] " Bill Fink 2007-09-14 13:44 ` TSO, TCP Cong control etc jamal @ 2007-09-14 17:24 ` David Miller 1 sibling, 0 replies; 37+ messages in thread From: David Miller @ 2007-09-14 17:24 UTC (permalink / raw) To: billfink Cc: hadi, jheffner, rick.jones2, krkumar2, gaagaan, general, herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri, tgraf, xma From: Bill Fink <billfink@mindspring.com> Date: Fri, 14 Sep 2007 03:20:55 -0400 > TSO disabled performance is always better than equivalent TSO enabled > performance. With TSO enabled, the optimum performance is indeed at > a TX/RX interrupt coalescing value of 75 usec. With TSO disabled, > performance is the full 10-GigE line rate of 9910 Mbps for any value > of TX/RX interrupt coalescing from 15 usec to 105 usec. Note that the systems where the coalescing tweaking if often necessary are the heavily NUMA'd systems where cpu to device latency can be huge, like the big SGI ones which is where we had to tweak things in the tg3 driver in the first place. On most systems, as you saw mostly in the non-TSO case, the value choosen for the most part is arbitrary and not critical. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-23 22:04 ` [ofa-general] " jamal 2007-08-23 22:25 ` jamal @ 2007-08-23 22:30 ` David Miller 2007-08-23 22:38 ` [ofa-general] " jamal 1 sibling, 1 reply; 37+ messages in thread From: David Miller @ 2007-08-23 22:30 UTC (permalink / raw) To: hadi Cc: rick.jones2, krkumar2, gaagaan, general, herbert, jagana, jeff, johnpol, kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri, tgraf, xma From: jamal <hadi@cyberus.ca> Date: Thu, 23 Aug 2007 18:04:10 -0400 > Possibly a bug - but you really should turn off TSO if you are doing > huge interactive transactions (which is fair because there is a clear > demarcation). I don't see how this can matter. TSO only ever does anything if you accumulate more than one MSS worth of data. And when that does happen, all it does is take whats in the send queue and send as much as possible at once. The packets are already built in big chunks, so there is no extra work to do. The card is going to send the things back to back and as fast as in the non-TSO case as well. It doesn't change application scheduling, and it absolutely does not penalize small sends by the application unless we have a bug somewhere. So I see no reason to disable TSO for any reason other than hardware implementation deficiencies. And for the drivers I am familiar with they do make smart default TSO enabling decisions based upon how well the chip does TSO. ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-23 22:30 ` David Miller @ 2007-08-23 22:38 ` jamal 2007-08-24 3:34 ` Stephen Hemminger 0 siblings, 1 reply; 37+ messages in thread From: jamal @ 2007-08-23 22:38 UTC (permalink / raw) To: David Miller Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson, netdev, rdreier, mcarlson, jeff, general, mchan, tgraf, johnpol, shemminger, kaber, sri On Thu, 2007-23-08 at 15:30 -0700, David Miller wrote: > From: jamal <hadi@cyberus.ca> > Date: Thu, 23 Aug 2007 18:04:10 -0400 > > > Possibly a bug - but you really should turn off TSO if you are doing > > huge interactive transactions (which is fair because there is a clear > > demarcation). > > I don't see how this can matter. > > TSO only ever does anything if you accumulate more than one MSS > worth of data. I stand corrected then. cheers, jamal ^ permalink raw reply [flat|nested] 37+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-23 22:38 ` [ofa-general] " jamal @ 2007-08-24 3:34 ` Stephen Hemminger 2007-08-24 12:36 ` jamal 2007-08-24 16:25 ` Rick Jones 0 siblings, 2 replies; 37+ messages in thread From: Stephen Hemminger @ 2007-08-24 3:34 UTC (permalink / raw) To: hadi Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson, netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf, johnpol, David Miller, sri On Thu, 23 Aug 2007 18:38:22 -0400 jamal <hadi@cyberus.ca> wrote: > On Thu, 2007-23-08 at 15:30 -0700, David Miller wrote: > > From: jamal <hadi@cyberus.ca> > > Date: Thu, 23 Aug 2007 18:04:10 -0400 > > > > > Possibly a bug - but you really should turn off TSO if you are doing > > > huge interactive transactions (which is fair because there is a clear > > > demarcation). > > > > I don't see how this can matter. > > > > TSO only ever does anything if you accumulate more than one MSS > > worth of data. > > I stand corrected then. > > cheers, > jamal > For most normal Internet TCP connections, you will see only 2 or 3 packets per TSO because of ACK clocking. If you turn off delayed ACK on the receiver it will be even less. A current hot topic of research is reducing the number of ACK's to make TCP work better over asymmetric links like 3G. -- Stephen Hemminger <shemminger@linux-foundation.org> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-24 3:34 ` Stephen Hemminger @ 2007-08-24 12:36 ` jamal 2007-08-24 16:25 ` Rick Jones 1 sibling, 0 replies; 37+ messages in thread From: jamal @ 2007-08-24 12:36 UTC (permalink / raw) To: Stephen Hemminger Cc: David Miller, rick.jones2, krkumar2, gaagaan, general, herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, Robert.Olsson, sri, tgraf, xma On Thu, 2007-23-08 at 20:34 -0700, Stephen Hemminger wrote: > A current hot topic of research is reducing the number of ACK's to make TCP > work better over asymmetric links like 3G. One other good reason to reduce ACKs to battery powered (3G) terminals is it reduces the power consumption i.e you have longer battery life. cheers, jamal ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB 2007-08-24 3:34 ` Stephen Hemminger 2007-08-24 12:36 ` jamal @ 2007-08-24 16:25 ` Rick Jones 1 sibling, 0 replies; 37+ messages in thread From: Rick Jones @ 2007-08-24 16:25 UTC (permalink / raw) To: Stephen Hemminger Cc: hadi, David Miller, krkumar2, gaagaan, general, herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier, Robert.Olsson, sri, tgraf, xma > > A current hot topic of research is reducing the number of ACK's to make TCP > work better over asymmetric links like 3G. Oy. People running Solaris and HP-UX have been "researching" ACK reductions since 1997 if not earlier. rick jones ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2007-09-14 17:25 UTC | newest] Thread overview: 37+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-08-17 6:06 [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar2 2007-08-21 7:18 ` David Miller 2007-08-21 12:30 ` [ofa-general] " jamal 2007-08-21 18:51 ` David Miller 2007-08-21 21:09 ` jamal 2007-08-21 22:50 ` David Miller 2007-08-22 4:11 ` [ofa-general] " Krishna Kumar2 2007-08-22 4:22 ` David Miller 2007-08-22 7:03 ` Krishna Kumar2 2007-08-22 9:14 ` David Miller 2007-08-23 2:43 ` Krishna Kumar2 2007-08-22 17:09 ` [ofa-general] " Rick Jones 2007-08-22 20:21 ` David Miller 2007-08-23 22:04 ` [ofa-general] " jamal 2007-08-23 22:25 ` jamal 2007-08-23 22:35 ` [ofa-general] " Rick Jones 2007-08-23 22:41 ` jamal 2007-08-24 3:18 ` Bill Fink 2007-08-24 12:14 ` jamal 2007-08-24 18:08 ` Bill Fink 2007-08-24 21:25 ` David Miller 2007-08-24 23:11 ` Herbert Xu 2007-08-25 23:45 ` Bill Fink 2007-08-24 18:46 ` [ofa-general] " Rick Jones 2007-08-25 0:42 ` John Heffner 2007-08-26 8:41 ` [ofa-general] " Bill Fink 2007-08-27 1:32 ` John Heffner 2007-08-27 2:04 ` David Miller 2007-08-27 23:23 ` jamal 2007-09-14 7:20 ` [ofa-general] " Bill Fink 2007-09-14 13:44 ` TSO, TCP Cong control etc jamal 2007-09-14 17:24 ` [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller 2007-08-23 22:30 ` David Miller 2007-08-23 22:38 ` [ofa-general] " jamal 2007-08-24 3:34 ` Stephen Hemminger 2007-08-24 12:36 ` jamal 2007-08-24 16:25 ` Rick Jones
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox