* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
@ 2007-08-17 6:06 Krishna Kumar2
2007-08-21 7:18 ` David Miller
0 siblings, 1 reply; 48+ messages in thread
From: Krishna Kumar2 @ 2007-08-17 6:06 UTC (permalink / raw)
To: David Miller
Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma
Hi Dave,
> I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will
> run a longer one tonight). The results are (results in KB/s, and %):
I ran a 8.5 hours run with no batching + another 8.5 hours run with
batching (Buffer sizes: "32 128 512 4096 16384", Threads: "1 8 32",
Each test run time: 3 minutes, Iterations to average: 5). TCP seems
to get a small improvement.
Thanks,
- KK
-------------------------------------------------------------------------------
TCP
-----------
Size:32 Procs:1 3415 3321 -2.75
Size:128 Procs:1 13094 13388 2.24
Size:512 Procs:1 49037 50683 3.35
Size:4096 Procs:1 114646 114619 -.02
Size:16384 Procs:1 114626 114644 .01
Size:32 Procs:8 22675 22633 -.18
Size:128 Procs:8 77994 77297 -.89
Size:512 Procs:8 114716 114711 0
Size:4096 Procs:8 114637 114636 0
Size:16384 Procs:8 95814 114638 19.64
Size:32 Procs:32 23240 23349 .46
Size:128 Procs:32 82284 82247 -.04
Size:512 Procs:32 114885 114769 -.10
Size:4096 Procs:32 95735 114634 19.74
Size:16384 Procs:32 114736 114641 -.08
Average: 1151534 1190210 3.36%
-------------------------------------------------------------------------------
No Delay:
---------
Size:32 Procs:1 3002 2873 -4.29
Size:128 Procs:1 11853 11801 -.43
Size:512 Procs:1 45565 45837 .59
Size:4096 Procs:1 114511 114485 -.02
Size:16384 Procs:1 114521 114555 .02
Size:32 Procs:8 8026 8029 .03
Size:128 Procs:8 31589 31573 -.05
Size:512 Procs:8 111506 105766 -5.14
Size:4096 Procs:8 114455 114454 0
Size:16384 Procs:8 95833 114491 19.46
Size:32 Procs:32 8005 8027 .27
Size:128 Procs:32 31475 31505 .09
Size:512 Procs:32 114558 113687 -.76
Size:4096 Procs:32 114784 114447 -.29
Size:16384 Procs:32 114719 114496 -.19
Average: 1046026 1034402 -1.11%
-------------------------------------------------------------------------------
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-17 6:06 [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar2
@ 2007-08-21 7:18 ` David Miller
2007-08-21 12:30 ` [ofa-general] " jamal
0 siblings, 1 reply; 48+ messages in thread
From: David Miller @ 2007-08-21 7:18 UTC (permalink / raw)
To: krkumar2
Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma
From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Fri, 17 Aug 2007 11:36:03 +0530
> > I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will
> > run a longer one tonight). The results are (results in KB/s, and %):
>
> I ran a 8.5 hours run with no batching + another 8.5 hours run with
> batching (Buffer sizes: "32 128 512 4096 16384", Threads: "1 8 32",
> Each test run time: 3 minutes, Iterations to average: 5). TCP seems
> to get a small improvement.
Using 16K buffer size really isn't going to keep the pipe full enough
for TSO. And realistically applications queue much more data at a
time. Also, with smaller buffer sizes can have negative effects for
the dynamic receive and send buffer growth algorithm the kernel uses,
it might consider the connection application limited for too long.
I would really prefer to see numbers that use buffer sizes more in
line with the amount of data that is typically inflight on a 1G
connection on a local network.
Do a tcpdump during the height of the transfer to see about what this
value is. When an ACK comes in, compare the sequence number it's
ACK'ing with the sequence number of the most recently sent frame.
The difference is approximately the pipe size at maximum congestion
window assuming a loss free local network.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-21 7:18 ` David Miller
@ 2007-08-21 12:30 ` jamal
2007-08-21 18:51 ` David Miller
0 siblings, 1 reply; 48+ messages in thread
From: jamal @ 2007-08-21 12:30 UTC (permalink / raw)
To: David Miller
Cc: jagana, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
peter.p.waskiewicz.jr, mcarlson, netdev, sri, general, mchan,
tgraf, johnpol, shemminger, kaber, herbert
On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote:
> Using 16K buffer size really isn't going to keep the pipe full enough
> for TSO.
Why the comparison with TSO (or GSO for that matter)?
Seems to me that is only valid/fair if you have a single flow.
Batching is multi-flow focused (or i should say flow-unaware).
cheers,
jamal
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-21 12:30 ` [ofa-general] " jamal
@ 2007-08-21 18:51 ` David Miller
2007-08-21 21:09 ` jamal
2007-08-22 4:11 ` [ofa-general] " Krishna Kumar2
0 siblings, 2 replies; 48+ messages in thread
From: David Miller @ 2007-08-21 18:51 UTC (permalink / raw)
To: hadi
Cc: jagana, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
peter.p.waskiewicz.jr, mcarlson, netdev, sri, general, mchan,
tgraf, johnpol, shemminger, kaber, herbert
From: jamal <hadi@cyberus.ca>
Date: Tue, 21 Aug 2007 08:30:22 -0400
> On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote:
>
> > Using 16K buffer size really isn't going to keep the pipe full enough
> > for TSO.
>
> Why the comparison with TSO (or GSO for that matter)?
Because TSO does batching already, so it's a very good
"tit for tat" comparison of the new batching scheme
vs. an existing one.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-21 18:51 ` David Miller
@ 2007-08-21 21:09 ` jamal
2007-08-21 22:50 ` David Miller
2007-08-22 4:11 ` [ofa-general] " Krishna Kumar2
1 sibling, 1 reply; 48+ messages in thread
From: jamal @ 2007-08-21 21:09 UTC (permalink / raw)
To: David Miller
Cc: jagana, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
peter.p.waskiewicz.jr, mcarlson, netdev, sri, general, mchan,
tgraf, johnpol, shemminger, kaber, herbert
On Tue, 2007-21-08 at 11:51 -0700, David Miller wrote:
> Because TSO does batching already, so it's a very good
> "tit for tat" comparison of the new batching scheme
> vs. an existing one.
Fair enough - I may have read too much into your email then;->
For bulk type of apps (where TSO will make a difference) this a fair
test. Hence i agree the 16KB buffer size is not sensible if the goal is
to simulate such an app.
However (and this is where i read too much into what you were saying)
that the test by itself is insufficient comparison. You gotta look at
the other side of the coin i.e. at apps where TSO wont buy much.
Examples, a busy ssh or irc server and you could go as far as looking at
the most pre-dominant app on the wild west, http (average page size from
a few years back was in the range of 10-20K and can be simulated with
good ole netperf/iperf).
cheers,
jamal
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-21 21:09 ` jamal
@ 2007-08-21 22:50 ` David Miller
0 siblings, 0 replies; 48+ messages in thread
From: David Miller @ 2007-08-21 22:50 UTC (permalink / raw)
To: hadi
Cc: krkumar2, gaagaan, general, herbert, jagana, jeff, johnpol, kaber,
kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma
From: jamal <hadi@cyberus.ca>
Date: Tue, 21 Aug 2007 17:09:12 -0400
> Examples, a busy ssh or irc server and you could go as far as
> looking at the most pre-dominant app on the wild west, http (average
> page size from a few years back was in the range of 10-20K and can
> be simulated with good ole netperf/iperf).
Pages have chunked up considerably in recent years.
Just bringing up a myspace page can give you megabytes of video,
images, flash, and other stuff.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-21 18:51 ` David Miller
2007-08-21 21:09 ` jamal
@ 2007-08-22 4:11 ` Krishna Kumar2
2007-08-22 4:22 ` David Miller
1 sibling, 1 reply; 48+ messages in thread
From: Krishna Kumar2 @ 2007-08-22 4:11 UTC (permalink / raw)
To: David Miller
Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
tgraf, sri, shemminger, kaber, herbert
David Miller <davem@davemloft.net> wrote on 08/22/2007 12:21:43 AM:
> From: jamal <hadi@cyberus.ca>
> Date: Tue, 21 Aug 2007 08:30:22 -0400
>
> > On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote:
> >
> > > Using 16K buffer size really isn't going to keep the pipe full enough
> > > for TSO.
> >
> > Why the comparison with TSO (or GSO for that matter)?
>
> Because TSO does batching already, so it's a very good
> "tit for tat" comparison of the new batching scheme
> vs. an existing one.
I am planning to do more testing on your suggestion over the
weekend, but I had a comment. Are you saying that TSO and
batching should be mutually exclusive so hardware that doesn't
support TSO (like IB) only would benefit?
But even if they can co-exist, aren't cases like sending
multiple small skbs better handled with batching?
Thanks,
- KK
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-22 4:11 ` [ofa-general] " Krishna Kumar2
@ 2007-08-22 4:22 ` David Miller
2007-08-22 7:03 ` Krishna Kumar2
2007-08-22 17:09 ` [ofa-general] " Rick Jones
0 siblings, 2 replies; 48+ messages in thread
From: David Miller @ 2007-08-22 4:22 UTC (permalink / raw)
To: krkumar2
Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
tgraf, sri, shemminger, kaber, herbert
From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Wed, 22 Aug 2007 09:41:52 +0530
> David Miller <davem@davemloft.net> wrote on 08/22/2007 12:21:43 AM:
>
> > From: jamal <hadi@cyberus.ca>
> > Date: Tue, 21 Aug 2007 08:30:22 -0400
> >
> > > On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote:
> > >
> > > > Using 16K buffer size really isn't going to keep the pipe full enough
> > > > for TSO.
> > >
> > > Why the comparison with TSO (or GSO for that matter)?
> >
> > Because TSO does batching already, so it's a very good
> > "tit for tat" comparison of the new batching scheme
> > vs. an existing one.
>
> I am planning to do more testing on your suggestion over the
> weekend, but I had a comment. Are you saying that TSO and
> batching should be mutually exclusive so hardware that doesn't
> support TSO (like IB) only would benefit?
>
> But even if they can co-exist, aren't cases like sending
> multiple small skbs better handled with batching?
I'm not making any suggestions, so don't read that into anything I've
said :-)
I think the jury is still out, but seeing TSO perform even slightly
worse with the batching changes in place would be very worrysome.
This applies to both throughput and cpu utilization.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-22 4:22 ` David Miller
@ 2007-08-22 7:03 ` Krishna Kumar2
2007-08-22 9:14 ` David Miller
2007-08-22 17:09 ` [ofa-general] " Rick Jones
1 sibling, 1 reply; 48+ messages in thread
From: Krishna Kumar2 @ 2007-08-22 7:03 UTC (permalink / raw)
To: David Miller
Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma
Hi Dave,
David Miller <davem@davemloft.net> wrote on 08/22/2007 09:52:29 AM:
> From: Krishna Kumar2 <krkumar2@in.ibm.com>
> Date: Wed, 22 Aug 2007 09:41:52 +0530
>
<snip>
> > > Because TSO does batching already, so it's a very good
> > > "tit for tat" comparison of the new batching scheme
> > > vs. an existing one.
> >
> > I am planning to do more testing on your suggestion over the
> > weekend, but I had a comment. Are you saying that TSO and
> > batching should be mutually exclusive so hardware that doesn't
> > support TSO (like IB) only would benefit?
> >
> > But even if they can co-exist, aren't cases like sending
> > multiple small skbs better handled with batching?
>
> I'm not making any suggestions, so don't read that into anything I've
> said :-)
>
> I think the jury is still out, but seeing TSO perform even slightly
> worse with the batching changes in place would be very worrysome.
> This applies to both throughput and cpu utilization.
Does turning off batching solve that problem? What I mean by that is:
batching can be disabled if a TSO device is worse for some cases. Infact
something that I had changed my latest code is to not enable batching
in register_netdevice (in Rev4 which I am sending in a few mins), rather
the user has to explicitly turn 'on' batching.
Wondering if that is what you are concerned about. In any case, I will
test your case on Monday (I am on vacation for next couple of days).
Thanks,
- KK
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-22 7:03 ` Krishna Kumar2
@ 2007-08-22 9:14 ` David Miller
2007-08-23 2:43 ` Krishna Kumar2
0 siblings, 1 reply; 48+ messages in thread
From: David Miller @ 2007-08-22 9:14 UTC (permalink / raw)
To: krkumar2
Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma
From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Wed, 22 Aug 2007 12:33:04 +0530
> Does turning off batching solve that problem? What I mean by that is:
> batching can be disabled if a TSO device is worse for some cases.
This new batching stuff isn't going to be enabled or disabled
on a per-device basis just to get "parity" with how things are
now.
It should be enabled by default, and give at least as good
performance as what can be obtained right now.
Otherwise it's a clear regression.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-22 9:14 ` David Miller
@ 2007-08-23 2:43 ` Krishna Kumar2
0 siblings, 0 replies; 48+ messages in thread
From: Krishna Kumar2 @ 2007-08-23 2:43 UTC (permalink / raw)
To: David Miller
Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma
David Miller <davem@davemloft.net> wrote on 08/22/2007 02:44:40 PM:
> From: Krishna Kumar2 <krkumar2@in.ibm.com>
> Date: Wed, 22 Aug 2007 12:33:04 +0530
>
> > Does turning off batching solve that problem? What I mean by that is:
> > batching can be disabled if a TSO device is worse for some cases.
>
> This new batching stuff isn't going to be enabled or disabled
> on a per-device basis just to get "parity" with how things are
> now.
>
> It should be enabled by default, and give at least as good
> performance as what can be obtained right now.
That was how it was in earlier revisions. In revision4 I coded it so
that it is enabled only if explicitly set by the user. I can revert
that change.
> Otherwise it's a clear regression.
Definitely. For drivers that support it, it should not reduce performance.
Thanks,
- KK
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-22 4:22 ` David Miller
2007-08-22 7:03 ` Krishna Kumar2
@ 2007-08-22 17:09 ` Rick Jones
2007-08-22 20:21 ` David Miller
1 sibling, 1 reply; 48+ messages in thread
From: Rick Jones @ 2007-08-22 17:09 UTC (permalink / raw)
To: David Miller
Cc: jagana, herbert, gaagaan, Robert.Olsson, kumarkr, rdreier,
peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general, mchan,
tgraf, netdev, johnpol, shemminger, kaber, sri
David Miller wrote:
> I think the jury is still out, but seeing TSO perform even slightly
> worse with the batching changes in place would be very worrysome.
> This applies to both throughput and cpu utilization.
Should it be any more or less worrysome than small packet performance (eg the
TCP_RR stuff I posted recently) being rather worse with TSO enabled than with it
disabled?
rick jones
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-22 17:09 ` [ofa-general] " Rick Jones
@ 2007-08-22 20:21 ` David Miller
2007-08-23 22:04 ` [ofa-general] " jamal
0 siblings, 1 reply; 48+ messages in thread
From: David Miller @ 2007-08-22 20:21 UTC (permalink / raw)
To: rick.jones2
Cc: krkumar2, gaagaan, general, hadi, herbert, jagana, jeff, johnpol,
kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
rdreier, Robert.Olsson, shemminger, sri, tgraf, xma
From: Rick Jones <rick.jones2@hp.com>
Date: Wed, 22 Aug 2007 10:09:37 -0700
> Should it be any more or less worrysome than small packet
> performance (eg the TCP_RR stuff I posted recently) being rather
> worse with TSO enabled than with it disabled?
That, like any such thing shown by the batching changes, is a bug
to fix.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-22 20:21 ` David Miller
@ 2007-08-23 22:04 ` jamal
2007-08-23 22:25 ` jamal
2007-08-23 22:30 ` David Miller
0 siblings, 2 replies; 48+ messages in thread
From: jamal @ 2007-08-23 22:04 UTC (permalink / raw)
To: David Miller
Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
kumarkr, rdreier, mcarlson, jeff, general, mchan, tgraf, netdev,
johnpol, shemminger, kaber, sri
On Wed, 2007-22-08 at 13:21 -0700, David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Wed, 22 Aug 2007 10:09:37 -0700
>
> > Should it be any more or less worrysome than small packet
> > performance (eg the TCP_RR stuff I posted recently) being rather
> > worse with TSO enabled than with it disabled?
>
> That, like any such thing shown by the batching changes, is a bug
> to fix.
Possibly a bug - but you really should turn off TSO if you are doing
huge interactive transactions (which is fair because there is a clear
demarcation).
The litmus test is the same as any change that is supposed to improve
net performance - it has to demonstrate it is not intrusive and that it
improves (consistently) performance. The standard metrics are
{throughput, cpu-utilization, latency} i.e as long as one improves and
others remain zero, it would make sense. Yes, i am religious for
batching after all the invested sweat (and i continue to work on it
hoping to demystify) - the theory makes a lot of sense.
cheers,
jamal
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-23 22:04 ` [ofa-general] " jamal
@ 2007-08-23 22:25 ` jamal
2007-08-23 22:35 ` [ofa-general] " Rick Jones
2007-08-23 22:30 ` David Miller
1 sibling, 1 reply; 48+ messages in thread
From: jamal @ 2007-08-23 22:25 UTC (permalink / raw)
To: David Miller
Cc: rick.jones2, krkumar2, gaagaan, general, herbert, jagana, jeff,
johnpol, kaber, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
rdreier, Robert.Olsson, shemminger, sri, tgraf, xma
On Thu, 2007-23-08 at 18:04 -0400, jamal wrote:
> The litmus test is the same as any change that is supposed to improve
> net performance - it has to demonstrate it is not intrusive and that it
> improves (consistently) performance. The standard metrics are
> {throughput, cpu-utilization, latency} i.e as long as one improves and
> others remain zero, it would make sense. Yes, i am religious for
> batching after all the invested sweat (and i continue to work on it
> hoping to demystify) - the theory makes a lot of sense.
Before someone jumps and strangles me ;-> By "litmus test" i meant as
applied to batching. [TSO already passed - iirc, it has been
demostranted to really not add much to throughput (cant improve much
over closeness to wire speed) but improve CPU utilization].
cheers,
jamal
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-23 22:25 ` jamal
@ 2007-08-23 22:35 ` Rick Jones
2007-08-23 22:41 ` jamal
2007-08-24 3:18 ` Bill Fink
0 siblings, 2 replies; 48+ messages in thread
From: Rick Jones @ 2007-08-23 22:35 UTC (permalink / raw)
To: hadi
Cc: jagana, herbert, gaagaan, Robert.Olsson, netdev, rdreier,
peter.p.waskiewicz.jr, mcarlson, kaber, jeff, general, mchan,
tgraf, johnpol, shemminger, David Miller, sri
jamal wrote:
> [TSO already passed - iirc, it has been
> demostranted to really not add much to throughput (cant improve much
> over closeness to wire speed) but improve CPU utilization].
In the one gig space sure, but in the 10 Gig space, TSO on/off does make a
difference for throughput.
rick jones
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-23 22:35 ` [ofa-general] " Rick Jones
@ 2007-08-23 22:41 ` jamal
2007-08-24 3:18 ` Bill Fink
1 sibling, 0 replies; 48+ messages in thread
From: jamal @ 2007-08-23 22:41 UTC (permalink / raw)
To: Rick Jones
Cc: jagana, herbert, gaagaan, Robert.Olsson, netdev, rdreier,
peter.p.waskiewicz.jr, mcarlson, kaber, jeff, general, mchan,
tgraf, johnpol, shemminger, David Miller, sri
On Thu, 2007-23-08 at 15:35 -0700, Rick Jones wrote:
> jamal wrote:
> > [TSO already passed - iirc, it has been
> > demostranted to really not add much to throughput (cant improve much
> > over closeness to wire speed) but improve CPU utilization].
>
> In the one gig space sure, but in the 10 Gig space, TSO on/off does make a
> difference for throughput.
I am still so 1Gige;-> I stand corrected again ;->
cheers,
jamal
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-23 22:35 ` [ofa-general] " Rick Jones
2007-08-23 22:41 ` jamal
@ 2007-08-24 3:18 ` Bill Fink
2007-08-24 12:14 ` jamal
` (2 more replies)
1 sibling, 3 replies; 48+ messages in thread
From: Bill Fink @ 2007-08-24 3:18 UTC (permalink / raw)
To: Rick Jones
Cc: jagana, herbert, gaagaan, Robert.Olsson, mcarlson, rdreier,
peter.p.waskiewicz.jr, hadi, kaber, jeff, general, mchan, tgraf,
netdev, johnpol, shemminger, David Miller, sri
On Thu, 23 Aug 2007, Rick Jones wrote:
> jamal wrote:
> > [TSO already passed - iirc, it has been
> > demostranted to really not add much to throughput (cant improve much
> > over closeness to wire speed) but improve CPU utilization].
>
> In the one gig space sure, but in the 10 Gig space, TSO on/off does make a
> difference for throughput.
Not too much.
TSO enabled:
[root@lang2 ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11813.4375 MB / 10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX
TSO disabled:
[root@lang2 ~]# ethtool -K eth2 tso off
[root@lang2 ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.2500 MB / 10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX
Pretty negligible difference it seems.
This is with a 2.6.20.7 kernel, Myricom 10-GigE NICs, and 9000 byte
jumbo frames, in a LAN environment.
For grins, I also did a couple of tests with an MSS of 1460 to
emulate a standard 1500 byte Ethernet MTU.
TSO enabled:
[root@lang2 ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5102.8503 MB / 10.06 sec = 4253.9124 Mbps 39 %TX 99 %RX
TSO disabled:
[root@lang2 ~]# ethtool -K eth2 tso off
[root@lang2 ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5399.5625 MB / 10.00 sec = 4527.9070 Mbps 99 %TX 76 %RX
Here you can see there is a major difference in the TX CPU utilization
(99 % with TSO disabled versus only 39 % with TSO enabled), although
the TSO disabled case was able to squeeze out a little extra performance
from its extra CPU utilization. Interestingly, with TSO enabled, the
receiver actually consumed more CPU than with TSO disabled, so I guess
the receiver CPU saturation in that case (99 %) was what restricted
its performance somewhat (this was consistent across a few test runs).
-Bill
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-24 3:18 ` Bill Fink
@ 2007-08-24 12:14 ` jamal
2007-08-24 18:08 ` Bill Fink
2007-08-24 21:25 ` David Miller
2007-08-24 18:46 ` [ofa-general] " Rick Jones
2007-08-25 0:42 ` John Heffner
2 siblings, 2 replies; 48+ messages in thread
From: jamal @ 2007-08-24 12:14 UTC (permalink / raw)
To: Bill Fink
Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf,
johnpol, shemminger, David Miller, sri
On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote:
[..]
> Here you can see there is a major difference in the TX CPU utilization
> (99 % with TSO disabled versus only 39 % with TSO enabled), although
> the TSO disabled case was able to squeeze out a little extra performance
> from its extra CPU utilization.
Good stuff. What kind of machine? SMP?
Seems the receive side of the sender is also consuming a lot more cpu
i suspect because receiver is generating a lot more ACKs with TSO.
Does the choice of the tcp congestion control algorithm affect results?
it would be interesting to see both MTUs with either TCP BIC vs good old
reno on sender (probably without changing what the receiver does). BIC
seems to be the default lately.
> Interestingly, with TSO enabled, the
> receiver actually consumed more CPU than with TSO disabled,
I would suspect the fact that a lot more packets making it into the
receiver for TSO contributes.
> so I guess
> the receiver CPU saturation in that case (99 %) was what restricted
> its performance somewhat (this was consistent across a few test runs).
Unfortunately the receiver plays a big role in such tests - if it is
bottlenecked then you are not really testing the limits of the
transmitter.
cheers,
jamal
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-24 12:14 ` jamal
@ 2007-08-24 18:08 ` Bill Fink
2007-08-24 21:25 ` David Miller
1 sibling, 0 replies; 48+ messages in thread
From: Bill Fink @ 2007-08-24 18:08 UTC (permalink / raw)
To: hadi
Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf,
johnpol, shemminger, David Miller, sri
On Fri, 24 Aug 2007, jamal wrote:
> On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote:
>
> [..]
> > Here you can see there is a major difference in the TX CPU utilization
> > (99 % with TSO disabled versus only 39 % with TSO enabled), although
> > the TSO disabled case was able to squeeze out a little extra performance
> > from its extra CPU utilization.
>
> Good stuff. What kind of machine? SMP?
Tyan Thunder K8WE S2895ANRF motherboard with Nvidia nForce
Professional 2200+2050 chipset, 2 AMD Opteron 254 2.8 GHz CPUs,
4 GB PC3200 ECC REG-DDR 400 memory, and 2 PCI-Express x16 slots
(2 buses).
It is SMP but both the NIC interrupts and nuttcp are bound to
CPU 0. And all other non-kernel system processes are bound to
CPU 1.
> Seems the receive side of the sender is also consuming a lot more cpu
> i suspect because receiver is generating a lot more ACKs with TSO.
Odd. I just reran the TCP CUBIC "-M1460" tests, and with TSO enabled
on the transmitter, there were about 153709 eth2 interrupts on the
receiver, while with TSO disabled there was actually a somewhat higher
number (164988) of receiver side eth2 interrupts, although the receive
side CPU utilization was actually lower in that case.
On the transmit side (different test run), the TSO enabled case had
about 161773 eth2 interrupts whereas the TSO disabled case had about
165179 eth2 interrupts.
> Does the choice of the tcp congestion control algorithm affect results?
> it would be interesting to see both MTUs with either TCP BIC vs good old
> reno on sender (probably without changing what the receiver does). BIC
> seems to be the default lately.
These tests were with the default TCP CUBIC (with initial_ssthresh
set to 0).
With TCP BIC (and initial_ssthresh set to 0):
TSO enabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11751.3750 MB / 10.00 sec = 9853.9839 Mbps 100 %TX 83 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4999.3321 MB / 10.06 sec = 4167.7872 Mbps 38 %TX 100 %RX
TSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.1875 MB / 10.00 sec = 9910.0682 Mbps 99 %TX 81 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5502.6250 MB / 10.00 sec = 4614.3297 Mbps 100 %TX 84 %RX
And with TCP Reno:
TSO enabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11782.6250 MB / 10.00 sec = 9880.2613 Mbps 100 %TX 77 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5024.6649 MB / 10.06 sec = 4191.6574 Mbps 38 %TX 99 %RX
TSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.2500 MB / 10.00 sec = 9910.0860 Mbps 99 %TX 77 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5284.0000 MB / 10.00 sec = 4430.9604 Mbps 99 %TX 79 %RX
Very similar results to the original TCP CUBIC tests.
> > Interestingly, with TSO enabled, the
> > receiver actually consumed more CPU than with TSO disabled,
>
> I would suspect the fact that a lot more packets making it into the
> receiver for TSO contributes.
>
> > so I guess
> > the receiver CPU saturation in that case (99 %) was what restricted
> > its performance somewhat (this was consistent across a few test runs).
>
> Unfortunately the receiver plays a big role in such tests - if it is
> bottlenecked then you are not really testing the limits of the
> transmitter.
It might be interesting to see what affect the LRO changes would have
on this. Once they are in a stable released kernel, I might try that
out, or maybe even before if I get some spare time (but that's in very
short supply right now).
-Thanks
-Bill
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-24 12:14 ` jamal
2007-08-24 18:08 ` Bill Fink
@ 2007-08-24 21:25 ` David Miller
2007-08-24 23:11 ` Herbert Xu
1 sibling, 1 reply; 48+ messages in thread
From: David Miller @ 2007-08-24 21:25 UTC (permalink / raw)
To: hadi
Cc: jagana, billfink, peter.p.waskiewicz.jr, herbert, gaagaan,
Robert.Olsson, netdev, rdreier, mcarlson, jeff, general, mchan,
tgraf, johnpol, shemminger, kaber, sri
From: jamal <hadi@cyberus.ca>
Date: Fri, 24 Aug 2007 08:14:16 -0400
> Seems the receive side of the sender is also consuming a lot more cpu
> i suspect because receiver is generating a lot more ACKs with TSO.
I've seen this behavior before on a low cpu powered receiver and the
issue is that batching too much actually hurts a receiver.
If the data packets were better spaced out, the receive would handle
the load better.
This is the thing the TOE guys keep talking about overcoming with
their packet pacing algorithms in their on-card TOE stack.
My hunch is that even if in the non-TSO case the TX packets were all
back to back in the cards TX ring, TSO still spits them out faster on
the wire.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-24 21:25 ` David Miller
@ 2007-08-24 23:11 ` Herbert Xu
2007-08-25 23:45 ` Bill Fink
0 siblings, 1 reply; 48+ messages in thread
From: Herbert Xu @ 2007-08-24 23:11 UTC (permalink / raw)
To: David Miller
Cc: hadi, billfink, rick.jones2, krkumar2, gaagaan, general, jagana,
jeff, johnpol, kaber, mcarlson, mchan, netdev,
peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
tgraf, xma
On Fri, Aug 24, 2007 at 02:25:03PM -0700, David Miller wrote:
>
> My hunch is that even if in the non-TSO case the TX packets were all
> back to back in the cards TX ring, TSO still spits them out faster on
> the wire.
If this is the case then we should see an improvement by
disabling TSO and enabling GSO.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-24 23:11 ` Herbert Xu
@ 2007-08-25 23:45 ` Bill Fink
0 siblings, 0 replies; 48+ messages in thread
From: Bill Fink @ 2007-08-25 23:45 UTC (permalink / raw)
To: Herbert Xu
Cc: David Miller, hadi, rick.jones2, krkumar2, gaagaan, general,
jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
tgraf, xma
On Sat, 25 Aug 2007, Herbert Xu wrote:
> On Fri, Aug 24, 2007 at 02:25:03PM -0700, David Miller wrote:
> >
> > My hunch is that even if in the non-TSO case the TX packets were all
> > back to back in the cards TX ring, TSO still spits them out faster on
> > the wire.
>
> If this is the case then we should see an improvement by
> disabling TSO and enabling GSO.
TSO disabled and GSO enabled:
[root@lang2 redhat]# nuttcp -w10m 192.168.88.16
11806.7500 MB / 10.00 sec = 9900.6278 Mbps 100 %TX 84 %RX
[root@lang2 redhat]# nuttcp -M1460 -w10m 192.168.88.16
4872.0625 MB / 10.00 sec = 4085.5690 Mbps 100 %TX 64 %RX
In the "-M1460" case, there was generally less receiver CPU utilization,
but the transmitter utilization was generally pegged at 100 %, even
though there wasn't any improvement in throughput compared to the
TSO enabled case (in fact the throughput generally seemed to be somewhat
less than the TSO enabled case). Note there was a fair degree of
variability across runs for the receiver CPU utilization (the one
shown I considered to be representative of the average behavior).
Repeat of previous test results:
TSO enabled and GSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11813.4375 MB / 10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5102.8503 MB / 10.06 sec = 4253.9124 Mbps 39 %TX 99 %RX
TSO disabled and GSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.2500 MB / 10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5399.5625 MB / 10.00 sec = 4527.9070 Mbps 99 %TX 76 %RX
-Bill
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-24 3:18 ` Bill Fink
2007-08-24 12:14 ` jamal
@ 2007-08-24 18:46 ` Rick Jones
2007-08-25 0:42 ` John Heffner
2 siblings, 0 replies; 48+ messages in thread
From: Rick Jones @ 2007-08-24 18:46 UTC (permalink / raw)
To: Bill Fink
Cc: jagana, herbert, gaagaan, Robert.Olsson, mcarlson, rdreier,
peter.p.waskiewicz.jr, hadi, kaber, jeff, general, mchan, tgraf,
netdev, johnpol, shemminger, David Miller, sri
Bill Fink wrote:
> On Thu, 23 Aug 2007, Rick Jones wrote:
>
>
>>jamal wrote:
>>
>>>[TSO already passed - iirc, it has been
>>>demostranted to really not add much to throughput (cant improve much
>>>over closeness to wire speed) but improve CPU utilization].
>>
>>In the one gig space sure, but in the 10 Gig space, TSO on/off does make a
>>difference for throughput.
>
>
> Not too much.
>
> TSO enabled:
>
> [root@lang2 ~]# ethtool -k eth2
> Offload parameters for eth2:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp segmentation offload: on
>
> [root@lang2 ~]# nuttcp -w10m 192.168.88.16
> 11813.4375 MB / 10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX
>
> TSO disabled:
>
> [root@lang2 ~]# ethtool -K eth2 tso off
> [root@lang2 ~]# ethtool -k eth2
> Offload parameters for eth2:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp segmentation offload: off
>
> [root@lang2 ~]# nuttcp -w10m 192.168.88.16
> 11818.2500 MB / 10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX
>
> Pretty negligible difference it seems.
Leaves one wondering how often more than one segment was sent to the card in the
9000 byte case :)
rick jones
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-24 3:18 ` Bill Fink
2007-08-24 12:14 ` jamal
2007-08-24 18:46 ` [ofa-general] " Rick Jones
@ 2007-08-25 0:42 ` John Heffner
2007-08-26 8:41 ` [ofa-general] " Bill Fink
2 siblings, 1 reply; 48+ messages in thread
From: John Heffner @ 2007-08-25 0:42 UTC (permalink / raw)
To: Bill Fink
Cc: Rick Jones, hadi, David Miller, krkumar2, gaagaan, general,
herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
tgraf, xma
Bill Fink wrote:
> Here you can see there is a major difference in the TX CPU utilization
> (99 % with TSO disabled versus only 39 % with TSO enabled), although
> the TSO disabled case was able to squeeze out a little extra performance
> from its extra CPU utilization. Interestingly, with TSO enabled, the
> receiver actually consumed more CPU than with TSO disabled, so I guess
> the receiver CPU saturation in that case (99 %) was what restricted
> its performance somewhat (this was consistent across a few test runs).
One possibility is that I think the receive-side processing tends to do
better when receiving into an empty queue. When the (non-TSO) sender is
the flow's bottleneck, this is going to be the case. But when you
switch to TSO, the receiver becomes the bottleneck and you're always
going to have to put the packets at the back of the receive queue. This
might help account for the reason why you have both lower throughput and
higher CPU utilization -- there's a point of instability right where the
receiver becomes the bottleneck and you end up pushing it over to the
bad side. :)
Just a theory. I'm honestly surprised this effect would be so
significant. What do the numbers from netstat -s look like in the two
cases?
-John
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-25 0:42 ` John Heffner
@ 2007-08-26 8:41 ` Bill Fink
2007-08-27 1:32 ` John Heffner
0 siblings, 1 reply; 48+ messages in thread
From: Bill Fink @ 2007-08-26 8:41 UTC (permalink / raw)
To: John Heffner
Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
mcarlson, rdreier, hadi, kaber, jeff, general, mchan, tgraf,
netdev, johnpol, shemminger, David Miller, sri
On Fri, 24 Aug 2007, John Heffner wrote:
> Bill Fink wrote:
> > Here you can see there is a major difference in the TX CPU utilization
> > (99 % with TSO disabled versus only 39 % with TSO enabled), although
> > the TSO disabled case was able to squeeze out a little extra performance
> > from its extra CPU utilization. Interestingly, with TSO enabled, the
> > receiver actually consumed more CPU than with TSO disabled, so I guess
> > the receiver CPU saturation in that case (99 %) was what restricted
> > its performance somewhat (this was consistent across a few test runs).
>
> One possibility is that I think the receive-side processing tends to do
> better when receiving into an empty queue. When the (non-TSO) sender is
> the flow's bottleneck, this is going to be the case. But when you
> switch to TSO, the receiver becomes the bottleneck and you're always
> going to have to put the packets at the back of the receive queue. This
> might help account for the reason why you have both lower throughput and
> higher CPU utilization -- there's a point of instability right where the
> receiver becomes the bottleneck and you end up pushing it over to the
> bad side. :)
>
> Just a theory. I'm honestly surprised this effect would be so
> significant. What do the numbers from netstat -s look like in the two
> cases?
Well, I was going to check this out, but I happened to reboot the
system and now I get somewhat different results.
Here are the new results, which should hopefully be more accurate
since they are on a freshly booted system.
TSO enabled and GSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11610.6875 MB / 10.00 sec = 9735.9526 Mbps 100 %TX 75 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5029.6875 MB / 10.06 sec = 4194.6931 Mbps 36 %TX 100 %RX
TSO disabled and GSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11817.9375 MB / 10.00 sec = 9909.7773 Mbps 99 %TX 77 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5823.3125 MB / 10.00 sec = 4883.2429 Mbps 100 %TX 82 %RX
The TSO disabled case got a little better performance even for
9000 byte jumbo frames. For the "-M1460" case eumalating a
standard 1500 byte Ethernet MTU, the performance was significantly
better and used less CPU on the receiver (82 % versus 100 %)
although it did use significantly more CPU on the transmitter
(100 % versus 36 %).
TSO disabled and GSO enabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11609.5625 MB / 10.00 sec = 9734.9859 Mbps 99 %TX 75 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5001.4375 MB / 10.06 sec = 4170.6739 Mbps 52 %TX 100 %RX
The GSO enabled case is very similar to the TSO enabled case,
except that for the "-M1460" test the transmitter used more
CPU (52 % versus 36 %), which is to be expected since TSO has
hardware assist.
Here's the beforeafter delta of the receiver's "netstat -s"
statistics for the TSO enabled case:
Ip:
3659898 total packets received
3659898 incoming packets delivered
80050 requests sent out
Tcp:
2 passive connection openings
3659897 segments received
80050 segments send out
TcpExt:
33 packets directly queued to recvmsg prequeue.
104956 packets directly received from backlog
705528 packets directly received from prequeue
3654842 packets header predicted
193 packets header predicted and directly queued to user
4 acknowledgments not containing data received
6 predicted acknowledgments
And here it is for the TSO disabled case (GSO also disabled):
Ip:
4107083 total packets received
4107083 incoming packets delivered
1401376 requests sent out
Tcp:
2 passive connection openings
4107083 segments received
1401376 segments send out
TcpExt:
2 TCP sockets finished time wait in fast timer
48486 packets directly queued to recvmsg prequeue.
1056111048 packets directly received from backlog
2273357712 packets directly received from prequeue
1819317 packets header predicted
2287497 packets header predicted and directly queued to user
4 acknowledgments not containing data received
10 predicted acknowledgments
For the TSO disabled case, there are a huge amount more TCP segments
sent out (1401376 versus 80050), which I assume are ACKs, and which
could possibly contribute to the higher throughput for the TSO disabled
case due to faster feedback, but not explain the lower CPU utilization.
There are many more packets directly queued to recvmsg prequeue
(48486 versus 33). The numbers for packets directly received from
backlog and prequeue in the TCP disabled case seem bogus to me so
I don't know how to interpret that. There are only about half as
many packets header predicted (1819317 versus 3654842), but there
are many more packets header predicted and directly queued to user
(2287497 versus 193). I'll leave the analysis of all this to those
who might actually know what it all means.
I also ran another set of tests that may be of interest. I changed
the rx-usecs/tx-usecs interrupt coalescing parameter from the
recommended optimum value of 75 usecs to 0 (no coalescing), but
only on the transmitter. The comparison discussions below are
relative to the previous tests where rx-usecs/tx-usecs were set
to 75 usecs.
TSO enabled and GSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11812.8125 MB / 10.00 sec = 9905.6640 Mbps 100 %TX 75 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
7701.8750 MB / 10.00 sec = 6458.5541 Mbps 100 %TX 56 %RX
For 9000 byte jumbo frames it now gets a little better performance
and almost matches the 10-GigE line rate performance of the TSO
disabled case. For the "-M1460" test, it gets substantially better
performance (6458.5541 Mbps versus 4194.6931 Mbps) at the expense
of much higher transmitter CPU utilization (100 % versus 36 %),
although the receiver CPU utilization is much less (56 % versus 100 %).
TSO disabled and GSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11817.3125 MB / 10.00 sec = 9909.4058 Mbps 100 %TX 76 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4081.2500 MB / 10.00 sec = 3422.3994 Mbps 99 %TX 41 %RX
For 9000 byte jumbo frames the results are essentially the same.
For the "-M1460" test, the performance is significantly worse
(3422.3994 Mbps versus 4883.2429 Mbps) even though the transmitter
CPU utilization is saturated in both cases, but the receiver CPU
utilization is about half (41 % versus 82 %).
TSO disabled and GSO enabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11813.3750 MB / 10.00 sec = 9906.1090 Mbps 99 %TX 77 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
3939.1875 MB / 10.00 sec = 3303.2814 Mbps 100 %TX 41 %RX
For 9000 byte jumbo frames the performance is a little better,
again approaching 10-GigE line rate. But for the "-M1460" test,
the performance is significantly worse (3303.2814 Mbps versus
4170.6739 Mbps) even though the transmitter consumes much more
CPU (100 % versus 52 %). In this case though the receiver has
a much lower CPU utilization (41 % versus 100 %).
-Bill
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-26 8:41 ` [ofa-general] " Bill Fink
@ 2007-08-27 1:32 ` John Heffner
2007-08-27 2:04 ` David Miller
0 siblings, 1 reply; 48+ messages in thread
From: John Heffner @ 2007-08-27 1:32 UTC (permalink / raw)
To: Bill Fink
Cc: Rick Jones, hadi, David Miller, krkumar2, gaagaan, general,
herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
tgraf, xma
Bill Fink wrote:
> Here's the beforeafter delta of the receiver's "netstat -s"
> statistics for the TSO enabled case:
>
> Ip:
> 3659898 total packets received
> 3659898 incoming packets delivered
> 80050 requests sent out
> Tcp:
> 2 passive connection openings
> 3659897 segments received
> 80050 segments send out
> TcpExt:
> 33 packets directly queued to recvmsg prequeue.
> 104956 packets directly received from backlog
> 705528 packets directly received from prequeue
> 3654842 packets header predicted
> 193 packets header predicted and directly queued to user
> 4 acknowledgments not containing data received
> 6 predicted acknowledgments
>
> And here it is for the TSO disabled case (GSO also disabled):
>
> Ip:
> 4107083 total packets received
> 4107083 incoming packets delivered
> 1401376 requests sent out
> Tcp:
> 2 passive connection openings
> 4107083 segments received
> 1401376 segments send out
> TcpExt:
> 2 TCP sockets finished time wait in fast timer
> 48486 packets directly queued to recvmsg prequeue.
> 1056111048 packets directly received from backlog
> 2273357712 packets directly received from prequeue
> 1819317 packets header predicted
> 2287497 packets header predicted and directly queued to user
> 4 acknowledgments not containing data received
> 10 predicted acknowledgments
>
> For the TSO disabled case, there are a huge amount more TCP segments
> sent out (1401376 versus 80050), which I assume are ACKs, and which
> could possibly contribute to the higher throughput for the TSO disabled
> case due to faster feedback, but not explain the lower CPU utilization.
> There are many more packets directly queued to recvmsg prequeue
> (48486 versus 33). The numbers for packets directly received from
> backlog and prequeue in the TCP disabled case seem bogus to me so
> I don't know how to interpret that. There are only about half as
> many packets header predicted (1819317 versus 3654842), but there
> are many more packets header predicted and directly queued to user
> (2287497 versus 193). I'll leave the analysis of all this to those
> who might actually know what it all means.
There are a few interesting things here. For one, the bursts caused by
TSO seem to be causing the receiver to do stretch acks. This may have a
negative impact on flow performance, but it's hard to say for sure how
much. Interestingly, it will even further reduce the CPU load on the
sender, since it has to process fewer acks.
As I suspected, in the non-TSO case the receiver gets lots of packets
directly queued to user. This should result in somewhat lower CPU
utilization on the receiver. I don't know if it can account for all the
difference you see.
The backlog and prequeue values are probably correct, but netstat's
description is wrong. A quick look at the code reveals these values are
in units of bytes, not packets.
-John
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-27 1:32 ` John Heffner
@ 2007-08-27 2:04 ` David Miller
2007-08-27 23:23 ` jamal
0 siblings, 1 reply; 48+ messages in thread
From: David Miller @ 2007-08-27 2:04 UTC (permalink / raw)
To: jheffner
Cc: billfink, rick.jones2, hadi, krkumar2, gaagaan, general, herbert,
jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
tgraf, xma
From: John Heffner <jheffner@psc.edu>
Date: Sun, 26 Aug 2007 21:32:26 -0400
> There are a few interesting things here. For one, the bursts caused by
> TSO seem to be causing the receiver to do stretch acks. This may have a
> negative impact on flow performance, but it's hard to say for sure how
> much. Interestingly, it will even further reduce the CPU load on the
> sender, since it has to process fewer acks.
>
> As I suspected, in the non-TSO case the receiver gets lots of packets
> directly queued to user. This should result in somewhat lower CPU
> utilization on the receiver. I don't know if it can account for all the
> difference you see.
I had completely forgotten these stretch ACK and ucopy issues.
When the receiver gets inundated with a backlog of receive
queue packets, it just spins there copying into userspace
_every_ _single_ packet in that queue, then spits out one ACK.
Meanwhile the sender has to pause long enough for the pipe to empty
slightly.
The transfer is much better behaved if we ACK every two full sized
frames we copy into the receiver, and therefore don't stretch ACK, but
at the cost of cpu utilization.
These effects are particularly pronounced on systems where the
bus bandwidth is also one of the limiting factors.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-27 2:04 ` David Miller
@ 2007-08-27 23:23 ` jamal
2007-09-14 7:20 ` [ofa-general] " Bill Fink
0 siblings, 1 reply; 48+ messages in thread
From: jamal @ 2007-08-27 23:23 UTC (permalink / raw)
To: David Miller
Cc: jheffner, billfink, rick.jones2, krkumar2, gaagaan, general,
herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
tgraf, xma
On Sun, 2007-26-08 at 19:04 -0700, David Miller wrote:
> The transfer is much better behaved if we ACK every two full sized
> frames we copy into the receiver, and therefore don't stretch ACK, but
> at the cost of cpu utilization.
The rx coalescing in theory should help by accumulating more ACKs on the
rx side of the sender. But it doesnt seem to do that i.e For the 9K MTU,
you are better off to turn off the coalescing if you want higher
numbers. Also some of the TOE vendors (chelsio?) claim to have fixed
this by reducing bursts on outgoing packets.
Bill:
who suggested (as per your email) the 75usec value and what was it based
on measurement-wise?
BTW, thanks for the finding the energy to run those tests and a very
refreshing perspective. I dont mean to add more work, but i had some
queries;
On your earlier tests, i think that Reno showed some significant
differences on the lower MTU case over BIC. I wonder if this is
consistent?
A side note: Although the experimentation reduces the variables (eg
tying all to CPU0), it would be more exciting to see multi-cpu and
multi-flow sender effect (which IMO is more real world).
Last note: you need a newer netstat.
> These effects are particularly pronounced on systems where the
> bus bandwidth is also one of the limiting factors.
Can you elucidate this a little more Dave? Did you mean memory
bandwidth?
cheers,
jamal
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-27 23:23 ` jamal
@ 2007-09-14 7:20 ` Bill Fink
2007-09-14 13:44 ` TSO, TCP Cong control etc jamal
2007-09-14 17:24 ` [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
0 siblings, 2 replies; 48+ messages in thread
From: Bill Fink @ 2007-09-14 7:20 UTC (permalink / raw)
To: hadi
Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf,
johnpol, shemminger, David Miller, jheffner, sri
On Mon, 27 Aug 2007, jamal wrote:
> On Sun, 2007-26-08 at 19:04 -0700, David Miller wrote:
>
> > The transfer is much better behaved if we ACK every two full sized
> > frames we copy into the receiver, and therefore don't stretch ACK, but
> > at the cost of cpu utilization.
>
> The rx coalescing in theory should help by accumulating more ACKs on the
> rx side of the sender. But it doesnt seem to do that i.e For the 9K MTU,
> you are better off to turn off the coalescing if you want higher
> numbers. Also some of the TOE vendors (chelsio?) claim to have fixed
> this by reducing bursts on outgoing packets.
>
> Bill:
> who suggested (as per your email) the 75usec value and what was it based
> on measurement-wise?
Belatedly getting back to this thread. There was a recent myri10ge
patch that changed the default value for tx/rx interrupt coalescing
to 75 usec claiming it was an optimum value for maximum throughput
(and is also mentioned in their external README documentation).
I also did some empirical testing to determine the effect of different
values of TX/RX interrupt coalescing on 10-GigE network performance,
both with TSO enabled and with TSO disabled. The actual test runs
are attached at the end of this message, but the results are summarized
in the following table (network performance in Mbps).
TX/RX interrupt coalescing in usec (both sides)
0 15 30 45 60 75 90 105
TSO enabled 8909 9682 9716 9725 9739 9745 9688 9648
TSO disabled 9113 9910 9910 9910 9910 9910 9910 9910
TSO disabled performance is always better than equivalent TSO enabled
performance. With TSO enabled, the optimum performance is indeed at
a TX/RX interrupt coalescing value of 75 usec. With TSO disabled,
performance is the full 10-GigE line rate of 9910 Mbps for any value
of TX/RX interrupt coalescing from 15 usec to 105 usec.
> BTW, thanks for the finding the energy to run those tests and a very
> refreshing perspective. I dont mean to add more work, but i had some
> queries;
> On your earlier tests, i think that Reno showed some significant
> differences on the lower MTU case over BIC. I wonder if this is
> consistent?
Here's a retest (5 tests each):
TSO enabled:
TCP Cubic (initial_ssthresh set to 0):
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5007.6295 MB / 10.06 sec = 4176.1807 Mbps 36 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4950.9279 MB / 10.06 sec = 4130.2528 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4917.1742 MB / 10.05 sec = 4102.5772 Mbps 35 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4948.7920 MB / 10.05 sec = 4128.7990 Mbps 36 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4937.5765 MB / 10.05 sec = 4120.6460 Mbps 35 %TX 99 %RX
TCP Bic (initial_ssthresh set to 0):
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5005.5335 MB / 10.06 sec = 4172.9571 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5001.0625 MB / 10.06 sec = 4169.2960 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4957.7500 MB / 10.06 sec = 4135.7355 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4957.3777 MB / 10.06 sec = 4135.6252 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5059.1815 MB / 10.05 sec = 4221.3546 Mbps 37 %TX 99 %RX
TCP Reno:
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4973.3532 MB / 10.06 sec = 4147.3589 Mbps 36 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4984.4375 MB / 10.06 sec = 4155.2131 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4995.6841 MB / 10.06 sec = 4166.2734 Mbps 36 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4982.2500 MB / 10.05 sec = 4156.7586 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4989.9796 MB / 10.05 sec = 4163.0949 Mbps 36 %TX 99 %RX
TSO disabled:
TCP Cubic (initial_ssthresh set to 0):
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5075.8125 MB / 10.02 sec = 4247.3408 Mbps 99 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5056.0000 MB / 10.03 sec = 4229.9621 Mbps 100 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5047.4375 MB / 10.03 sec = 4223.1203 Mbps 99 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5066.1875 MB / 10.03 sec = 4239.1659 Mbps 100 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4986.3750 MB / 10.03 sec = 4171.9906 Mbps 99 %TX 100 %RX
TCP Bic (initial_ssthresh set to 0):
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5040.5625 MB / 10.03 sec = 4217.3521 Mbps 100 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5049.7500 MB / 10.03 sec = 4225.4585 Mbps 99 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5076.5000 MB / 10.03 sec = 4247.6632 Mbps 100 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5017.2500 MB / 10.03 sec = 4197.4990 Mbps 100 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5013.3125 MB / 10.03 sec = 4194.8851 Mbps 100 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5036.0625 MB / 10.03 sec = 4213.9195 Mbps 100 %TX 100 %RX
TCP Reno:
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5006.8750 MB / 10.02 sec = 4189.6051 Mbps 99 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5028.1250 MB / 10.02 sec = 4207.4553 Mbps 100 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5021.9375 MB / 10.02 sec = 4202.2668 Mbps 99 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5000.5625 MB / 10.03 sec = 4184.3109 Mbps 99 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5025.1250 MB / 10.03 sec = 4204.7378 Mbps 99 %TX 100 %RX
Not too much variation here, and not quite as high results
as previously. Some further testing reveals that while this
time I mainly get results like (here for TCP Bic with TSO
disabled):
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4958.0625 MB / 10.02 sec = 4148.9361 Mbps 100 %TX 99 %RX
I also sometimes get results like:
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5882.1875 MB / 10.00 sec = 4932.5549 Mbps 100 %TX 90 %RX
The higher performing results seem to correspond to when there's a
somewhat lower receiver CPU utilization. I'm not sure but there
could also have been an effect from running the "-M1460" test after
the 9000 byte jumbo frame test (no jumbo tests were done at all prior
to running the above sets of 5 tests, although I did always discard
an initial "warmup" test, and now that I think about it some of
those initial discarded "warmup" tests did have somewhat anomalously
high results).
> A side note: Although the experimentation reduces the variables (eg
> tying all to CPU0), it would be more exciting to see multi-cpu and
> multi-flow sender effect (which IMO is more real world).
These systems are intended as test systems for 10-GigE networks,
and as such it's important to get as consistently close to full
10-GigE line rate as possible, and that's why the interrupts and
nuttcp application are tied to CPU0, with almost all other system
applications tied to CPU1.
Now on another system that's intended as a 10-GigE firewall system,
it has 2 Myricom 10-GigE NICs with the interrupts for eth2 tied to
CPU0 and the interrupts for CPU1 tied to CPU1. In IP forwarding
tests of this system, I have basically achieved full bidirectional
10-GigE line rate IP forwarding with 9000 byte jumbo frames.
chance4 -> chance6 -> chance9 4.85 Gbps rate limited TCP stream
chance5 -> chance6 -> chance9 4.85 Gbps rate limited TCP stream
chance7 <- chance6 <- chance8 10.0 Gbps non-rate limited TCP stream
[root@chance7 ~]# nuttcp -Ic4tc9 -Ri4.85g -w10m 192.168.88.8 192.168.89.16 & \
nuttcp -Ic5tc9 -Ri4.85g -w10m -P5100 -p5101 192.168.88.9 192.168.89.16 & \
nuttcp -Ic7rc8 -r -w10m 192.168.89.15
c4tc9: 5778.6875 MB / 10.01 sec = 4842.7158 Mbps 100 %TX 42 %RX
c5tc9: 5778.9375 MB / 10.01 sec = 4843.1595 Mbps 100 %TX 40 %RX
c7rc8: 11509.1875 MB / 10.00 sec = 9650.8009 Mbps 99 %TX 74 %RX
If there's some other specific test you'd like to see, and it's not
too difficult to set up and I have some spare time, I'll see what I
can do.
-Bill
Testing of effect of RX/TX interrupt coalescing on 10-GigE network performance
(both with TSO enabled and with TSO disabled):
--------------------------------------------------------------------------------
No RX/TX interrupt coalescing (either side):
TSO enabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
10649.8750 MB / 10.03 sec = 8908.9806 Mbps 97 %TX 100 %RX
TSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
10879.5000 MB / 10.02 sec = 9112.5141 Mbps 99 %TX 99 %RX
RX/TX interrupt coalescing set to 15 usec (both sides):
TSO enabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11546.7500 MB / 10.00 sec = 9682.0785 Mbps 99 %TX 90 %RX
TSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.9375 MB / 10.00 sec = 9910.3702 Mbps 100 %TX 92 %RX
RX/TX interrupt coalescing set to 30 usec (both sides):
TSO enabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11587.1250 MB / 10.00 sec = 9715.9489 Mbps 99 %TX 81 %RX
TSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.8125 MB / 10.00 sec = 9910.3040 Mbps 100 %TX 81 %RX
RX/TX interrupt coalescing set to 45 usec (both sides):
TSO enabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11597.8750 MB / 10.00 sec = 9724.9902 Mbps 99 %TX 76 %RX
TSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.6250 MB / 10.00 sec = 9910.0933 Mbps 100 %TX 77 %RX
RX/TX interrupt coalescing set to 60 usec (both sides):
TSO enabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11614.7500 MB / 10.00 sec = 9739.1323 Mbps 100 %TX 74 %RX
TSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.4375 MB / 10.00 sec = 9909.9995 Mbps 100 %TX 76 %RX
RX/TX interrupt coalescing set to 75 usec (both sides):
TSO enabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11621.7500 MB / 10.00 sec = 9745.0993 Mbps 100 %TX 72 %RX
TSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.0625 MB / 10.00 sec = 9909.7881 Mbps 100 %TX 75 %RX
RX/TX interrupt coalescing set to 90 usec (both sides):
TSO enabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11553.1250 MB / 10.00 sec = 9687.6458 Mbps 100 %TX 71 %RX
TSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.4375 MB / 10.00 sec = 9910.0837 Mbps 100 %TX 73 %RX
RX/TX interrupt coalescing set to 105 usec (both sides):
TSO enabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11505.7500 MB / 10.00 sec = 9647.8558 Mbps 99 %TX 69 %RX
TSO disabled:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.4375 MB / 10.00 sec = 9910.0530 Mbps 100 %TX 74 %RX
^ permalink raw reply [flat|nested] 48+ messages in thread
* TSO, TCP Cong control etc
2007-09-14 7:20 ` [ofa-general] " Bill Fink
@ 2007-09-14 13:44 ` jamal
2007-09-14 17:24 ` [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
1 sibling, 0 replies; 48+ messages in thread
From: jamal @ 2007-09-14 13:44 UTC (permalink / raw)
To: Bill Fink
Cc: David Miller, jheffner, rick.jones2, krkumar2, gaagaan, general,
herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
tgraf, xma
Ive changed the subject to match content..
On Fri, 2007-14-09 at 03:20 -0400, Bill Fink wrote:
> On Mon, 27 Aug 2007, jamal wrote:
>
> > Bill:
> > who suggested (as per your email) the 75usec value and what was it based
> > on measurement-wise?
>
> Belatedly getting back to this thread. There was a recent myri10ge
> patch that changed the default value for tx/rx interrupt coalescing
> to 75 usec claiming it was an optimum value for maximum throughput
> (and is also mentioned in their external README documentation).
I would think such a value would be very specific to the ring size and
maybe even the machine in use.
> I also did some empirical testing to determine the effect of different
> values of TX/RX interrupt coalescing on 10-GigE network performance,
> both with TSO enabled and with TSO disabled. The actual test runs
> are attached at the end of this message, but the results are summarized
> in the following table (network performance in Mbps).
>
> TX/RX interrupt coalescing in usec (both sides)
> 0 15 30 45 60 75 90 105
>
> TSO enabled 8909 9682 9716 9725 9739 9745 9688 9648
> TSO disabled 9113 9910 9910 9910 9910 9910 9910 9910
>
> TSO disabled performance is always better than equivalent TSO enabled
> performance. With TSO enabled, the optimum performance is indeed at
> a TX/RX interrupt coalescing value of 75 usec. With TSO disabled,
> performance is the full 10-GigE line rate of 9910 Mbps for any value
> of TX/RX interrupt coalescing from 15 usec to 105 usec.
Interesting results. I think J Heffner made a very compelling
description the other day based on your netstat results at the receiver
as to what is going on (refer to the comments on stretch ACKs). If the
receiver is fixed, then youd see better numbers from TSO.
The 75 microsecs is very benchmarky in my opinion. If i was to pick a
different app or different NIC or run on many cpus with many apps doing
TSO, i highly doubt that will be the right number.
> Here's a retest (5 tests each):
>
> TSO enabled:
>
> TCP Cubic (initial_ssthresh set to 0):
[..]
> TCP Bic (initial_ssthresh set to 0):
[..]
>
> TCP Reno:
>
[..]
> TSO disabled:
>
> TCP Cubic (initial_ssthresh set to 0):
>
[..]
> TCP Bic (initial_ssthresh set to 0):
>
[..]
> TCP Reno:
>
[..]
> Not too much variation here, and not quite as high results
> as previously.
BIC seems to be on average better followed by CUBIC followed by Reno.
The difference this time maybe because you set the ssthresh to 0
(hopefully every run) and so Reno is definetely going to perform less
better since it is a lot less agressive in comparison to other two.
> Some further testing reveals that while this
> time I mainly get results like (here for TCP Bic with TSO
> disabled):
>
> [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
> 4958.0625 MB / 10.02 sec = 4148.9361 Mbps 100 %TX 99 %RX
>
> I also sometimes get results like:
>
> [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
> 5882.1875 MB / 10.00 sec = 4932.5549 Mbps 100 %TX 90 %RX
>
not good.
> The higher performing results seem to correspond to when there's a
> somewhat lower receiver CPU utilization. I'm not sure but there
> could also have been an effect from running the "-M1460" test after
> the 9000 byte jumbo frame test (no jumbo tests were done at all prior
> to running the above sets of 5 tests, although I did always discard
> an initial "warmup" test, and now that I think about it some of
> those initial discarded "warmup" tests did have somewhat anomalously
> high results).
If you didnt reset the ssthresh on every run, could it have been cached
and used on subsequent runs?
> > A side note: Although the experimentation reduces the variables (eg
> > tying all to CPU0), it would be more exciting to see multi-cpu and
> > multi-flow sender effect (which IMO is more real world).
>
> These systems are intended as test systems for 10-GigE networks,
> and as such it's important to get as consistently close to full
> 10-GigE line rate as possible, and that's why the interrupts and
> nuttcp application are tied to CPU0, with almost all other system
> applications tied to CPU1.
Sure, good benchmark. You get to know how well you can do.
> Now on another system that's intended as a 10-GigE firewall system,
> it has 2 Myricom 10-GigE NICs with the interrupts for eth2 tied to
> CPU0 and the interrupts for CPU1 tied to CPU1. In IP forwarding
> tests of this system, I have basically achieved full bidirectional
> 10-GigE line rate IP forwarding with 9000 byte jumbo frames.
In forwarding a more meaningful metric would be pps. The cost per packet
tends to dominate the results over the cost/byte.
9K jumbo frames at 10G is less than 500Kpps - so i dont see that machine
you are using sweating at all. To give you a comparison on a lower end
opteron a single CPU i can generate with batching pktgen 1Mpps; Robert
says he can do that even without batching on an opteron closer to what
you are using. So if you want to run that test, youd need to use
incrementally smaller packets.
> If there's some other specific test you'd like to see, and it's not
> too difficult to set up and I have some spare time, I'll see what I
> can do.
Well, the more interesting tests would be to go full throttle on all
CPUs you have and target one (or more) receivers. i.e you simulate a
real server. Can the utility you have be bound to a cpu? If yes, you
should be able to achieve this without much effort.
Thanks a lot Bill for the effort.
cheers,
jamal
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-09-14 7:20 ` [ofa-general] " Bill Fink
2007-09-14 13:44 ` TSO, TCP Cong control etc jamal
@ 2007-09-14 17:24 ` David Miller
1 sibling, 0 replies; 48+ messages in thread
From: David Miller @ 2007-09-14 17:24 UTC (permalink / raw)
To: billfink
Cc: hadi, jheffner, rick.jones2, krkumar2, gaagaan, general, herbert,
jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
tgraf, xma
From: Bill Fink <billfink@mindspring.com>
Date: Fri, 14 Sep 2007 03:20:55 -0400
> TSO disabled performance is always better than equivalent TSO enabled
> performance. With TSO enabled, the optimum performance is indeed at
> a TX/RX interrupt coalescing value of 75 usec. With TSO disabled,
> performance is the full 10-GigE line rate of 9910 Mbps for any value
> of TX/RX interrupt coalescing from 15 usec to 105 usec.
Note that the systems where the coalescing tweaking if often necessary
are the heavily NUMA'd systems where cpu to device latency can be
huge, like the big SGI ones which is where we had to tweak things in
the tg3 driver in the first place.
On most systems, as you saw mostly in the non-TSO case, the
value choosen for the most part is arbitrary and not critical.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-23 22:04 ` [ofa-general] " jamal
2007-08-23 22:25 ` jamal
@ 2007-08-23 22:30 ` David Miller
2007-08-23 22:38 ` [ofa-general] " jamal
1 sibling, 1 reply; 48+ messages in thread
From: David Miller @ 2007-08-23 22:30 UTC (permalink / raw)
To: hadi
Cc: rick.jones2, krkumar2, gaagaan, general, herbert, jagana, jeff,
johnpol, kaber, kumarkr, mcarlson, mchan, netdev,
peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
tgraf, xma
From: jamal <hadi@cyberus.ca>
Date: Thu, 23 Aug 2007 18:04:10 -0400
> Possibly a bug - but you really should turn off TSO if you are doing
> huge interactive transactions (which is fair because there is a clear
> demarcation).
I don't see how this can matter.
TSO only ever does anything if you accumulate more than one MSS
worth of data.
And when that does happen, all it does is take whats in the send queue
and send as much as possible at once. The packets are already built
in big chunks, so there is no extra work to do.
The card is going to send the things back to back and as fast as
in the non-TSO case as well.
It doesn't change application scheduling, and it absolutely does not
penalize small sends by the application unless we have a bug
somewhere.
So I see no reason to disable TSO for any reason other than hardware
implementation deficiencies. And for the drivers I am familiar with
they do make smart default TSO enabling decisions based upon how well
the chip does TSO.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-23 22:30 ` David Miller
@ 2007-08-23 22:38 ` jamal
2007-08-24 3:34 ` Stephen Hemminger
0 siblings, 1 reply; 48+ messages in thread
From: jamal @ 2007-08-23 22:38 UTC (permalink / raw)
To: David Miller
Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
netdev, rdreier, mcarlson, jeff, general, mchan, tgraf, johnpol,
shemminger, kaber, sri
On Thu, 2007-23-08 at 15:30 -0700, David Miller wrote:
> From: jamal <hadi@cyberus.ca>
> Date: Thu, 23 Aug 2007 18:04:10 -0400
>
> > Possibly a bug - but you really should turn off TSO if you are doing
> > huge interactive transactions (which is fair because there is a clear
> > demarcation).
>
> I don't see how this can matter.
>
> TSO only ever does anything if you accumulate more than one MSS
> worth of data.
I stand corrected then.
cheers,
jamal
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-23 22:38 ` [ofa-general] " jamal
@ 2007-08-24 3:34 ` Stephen Hemminger
2007-08-24 12:36 ` jamal
2007-08-24 16:25 ` Rick Jones
0 siblings, 2 replies; 48+ messages in thread
From: Stephen Hemminger @ 2007-08-24 3:34 UTC (permalink / raw)
To: hadi
Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf,
johnpol, David Miller, sri
On Thu, 23 Aug 2007 18:38:22 -0400
jamal <hadi@cyberus.ca> wrote:
> On Thu, 2007-23-08 at 15:30 -0700, David Miller wrote:
> > From: jamal <hadi@cyberus.ca>
> > Date: Thu, 23 Aug 2007 18:04:10 -0400
> >
> > > Possibly a bug - but you really should turn off TSO if you are doing
> > > huge interactive transactions (which is fair because there is a clear
> > > demarcation).
> >
> > I don't see how this can matter.
> >
> > TSO only ever does anything if you accumulate more than one MSS
> > worth of data.
>
> I stand corrected then.
>
> cheers,
> jamal
>
For most normal Internet TCP connections, you will see only 2 or 3 packets per TSO
because of ACK clocking. If you turn off delayed ACK on the receiver it
will be even less.
A current hot topic of research is reducing the number of ACK's to make TCP
work better over asymmetric links like 3G.
--
Stephen Hemminger <shemminger@linux-foundation.org>
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-24 3:34 ` Stephen Hemminger
@ 2007-08-24 12:36 ` jamal
2007-08-24 16:25 ` Rick Jones
1 sibling, 0 replies; 48+ messages in thread
From: jamal @ 2007-08-24 12:36 UTC (permalink / raw)
To: Stephen Hemminger
Cc: David Miller, rick.jones2, krkumar2, gaagaan, general, herbert,
jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
peter.p.waskiewicz.jr, rdreier, Robert.Olsson, sri, tgraf, xma
On Thu, 2007-23-08 at 20:34 -0700, Stephen Hemminger wrote:
> A current hot topic of research is reducing the number of ACK's to make TCP
> work better over asymmetric links like 3G.
One other good reason to reduce ACKs to battery powered (3G) terminals
is it reduces the power consumption i.e you have longer battery life.
cheers,
jamal
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-24 3:34 ` Stephen Hemminger
2007-08-24 12:36 ` jamal
@ 2007-08-24 16:25 ` Rick Jones
1 sibling, 0 replies; 48+ messages in thread
From: Rick Jones @ 2007-08-24 16:25 UTC (permalink / raw)
To: Stephen Hemminger
Cc: hadi, David Miller, krkumar2, gaagaan, general, herbert, jagana,
jeff, johnpol, kaber, mcarlson, mchan, netdev,
peter.p.waskiewicz.jr, rdreier, Robert.Olsson, sri, tgraf, xma
>
> A current hot topic of research is reducing the number of ACK's to make TCP
> work better over asymmetric links like 3G.
Oy. People running Solaris and HP-UX have been "researching" ACK reductions
since 1997 if not earlier.
rick jones
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
@ 2007-08-08 9:31 Krishna Kumar
2007-08-08 10:49 ` [ofa-general] " David Miller
0 siblings, 1 reply; 48+ messages in thread
From: Krishna Kumar @ 2007-08-08 9:31 UTC (permalink / raw)
To: johnpol, sri, shemminger, davem, kaber
Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, rdreier, mcarlson, jeff, general, mchan, tgraf, hadi,
netdev
This set of patches implements the batching API, and adds support for this
API in IPoIB.
List of changes from original submission:
-----------------------------------------
1. [Patrick] Suggestion to remove tx_queue_len check for enabling batching.
2. [Patrick] Move queue purging to dev_deactivate to free references on
device going down.
3. [Patrick] Remove changelog & unrelated changes from sch_generic.c
4. [Patrick] Free skb_blist in unregister_netdev (also suggested to put in
free_netdev, but it is not required as unregister_netdev will not fail
at this location).
5. [Stephen/Patrick] Remove /sysfs support.
6. [Stephen] Add ethtool support.
7. [Evgeniy] Stop interrupts while changing tx_batch_skb value.
8. [Michael Tsirkin] Remove misleading comment in ipoib_send().
9. [KK] Remove NETIF_F_BATCH_SKBS (device supports batching if API present).
10. [KK] Remove xmit_slots from netdev.
11. [KK] [IPoIB]: Use unsigned instead of int for index's, handle race
between multiple WC's executing on different CPU's by having a new
lock (or might need to hold lock for entire duration of WC - some
optimization is possible here), changed multiple skb algo to not use
xmit_slots, simplify code, minor performance changes wrt slot
counters, etc.
List of changes implemented, tested and dropped:
------------------------------------------------
1. [Patrick] Suggestion to use skb_blist statically in netdevice. This
reduces performance (~ 1%) (possibly due to having an extra check for
dev->hard_start_xmit_batch API).
2. [Patrick] Suggestion to check if hard_start_xmit_batch can be removed:
This reduces performance as a call to a non inline function is made,
and an extra check in driver to see if skb is NULL.
3. [Sridhar] Suggestion to always use batching for regular xmit case too:
While testing, for some reason the tests virtually hangs and
transfers almost no data for higher number of proceses (like 64 and
above).
Patches are described as:
Mail 0/9: This mail
Mail 1/9: HOWTO documentation
Mail 2/9: Introduce skb_blist and hard_start_xmit_batch API
Mail 3/9: Modify qdisc_run() to support batching
Mail 4/9: Add ethtool support to enable/disable batching
Mail 5/9: IPoIB header file changes to use batching
Mail 6/9: IPoIB CM & Multicast changes
Mail 7/9: IPoIB verb changes to use batching
Mail 8/9: IPoIB internal post and work completion handler
Mail 9/9: Implement the new batching API
RESULTS: The performance improvement for TCP No Delay is in the range of -8%
to 320% (with -8% being the sole negative), with many individual tests
giving 50% or more improvement (I think it is to do with the hw slots
getting full quicker resulting in more batching when the queue gets
woken). The results for TCP is in the range of -11% to 93%, with most
of the tests (8/12) giving improvements.
ISSUES: I am getting a huge amount of retransmissions for both TCP and TCP No
Delay cases for IPoIB (which explains the slight degradation for some
test cases mentioned above). After a full test run, the regular code
resulted in 74 retransmissions, while there were 1365716 retrans with
batching code - or 18500 retransmissions for every 1 in regular code.
But with this huge amount of retransmissions there is 20.7% overall
improvement in BW (which implies batching will improve results even
more if this problem is fixed). I suspect this is some issue in the
driver/firmware since:
a. I see similar low retransmissions numbers for E1000 (so
no bug in core changes).
b. Even with batching set to maximum 2 skbs, I get almost the
same number of retransmissions (implies receiver is
probably not dropping skbs). ifconfig/netstat on receiver
gives no clue (drop/errors, etc).
This issue delayed submitting patches for the last 2 weeks, as I was
trying to debug this; any help from openIB community is appreciated.
Please review and provide feedback; and consider for inclusion.
Thanks,
- KK
---------------------------------------------------------------
Test Case ORG NEW % Change
---------------------------------------------------------------
TCP
---
Size:32 Procs:1 2709 4217 55.66
Size:128 Procs:1 10950 15853 44.77
Size:512 Procs:1 35313 68224 93.19
Size:4096 Procs:1 118144 119935 1.51
Size:32 Procs:8 18976 22432 18.21
Size:128 Procs:8 66351 86072 29.72
Size:512 Procs:8 246546 234373 -4.93
Size:4096 Procs:8 268861 251540 -6.44
Size:32 Procs:16 35009 45861 30.99
Size:128 Procs:16 150979 164961 9.26
Size:512 Procs:16 259443 230730 -11.06
Size:4096 Procs:16 265313 246794 -6.98
TCP No Delay
------------
Size:32 Procs:1 1930 1944 .72
Size:128 Procs:1 8573 7831 -8.65
Size:512 Procs:1 28536 29347 2.84
Size:4096 Procs:1 98916 104236 5.37
Size:32 Procs:8 4173 17560 320.80
Size:128 Procs:8 17350 66205 281.58
Size:512 Procs:8 69777 211467 203.06
Size:4096 Procs:8 201096 242578 20.62
Size:32 Procs:16 20570 37778 83.65
Size:128 Procs:16 95005 154464 62.58
Size:512 Procs:16 111677 221570 98.40
Size:4096 Procs:16 204765 240368 17.38
---------------------------------------------------------------
Overall: 2340962 2826340 20.73%
[Summary: 19 Better cases, 5 worse]
Testing environment (on client, server uses 4096 sendq size):
echo "Using 512 size sendq"
modprobe ib_ipoib send_queue_size=512 recv_queue_size=512
echo "4096 524288 4194304" > /proc/sys/net/ipv4/tcp_wmem
echo "4096 1048576 4194304" > /proc/sys/net/ipv4/tcp_rmem
echo 4194304 > /proc/sys/net/core/rmem_max
echo 4194304 > /proc/sys/net/core/wmem_max
echo 120000 > /proc/sys/net/core/netdev_max_backlog
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 9:31 [ofa-general] " Krishna Kumar
@ 2007-08-08 10:49 ` David Miller
2007-08-08 11:09 ` Krishna Kumar2
` (2 more replies)
0 siblings, 3 replies; 48+ messages in thread
From: David Miller @ 2007-08-08 10:49 UTC (permalink / raw)
To: krkumar2
Cc: johnpol, jagana, peter.p.waskiewicz.jr, herbert, gaagaan,
Robert.Olsson, kumarkr, rdreier, mcarlson, jeff, hadi, general,
mchan, tgraf, netdev, shemminger, kaber, sri
From: Krishna Kumar <krkumar2@in.ibm.com>
Date: Wed, 08 Aug 2007 15:01:14 +0530
> RESULTS: The performance improvement for TCP No Delay is in the range of -8%
> to 320% (with -8% being the sole negative), with many individual tests
> giving 50% or more improvement (I think it is to do with the hw slots
> getting full quicker resulting in more batching when the queue gets
> woken). The results for TCP is in the range of -11% to 93%, with most
> of the tests (8/12) giving improvements.
Not because I think it obviates your work, but rather because I'm
curious, could you test a TSO-in-hardware driver converted to
batching and see how TSO alone compares to batching for a pure
TCP workload?
I personally don't think it will help for that case at all as
TSO likely does better job of coalescing the work _and_ reducing
bus traffic as well as work in the TCP stack.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 10:49 ` [ofa-general] " David Miller
@ 2007-08-08 11:09 ` Krishna Kumar2
2007-08-08 22:01 ` [ofa-general] " David Miller
2007-08-08 13:42 ` Herbert Xu
2007-08-14 9:02 ` Krishna Kumar2
2 siblings, 1 reply; 48+ messages in thread
From: Krishna Kumar2 @ 2007-08-08 11:09 UTC (permalink / raw)
To: David Miller
Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma
David Miller <davem@davemloft.net> wrote on 08/08/2007 04:19:00 PM:
> From: Krishna Kumar <krkumar2@in.ibm.com>
> Date: Wed, 08 Aug 2007 15:01:14 +0530
>
> > RESULTS: The performance improvement for TCP No Delay is in the range
of -8%
> > to 320% (with -8% being the sole negative), with many individual
tests
> > giving 50% or more improvement (I think it is to do with the hw
slots
> > getting full quicker resulting in more batching when the queue gets
> > woken). The results for TCP is in the range of -11% to 93%, with
most
> > of the tests (8/12) giving improvements.
>
> Not because I think it obviates your work, but rather because I'm
> curious, could you test a TSO-in-hardware driver converted to
> batching and see how TSO alone compares to batching for a pure
> TCP workload?
>
> I personally don't think it will help for that case at all as
> TSO likely does better job of coalescing the work _and_ reducing
> bus traffic as well as work in the TCP stack.
Definitely, I will try to do this.
What do you generally think of the patch/implementation ? :)
thanks,
- KK
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 11:09 ` Krishna Kumar2
@ 2007-08-08 22:01 ` David Miller
2007-08-09 4:19 ` Krishna Kumar2
0 siblings, 1 reply; 48+ messages in thread
From: David Miller @ 2007-08-08 22:01 UTC (permalink / raw)
To: krkumar2
Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
tgraf, sri, shemminger, kaber, herbert
From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Wed, 8 Aug 2007 16:39:47 +0530
> What do you generally think of the patch/implementation ? :)
We have two driver implementation paths on recieve and now
we'll have two on send, and that's not a good trend.
In an ideal world all the drivers would be NAPI and netif_rx()
would only be used by tunneling drivers and similar in the
protocol layers. And likewise all sends would go through
->hard_start_xmit().
If you can come up with a long term strategy that gets rid of
the special transmit method, that'd be great.
We should make Linux network drivers easy to write, not more difficult
by constantly adding most interfaces than we consolidate.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 22:01 ` [ofa-general] " David Miller
@ 2007-08-09 4:19 ` Krishna Kumar2
0 siblings, 0 replies; 48+ messages in thread
From: Krishna Kumar2 @ 2007-08-09 4:19 UTC (permalink / raw)
To: David Miller
Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
tgraf, sri, shemminger, kaber, herbert
Hi Dave,
David Miller <davem@davemloft.net> wrote on 08/09/2007 03:31:37 AM:
> > What do you generally think of the patch/implementation ? :)
>
> We have two driver implementation paths on recieve and now
> we'll have two on send, and that's not a good trend.
Correct.
> In an ideal world all the drivers would be NAPI and netif_rx()
> would only be used by tunneling drivers and similar in the
> protocol layers. And likewise all sends would go through
> ->hard_start_xmit().
>
> If you can come up with a long term strategy that gets rid of
> the special transmit method, that'd be great.
>
> We should make Linux network drivers easy to write, not more difficult
> by constantly adding most interfaces than we consolidate.
I think that is a good top level view, and I agree with that.
Patrick had suggested calling dev_hard_start_xmit() instead of
conditionally calling the new API and to remove the new API
entirely. The driver determines whether batching is required or
not depending on (skb==NULL) or not. Would that approach be fine
with this "single interface" goal ?
Thanks,
- KK
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 10:49 ` [ofa-general] " David Miller
2007-08-08 11:09 ` Krishna Kumar2
@ 2007-08-08 13:42 ` Herbert Xu
2007-08-08 15:14 ` jamal
2007-08-09 3:19 ` Krishna Kumar2
2007-08-14 9:02 ` Krishna Kumar2
2 siblings, 2 replies; 48+ messages in thread
From: Herbert Xu @ 2007-08-08 13:42 UTC (permalink / raw)
To: David Miller
Cc: johnpol, peter.p.waskiewicz.jr, jeff, gaagaan, Robert.Olsson,
kumarkr, rdreier, mcarlson, jagana, hadi, general, mchan, tgraf,
netdev, shemminger, kaber, sri
On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
>
> Not because I think it obviates your work, but rather because I'm
> curious, could you test a TSO-in-hardware driver converted to
> batching and see how TSO alone compares to batching for a pure
> TCP workload?
You could even lower the bar by disabling TSO and enabling
software GSO.
> I personally don't think it will help for that case at all as
> TSO likely does better job of coalescing the work _and_ reducing
> bus traffic as well as work in the TCP stack.
I agree. I suspect the bulk of the effort is in getting
these skb's created and processed by the stack so that by
the time that they're exiting the qdisc there's not much
to be saved anymore.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 13:42 ` Herbert Xu
@ 2007-08-08 15:14 ` jamal
2007-08-08 20:55 ` Stephen Hemminger
2007-08-08 22:22 ` David Miller
2007-08-09 3:19 ` Krishna Kumar2
1 sibling, 2 replies; 48+ messages in thread
From: jamal @ 2007-08-08 15:14 UTC (permalink / raw)
To: Herbert Xu
Cc: johnpol, peter.p.waskiewicz.jr, jeff, gaagaan, Robert.Olsson,
kumarkr, rdreier, mcarlson, kaber, jagana, general, mchan, tgraf,
netdev, shemminger, David Miller, sri
On Wed, 2007-08-08 at 21:42 +0800, Herbert Xu wrote:
> On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
> >
> > Not because I think it obviates your work, but rather because I'm
> > curious, could you test a TSO-in-hardware driver converted to
> > batching and see how TSO alone compares to batching for a pure
> > TCP workload?
>
> You could even lower the bar by disabling TSO and enabling
> software GSO.
>From my observation for TCP packets slightly above MDU (upto 2K), GSO
gives worse performance than non-GSO throughput-wise. Actually this has
nothing to do with batching, rather the behavior is consistent with or
without batching changes.
> > I personally don't think it will help for that case at all as
> > TSO likely does better job of coalescing the work _and_ reducing
> > bus traffic as well as work in the TCP stack.
>
> I agree.
> I suspect the bulk of the effort is in getting
> these skb's created and processed by the stack so that by
> the time that they're exiting the qdisc there's not much
> to be saved anymore.
pktgen shows a clear win if you test the driver path - which is what you
should test because thats where the batching changes are.
Using TCP or UDP adds other variables[1] that need to be isolated first
in order to quantify the effect of batching.
For throughput and CPU utilization, the benefit will be clear when there
are a lot more flows.
cheers,
jamal
[1] I think there are too many other variables in play unfortunately
when you are dealing with a path that starts above the driver and one
that covers end to end effect: traffic/app source, system clock sources
as per my recent discovery, congestion control algorithms used, tuning
of recevier etc.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 15:14 ` jamal
@ 2007-08-08 20:55 ` Stephen Hemminger
2007-08-08 22:40 ` jamal
2007-08-08 22:22 ` David Miller
1 sibling, 1 reply; 48+ messages in thread
From: Stephen Hemminger @ 2007-08-08 20:55 UTC (permalink / raw)
To: hadi
Cc: johnpol, peter.p.waskiewicz.jr, Herbert Xu, gaagaan,
Robert.Olsson, kumarkr, rdreier, mcarlson, David Miller, jagana,
general, mchan, tgraf, jeff, netdev, kaber, sri
On Wed, 08 Aug 2007 11:14:35 -0400
jamal <hadi@cyberus.ca> wrote:
> On Wed, 2007-08-08 at 21:42 +0800, Herbert Xu wrote:
> > On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
> > >
> > > Not because I think it obviates your work, but rather because I'm
> > > curious, could you test a TSO-in-hardware driver converted to
> > > batching and see how TSO alone compares to batching for a pure
> > > TCP workload?
> >
> > You could even lower the bar by disabling TSO and enabling
> > software GSO.
>
> >From my observation for TCP packets slightly above MDU (upto 2K), GSO
> gives worse performance than non-GSO throughput-wise. Actually this has
> nothing to do with batching, rather the behavior is consistent with or
> without batching changes.
>
> > > I personally don't think it will help for that case at all as
> > > TSO likely does better job of coalescing the work _and_ reducing
> > > bus traffic as well as work in the TCP stack.
> >
> > I agree.
> > I suspect the bulk of the effort is in getting
> > these skb's created and processed by the stack so that by
> > the time that they're exiting the qdisc there's not much
> > to be saved anymore.
>
> pktgen shows a clear win if you test the driver path - which is what you
> should test because thats where the batching changes are.
> Using TCP or UDP adds other variables[1] that need to be isolated first
> in order to quantify the effect of batching.
> For throughput and CPU utilization, the benefit will be clear when there
> are a lot more flows.
Optimizing for pktgen is a mistake for most users. Please show something
useful like router forwarding, TCP (single and multi flow) and/or better
yet application benchmark improvement.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 20:55 ` Stephen Hemminger
@ 2007-08-08 22:40 ` jamal
0 siblings, 0 replies; 48+ messages in thread
From: jamal @ 2007-08-08 22:40 UTC (permalink / raw)
To: Stephen Hemminger
Cc: johnpol, peter.p.waskiewicz.jr, Herbert Xu, gaagaan,
Robert.Olsson, kumarkr, rdreier, mcarlson, David Miller, jagana,
general, mchan, tgraf, jeff, netdev, kaber, sri
On Wed, 2007-08-08 at 21:55 +0100, Stephen Hemminger wrote:
> > pktgen shows a clear win if you test the driver path - which is what you
> > should test because thats where the batching changes are.
> > Using TCP or UDP adds other variables[1] that need to be isolated first
> > in order to quantify the effect of batching.
> > For throughput and CPU utilization, the benefit will be clear when there
> > are a lot more flows.
>
> Optimizing for pktgen is a mistake for most users.
There is no "optimization for pktgen".
If you are improving the tx path, the first step is to test that you can
show the tx path improved. pktgen happens to be the best test suite for
that because it talks to the driver and exercises the changes.
i.e if one cant show that exercising the direct path demonstrates
improvements you probably wont be able to show batching improves TCP -
but i dont even wanna swear by that.
Does that make sense?
> Please show something
> useful like router forwarding, TCP (single and multi flow) and/or better
> yet application benchmark improvement.
Absolutely, but first things first. Analysis of why something improves
is extremely important, just saying "TCP throughput improved" is not
interesting and lazy.
To be scientific, it is important to isolate variables first in order to
come up with meaningful results that can be analysed.
To make a point, I have noticed extremely different results between TCP
BIC vs reno with batching. So congestion control as a variable is
important.
cheers,
jamal
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 15:14 ` jamal
2007-08-08 20:55 ` Stephen Hemminger
@ 2007-08-08 22:22 ` David Miller
2007-08-08 22:53 ` jamal
1 sibling, 1 reply; 48+ messages in thread
From: David Miller @ 2007-08-08 22:22 UTC (permalink / raw)
To: hadi
Cc: johnpol, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
kumarkr, rdreier, mcarlson, netdev, jagana, general, mchan, tgraf,
jeff, shemminger, kaber, sri
From: jamal <hadi@cyberus.ca>
Date: Wed, 08 Aug 2007 11:14:35 -0400
> pktgen shows a clear win if you test the driver path - which is what
> you should test because thats where the batching changes are.
The driver path, however, does not exist on an island and what
we care about is the final result with the changes running
inside the full system.
So, to be honest, besides for initial internal development
feedback, the isolated tests only have minimal merit and
it's the full protocol tests that are really interesting.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 22:22 ` David Miller
@ 2007-08-08 22:53 ` jamal
0 siblings, 0 replies; 48+ messages in thread
From: jamal @ 2007-08-08 22:53 UTC (permalink / raw)
To: David Miller
Cc: johnpol, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
kumarkr, rdreier, mcarlson, netdev, jagana, general, mchan, tgraf,
jeff, shemminger, kaber, sri
On Wed, 2007-08-08 at 15:22 -0700, David Miller wrote:
> The driver path, however, does not exist on an island and what
> we care about is the final result with the changes running
> inside the full system.
>
> So, to be honest, besides for initial internal development
> feedback, the isolated tests only have minimal merit and
> it's the full protocol tests that are really interesting.
But you cant go there if cant show the path which is supposedly improved
has indeed improved;-> I would certainly agree with you that if it
doesnt prove consistently useful with protocols it has no value
(remember thats why i never submitted these patches all this time).
We just need better analysis of the results - i cant ignore that the
selection of the clock sources for example gives me different results
and that when i boot i cant be guaranteed the same clock source. I cant
ignore the fact that i get different results when i use a different
congestion control algorithm. And none of this has to do with the
batching patches.
I am using UDP at the moment because it is simpler to analyze. And yes,
it would be "an interesting idea that gets shelved" if we cant achieve
any of the expected goals. We've shelved ideas before. BTW, read the
little doc i wrote on the dev->prep_xmit() you may find it interesting.
cheers,
jamal
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 13:42 ` Herbert Xu
2007-08-08 15:14 ` jamal
@ 2007-08-09 3:19 ` Krishna Kumar2
1 sibling, 0 replies; 48+ messages in thread
From: Krishna Kumar2 @ 2007-08-09 3:19 UTC (permalink / raw)
To: Herbert Xu
Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, mcarlson,
peter.p.waskiewicz.jr, hadi, kaber, netdev, general, mchan, tgraf,
sri, shemminger, David Miller, rdreier
Herbert Xu <herbert@gondor.apana.org.au> wrote on 08/08/2007 07:12:47 PM:
> On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
> >
> > Not because I think it obviates your work, but rather because I'm
> > curious, could you test a TSO-in-hardware driver converted to
> > batching and see how TSO alone compares to batching for a pure
> > TCP workload?
>
> You could even lower the bar by disabling TSO and enabling
> software GSO.
I will try with E1000 (though I didn't see improvement when I tested a long
time back). The difference I expect is that TSO would help with large
packets and not necessarily small/medium packets and not definitely in
the case of multiple different skbs (as opposed to single large skb)
getting
queue'd. I think these are two different workloads.
> > I personally don't think it will help for that case at all as
> > TSO likely does better job of coalescing the work _and_ reducing
> > bus traffic as well as work in the TCP stack.
>
> I agree. I suspect the bulk of the effort is in getting
> these skb's created and processed by the stack so that by
> the time that they're exiting the qdisc there's not much
> to be saved anymore.
However, I am getting a large improvement for IPoIB specifically for this
same case. The reason - batching will help only when queue gets full and
stopped (and to a lesser extent if tx lock was not got, which results
in fewer amount of batching that can be done).
thanks,
- KK
^ permalink raw reply [flat|nested] 48+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 10:49 ` [ofa-general] " David Miller
2007-08-08 11:09 ` Krishna Kumar2
2007-08-08 13:42 ` Herbert Xu
@ 2007-08-14 9:02 ` Krishna Kumar2
2 siblings, 0 replies; 48+ messages in thread
From: Krishna Kumar2 @ 2007-08-14 9:02 UTC (permalink / raw)
To: David Miller
Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
tgraf, sri, shemminger, kaber, herbert
Hi Dave,
David Miller <davem@davemloft.net> wrote on 08/08/2007 04:19:00 PM:
> From: Krishna Kumar <krkumar2@in.ibm.com>
> Date: Wed, 08 Aug 2007 15:01:14 +0530
>
> > RESULTS: The performance improvement for TCP No Delay is in the range
of -8%
> > to 320% (with -8% being the sole negative), with many individual
tests
> > giving 50% or more improvement (I think it is to do with the hw
slots
> > getting full quicker resulting in more batching when the queue gets
> > woken). The results for TCP is in the range of -11% to 93%, with
most
> > of the tests (8/12) giving improvements.
>
> Not because I think it obviates your work, but rather because I'm
> curious, could you test a TSO-in-hardware driver converted to
> batching and see how TSO alone compares to batching for a pure
> TCP workload?
>
> I personally don't think it will help for that case at all as
> TSO likely does better job of coalescing the work _and_ reducing
> bus traffic as well as work in the TCP stack.
I used E1000 (guess the choice is OK as e1000_tso returns TRUE. My
hw is 82547GI).
You are right, it doesn't help TSO case at all (infact degrades). Two
things to note though:
- E1000 may not be suitable for adding batching (which is no
longer a new API, as I have changed it already).
- Small skbs where TSO doesn't come into picture still seems to
improve. A couple of cases for large skbs did result in some
improved (like 4K, TCP No Delay, 32 procs).
[Total segments retransmission for original code test run: 2220 & for
new code test run: 1620. So the retransmission problem that I was
getting seems to be an IPoIB bug, though I did have to fix one bug
in my networking component where I was calling qdisc_run(NULL) for
regular xmit path and change to always use batching. The problem is
that skb1 - skb10 may be present in the queue after each of them
failed to be sent out, then net_tx_action fires which batches all of
these into the blist and tries to send them out again, which also
fails (eg tx lock fail or queue full), then the next single skb xmit
will send the latest skb ignoring the 10 skbs that are already waiting
in the batching list. These 10 skbs are sent out only the next time
net_tx_action is called, so out of order skbs result. This fix reduced
retransmissions from 180,000 to 55,000 or so. When I changed IPoIB
driver to use iterative sends of each skb instead of creating multiple
Work Request's, that number went down to 15].
I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will
run a longer one tonight). The results are (results in KB/s, and %):
Test Case Org BW New BW % Change
TCP
--------
Size:32 Procs:1 1848 3918 112.01
Size:32 Procs:8 21888 21555 -1.52
Size:32 Procs:32 19317 22433 16.13
Size:256 Procs:1 15584 25991 66.78
Size:256 Procs:8 110937 74565 -32.78
Size:256 Procs:32 105767 98967 -6.42
Size:4096 Procs:1 81910 96073 17.29
Size:4096 Procs:8 113302 94040 -17.00
Size:4096 Procs:32 109664 105522 -3.77
TCP No Delay:
--------------
Size:32 Procs:1 2688 3177 18.19
Size:32 Procs:8 6568 10588 61.20
Size:32 Procs:32 6573 7838 19.24
Size:256 Procs:1 7869 12724 61.69
Size:256 Procs:8 65652 45652 -30.46
Size:256 Procs:32 95114 112279 18.04
Size:4096 Procs:1 95302 84664 -11.16
Size:4096 Procs:8 111119 89111 -19.80
Size:4096 Procs:32 109249 113919 4.27
I will submit Rev4 with suggested changes (including single merged
API) on Thursday after some more testing.
Thanks,
- KK
^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2007-09-14 17:25 UTC | newest]
Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-17 6:06 [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar2
2007-08-21 7:18 ` David Miller
2007-08-21 12:30 ` [ofa-general] " jamal
2007-08-21 18:51 ` David Miller
2007-08-21 21:09 ` jamal
2007-08-21 22:50 ` David Miller
2007-08-22 4:11 ` [ofa-general] " Krishna Kumar2
2007-08-22 4:22 ` David Miller
2007-08-22 7:03 ` Krishna Kumar2
2007-08-22 9:14 ` David Miller
2007-08-23 2:43 ` Krishna Kumar2
2007-08-22 17:09 ` [ofa-general] " Rick Jones
2007-08-22 20:21 ` David Miller
2007-08-23 22:04 ` [ofa-general] " jamal
2007-08-23 22:25 ` jamal
2007-08-23 22:35 ` [ofa-general] " Rick Jones
2007-08-23 22:41 ` jamal
2007-08-24 3:18 ` Bill Fink
2007-08-24 12:14 ` jamal
2007-08-24 18:08 ` Bill Fink
2007-08-24 21:25 ` David Miller
2007-08-24 23:11 ` Herbert Xu
2007-08-25 23:45 ` Bill Fink
2007-08-24 18:46 ` [ofa-general] " Rick Jones
2007-08-25 0:42 ` John Heffner
2007-08-26 8:41 ` [ofa-general] " Bill Fink
2007-08-27 1:32 ` John Heffner
2007-08-27 2:04 ` David Miller
2007-08-27 23:23 ` jamal
2007-09-14 7:20 ` [ofa-general] " Bill Fink
2007-09-14 13:44 ` TSO, TCP Cong control etc jamal
2007-09-14 17:24 ` [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
2007-08-23 22:30 ` David Miller
2007-08-23 22:38 ` [ofa-general] " jamal
2007-08-24 3:34 ` Stephen Hemminger
2007-08-24 12:36 ` jamal
2007-08-24 16:25 ` Rick Jones
-- strict thread matches above, loose matches on Subject: below --
2007-08-08 9:31 [ofa-general] " Krishna Kumar
2007-08-08 10:49 ` [ofa-general] " David Miller
2007-08-08 11:09 ` Krishna Kumar2
2007-08-08 22:01 ` [ofa-general] " David Miller
2007-08-09 4:19 ` Krishna Kumar2
2007-08-08 13:42 ` Herbert Xu
2007-08-08 15:14 ` jamal
2007-08-08 20:55 ` Stephen Hemminger
2007-08-08 22:40 ` jamal
2007-08-08 22:22 ` David Miller
2007-08-08 22:53 ` jamal
2007-08-09 3:19 ` Krishna Kumar2
2007-08-14 9:02 ` Krishna Kumar2
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).