Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
@ 2007-08-17  6:06 Krishna Kumar2
  2007-08-21  7:18 ` David Miller
  0 siblings, 1 reply; 37+ messages in thread
From: Krishna Kumar2 @ 2007-08-17  6:06 UTC (permalink / raw)
  To: David Miller
  Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma


Hi Dave,

> I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will
> run a longer one tonight). The results are (results in KB/s, and %):

I ran a 8.5 hours run with no batching + another 8.5 hours run with
batching (Buffer sizes: "32 128 512 4096 16384", Threads: "1 8 32",
Each test run time: 3 minutes, Iterations to average: 5). TCP seems
to get a small improvement.

Thanks,

- KK

-------------------------------------------------------------------------------
                           TCP
                        -----------
Size:32 Procs:1             3415       3321       -2.75
Size:128 Procs:1           13094      13388      2.24
Size:512 Procs:1            49037      50683      3.35
Size:4096 Procs:1          114646     114619     -.02
Size:16384 Procs:1        114626     114644     .01

Size:32 Procs:8             22675      22633      -.18
Size:128 Procs:8            77994      77297      -.89
Size:512 Procs:8            114716     114711     0
Size:4096 Procs:8          114637     114636     0
Size:16384 Procs:8         95814      114638     19.64

Size:32 Procs:32            23240      23349      .46
Size:128 Procs:32          82284      82247      -.04
Size:512 Procs:32          114885     114769     -.10
Size:4096 Procs:32         95735      114634     19.74
Size:16384 Procs:32       114736     114641     -.08

Average:                        1151534    1190210    3.36%
-------------------------------------------------------------------------------
                        No Delay:
                        ---------
Size:32 Procs:1             3002       2873       -4.29
Size:128 Procs:1           11853      11801      -.43
Size:512 Procs:1           45565      45837      .59
Size:4096 Procs:1         114511     114485     -.02
Size:16384 Procs:1        114521     114555     .02

Size:32 Procs:8             8026       8029       .03
Size:128 Procs:8           31589      31573      -.05
Size:512 Procs:8           111506     105766     -5.14
Size:4096 Procs:8         114455     114454     0
Size:16384 Procs:8        95833      114491     19.46

Size:32 Procs:32            8005       8027       .27
Size:128 Procs:32          31475      31505      .09
Size:512 Procs:32          114558     113687     -.76
Size:4096 Procs:32        114784     114447     -.29
Size:16384 Procs:32       114719     114496     -.19

Average:                         1046026    1034402    -1.11%
-------------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-17  6:06 [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar2
@ 2007-08-21  7:18 ` David Miller
  2007-08-21 12:30   ` [ofa-general] " jamal
  0 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2007-08-21  7:18 UTC (permalink / raw)
  To: krkumar2
  Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma

From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Fri, 17 Aug 2007 11:36:03 +0530

> > I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will
> > run a longer one tonight). The results are (results in KB/s, and %):
> 
> I ran a 8.5 hours run with no batching + another 8.5 hours run with
> batching (Buffer sizes: "32 128 512 4096 16384", Threads: "1 8 32",
> Each test run time: 3 minutes, Iterations to average: 5). TCP seems
> to get a small improvement.

Using 16K buffer size really isn't going to keep the pipe full enough
for TSO.  And realistically applications queue much more data at a
time.  Also, with smaller buffer sizes can have negative effects for
the dynamic receive and send buffer growth algorithm the kernel uses,
it might consider the connection application limited for too long.

I would really prefer to see numbers that use buffer sizes more in
line with the amount of data that is typically inflight on a 1G
connection on a local network.

Do a tcpdump during the height of the transfer to see about what this
value is.  When an ACK comes in, compare the sequence number it's
ACK'ing with the sequence number of the most recently sent frame.
The difference is approximately the pipe size at maximum congestion
window assuming a loss free local network.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-21  7:18 ` David Miller
@ 2007-08-21 12:30   ` jamal
  2007-08-21 18:51     ` David Miller
  0 siblings, 1 reply; 37+ messages in thread
From: jamal @ 2007-08-21 12:30 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, netdev, sri, general, mchan,
	tgraf, johnpol, shemminger, kaber, herbert

On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote:

> Using 16K buffer size really isn't going to keep the pipe full enough
> for TSO. 

Why the comparison with TSO (or GSO for that matter)?
Seems to me that is only valid/fair if you have a single flow. 
Batching is multi-flow focused (or i should say flow-unaware).

cheers,
jamal

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-21 12:30   ` [ofa-general] " jamal
@ 2007-08-21 18:51     ` David Miller
  2007-08-21 21:09       ` jamal
  2007-08-22  4:11       ` [ofa-general] " Krishna Kumar2
  0 siblings, 2 replies; 37+ messages in thread
From: David Miller @ 2007-08-21 18:51 UTC (permalink / raw)
  To: hadi
  Cc: jagana, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, netdev, sri, general, mchan,
	tgraf, johnpol, shemminger, kaber, herbert

From: jamal <hadi@cyberus.ca>
Date: Tue, 21 Aug 2007 08:30:22 -0400

> On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote:
> 
> > Using 16K buffer size really isn't going to keep the pipe full enough
> > for TSO. 
> 
> Why the comparison with TSO (or GSO for that matter)?

Because TSO does batching already, so it's a very good
"tit for tat" comparison of the new batching scheme
vs. an existing one.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-21 18:51     ` David Miller
@ 2007-08-21 21:09       ` jamal
  2007-08-21 22:50         ` David Miller
  2007-08-22  4:11       ` [ofa-general] " Krishna Kumar2
  1 sibling, 1 reply; 37+ messages in thread
From: jamal @ 2007-08-21 21:09 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, netdev, sri, general, mchan,
	tgraf, johnpol, shemminger, kaber, herbert

On Tue, 2007-21-08 at 11:51 -0700, David Miller wrote:

> Because TSO does batching already, so it's a very good
> "tit for tat" comparison of the new batching scheme
> vs. an existing one.

Fair enough - I may have read too much into your email then;->

For bulk type of apps (where TSO will make a difference) this a fair
test. Hence i agree the 16KB buffer size is not sensible if the goal is
to simulate such an app. 
However (and this is where i read too much into what you were saying)
that the test by itself is insufficient comparison. You gotta look at
the other side of the coin i.e. at apps where TSO wont buy much.
Examples, a busy ssh or irc server and you could go as far as looking at
the most pre-dominant app on the wild west, http (average page size from
a few years back was in the range of 10-20K and can be simulated with
good ole netperf/iperf). 

cheers,
jamal

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-21 21:09       ` jamal
@ 2007-08-21 22:50         ` David Miller
  0 siblings, 0 replies; 37+ messages in thread
From: David Miller @ 2007-08-21 22:50 UTC (permalink / raw)
  To: hadi
  Cc: krkumar2, gaagaan, general, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma

From: jamal <hadi@cyberus.ca>
Date: Tue, 21 Aug 2007 17:09:12 -0400

> Examples, a busy ssh or irc server and you could go as far as
> looking at the most pre-dominant app on the wild west, http (average
> page size from a few years back was in the range of 10-20K and can
> be simulated with good ole netperf/iperf).

Pages have chunked up considerably in recent years.

Just bringing up a myspace page can give you megabytes of video,
images, flash, and other stuff.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-21 18:51     ` David Miller
  2007-08-21 21:09       ` jamal
@ 2007-08-22  4:11       ` Krishna Kumar2
  2007-08-22  4:22         ` David Miller
  1 sibling, 1 reply; 37+ messages in thread
From: Krishna Kumar2 @ 2007-08-22  4:11 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
	tgraf, sri, shemminger, kaber, herbert

David Miller <davem@davemloft.net> wrote on 08/22/2007 12:21:43 AM:

> From: jamal <hadi@cyberus.ca>
> Date: Tue, 21 Aug 2007 08:30:22 -0400
>
> > On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote:
> >
> > > Using 16K buffer size really isn't going to keep the pipe full enough
> > > for TSO.
> >
> > Why the comparison with TSO (or GSO for that matter)?
>
> Because TSO does batching already, so it's a very good
> "tit for tat" comparison of the new batching scheme
> vs. an existing one.

I am planning to do more testing on your suggestion over the
weekend, but I had a comment. Are you saying that TSO and
batching should be mutually exclusive so hardware that doesn't
support TSO (like IB) only would benefit?

But even if they can co-exist, aren't cases like sending
multiple small skbs better handled with batching?

Thanks,

- KK

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-22  4:11       ` [ofa-general] " Krishna Kumar2
@ 2007-08-22  4:22         ` David Miller
  2007-08-22  7:03           ` Krishna Kumar2
  2007-08-22 17:09           ` [ofa-general] " Rick Jones
  0 siblings, 2 replies; 37+ messages in thread
From: David Miller @ 2007-08-22  4:22 UTC (permalink / raw)
  To: krkumar2
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
	tgraf, sri, shemminger, kaber, herbert

From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Wed, 22 Aug 2007 09:41:52 +0530

> David Miller <davem@davemloft.net> wrote on 08/22/2007 12:21:43 AM:
> 
> > From: jamal <hadi@cyberus.ca>
> > Date: Tue, 21 Aug 2007 08:30:22 -0400
> >
> > > On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote:
> > >
> > > > Using 16K buffer size really isn't going to keep the pipe full enough
> > > > for TSO.
> > >
> > > Why the comparison with TSO (or GSO for that matter)?
> >
> > Because TSO does batching already, so it's a very good
> > "tit for tat" comparison of the new batching scheme
> > vs. an existing one.
> 
> I am planning to do more testing on your suggestion over the
> weekend, but I had a comment. Are you saying that TSO and
> batching should be mutually exclusive so hardware that doesn't
> support TSO (like IB) only would benefit?
> 
> But even if they can co-exist, aren't cases like sending
> multiple small skbs better handled with batching?

I'm not making any suggestions, so don't read that into anything I've
said :-)

I think the jury is still out, but seeing TSO perform even slightly
worse with the batching changes in place would be very worrysome.
This applies to both throughput and cpu utilization.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-22  4:22         ` David Miller
@ 2007-08-22  7:03           ` Krishna Kumar2
  2007-08-22  9:14             ` David Miller
  2007-08-22 17:09           ` [ofa-general] " Rick Jones
  1 sibling, 1 reply; 37+ messages in thread
From: Krishna Kumar2 @ 2007-08-22  7:03 UTC (permalink / raw)
  To: David Miller
  Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma

Hi Dave,

David Miller <davem@davemloft.net> wrote on 08/22/2007 09:52:29 AM:

> From: Krishna Kumar2 <krkumar2@in.ibm.com>
> Date: Wed, 22 Aug 2007 09:41:52 +0530
>
<snip>
> > > Because TSO does batching already, so it's a very good
> > > "tit for tat" comparison of the new batching scheme
> > > vs. an existing one.
> >
> > I am planning to do more testing on your suggestion over the
> > weekend, but I had a comment. Are you saying that TSO and
> > batching should be mutually exclusive so hardware that doesn't
> > support TSO (like IB) only would benefit?
> >
> > But even if they can co-exist, aren't cases like sending
> > multiple small skbs better handled with batching?
>
> I'm not making any suggestions, so don't read that into anything I've
> said :-)
>
> I think the jury is still out, but seeing TSO perform even slightly
> worse with the batching changes in place would be very worrysome.
> This applies to both throughput and cpu utilization.

Does turning off batching solve that problem? What I mean by that is:
batching can be disabled if a TSO device is worse for some cases. Infact
something that I had changed my latest code is to not enable batching
in register_netdevice (in Rev4 which I am sending in a few mins), rather
the user has to explicitly turn 'on' batching.

Wondering if that is what you are concerned about. In any case, I will
test your case on Monday (I am on vacation for next couple of days).

Thanks,

- KK


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-22  7:03           ` Krishna Kumar2
@ 2007-08-22  9:14             ` David Miller
  2007-08-23  2:43               ` Krishna Kumar2
  0 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2007-08-22  9:14 UTC (permalink / raw)
  To: krkumar2
  Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma

From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Wed, 22 Aug 2007 12:33:04 +0530

> Does turning off batching solve that problem? What I mean by that is:
> batching can be disabled if a TSO device is worse for some cases.

This new batching stuff isn't going to be enabled or disabled
on a per-device basis just to get "parity" with how things are
now.

It should be enabled by default, and give at least as good
performance as what can be obtained right now.

Otherwise it's a clear regression.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-22  9:14             ` David Miller
@ 2007-08-23  2:43               ` Krishna Kumar2
  0 siblings, 0 replies; 37+ messages in thread
From: Krishna Kumar2 @ 2007-08-23  2:43 UTC (permalink / raw)
  To: David Miller
  Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma

David Miller <davem@davemloft.net> wrote on 08/22/2007 02:44:40 PM:

> From: Krishna Kumar2 <krkumar2@in.ibm.com>
> Date: Wed, 22 Aug 2007 12:33:04 +0530
>
> > Does turning off batching solve that problem? What I mean by that is:
> > batching can be disabled if a TSO device is worse for some cases.
>
> This new batching stuff isn't going to be enabled or disabled
> on a per-device basis just to get "parity" with how things are
> now.
>
> It should be enabled by default, and give at least as good
> performance as what can be obtained right now.

That was how it was in earlier revisions. In revision4 I coded it so
that it is enabled only if explicitly set by the user. I can revert
that change.

> Otherwise it's a clear regression.

Definitely. For drivers that support it, it should not reduce performance.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-22  4:22         ` David Miller
  2007-08-22  7:03           ` Krishna Kumar2
@ 2007-08-22 17:09           ` Rick Jones
  2007-08-22 20:21             ` David Miller
  1 sibling, 1 reply; 37+ messages in thread
From: Rick Jones @ 2007-08-22 17:09 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, herbert, gaagaan, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general, mchan,
	tgraf, netdev, johnpol, shemminger, kaber, sri

David Miller wrote:
> I think the jury is still out, but seeing TSO perform even slightly
> worse with the batching changes in place would be very worrysome.
> This applies to both throughput and cpu utilization.

Should it be any more or less worrysome than small packet performance (eg the 
TCP_RR stuff I posted recently) being rather worse with TSO enabled than with it 
disabled?

rick jones

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-22 17:09           ` [ofa-general] " Rick Jones
@ 2007-08-22 20:21             ` David Miller
  2007-08-23 22:04               ` [ofa-general] " jamal
  0 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2007-08-22 20:21 UTC (permalink / raw)
  To: rick.jones2
  Cc: krkumar2, gaagaan, general, hadi, herbert, jagana, jeff, johnpol,
	kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
	rdreier, Robert.Olsson, shemminger, sri, tgraf, xma

From: Rick Jones <rick.jones2@hp.com>
Date: Wed, 22 Aug 2007 10:09:37 -0700

> Should it be any more or less worrysome than small packet
> performance (eg the TCP_RR stuff I posted recently) being rather
> worse with TSO enabled than with it disabled?

That, like any such thing shown by the batching changes, is a bug
to fix.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-22 20:21             ` David Miller
@ 2007-08-23 22:04               ` jamal
  2007-08-23 22:25                 ` jamal
  2007-08-23 22:30                 ` David Miller
  0 siblings, 2 replies; 37+ messages in thread
From: jamal @ 2007-08-23 22:04 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	kumarkr, rdreier, mcarlson, jeff, general, mchan, tgraf, netdev,
	johnpol, shemminger, kaber, sri

On Wed, 2007-22-08 at 13:21 -0700, David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Wed, 22 Aug 2007 10:09:37 -0700
> 
> > Should it be any more or less worrysome than small packet
> > performance (eg the TCP_RR stuff I posted recently) being rather
> > worse with TSO enabled than with it disabled?
> 
> That, like any such thing shown by the batching changes, is a bug
> to fix.

Possibly a bug - but you really should turn off TSO if you are doing
huge interactive transactions (which is fair because there is a clear
demarcation).
The litmus test is the same as any change that is supposed to improve
net performance - it has to demonstrate it is not intrusive and that it
improves (consistently) performance. The standard metrics are
{throughput, cpu-utilization, latency} i.e as long as one improves and
others remain zero, it would make sense. Yes, i am religious for
batching after all the invested sweat (and i continue to work on it
hoping to demystify) - the theory makes a lot of sense.

cheers,
jamal

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-23 22:04               ` [ofa-general] " jamal
@ 2007-08-23 22:25                 ` jamal
  2007-08-23 22:35                   ` [ofa-general] " Rick Jones
  2007-08-23 22:30                 ` David Miller
  1 sibling, 1 reply; 37+ messages in thread
From: jamal @ 2007-08-23 22:25 UTC (permalink / raw)
  To: David Miller
  Cc: rick.jones2, krkumar2, gaagaan, general, herbert, jagana, jeff,
	johnpol, kaber, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
	rdreier, Robert.Olsson, shemminger, sri, tgraf, xma

On Thu, 2007-23-08 at 18:04 -0400, jamal wrote:

> The litmus test is the same as any change that is supposed to improve
> net performance - it has to demonstrate it is not intrusive and that it
> improves (consistently) performance. The standard metrics are
> {throughput, cpu-utilization, latency} i.e as long as one improves and
> others remain zero, it would make sense. Yes, i am religious for
> batching after all the invested sweat (and i continue to work on it
> hoping to demystify) - the theory makes a lot of sense.

Before someone jumps and strangles me ;-> By "litmus test" i meant as
applied to batching. [TSO already passed - iirc, it has been
demostranted to really not add much to throughput (cant improve much
over closeness to wire speed) but improve CPU utilization].

cheers,
jamal



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-23 22:25                 ` jamal
@ 2007-08-23 22:35                   ` Rick Jones
  2007-08-23 22:41                     ` jamal
  2007-08-24  3:18                     ` Bill Fink
  0 siblings, 2 replies; 37+ messages in thread
From: Rick Jones @ 2007-08-23 22:35 UTC (permalink / raw)
  To: hadi
  Cc: jagana, herbert, gaagaan, Robert.Olsson, netdev, rdreier,
	peter.p.waskiewicz.jr, mcarlson, kaber, jeff, general, mchan,
	tgraf, johnpol, shemminger, David Miller, sri

jamal wrote:
> [TSO already passed - iirc, it has been
> demostranted to really not add much to throughput (cant improve much
> over closeness to wire speed) but improve CPU utilization].

In the one gig space sure, but in the 10 Gig space, TSO on/off does make a 
difference for throughput.

rick jones

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-23 22:35                   ` [ofa-general] " Rick Jones
@ 2007-08-23 22:41                     ` jamal
  2007-08-24  3:18                     ` Bill Fink
  1 sibling, 0 replies; 37+ messages in thread
From: jamal @ 2007-08-23 22:41 UTC (permalink / raw)
  To: Rick Jones
  Cc: jagana, herbert, gaagaan, Robert.Olsson, netdev, rdreier,
	peter.p.waskiewicz.jr, mcarlson, kaber, jeff, general, mchan,
	tgraf, johnpol, shemminger, David Miller, sri

On Thu, 2007-23-08 at 15:35 -0700, Rick Jones wrote:
> jamal wrote:
> > [TSO already passed - iirc, it has been
> > demostranted to really not add much to throughput (cant improve much
> > over closeness to wire speed) but improve CPU utilization].
> 
> In the one gig space sure, but in the 10 Gig space, TSO on/off does make a 
> difference for throughput.

I am still so 1Gige;-> I stand corrected again ;->

cheers,
jamal

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-23 22:35                   ` [ofa-general] " Rick Jones
  2007-08-23 22:41                     ` jamal
@ 2007-08-24  3:18                     ` Bill Fink
  2007-08-24 12:14                       ` jamal
                                         ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Bill Fink @ 2007-08-24  3:18 UTC (permalink / raw)
  To: Rick Jones
  Cc: jagana, herbert, gaagaan, Robert.Olsson, mcarlson, rdreier,
	peter.p.waskiewicz.jr, hadi, kaber, jeff, general, mchan, tgraf,
	netdev, johnpol, shemminger, David Miller, sri

On Thu, 23 Aug 2007, Rick Jones wrote:

> jamal wrote:
> > [TSO already passed - iirc, it has been
> > demostranted to really not add much to throughput (cant improve much
> > over closeness to wire speed) but improve CPU utilization].
> 
> In the one gig space sure, but in the 10 Gig space, TSO on/off does make a 
> difference for throughput.

Not too much.

TSO enabled:

[root@lang2 ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11813.4375 MB /  10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX

TSO disabled:

[root@lang2 ~]# ethtool -K eth2 tso off
[root@lang2 ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.2500 MB /  10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX

Pretty negligible difference it seems.

This is with a 2.6.20.7 kernel, Myricom 10-GigE NICs, and 9000 byte
jumbo frames, in a LAN environment.

For grins, I also did a couple of tests with an MSS of 1460 to
emulate a standard 1500 byte Ethernet MTU.

TSO enabled:

[root@lang2 ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5102.8503 MB /  10.06 sec = 4253.9124 Mbps 39 %TX 99 %RX

TSO disabled:

[root@lang2 ~]# ethtool -K eth2 tso off
[root@lang2 ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5399.5625 MB /  10.00 sec = 4527.9070 Mbps 99 %TX 76 %RX

Here you can see there is a major difference in the TX CPU utilization
(99 % with TSO disabled versus only 39 % with TSO enabled), although
the TSO disabled case was able to squeeze out a little extra performance
from its extra CPU utilization.  Interestingly, with TSO enabled, the
receiver actually consumed more CPU than with TSO disabled, so I guess
the receiver CPU saturation in that case (99 %) was what restricted
its performance somewhat (this was consistent across a few test runs).

						-Bill

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-24  3:18                     ` Bill Fink
@ 2007-08-24 12:14                       ` jamal
  2007-08-24 18:08                         ` Bill Fink
  2007-08-24 21:25                         ` David Miller
  2007-08-24 18:46                       ` [ofa-general] " Rick Jones
  2007-08-25  0:42                       ` John Heffner
  2 siblings, 2 replies; 37+ messages in thread
From: jamal @ 2007-08-24 12:14 UTC (permalink / raw)
  To: Bill Fink
  Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf,
	johnpol, shemminger, David Miller, sri

On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote:

[..]
> Here you can see there is a major difference in the TX CPU utilization
> (99 % with TSO disabled versus only 39 % with TSO enabled), although
> the TSO disabled case was able to squeeze out a little extra performance
> from its extra CPU utilization.  

Good stuff. What kind of machine? SMP?
Seems the receive side of the sender is also consuming a lot more cpu
i suspect because receiver is generating a lot more ACKs with TSO.
Does the choice of the tcp congestion control algorithm affect results?
it would be interesting to see both MTUs with either TCP BIC vs good old
reno on sender (probably without changing what the receiver does). BIC
seems to be the default lately.

> Interestingly, with TSO enabled, the
> receiver actually consumed more CPU than with TSO disabled, 

I would suspect the fact that a lot more packets making it into the
receiver for TSO contributes.

> so I guess
> the receiver CPU saturation in that case (99 %) was what restricted
> its performance somewhat (this was consistent across a few test runs).

Unfortunately the receiver plays a big role in such tests - if it is
bottlenecked then you are not really testing the limits of the
transmitter. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-24 12:14                       ` jamal
@ 2007-08-24 18:08                         ` Bill Fink
  2007-08-24 21:25                         ` David Miller
  1 sibling, 0 replies; 37+ messages in thread
From: Bill Fink @ 2007-08-24 18:08 UTC (permalink / raw)
  To: hadi
  Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf,
	johnpol, shemminger, David Miller, sri

On Fri, 24 Aug 2007, jamal wrote:

> On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote:
> 
> [..]
> > Here you can see there is a major difference in the TX CPU utilization
> > (99 % with TSO disabled versus only 39 % with TSO enabled), although
> > the TSO disabled case was able to squeeze out a little extra performance
> > from its extra CPU utilization.  
> 
> Good stuff. What kind of machine? SMP?

Tyan Thunder K8WE S2895ANRF motherboard with Nvidia nForce
Professional 2200+2050 chipset, 2 AMD Opteron 254 2.8 GHz CPUs,
4 GB PC3200 ECC REG-DDR 400 memory, and 2 PCI-Express x16 slots
(2 buses).

It is SMP but both the NIC interrupts and nuttcp are bound to
CPU 0.  And all other non-kernel system processes are bound to
CPU 1.

> Seems the receive side of the sender is also consuming a lot more cpu
> i suspect because receiver is generating a lot more ACKs with TSO.

Odd.  I just reran the TCP CUBIC "-M1460" tests, and with TSO enabled
on the transmitter, there were about 153709 eth2 interrupts on the
receiver, while with TSO disabled there was actually a somewhat higher
number (164988) of receiver side eth2 interrupts, although the receive
side CPU utilization was actually lower in that case.

On the transmit side (different test run), the TSO enabled case had
about 161773 eth2 interrupts whereas the TSO disabled case had about
165179 eth2 interrupts.

> Does the choice of the tcp congestion control algorithm affect results?
> it would be interesting to see both MTUs with either TCP BIC vs good old
> reno on sender (probably without changing what the receiver does). BIC
> seems to be the default lately.

These tests were with the default TCP CUBIC (with initial_ssthresh
set to 0).

With TCP BIC (and initial_ssthresh set to 0):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11751.3750 MB /  10.00 sec = 9853.9839 Mbps 100 %TX 83 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4999.3321 MB /  10.06 sec = 4167.7872 Mbps 38 %TX 100 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.1875 MB /  10.00 sec = 9910.0682 Mbps 99 %TX 81 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5502.6250 MB /  10.00 sec = 4614.3297 Mbps 100 %TX 84 %RX

And with TCP Reno:

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11782.6250 MB /  10.00 sec = 9880.2613 Mbps 100 %TX 77 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5024.6649 MB /  10.06 sec = 4191.6574 Mbps 38 %TX 99 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.2500 MB /  10.00 sec = 9910.0860 Mbps 99 %TX 77 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5284.0000 MB /  10.00 sec = 4430.9604 Mbps 99 %TX 79 %RX

Very similar results to the original TCP CUBIC tests.

> > Interestingly, with TSO enabled, the
> > receiver actually consumed more CPU than with TSO disabled, 
> 
> I would suspect the fact that a lot more packets making it into the
> receiver for TSO contributes.
> 
> > so I guess
> > the receiver CPU saturation in that case (99 %) was what restricted
> > its performance somewhat (this was consistent across a few test runs).
> 
> Unfortunately the receiver plays a big role in such tests - if it is
> bottlenecked then you are not really testing the limits of the
> transmitter. 

It might be interesting to see what affect the LRO changes would have
on this.  Once they are in a stable released kernel, I might try that
out, or maybe even before if I get some spare time (but that's in very
short supply right now).

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-24 12:14                       ` jamal
  2007-08-24 18:08                         ` Bill Fink
@ 2007-08-24 21:25                         ` David Miller
  2007-08-24 23:11                           ` Herbert Xu
  1 sibling, 1 reply; 37+ messages in thread
From: David Miller @ 2007-08-24 21:25 UTC (permalink / raw)
  To: hadi
  Cc: jagana, billfink, peter.p.waskiewicz.jr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, jeff, general, mchan,
	tgraf, johnpol, shemminger, kaber, sri

From: jamal <hadi@cyberus.ca>
Date: Fri, 24 Aug 2007 08:14:16 -0400

> Seems the receive side of the sender is also consuming a lot more cpu
> i suspect because receiver is generating a lot more ACKs with TSO.

I've seen this behavior before on a low cpu powered receiver and the
issue is that batching too much actually hurts a receiver.

If the data packets were better spaced out, the receive would handle
the load better.

This is the thing the TOE guys keep talking about overcoming with
their packet pacing algorithms in their on-card TOE stack.

My hunch is that even if in the non-TSO case the TX packets were all
back to back in the cards TX ring, TSO still spits them out faster on
the wire.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-24 21:25                         ` David Miller
@ 2007-08-24 23:11                           ` Herbert Xu
  2007-08-25 23:45                             ` Bill Fink
  0 siblings, 1 reply; 37+ messages in thread
From: Herbert Xu @ 2007-08-24 23:11 UTC (permalink / raw)
  To: David Miller
  Cc: hadi, billfink, rick.jones2, krkumar2, gaagaan, general, jagana,
	jeff, johnpol, kaber, mcarlson, mchan, netdev,
	peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
	tgraf, xma

On Fri, Aug 24, 2007 at 02:25:03PM -0700, David Miller wrote:
>
> My hunch is that even if in the non-TSO case the TX packets were all
> back to back in the cards TX ring, TSO still spits them out faster on
> the wire.

If this is the case then we should see an improvement by
disabling TSO and enabling GSO.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-24 23:11                           ` Herbert Xu
@ 2007-08-25 23:45                             ` Bill Fink
  0 siblings, 0 replies; 37+ messages in thread
From: Bill Fink @ 2007-08-25 23:45 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, hadi, rick.jones2, krkumar2, gaagaan, general,
	jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
	peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
	tgraf, xma

On Sat, 25 Aug 2007, Herbert Xu wrote:

> On Fri, Aug 24, 2007 at 02:25:03PM -0700, David Miller wrote:
> >
> > My hunch is that even if in the non-TSO case the TX packets were all
> > back to back in the cards TX ring, TSO still spits them out faster on
> > the wire.
> 
> If this is the case then we should see an improvement by
> disabling TSO and enabling GSO.

TSO disabled and GSO enabled:

[root@lang2 redhat]# nuttcp -w10m 192.168.88.16
11806.7500 MB /  10.00 sec = 9900.6278 Mbps 100 %TX 84 %RX

[root@lang2 redhat]# nuttcp -M1460 -w10m 192.168.88.16
 4872.0625 MB /  10.00 sec = 4085.5690 Mbps 100 %TX 64 %RX

In the "-M1460" case, there was generally less receiver CPU utilization,
but the transmitter utilization was generally pegged at 100 %, even
though there wasn't any improvement in throughput compared to the
TSO enabled case (in fact the throughput generally seemed to be somewhat
less than the TSO enabled case).  Note there was a fair degree of
variability across runs for the receiver CPU utilization (the one
shown I considered to be representative of the average behavior).

Repeat of previous test results:

TSO enabled and GSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11813.4375 MB /  10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5102.8503 MB /  10.06 sec = 4253.9124 Mbps 39 %TX 99 %RX

TSO disabled and GSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.2500 MB /  10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5399.5625 MB /  10.00 sec = 4527.9070 Mbps 99 %TX 76 %RX

						-Bill

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-24  3:18                     ` Bill Fink
  2007-08-24 12:14                       ` jamal
@ 2007-08-24 18:46                       ` Rick Jones
  2007-08-25  0:42                       ` John Heffner
  2 siblings, 0 replies; 37+ messages in thread
From: Rick Jones @ 2007-08-24 18:46 UTC (permalink / raw)
  To: Bill Fink
  Cc: jagana, herbert, gaagaan, Robert.Olsson, mcarlson, rdreier,
	peter.p.waskiewicz.jr, hadi, kaber, jeff, general, mchan, tgraf,
	netdev, johnpol, shemminger, David Miller, sri

Bill Fink wrote:
> On Thu, 23 Aug 2007, Rick Jones wrote:
> 
> 
>>jamal wrote:
>>
>>>[TSO already passed - iirc, it has been
>>>demostranted to really not add much to throughput (cant improve much
>>>over closeness to wire speed) but improve CPU utilization].
>>
>>In the one gig space sure, but in the 10 Gig space, TSO on/off does make a 
>>difference for throughput.
> 
> 
> Not too much.
> 
> TSO enabled:
> 
> [root@lang2 ~]# ethtool -k eth2
> Offload parameters for eth2:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp segmentation offload: on
> 
> [root@lang2 ~]# nuttcp -w10m 192.168.88.16
> 11813.4375 MB /  10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX
> 
> TSO disabled:
> 
> [root@lang2 ~]# ethtool -K eth2 tso off
> [root@lang2 ~]# ethtool -k eth2
> Offload parameters for eth2:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp segmentation offload: off
> 
> [root@lang2 ~]# nuttcp -w10m 192.168.88.16
> 11818.2500 MB /  10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX
> 
> Pretty negligible difference it seems.

Leaves one wondering how often more than one segment was sent to the card in the 
9000 byte case :)

rick jones

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-24  3:18                     ` Bill Fink
  2007-08-24 12:14                       ` jamal
  2007-08-24 18:46                       ` [ofa-general] " Rick Jones
@ 2007-08-25  0:42                       ` John Heffner
  2007-08-26  8:41                         ` [ofa-general] " Bill Fink
  2 siblings, 1 reply; 37+ messages in thread
From: John Heffner @ 2007-08-25  0:42 UTC (permalink / raw)
  To: Bill Fink
  Cc: Rick Jones, hadi, David Miller, krkumar2, gaagaan, general,
	herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
	peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
	tgraf, xma

Bill Fink wrote:
> Here you can see there is a major difference in the TX CPU utilization
> (99 % with TSO disabled versus only 39 % with TSO enabled), although
> the TSO disabled case was able to squeeze out a little extra performance
> from its extra CPU utilization.  Interestingly, with TSO enabled, the
> receiver actually consumed more CPU than with TSO disabled, so I guess
> the receiver CPU saturation in that case (99 %) was what restricted
> its performance somewhat (this was consistent across a few test runs).

One possibility is that I think the receive-side processing tends to do 
better when receiving into an empty queue.  When the (non-TSO) sender is 
the flow's bottleneck, this is going to be the case.  But when you 
switch to TSO, the receiver becomes the bottleneck and you're always 
going to have to put the packets at the back of the receive queue.  This 
might help account for the reason why you have both lower throughput and 
higher CPU utilization -- there's a point of instability right where the 
receiver becomes the bottleneck and you end up pushing it over to the 
bad side. :)

Just a theory.  I'm honestly surprised this effect would be so 
significant.  What do the numbers from netstat -s look like in the two 
cases?

   -John

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-25  0:42                       ` John Heffner
@ 2007-08-26  8:41                         ` Bill Fink
  2007-08-27  1:32                           ` John Heffner
  0 siblings, 1 reply; 37+ messages in thread
From: Bill Fink @ 2007-08-26  8:41 UTC (permalink / raw)
  To: John Heffner
  Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	mcarlson, rdreier, hadi, kaber, jeff, general, mchan, tgraf,
	netdev, johnpol, shemminger, David Miller, sri

On Fri, 24 Aug 2007, John Heffner wrote:

> Bill Fink wrote:
> > Here you can see there is a major difference in the TX CPU utilization
> > (99 % with TSO disabled versus only 39 % with TSO enabled), although
> > the TSO disabled case was able to squeeze out a little extra performance
> > from its extra CPU utilization.  Interestingly, with TSO enabled, the
> > receiver actually consumed more CPU than with TSO disabled, so I guess
> > the receiver CPU saturation in that case (99 %) was what restricted
> > its performance somewhat (this was consistent across a few test runs).
> 
> One possibility is that I think the receive-side processing tends to do 
> better when receiving into an empty queue.  When the (non-TSO) sender is 
> the flow's bottleneck, this is going to be the case.  But when you 
> switch to TSO, the receiver becomes the bottleneck and you're always 
> going to have to put the packets at the back of the receive queue.  This 
> might help account for the reason why you have both lower throughput and 
> higher CPU utilization -- there's a point of instability right where the 
> receiver becomes the bottleneck and you end up pushing it over to the 
> bad side. :)
> 
> Just a theory.  I'm honestly surprised this effect would be so 
> significant.  What do the numbers from netstat -s look like in the two 
> cases?

Well, I was going to check this out, but I happened to reboot the
system and now I get somewhat different results.

Here are the new results, which should hopefully be more accurate
since they are on a freshly booted system.

TSO enabled and GSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11610.6875 MB /  10.00 sec = 9735.9526 Mbps 100 %TX 75 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5029.6875 MB /  10.06 sec = 4194.6931 Mbps 36 %TX 100 %RX

TSO disabled and GSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11817.9375 MB /  10.00 sec = 9909.7773 Mbps 99 %TX 77 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5823.3125 MB /  10.00 sec = 4883.2429 Mbps 100 %TX 82 %RX

The TSO disabled case got a little better performance even for
9000 byte jumbo frames.  For the "-M1460" case eumalating a
standard 1500 byte Ethernet MTU, the performance was significantly
better and used less CPU on the receiver (82 % versus 100 %)
although it did use significantly more CPU on the transmitter
(100 % versus 36 %).

TSO disabled and GSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11609.5625 MB /  10.00 sec = 9734.9859 Mbps 99 %TX 75 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5001.4375 MB /  10.06 sec = 4170.6739 Mbps 52 %TX 100 %RX

The GSO enabled case is very similar to the TSO enabled case,
except that for the "-M1460" test the transmitter used more
CPU (52 % versus 36 %), which is to be expected since TSO has
hardware assist.

Here's the beforeafter delta of the receiver's "netstat -s"
statistics for the TSO enabled case:

Ip:
    3659898 total packets received
    3659898 incoming packets delivered
    80050 requests sent out
Tcp:
    2 passive connection openings
    3659897 segments received
    80050 segments send out
TcpExt:
    33 packets directly queued to recvmsg prequeue.
    104956 packets directly received from backlog
    705528 packets directly received from prequeue
    3654842 packets header predicted
    193 packets header predicted and directly queued to user
    4 acknowledgments not containing data received
    6 predicted acknowledgments

And here it is for the TSO disabled case (GSO also disabled):

Ip:
    4107083 total packets received
    4107083 incoming packets delivered
    1401376 requests sent out
Tcp:
    2 passive connection openings
    4107083 segments received
    1401376 segments send out
TcpExt:
    2 TCP sockets finished time wait in fast timer
    48486 packets directly queued to recvmsg prequeue.
    1056111048 packets directly received from backlog
    2273357712 packets directly received from prequeue
    1819317 packets header predicted
    2287497 packets header predicted and directly queued to user
    4 acknowledgments not containing data received
    10 predicted acknowledgments

For the TSO disabled case, there are a huge amount more TCP segments
sent out (1401376 versus 80050), which I assume are ACKs, and which
could possibly contribute to the higher throughput for the TSO disabled
case due to faster feedback, but not explain the lower CPU utilization.
There are many more packets directly queued to recvmsg prequeue
(48486 versus 33).  The numbers for packets directly received from
backlog and prequeue in the TCP disabled case seem bogus to me so
I don't know how to interpret that.  There are only about half as
many packets header predicted (1819317 versus 3654842), but there
are many more packets header predicted and directly queued to user
(2287497 versus 193).  I'll leave the analysis of all this to those
who might actually know what it all means.

I also ran another set of tests that may be of interest.  I changed
the rx-usecs/tx-usecs interrupt coalescing parameter from the
recommended optimum value of 75 usecs to 0 (no coalescing), but
only on the transmitter.  The comparison discussions below are
relative to the previous tests where rx-usecs/tx-usecs were set
to 75 usecs.

TSO enabled and GSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11812.8125 MB /  10.00 sec = 9905.6640 Mbps 100 %TX 75 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 7701.8750 MB /  10.00 sec = 6458.5541 Mbps 100 %TX 56 %RX

For 9000 byte jumbo frames it now gets a little better performance
and almost matches the 10-GigE line rate performance of the TSO
disabled case.  For the "-M1460" test, it gets substantially better
performance (6458.5541 Mbps versus 4194.6931 Mbps) at the expense
of much higher transmitter CPU utilization (100 % versus 36 %),
although the receiver CPU utilization is much less (56 % versus 100 %).

TSO disabled and GSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11817.3125 MB /  10.00 sec = 9909.4058 Mbps 100 %TX 76 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4081.2500 MB /  10.00 sec = 3422.3994 Mbps 99 %TX 41 %RX

For 9000 byte jumbo frames the results are essentially the same.
For the "-M1460" test, the performance is significantly worse
(3422.3994 Mbps versus 4883.2429 Mbps) even though the transmitter
CPU utilization is saturated in both cases, but the receiver CPU
utilization is about half (41 % versus 82 %).

TSO disabled and GSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11813.3750 MB /  10.00 sec = 9906.1090 Mbps 99 %TX 77 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 3939.1875 MB /  10.00 sec = 3303.2814 Mbps 100 %TX 41 %RX

For 9000 byte jumbo frames the performance is a little better,
again approaching 10-GigE line rate.  But for the "-M1460" test,
the performance is significantly worse (3303.2814 Mbps versus
4170.6739 Mbps) even though the transmitter consumes much more
CPU (100 % versus 52 %).  In this case though the receiver has
a much lower CPU utilization (41 % versus 100 %).

						-Bill

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-26  8:41                         ` [ofa-general] " Bill Fink
@ 2007-08-27  1:32                           ` John Heffner
  2007-08-27  2:04                             ` David Miller
  0 siblings, 1 reply; 37+ messages in thread
From: John Heffner @ 2007-08-27  1:32 UTC (permalink / raw)
  To: Bill Fink
  Cc: Rick Jones, hadi, David Miller, krkumar2, gaagaan, general,
	herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
	peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
	tgraf, xma

Bill Fink wrote:
> Here's the beforeafter delta of the receiver's "netstat -s"
> statistics for the TSO enabled case:
> 
> Ip:
>     3659898 total packets received
>     3659898 incoming packets delivered
>     80050 requests sent out
> Tcp:
>     2 passive connection openings
>     3659897 segments received
>     80050 segments send out
> TcpExt:
>     33 packets directly queued to recvmsg prequeue.
>     104956 packets directly received from backlog
>     705528 packets directly received from prequeue
>     3654842 packets header predicted
>     193 packets header predicted and directly queued to user
>     4 acknowledgments not containing data received
>     6 predicted acknowledgments
> 
> And here it is for the TSO disabled case (GSO also disabled):
> 
> Ip:
>     4107083 total packets received
>     4107083 incoming packets delivered
>     1401376 requests sent out
> Tcp:
>     2 passive connection openings
>     4107083 segments received
>     1401376 segments send out
> TcpExt:
>     2 TCP sockets finished time wait in fast timer
>     48486 packets directly queued to recvmsg prequeue.
>     1056111048 packets directly received from backlog
>     2273357712 packets directly received from prequeue
>     1819317 packets header predicted
>     2287497 packets header predicted and directly queued to user
>     4 acknowledgments not containing data received
>     10 predicted acknowledgments
> 
> For the TSO disabled case, there are a huge amount more TCP segments
> sent out (1401376 versus 80050), which I assume are ACKs, and which
> could possibly contribute to the higher throughput for the TSO disabled
> case due to faster feedback, but not explain the lower CPU utilization.
> There are many more packets directly queued to recvmsg prequeue
> (48486 versus 33).  The numbers for packets directly received from
> backlog and prequeue in the TCP disabled case seem bogus to me so
> I don't know how to interpret that.  There are only about half as
> many packets header predicted (1819317 versus 3654842), but there
> are many more packets header predicted and directly queued to user
> (2287497 versus 193).  I'll leave the analysis of all this to those
> who might actually know what it all means.

There are a few interesting things here.  For one, the bursts caused by 
TSO seem to be causing the receiver to do stretch acks.  This may have a 
negative impact on flow performance, but it's hard to say for sure how 
much.  Interestingly, it will even further reduce the CPU load on the 
sender, since it has to process fewer acks.

As I suspected, in the non-TSO case the receiver gets lots of packets 
directly queued to user.  This should result in somewhat lower CPU 
utilization on the receiver.  I don't know if it can account for all the 
difference you see.

The backlog and prequeue values are probably correct, but netstat's 
description is wrong.  A quick look at the code reveals these values are 
in units of bytes, not packets.

   -John

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-27  1:32                           ` John Heffner
@ 2007-08-27  2:04                             ` David Miller
  2007-08-27 23:23                               ` jamal
  0 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2007-08-27  2:04 UTC (permalink / raw)
  To: jheffner
  Cc: billfink, rick.jones2, hadi, krkumar2, gaagaan, general, herbert,
	jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
	peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
	tgraf, xma

From: John Heffner <jheffner@psc.edu>
Date: Sun, 26 Aug 2007 21:32:26 -0400

> There are a few interesting things here.  For one, the bursts caused by 
> TSO seem to be causing the receiver to do stretch acks.  This may have a 
> negative impact on flow performance, but it's hard to say for sure how 
> much.  Interestingly, it will even further reduce the CPU load on the 
> sender, since it has to process fewer acks.
> 
> As I suspected, in the non-TSO case the receiver gets lots of packets 
> directly queued to user.  This should result in somewhat lower CPU 
> utilization on the receiver.  I don't know if it can account for all the 
> difference you see.

I had completely forgotten these stretch ACK and ucopy issues.

When the receiver gets inundated with a backlog of receive
queue packets, it just spins there copying into userspace
_every_ _single_ packet in that queue, then spits out one ACK.

Meanwhile the sender has to pause long enough for the pipe to empty
slightly.

The transfer is much better behaved if we ACK every two full sized
frames we copy into the receiver, and therefore don't stretch ACK, but
at the cost of cpu utilization.

These effects are particularly pronounced on systems where the
bus bandwidth is also one of the limiting factors.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-27  2:04                             ` David Miller
@ 2007-08-27 23:23                               ` jamal
  2007-09-14  7:20                                 ` [ofa-general] " Bill Fink
  0 siblings, 1 reply; 37+ messages in thread
From: jamal @ 2007-08-27 23:23 UTC (permalink / raw)
  To: David Miller
  Cc: jheffner, billfink, rick.jones2, krkumar2, gaagaan, general,
	herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
	peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
	tgraf, xma

On Sun, 2007-26-08 at 19:04 -0700, David Miller wrote:

> The transfer is much better behaved if we ACK every two full sized
> frames we copy into the receiver, and therefore don't stretch ACK, but
> at the cost of cpu utilization.

The rx coalescing in theory should help by accumulating more ACKs on the
rx side of the sender. But it doesnt seem to do that i.e For the 9K MTU,
you are better off to turn off the coalescing if you want higher
numbers. Also some of the TOE vendors (chelsio?) claim to have fixed
this by reducing bursts on outgoing packets.

Bill:
who suggested (as per your email) the 75usec value and what was it based
on measurement-wise? 
BTW, thanks for the finding the energy to run those tests and a very
refreshing perspective. I dont mean to add more work, but i had some
queries;
On your earlier tests, i think that Reno showed some significant
differences on the lower MTU case over BIC. I wonder if this is
consistent? 
A side note: Although the experimentation reduces the variables (eg
tying all to CPU0), it would be more exciting to see multi-cpu and
multi-flow sender effect (which IMO is more real world). 
Last note: you need a newer netstat.

> These effects are particularly pronounced on systems where the
> bus bandwidth is also one of the limiting factors.

Can you elucidate this a little more Dave? Did you mean memory
bandwidth? 

cheers,
jamal

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-27 23:23                               ` jamal
@ 2007-09-14  7:20                                 ` Bill Fink
  2007-09-14 13:44                                   ` TSO, TCP Cong control etc jamal
  2007-09-14 17:24                                   ` [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
  0 siblings, 2 replies; 37+ messages in thread
From: Bill Fink @ 2007-09-14  7:20 UTC (permalink / raw)
  To: hadi
  Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf,
	johnpol, shemminger, David Miller, jheffner, sri

On Mon, 27 Aug 2007, jamal wrote:

> On Sun, 2007-26-08 at 19:04 -0700, David Miller wrote:
> 
> > The transfer is much better behaved if we ACK every two full sized
> > frames we copy into the receiver, and therefore don't stretch ACK, but
> > at the cost of cpu utilization.
> 
> The rx coalescing in theory should help by accumulating more ACKs on the
> rx side of the sender. But it doesnt seem to do that i.e For the 9K MTU,
> you are better off to turn off the coalescing if you want higher
> numbers. Also some of the TOE vendors (chelsio?) claim to have fixed
> this by reducing bursts on outgoing packets.
>  
> Bill:
> who suggested (as per your email) the 75usec value and what was it based
> on measurement-wise? 

Belatedly getting back to this thread.  There was a recent myri10ge
patch that changed the default value for tx/rx interrupt coalescing
to 75 usec claiming it was an optimum value for maximum throughput
(and is also mentioned in their external README documentation).

I also did some empirical testing to determine the effect of different
values of TX/RX interrupt coalescing on 10-GigE network performance,
both with TSO enabled and with TSO disabled.  The actual test runs
are attached at the end of this message, but the results are summarized
in the following table (network performance in Mbps).

		        TX/RX interrupt coalescing in usec (both sides)
		   0	  15	  30	  45	  60	  75	  90	 105

TSO enabled	8909	9682	9716	9725	9739	9745	9688	9648
TSO disabled	9113	9910	9910	9910	9910	9910	9910	9910

TSO disabled performance is always better than equivalent TSO enabled
performance.  With TSO enabled, the optimum performance is indeed at
a TX/RX interrupt coalescing value of 75 usec.  With TSO disabled,
performance is the full 10-GigE line rate of 9910 Mbps for any value
of TX/RX interrupt coalescing from 15 usec to 105 usec.

> BTW, thanks for the finding the energy to run those tests and a very
> refreshing perspective. I dont mean to add more work, but i had some
> queries;
> On your earlier tests, i think that Reno showed some significant
> differences on the lower MTU case over BIC. I wonder if this is
> consistent? 

Here's a retest (5 tests each):

TSO enabled:

TCP Cubic (initial_ssthresh set to 0):

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5007.6295 MB /  10.06 sec = 4176.1807 Mbps 36 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4950.9279 MB /  10.06 sec = 4130.2528 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4917.1742 MB /  10.05 sec = 4102.5772 Mbps 35 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4948.7920 MB /  10.05 sec = 4128.7990 Mbps 36 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4937.5765 MB /  10.05 sec = 4120.6460 Mbps 35 %TX 99 %RX

TCP Bic (initial_ssthresh set to 0):

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5005.5335 MB /  10.06 sec = 4172.9571 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5001.0625 MB /  10.06 sec = 4169.2960 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4957.7500 MB /  10.06 sec = 4135.7355 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4957.3777 MB /  10.06 sec = 4135.6252 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5059.1815 MB /  10.05 sec = 4221.3546 Mbps 37 %TX 99 %RX

TCP Reno:

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4973.3532 MB /  10.06 sec = 4147.3589 Mbps 36 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4984.4375 MB /  10.06 sec = 4155.2131 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4995.6841 MB /  10.06 sec = 4166.2734 Mbps 36 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4982.2500 MB /  10.05 sec = 4156.7586 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4989.9796 MB /  10.05 sec = 4163.0949 Mbps 36 %TX 99 %RX

TSO disabled:

TCP Cubic (initial_ssthresh set to 0):

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5075.8125 MB /  10.02 sec = 4247.3408 Mbps 99 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5056.0000 MB /  10.03 sec = 4229.9621 Mbps 100 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5047.4375 MB /  10.03 sec = 4223.1203 Mbps 99 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5066.1875 MB /  10.03 sec = 4239.1659 Mbps 100 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4986.3750 MB /  10.03 sec = 4171.9906 Mbps 99 %TX 100 %RX

TCP Bic (initial_ssthresh set to 0):

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5040.5625 MB /  10.03 sec = 4217.3521 Mbps 100 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5049.7500 MB /  10.03 sec = 4225.4585 Mbps 99 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5076.5000 MB /  10.03 sec = 4247.6632 Mbps 100 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5017.2500 MB /  10.03 sec = 4197.4990 Mbps 100 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5013.3125 MB /  10.03 sec = 4194.8851 Mbps 100 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5036.0625 MB /  10.03 sec = 4213.9195 Mbps 100 %TX 100 %RX

TCP Reno:

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5006.8750 MB /  10.02 sec = 4189.6051 Mbps 99 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5028.1250 MB /  10.02 sec = 4207.4553 Mbps 100 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5021.9375 MB /  10.02 sec = 4202.2668 Mbps 99 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5000.5625 MB /  10.03 sec = 4184.3109 Mbps 99 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5025.1250 MB /  10.03 sec = 4204.7378 Mbps 99 %TX 100 %RX

Not too much variation here, and not quite as high results
as previously.  Some further testing reveals that while this
time I mainly get results like (here for TCP Bic with TSO
disabled):

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4958.0625 MB /  10.02 sec = 4148.9361 Mbps 100 %TX 99 %RX

I also sometimes get results like:

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5882.1875 MB /  10.00 sec = 4932.5549 Mbps 100 %TX 90 %RX

The higher performing results seem to correspond to when there's a
somewhat lower receiver CPU utilization.  I'm not sure but there
could also have been an effect from running the "-M1460" test after
the 9000 byte jumbo frame test (no jumbo tests were done at all prior
to running the above sets of 5 tests, although I did always discard
an initial "warmup" test, and now that I think about it some of
those initial discarded "warmup" tests did have somewhat anomalously
high results).

> A side note: Although the experimentation reduces the variables (eg
> tying all to CPU0), it would be more exciting to see multi-cpu and
> multi-flow sender effect (which IMO is more real world). 

These systems are intended as test systems for 10-GigE networks,
and as such it's important to get as consistently close to full
10-GigE line rate as possible, and that's why the interrupts and
nuttcp application are tied to CPU0, with almost all other system
applications tied to CPU1.

Now on another system that's intended as a 10-GigE firewall system,
it has 2 Myricom 10-GigE NICs with the interrupts for eth2 tied to
CPU0 and the interrupts for CPU1 tied to CPU1.  In IP forwarding
tests of this system, I have basically achieved full bidirectional
10-GigE line rate IP forwarding with 9000 byte jumbo frames.

chance4 -> chance6 -> chance9 4.85 Gbps rate limited TCP stream
chance5 -> chance6 -> chance9 4.85 Gbps rate limited TCP stream
chance7 <- chance6 <- chance8 10.0 Gbps non-rate limited TCP stream

[root@chance7 ~]# nuttcp -Ic4tc9 -Ri4.85g -w10m 192.168.88.8 192.168.89.16 & \
    nuttcp -Ic5tc9 -Ri4.85g -w10m -P5100 -p5101 192.168.88.9 192.168.89.16 & \
    nuttcp -Ic7rc8 -r -w10m 192.168.89.15
c4tc9:  5778.6875 MB /  10.01 sec = 4842.7158 Mbps 100 %TX 42 %RX
c5tc9:  5778.9375 MB /  10.01 sec = 4843.1595 Mbps 100 %TX 40 %RX
c7rc8: 11509.1875 MB /  10.00 sec = 9650.8009 Mbps 99 %TX 74 %RX

If there's some other specific test you'd like to see, and it's not
too difficult to set up and I have some spare time, I'll see what I
can do.

						-Bill



Testing of effect of RX/TX interrupt coalescing on 10-GigE network performance
(both with TSO enabled and with TSO disabled):
--------------------------------------------------------------------------------

No RX/TX interrupt coalescing (either side):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
10649.8750 MB /  10.03 sec = 8908.9806 Mbps 97 %TX 100 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
10879.5000 MB /  10.02 sec = 9112.5141 Mbps 99 %TX 99 %RX

RX/TX interrupt coalescing set to 15 usec (both sides):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11546.7500 MB /  10.00 sec = 9682.0785 Mbps 99 %TX 90 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.9375 MB /  10.00 sec = 9910.3702 Mbps 100 %TX 92 %RX

RX/TX interrupt coalescing set to 30 usec (both sides):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11587.1250 MB /  10.00 sec = 9715.9489 Mbps 99 %TX 81 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.8125 MB /  10.00 sec = 9910.3040 Mbps 100 %TX 81 %RX

RX/TX interrupt coalescing set to 45 usec (both sides):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11597.8750 MB /  10.00 sec = 9724.9902 Mbps 99 %TX 76 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.6250 MB /  10.00 sec = 9910.0933 Mbps 100 %TX 77 %RX

RX/TX interrupt coalescing set to 60 usec (both sides):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11614.7500 MB /  10.00 sec = 9739.1323 Mbps 100 %TX 74 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.4375 MB /  10.00 sec = 9909.9995 Mbps 100 %TX 76 %RX

RX/TX interrupt coalescing set to 75 usec (both sides):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11621.7500 MB /  10.00 sec = 9745.0993 Mbps 100 %TX 72 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.0625 MB /  10.00 sec = 9909.7881 Mbps 100 %TX 75 %RX

RX/TX interrupt coalescing set to 90 usec (both sides):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11553.1250 MB /  10.00 sec = 9687.6458 Mbps 100 %TX 71 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.4375 MB /  10.00 sec = 9910.0837 Mbps 100 %TX 73 %RX

RX/TX interrupt coalescing set to 105 usec (both sides):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11505.7500 MB /  10.00 sec = 9647.8558 Mbps 99 %TX 69 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.4375 MB /  10.00 sec = 9910.0530 Mbps 100 %TX 74 %RX

^ permalink raw reply	[flat|nested] 37+ messages in thread

* TSO, TCP Cong control etc
  2007-09-14  7:20                                 ` [ofa-general] " Bill Fink
@ 2007-09-14 13:44                                   ` jamal
  2007-09-14 17:24                                   ` [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
  1 sibling, 0 replies; 37+ messages in thread
From: jamal @ 2007-09-14 13:44 UTC (permalink / raw)
  To: Bill Fink
  Cc: David Miller, jheffner, rick.jones2, krkumar2, gaagaan, general,
	herbert, jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
	peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
	tgraf, xma

Ive changed the subject to match content..

On Fri, 2007-14-09 at 03:20 -0400, Bill Fink wrote:
> On Mon, 27 Aug 2007, jamal wrote:
> 
> > Bill:
> > who suggested (as per your email) the 75usec value and what was it based
> > on measurement-wise? 
> 
> Belatedly getting back to this thread.  There was a recent myri10ge
> patch that changed the default value for tx/rx interrupt coalescing
> to 75 usec claiming it was an optimum value for maximum throughput
> (and is also mentioned in their external README documentation).

I would think such a value would be very specific to the ring size and
maybe even the machine in use. 

> I also did some empirical testing to determine the effect of different
> values of TX/RX interrupt coalescing on 10-GigE network performance,
> both with TSO enabled and with TSO disabled.  The actual test runs
> are attached at the end of this message, but the results are summarized
> in the following table (network performance in Mbps).
> 
> 		        TX/RX interrupt coalescing in usec (both sides)
> 		   0	  15	  30	  45	  60	  75	  90	 105
> 
> TSO enabled	8909	9682	9716	9725	9739	9745	9688	9648
> TSO disabled	9113	9910	9910	9910	9910	9910	9910	9910
>
> TSO disabled performance is always better than equivalent TSO enabled
> performance.  With TSO enabled, the optimum performance is indeed at
> a TX/RX interrupt coalescing value of 75 usec.  With TSO disabled,
> performance is the full 10-GigE line rate of 9910 Mbps for any value
> of TX/RX interrupt coalescing from 15 usec to 105 usec.

Interesting results. I think J Heffner made a very compelling
description the other day based on your netstat results at the receiver
as to what is going on (refer to the comments on stretch ACKs). If the
receiver is fixed, then youd see better numbers from TSO. 

The 75 microsecs is very benchmarky in my opinion. If i was to pick a
different app or different NIC or run on many cpus with many apps doing
TSO, i highly doubt that will be the right number.


> Here's a retest (5 tests each):
> 
> TSO enabled:
> 
> TCP Cubic (initial_ssthresh set to 0):
[..]

> TCP Bic (initial_ssthresh set to 0):
[..]
> 
> TCP Reno:
> 
[..]
> TSO disabled:
> 
> TCP Cubic (initial_ssthresh set to 0):
> 
[..]
> TCP Bic (initial_ssthresh set to 0):
> 
[..]
> TCP Reno:
> 
[..]
> Not too much variation here, and not quite as high results
> as previously.  

BIC seems to be on average better followed by CUBIC followed by Reno.
The difference this time maybe because you set the ssthresh to 0
(hopefully every run) and so Reno is definetely going to perform less
better since it is a lot less agressive in comparison to other two.

> Some further testing reveals that while this
> time I mainly get results like (here for TCP Bic with TSO
> disabled):
> 
> [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
>  4958.0625 MB /  10.02 sec = 4148.9361 Mbps 100 %TX 99 %RX
> 
> I also sometimes get results like:
> 
> [root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
>  5882.1875 MB /  10.00 sec = 4932.5549 Mbps 100 %TX 90 %RX
> 

not good.

> The higher performing results seem to correspond to when there's a
> somewhat lower receiver CPU utilization.  I'm not sure but there
> could also have been an effect from running the "-M1460" test after
> the 9000 byte jumbo frame test (no jumbo tests were done at all prior
> to running the above sets of 5 tests, although I did always discard
> an initial "warmup" test, and now that I think about it some of
> those initial discarded "warmup" tests did have somewhat anomalously
> high results).

If you didnt reset the ssthresh on every run, could it have been cached
and used on subsequent runs?

> > A side note: Although the experimentation reduces the variables (eg
> > tying all to CPU0), it would be more exciting to see multi-cpu and
> > multi-flow sender effect (which IMO is more real world). 
> 
> These systems are intended as test systems for 10-GigE networks,
> and as such it's important to get as consistently close to full
> 10-GigE line rate as possible, and that's why the interrupts and
> nuttcp application are tied to CPU0, with almost all other system
> applications tied to CPU1.

Sure, good benchmark. You get to know how well you can do.

> Now on another system that's intended as a 10-GigE firewall system,
> it has 2 Myricom 10-GigE NICs with the interrupts for eth2 tied to
> CPU0 and the interrupts for CPU1 tied to CPU1.  In IP forwarding
> tests of this system, I have basically achieved full bidirectional
> 10-GigE line rate IP forwarding with 9000 byte jumbo frames.

In forwarding a more meaningful metric would be pps. The cost per packet
tends to dominate the results over the cost/byte.
9K jumbo frames at 10G is less than 500Kpps - so i dont see that machine
you are using sweating at all. To give you a comparison on a lower end
opteron a single CPU i can generate with batching pktgen 1Mpps; Robert
says he can do that even without batching on an opteron closer to what
you are using. So if you want to run that test, youd need to use
incrementally smaller packets.

> If there's some other specific test you'd like to see, and it's not
> too difficult to set up and I have some spare time, I'll see what I
> can do.

Well, the more interesting tests would be to go full throttle on all
CPUs you have and target one (or more) receivers. i.e you simulate a
real server. Can the utility you have be bound to a cpu? If yes, you
should be able to achieve this without much effort.

Thanks a lot Bill for the effort.

cheers,
jamal


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-09-14  7:20                                 ` [ofa-general] " Bill Fink
  2007-09-14 13:44                                   ` TSO, TCP Cong control etc jamal
@ 2007-09-14 17:24                                   ` David Miller
  1 sibling, 0 replies; 37+ messages in thread
From: David Miller @ 2007-09-14 17:24 UTC (permalink / raw)
  To: billfink
  Cc: hadi, jheffner, rick.jones2, krkumar2, gaagaan, general, herbert,
	jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
	peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
	tgraf, xma

From: Bill Fink <billfink@mindspring.com>
Date: Fri, 14 Sep 2007 03:20:55 -0400

> TSO disabled performance is always better than equivalent TSO enabled
> performance.  With TSO enabled, the optimum performance is indeed at
> a TX/RX interrupt coalescing value of 75 usec.  With TSO disabled,
> performance is the full 10-GigE line rate of 9910 Mbps for any value
> of TX/RX interrupt coalescing from 15 usec to 105 usec.

Note that the systems where the coalescing tweaking if often necessary
are the heavily NUMA'd systems where cpu to device latency can be
huge, like the big SGI ones which is where we had to tweak things in
the tg3 driver in the first place.

On most systems, as you saw mostly in the non-TSO case, the
value choosen for the most part is arbitrary and not critical.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-23 22:04               ` [ofa-general] " jamal
  2007-08-23 22:25                 ` jamal
@ 2007-08-23 22:30                 ` David Miller
  2007-08-23 22:38                   ` [ofa-general] " jamal
  1 sibling, 1 reply; 37+ messages in thread
From: David Miller @ 2007-08-23 22:30 UTC (permalink / raw)
  To: hadi
  Cc: rick.jones2, krkumar2, gaagaan, general, herbert, jagana, jeff,
	johnpol, kaber, kumarkr, mcarlson, mchan, netdev,
	peter.p.waskiewicz.jr, rdreier, Robert.Olsson, shemminger, sri,
	tgraf, xma

From: jamal <hadi@cyberus.ca>
Date: Thu, 23 Aug 2007 18:04:10 -0400

> Possibly a bug - but you really should turn off TSO if you are doing
> huge interactive transactions (which is fair because there is a clear
> demarcation).

I don't see how this can matter.

TSO only ever does anything if you accumulate more than one MSS
worth of data.

And when that does happen, all it does is take whats in the send queue
and send as much as possible at once.  The packets are already built
in big chunks, so there is no extra work to do.

The card is going to send the things back to back and as fast as
in the non-TSO case as well.

It doesn't change application scheduling, and it absolutely does not
penalize small sends by the application unless we have a bug
somewhere.

So I see no reason to disable TSO for any reason other than hardware
implementation deficiencies.  And for the drivers I am familiar with
they do make smart default TSO enabling decisions based upon how well
the chip does TSO.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-23 22:30                 ` David Miller
@ 2007-08-23 22:38                   ` jamal
  2007-08-24  3:34                     ` Stephen Hemminger
  0 siblings, 1 reply; 37+ messages in thread
From: jamal @ 2007-08-23 22:38 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	netdev, rdreier, mcarlson, jeff, general, mchan, tgraf, johnpol,
	shemminger, kaber, sri

On Thu, 2007-23-08 at 15:30 -0700, David Miller wrote:
> From: jamal <hadi@cyberus.ca>
> Date: Thu, 23 Aug 2007 18:04:10 -0400
> 
> > Possibly a bug - but you really should turn off TSO if you are doing
> > huge interactive transactions (which is fair because there is a clear
> > demarcation).
> 
> I don't see how this can matter.
> 
> TSO only ever does anything if you accumulate more than one MSS
> worth of data.

I stand corrected then.

cheers,
jamal

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-23 22:38                   ` [ofa-general] " jamal
@ 2007-08-24  3:34                     ` Stephen Hemminger
  2007-08-24 12:36                       ` jamal
  2007-08-24 16:25                       ` Rick Jones
  0 siblings, 2 replies; 37+ messages in thread
From: Stephen Hemminger @ 2007-08-24  3:34 UTC (permalink / raw)
  To: hadi
  Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf,
	johnpol, David Miller, sri

On Thu, 23 Aug 2007 18:38:22 -0400
jamal <hadi@cyberus.ca> wrote:

> On Thu, 2007-23-08 at 15:30 -0700, David Miller wrote:
> > From: jamal <hadi@cyberus.ca>
> > Date: Thu, 23 Aug 2007 18:04:10 -0400
> > 
> > > Possibly a bug - but you really should turn off TSO if you are doing
> > > huge interactive transactions (which is fair because there is a clear
> > > demarcation).
> > 
> > I don't see how this can matter.
> > 
> > TSO only ever does anything if you accumulate more than one MSS
> > worth of data.
> 
> I stand corrected then.
> 
> cheers,
> jamal
> 

For most normal Internet TCP connections, you will see only 2 or 3 packets per TSO
because of ACK clocking. If you turn off delayed ACK on the receiver it
will be even less.

A current hot topic of research is reducing the number of ACK's to make TCP
work better over asymmetric links like 3G.

-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-24  3:34                     ` Stephen Hemminger
@ 2007-08-24 12:36                       ` jamal
  2007-08-24 16:25                       ` Rick Jones
  1 sibling, 0 replies; 37+ messages in thread
From: jamal @ 2007-08-24 12:36 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, rick.jones2, krkumar2, gaagaan, general, herbert,
	jagana, jeff, johnpol, kaber, mcarlson, mchan, netdev,
	peter.p.waskiewicz.jr, rdreier, Robert.Olsson, sri, tgraf, xma

On Thu, 2007-23-08 at 20:34 -0700, Stephen Hemminger wrote:

> A current hot topic of research is reducing the number of ACK's to make TCP
> work better over asymmetric links like 3G.

One other good reason to reduce ACKs to battery powered (3G) terminals
is it reduces the power consumption i.e you have longer battery life.

cheers,
jamal


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-24  3:34                     ` Stephen Hemminger
  2007-08-24 12:36                       ` jamal
@ 2007-08-24 16:25                       ` Rick Jones
  1 sibling, 0 replies; 37+ messages in thread
From: Rick Jones @ 2007-08-24 16:25 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: hadi, David Miller, krkumar2, gaagaan, general, herbert, jagana,
	jeff, johnpol, kaber, mcarlson, mchan, netdev,
	peter.p.waskiewicz.jr, rdreier, Robert.Olsson, sri, tgraf, xma

> 
> A current hot topic of research is reducing the number of ACK's to make TCP
> work better over asymmetric links like 3G.

Oy.  People running Solaris and HP-UX have been "researching" ACK reductions 
since 1997 if not earlier.

rick jones

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2007-09-14 17:25 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-17  6:06 [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar2
2007-08-21  7:18 ` David Miller
2007-08-21 12:30   ` [ofa-general] " jamal
2007-08-21 18:51     ` David Miller
2007-08-21 21:09       ` jamal
2007-08-21 22:50         ` David Miller
2007-08-22  4:11       ` [ofa-general] " Krishna Kumar2
2007-08-22  4:22         ` David Miller
2007-08-22  7:03           ` Krishna Kumar2
2007-08-22  9:14             ` David Miller
2007-08-23  2:43               ` Krishna Kumar2
2007-08-22 17:09           ` [ofa-general] " Rick Jones
2007-08-22 20:21             ` David Miller
2007-08-23 22:04               ` [ofa-general] " jamal
2007-08-23 22:25                 ` jamal
2007-08-23 22:35                   ` [ofa-general] " Rick Jones
2007-08-23 22:41                     ` jamal
2007-08-24  3:18                     ` Bill Fink
2007-08-24 12:14                       ` jamal
2007-08-24 18:08                         ` Bill Fink
2007-08-24 21:25                         ` David Miller
2007-08-24 23:11                           ` Herbert Xu
2007-08-25 23:45                             ` Bill Fink
2007-08-24 18:46                       ` [ofa-general] " Rick Jones
2007-08-25  0:42                       ` John Heffner
2007-08-26  8:41                         ` [ofa-general] " Bill Fink
2007-08-27  1:32                           ` John Heffner
2007-08-27  2:04                             ` David Miller
2007-08-27 23:23                               ` jamal
2007-09-14  7:20                                 ` [ofa-general] " Bill Fink
2007-09-14 13:44                                   ` TSO, TCP Cong control etc jamal
2007-09-14 17:24                                   ` [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
2007-08-23 22:30                 ` David Miller
2007-08-23 22:38                   ` [ofa-general] " jamal
2007-08-24  3:34                     ` Stephen Hemminger
2007-08-24 12:36                       ` jamal
2007-08-24 16:25                       ` Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox