on the wire behaviour of TSO on/off is supposed to be the same yes?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* on the wire behaviour of TSO on/off is supposed to be the same yes?
@ 2005-01-21 19:01 Rick Jones
  2005-01-21 19:58 ` Jon Mason
  0 siblings, 1 reply; 18+ messages in thread
From: Rick Jones @ 2005-01-21 19:01 UTC (permalink / raw)
  To: netdev

Is it indeed supposed to be the case that the on the wire behaviour of TCP with 
TSO on/off is supposed to be the same?  I'm seeing some cases (netperf 
TCP_STREAM 2.6.10 sending to HP-UX 11.23) where the throughput differences are 
 >= 10 MB/s on an e1000 card (TSO being slower).  I've got lots of netperf and 
tcpdump but thought I'd ask first before dumping it onto the list.  Actually, 
i'll probably have to put it up on the net somewhere since it is O(200MB) in 
total (although I have a smaller point example).

sincerely,

rick jones

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-21 19:01 on the wire behaviour of TSO on/off is supposed to be the same yes? Rick Jones
@ 2005-01-21 19:58 ` Jon Mason
  2005-01-21 20:18   ` Rick Jones
  0 siblings, 1 reply; 18+ messages in thread
From: Jon Mason @ 2005-01-21 19:58 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

The benefit of TSO is not throughput, but CPU utilization.  Throughput 
increase is usually a side effect because of better PCI DMA behavior (eg. 
large PCI transfers are better).  Are you seeing a CPU utilization decrease?

On Friday 21 January 2005 01:01 pm, Rick Jones wrote:
> Is it indeed supposed to be the case that the on the wire behaviour of TCP with 
> TSO on/off is supposed to be the same?  I'm seeing some cases (netperf 
> TCP_STREAM 2.6.10 sending to HP-UX 11.23) where the throughput differences are 
>  >= 10 MB/s on an e1000 card (TSO being slower).  I've got lots of netperf and 
> tcpdump but thought I'd ask first before dumping it onto the list.  Actually, 
> i'll probably have to put it up on the net somewhere since it is O(200MB) in 
> total (although I have a smaller point example).
> 
> sincerely,
> 
> rick jones
> 
> 
> 

-- 
Jon Mason
jdmason@us.ibm.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-21 19:58 ` Jon Mason
@ 2005-01-21 20:18   ` Rick Jones
  2005-01-21 20:44     ` David S. Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Rick Jones @ 2005-01-21 20:18 UTC (permalink / raw)
  To: netdev

Jon Mason wrote:
> The benefit of TSO is not throughput, but CPU utilization.  Throughput 
> increase is usually a side effect because of better PCI DMA behavior (eg. 
> large PCI transfers are better). 

I'm not looking for a throughput increase, the systems involved are already 
capable of link-rate without TSO, just looking for throughput to not drop with 
TSO.  If it were a throughput drop because the NIC CPU (bletch) saturated that 
would be one thing (such as happened with Tigon2 in another context) but this 
drop stems from what appears to be not filling cwnd and getting acks delayed by 
a timer.

> Are you seeing a CPU utilization decrease?

Yes, at the cost of occasional pauses in the data stream.

TSO is on
TCP STREAM TEST to 192.168.13.1
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

131072 262142 262142    10.00       843.86   12.77    -1.00    2.480   -1.000
TSO is off
TCP STREAM TEST to 192.168.13.1
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

131072 262142 262142    10.00       941.13   22.82    -1.00    3.972   -1.000

This is with tcp_tso_win_divisor set to 1 so TSO kicks-in before 200some-oddK 
are transfered.   The service demand drop (modulo the accuracy of CPU util 
measurements via the -DUSE_PROC_STAT stuff) is rather nice.

rick jones
subscribed, so mail to netdev will reach me - no need for separate cc

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-21 20:18   ` Rick Jones
@ 2005-01-21 20:44     ` David S. Miller
  2005-01-21 22:00       ` Rick Jones
  0 siblings, 1 reply; 18+ messages in thread
From: David S. Miller @ 2005-01-21 20:44 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

On Fri, 21 Jan 2005 12:18:53 -0800
Rick Jones <rick.jones2@hp.com> wrote:

> This is with tcp_tso_win_divisor set to 1 so TSO kicks-in before 200some-oddK 
> are transfered.   The service demand drop (modulo the accuracy of CPU util 
> measurements via the -DUSE_PROC_STAT stuff) is rather nice.

Don't set tcp_tso_win_divisor to such a low value, that's why
TCP is being so bursty in your case.  The default value
of "8" keeps TCP reasonable well ACK clocked, thus avoiding
the throughput lossage you are seeing with it set to "1".

With a value of "1", TCP will wait for the entire congestion
window to be ACK'd before it will spit out a huge TSO frame.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-21 20:44     ` David S. Miller
@ 2005-01-21 22:00       ` Rick Jones
  2005-01-21 22:18         ` David S. Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Rick Jones @ 2005-01-21 22:00 UTC (permalink / raw)
  To: netdev

David S. Miller wrote:
> Don't set tcp_tso_win_divisor to such a low value, that's why
> TCP is being so bursty in your case.  The default value
> of "8" keeps TCP reasonable well ACK clocked, thus avoiding
> the throughput lossage you are seeing with it set to "1".

If my only interest were bulk throughput then that would be fine, but I'm also concerned about shorter lived, request/response sorts 
of workloads.  The netperf TCP_STREAM test was simply a convenient vehicle.  If it would be better, I could switch to a different 
netperf test.

> With a value of "1", TCP will wait for the entire congestion
> window to be ACK'd before it will spit out a huge TSO frame.

It looks though like it then is not spitting-out a full congestion window.  Here is the openeing from the TSO on case:

000031 IP 192.168.13.223.33287 > 192.168.13.1.64632: S 2243249440:2243249440(0) win 5840 <mss 1460,sackOK,timestamp 168858934 
0,nop,wscale 2>
000095 IP 192.168.13.1.64632 > 192.168.13.223.33287: S 3684332982:3684332982(0) ack 2243249441 win 65535 <mss 
1460,nop,nop,sackOK,wscale 2,nop,nop,nop,timestamp 960528547 168858934>
000014 IP 192.168.13.223.33287 > 192.168.13.1.64632: . ack 1 win 1460 <nop,nop,timestamp 168858934 960528547>
000118 IP 192.168.13.223.33287 > 192.168.13.1.64632: . 1:4345(4344) ack 1 win 1460 <nop,nop,timestamp 168858934 960528547>
000117 IP 192.168.13.1.64632 > 192.168.13.223.33287: . ack 1449 win 32768 <nop,nop,timestamp 960528547 168858934>
000002 IP 192.168.13.1.64632 > 192.168.13.223.33287: . ack 4345 win 32768 <nop,nop,timestamp 960528547 168858934>
000248 IP 192.168.13.223.33287 > 192.168.13.1.64632: . 4345:8689(4344) ack 1 win 1460 <nop,nop,timestamp 168858935 960528547>

Indeed, it waited for the ACK 4335, but then shouldn't it have emitted 4344+1448 or 5792 bytes or perhaps 7240 (since there were two 
ACKs?

(this is a hacked tcpdump to treat an IP length field of zero as a TSO segment and use the other reported length - a patch went to 
tcpdump-workers, not sure if they will like it or not...)

In the TSO off case it does send a full cwnd:

000031 IP 192.168.13.223.33289 > 192.168.13.1.64633: S 2252401705:2252401705(0) win 5840 <mss 1460,sackOK,timestamp 168870470 
0,nop,wscale 2>
000099 IP 192.168.13.1.64633 > 192.168.13.223.33289: S 3685848941:3685848941(0) ack 2252401706 win 65535 <mss 
1460,nop,nop,sackOK,wscale 2,nop,nop,nop,timestamp 960529700 168870470>
000014 IP 192.168.13.223.33289 > 192.168.13.1.64633: . ack 1 win 1460 <nop,nop,timestamp 168870470 960529700>
000080 IP 192.168.13.223.33289 > 192.168.13.1.64633: . 1:1449(1448) ack 1 win 1460 <nop,nop,timestamp 168870470 960529700>
000009 IP 192.168.13.223.33289 > 192.168.13.1.64633: . 1449:2897(1448) ack 1 win 1460 <nop,nop,timestamp 168870470 960529700>
000010 IP 192.168.13.223.33289 > 192.168.13.1.64633: . 2897:4345(1448) ack 1 win 1460 <nop,nop,timestamp 168870470 960529700>
000145 IP 192.168.13.1.64633 > 192.168.13.223.33289: . ack 1449 win 32768 <nop,nop,timestamp 960529700 168870470>
000001 IP 192.168.13.1.64633 > 192.168.13.223.33289: . ack 4345 win 32768 <nop,nop,timestamp 960529700 168870470>
000190 IP 192.168.13.223.33289 > 192.168.13.1.64633: . 4345:5793(1448) ack 1 win 1460 <nop,nop,timestamp 168870470 960529700>
000006 IP 192.168.13.223.33289 > 192.168.13.1.64633: . 5793:7241(1448) ack 1 win 1460 <nop,nop,timestamp 168870470 960529700>
000013 IP 192.168.13.223.33289 > 192.168.13.1.64633: . 7241:8689(1448) ack 1 win 1460 <nop,nop,timestamp 168870470 960529700>
000005 IP 192.168.13.223.33289 > 192.168.13.1.64633: . 8689:10137(1448) ack 1 win 1460 <nop,nop,timestamp 168870470 960529700>
000004 IP 192.168.13.223.33289 > 192.168.13.1.64633: . 10137:11585(1448) ack 1 win 1460 <nop,nop,timestamp 168870470 960529700>

Given the relative timestamps (tcpdump -ttt... taken on the sender) it _seems_ that even in the TSO-off case it was waiting for the 
full cwnd to be ACKed, buth then once ACKed, it send the full 5 segment cwnd. (Although that seeming to wait would really need to be 
confirmed by an intra-stack trace I suppose...)

rick jones

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-21 22:00       ` Rick Jones
@ 2005-01-21 22:18         ` David S. Miller
  2005-01-21 22:48           ` Rick Jones
  0 siblings, 1 reply; 18+ messages in thread
From: David S. Miller @ 2005-01-21 22:18 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

On Fri, 21 Jan 2005 14:00:30 -0800
Rick Jones <rick.jones2@hp.com> wrote:

> Indeed, it waited for the ACK 4335, but then shouldn't it have emitted 4344+1448 or 5792 bytes or perhaps 7240 (since there were two 
> ACKs?

The tcp_tso_win_divisor calculation occurs on the congestion window at the
time of the user request, not at the time of the ACK.

That's an interesting observation actually, thanks for showing it.

It means that ideally we might want to try and find a way to either:

1) defer the TSO window size calculation to some later moment, ie.
   at tcp_write_xmit() time

2) use an optimistic TSO size calculation at the same moment we compute
   it now, and later if it is found to be too aggressive we chop up the
   TSO frame and resegment the transmit queue to accomodate

Neither is easy to implement as far as I can tell, but it should fix
all the problems IBM and others are trying to work around by setting
the tcp_tso_win_divisor really small.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-21 22:18         ` David S. Miller
@ 2005-01-21 22:48           ` Rick Jones
  2005-01-21 22:58             ` Rick Jones
  2005-01-22  4:49             ` David S. Miller
  0 siblings, 2 replies; 18+ messages in thread
From: Rick Jones @ 2005-01-21 22:48 UTC (permalink / raw)
  To: netdev

David S. Miller wrote:
> On Fri, 21 Jan 2005 14:00:30 -0800 Rick Jones <rick.jones2@hp.com> wrote:
> 
> 
>> Indeed, it waited for the ACK 4335, but then shouldn't it have emitted
>> 4344+1448 or 5792 bytes or perhaps 7240 (since there were two ACKs?
> 
> 
> The tcp_tso_win_divisor calculation occurs on the congestion window at the 
> time of the user request, not at the time of the ACK.

Ah, _that_ explains why in so many of my traces it stays at one value for sooo 
long.  And in some places it seemed to jump by fairly large quantities. I 
thought it was related to the window size, but in a netperf TCP_STREAM test, 
unless the sender sets the -m option, it is set based on the getsockopt() that 
follows the setsockopt() from the -s, and since -S was 128K, and since Linux 
doubles that on the getsockopt().... that explains the O(200K) bit before > 1448 
byte sends when the divisor was set to 8.

> That's an interesting observation actually, thanks for showing it.

My pleasure.

> It means that ideally we might want to try and find a way to either:
> 
> 1) defer the TSO window size calculation to some later moment, ie. at
> tcp_write_xmit() time
> 
> 2) use an optimistic TSO size calculation at the same moment we compute it
> now, and later if it is found to be too aggressive we chop up the TSO frame
> and resegment the transmit queue to accomodate
> 
> Neither is easy to implement as far as I can tell, but it should fix all the
> problems IBM and others are trying to work around by setting the
> tcp_tso_win_divisor really small.

Indeed, it seems that one would want to decide about TSO when one is about to 
transmit, not when the user does a send since otherwise, you penalize users 
doing larger sends.  Someone doing say a sendfile() of a large file would be 
pretty much precluded from getting benefit from TSO the way things are now right?

(There is a netperf TCP_SENDFILE test, but it defaults the send size to the 
socket buffer size just like TCP_STREAM)

And I suspect that is the case for some of the (un)spoken workloads of interest 
among the system vendors.  That's not to say that we still won't have incentive 
to set tcp_tso_win_divisor (shouldn't that really be tcp_tso_cwnd_divisor?) to 1 
:)  I suspect we will still want that initial "4380" cwnd bytes to be a single 
TSO transmission... every cycle's sacred, every cycle's great... :)

rick jones

BTW, has the whole "reply-to" question already been thrashed about on this list? 
  Is it an open or closed list?  I ask because I keep getting two copies of 
everyone's replies - one to me, one to the list... just a nit...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-21 22:48           ` Rick Jones
@ 2005-01-21 22:58             ` Rick Jones
  2005-01-22  4:44               ` David S. Miller
  2005-01-22  4:49             ` David S. Miller
  1 sibling, 1 reply; 18+ messages in thread
From: Rick Jones @ 2005-01-21 22:58 UTC (permalink / raw)
  To: netdev

Rick Jones wrote:
> That's not to say that we still won't have incentive to set
> tcp_tso_win_divisor (shouldn't that really be tcp_tso_cwnd_divisor?) to 1 :)

Speaking of divisor values... is zero (0) supposed to be a legal value?  The 
sysctl seems to allow it but it does seem to behave a triffle strangely.  The 
initial TSO size appeared to be 2MSS.

It might be rather interesting if a value of zero were to have the effect of 
ignoring initial cwnd entirely :)  It wouldn't be "legal" in the RFC sense, but 
I suspect it would make for some interesting experimental opportunities.  Rather 
far down on the list though.

rick jones

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-21 22:58             ` Rick Jones
@ 2005-01-22  4:44               ` David S. Miller
  2005-01-22 18:58                 ` rick jones
  0 siblings, 1 reply; 18+ messages in thread
From: David S. Miller @ 2005-01-22  4:44 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

On Fri, 21 Jan 2005 14:58:47 -0800
Rick Jones <rick.jones2@hp.com> wrote:

> Speaking of divisor values... is zero (0) supposed to be a legal value?  The 
> sysctl seems to allow it but it does seem to behave a triffle strangely.  The 
> initial TSO size appeared to be 2MSS.

The value "0" behaves the same as "1".

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-21 22:48           ` Rick Jones
  2005-01-21 22:58             ` Rick Jones
@ 2005-01-22  4:49             ` David S. Miller
  2005-01-22 19:05               ` rick jones
  2005-01-24 20:33               ` Rick Jones
  1 sibling, 2 replies; 18+ messages in thread
From: David S. Miller @ 2005-01-22  4:49 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

On Fri, 21 Jan 2005 14:48:08 -0800
Rick Jones <rick.jones2@hp.com> wrote:

> Indeed, it seems that one would want to decide about TSO when one is about to 
> transmit, not when the user does a send since otherwise, you penalize users 
> doing larger sends.  Someone doing say a sendfile() of a large file would be 
> pretty much precluded from getting benefit from TSO the way things are now right?

Not really, if the file is larger than the send buffer limit, we'll
have a large enough cwnd once the later packets get built.

The cwnd is sampled when we build packets onto the send queue.
Each socket has a send buffer limit on that, so we only build
until we reach that limit.

At user sendmsg/sendfile time is when we do the segmentation and thus
the packet sizing, because it is the most logical place to do this.

The code can potentially get really messy and ugly if we start
preemptively building larger frames "hoping" the cwnd will be
large enough by the time we push it onto the wire.  Segmenting
at send time is completely upside down to the way packets are
built currently for transmission.  A bad guess also means that
we'll spend significant cycles chopping up TSO packets and
resegmenting the queue.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-22  4:44               ` David S. Miller
@ 2005-01-22 18:58                 ` rick jones
  0 siblings, 0 replies; 18+ messages in thread
From: rick jones @ 2005-01-22 18:58 UTC (permalink / raw)
  To: netdev


On Jan 21, 2005, at 8:44 PM, David S. Miller wrote:

> On Fri, 21 Jan 2005 14:58:47 -0800
> Rick Jones <rick.jones2@hp.com> wrote:
>
>> Speaking of divisor values... is zero (0) supposed to be a legal 
>> value?  The
>> sysctl seems to allow it but it does seem to behave a triffle 
>> strangely.  The
>> initial TSO size appeared to be 2MSS.
>
> The value "0" behaves the same as "1".

Alas I'm away from my traces at the moment, but I do recall seeing 
different behaviour with 0 than with one - with one the TSO sends 
started at 3*1448, with the divisor at zero they started at what 
appeared to be 2*1448.

I'll see if I can get to the traces before monday and send them along.

rick jones
there is no rest for the wicked, yet the virtuous have no pillows

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-22  4:49             ` David S. Miller
@ 2005-01-22 19:05               ` rick jones
  2005-01-24 20:33               ` Rick Jones
  1 sibling, 0 replies; 18+ messages in thread
From: rick jones @ 2005-01-22 19:05 UTC (permalink / raw)
  To: netdev

On Jan 21, 2005, at 8:49 PM, David S. Miller wrote:
> The code can potentially get really messy and ugly if we start
> preemptively building larger frames "hoping" the cwnd will be
> large enough by the time we push it onto the wire.  Segmenting
> at send time is completely upside down to the way packets are
> built currently for transmission.  A bad guess also means that
> we'll spend significant cycles chopping up TSO packets and
> resegmenting the queue.

so if i'm parsing correctly, with TSO enabled, at user send time, the 
send code builds "tso-sized" segments based on the value of cwnd at the 
time of the send.

with TSO disabled, at user send time, the code will build a string of 
MSS-sized segments and queue them.

then at transmit time, cwnd is consulted and either a group of 
mss-sized segments is sent, or a tso-sized segment is sent down the 
stack.

is it necessary to build TSO-sized segments at the time the user does 
the send? could just a chain of mss-sized segments be used as the 
TSO-sized segment?

admittedly, it is more buffers to manipulate by the NIC, and the large 
DMA's don't happen, but it means that TSO can take full advantage of 
cwnd at transmit time without much resegmentation.

rick jones
there is no rest for the wicked, yet the virtuous have no pillows

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-22  4:49             ` David S. Miller
  2005-01-22 19:05               ` rick jones
@ 2005-01-24 20:33               ` Rick Jones
  2005-01-24 20:43                 ` David S. Miller
  1 sibling, 1 reply; 18+ messages in thread
From: Rick Jones @ 2005-01-24 20:33 UTC (permalink / raw)
  To: netdev

> The code can potentially get really messy and ugly if we start
> preemptively building larger frames "hoping" the cwnd will be
> large enough by the time we push it onto the wire.  Segmenting
> at send time is completely upside down to the way packets are
> built currently for transmission.  A bad guess also means that
> we'll spend significant cycles chopping up TSO packets and
> resegmenting the queue.

Just some pseudo-random thoughts which may or may not hold true in the context 
of Linux TSO, and may or may not show my utter ignorance of current cwnd 
behaviour :)

*) At present the code executed at time of user send knows about the cwnd at the 
time of user send.  It then builds some number of TSO-sized segements and queues 
them

*) "Classic" cwnd behaviour is either:

    a) Increasing rapidly to ssthresh
    b) Increasing slowly after ssthresh
    c) Restarting from initial values after timeout

(IIRC there is other cwnd manipulation for fast rtxes and the like)

*) If there is a timeout, any previously queued TSO-sized segments are going to 
have to be resegmented anyway no matter what their size was before the timeout. 
So conservative versus optimistic guesses there would not seem to matter.

*) If there is a fast rtx and the like, any previously queued TSO-sized segments 
may very likely (not certain though) have to be resegmented, particularly if 
they were queued when cwnd was >= ssthresh

*) If cwnd is < ssthresh, the next cwnd is going to be > current cwnd unless 
there is a retransmission of some sort.

*) If cwnd is >= ssthresh, cwnd will remain cwnd for some time (compared to 
otherwise)

It would seem that the segmentation code, if it knew ssthresh as well as cwnd 
_could_ make some reasonably optimistic guesses as to cwnd growth while doing 
its segmentation.  Guesses that wouldn't seem to be any worse than they would be 
at present if there is a full RTO at least.  And if the tcp_tso_win_divisor is > 
1, it is possible that even the onesey-twosies of fast rtx may not require all 
_that_ much resegmentation?

of have I just gone-off into the deep-end?

rick jones

BTW, speaking of tcp_tso_win_divisor, I've gone back through my traces and they 
do not support my recollection, so I must have been confused as to what I was 
looking-at when I got that impression.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-24 20:33               ` Rick Jones
@ 2005-01-24 20:43                 ` David S. Miller
  2005-01-24 21:22                   ` Rick Jones
  2005-01-28  0:10                   ` Rick Jones
  0 siblings, 2 replies; 18+ messages in thread
From: David S. Miller @ 2005-01-24 20:43 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

On Mon, 24 Jan 2005 12:33:23 -0800
Rick Jones <rick.jones2@hp.com> wrote:

> It would seem that the segmentation code, if it knew ssthresh as well as cwnd 
> _could_ make some reasonably optimistic guesses as to cwnd growth while doing 
> its segmentation.

Becuase we disable TSO on any packet loss whatsoever, we can predict
exactly what the CWND will be at the time a packet is sent.

I've been quiet the past few days, but this is the kind of implementation
I've been thinking of.

When we take away that invariant, which we do want to do, we'll need
to tweak how this works.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-24 20:43                 ` David S. Miller
@ 2005-01-24 21:22                   ` Rick Jones
  2005-01-28  0:10                   ` Rick Jones
  1 sibling, 0 replies; 18+ messages in thread
From: Rick Jones @ 2005-01-24 21:22 UTC (permalink / raw)
  To: netdev

David S. Miller wrote:
> Becuase we disable TSO on any packet loss whatsoever, we can predict
> exactly what the CWND will be at the time a packet is sent.

I'd heard someone else mention that, but wasn't sure. Now I know I guess.  I can 
see how that would simplify things considerably, although it may have some 
non-technical implications...

> When we take away that invariant, which we do want to do, we'll need
> to tweak how this works.

OK.  BTW, how "hard" is it to reference chunks of a big buffer and send them? 
Particularly in a situation where there is TSO which implies SG and CKO.

rick

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-24 20:43                 ` David S. Miller
  2005-01-24 21:22                   ` Rick Jones
@ 2005-01-28  0:10                   ` Rick Jones
  2005-01-28  0:57                     ` David S. Miller
  1 sibling, 1 reply; 18+ messages in thread
From: Rick Jones @ 2005-01-28  0:10 UTC (permalink / raw)
  To: netdev

David S. Miller wrote:
> Becuase we disable TSO on any packet loss whatsoever, we can predict
> exactly what the CWND will be at the time a packet is sent.
> 
> I've been quiet the past few days, but this is the kind of implementation
> I've been thinking of.
> 
> When we take away that invariant, which we do want to do, we'll need
> to tweak how this works.

Having thought about the topic a bit, it now seems that there were two benchmark 
run-rule compliance problems with TSO in 2.6.  One is the slow-start stuff that 
has been worked-on and may get a couple additional tweaks (hopefully).

The other relates to the business of disabling TSO on a connection upon packet loss.

While the benchmark(s) that spring to mind are run over generally lossless LANs, 
the intent is that the solution be suitable for an Internet connected system 
(yes, someone could probably punch big gaping holes there...).  Internet 
connected systems experience non-trivial packet loss rates and so if TSO 
disabled upon packet loss it means a given benchmark result using TSO deviates 
even more from reality than one without TSO.

I suspect it would be found to be a benchmark special and disallowed.

I do not know what other stacks do with their TSO implementation to know if they 
are in a similar state.  It would be good to know if anyone out there knows and 
would be willing to say.

rick jones

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-28  0:10                   ` Rick Jones
@ 2005-01-28  0:57                     ` David S. Miller
  2005-01-28  1:36                       ` Rick Jones
  0 siblings, 1 reply; 18+ messages in thread
From: David S. Miller @ 2005-01-28  0:57 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

On Thu, 27 Jan 2005 16:10:46 -0800
Rick Jones <rick.jones2@hp.com> wrote:

> The other relates to the business of disabling TSO on a connection upon packet loss.

There cannot possibly any compliance issues resulting from turning
off an optimization in the face of packet loss.

> Internet connected systems experience non-trivial packet loss rates and so if TSO 
> disabled upon packet loss it means a given benchmark result using TSO deviates 
> even more from reality than one without TSO.

And running the benchmark over a local gigabit subnet doesn't deviate
from what Internet connected systems can expect to achieve how-so?

Oh you mean I really can get 60,000 web or database connections a second
when the users are over modems half-way across the planet?  Give me a
break...

Furthermore, all the tuning people do in each run is to optimize
specifically for a local high-speed interconnect subnet, no limits
on TCP or filesystem memory use, and large cache sizes.  Nobody
configures their machines this way, unless they want remote users
to be able to consume %90 or so of their system memory with TCP
socket memory.

Anyways, see my other posting, we'll be able to keep TSO enabled in
the face of packet loss, but that is an optimization not a correctness
fix.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: on the wire behaviour of TSO on/off is supposed to be the same yes?
  2005-01-28  0:57                     ` David S. Miller
@ 2005-01-28  1:36                       ` Rick Jones
  0 siblings, 0 replies; 18+ messages in thread
From: Rick Jones @ 2005-01-28  1:36 UTC (permalink / raw)
  To: netdev

David S. Miller wrote:
> On Thu, 27 Jan 2005 16:10:46 -0800
> Rick Jones <rick.jones2@hp.com> wrote:
> 
> 
> 
>>The other relates to the business of disabling TSO on a connection upon packet loss.
> 
> 
> There cannot possibly any compliance issues resulting from turning
> off an optimization in the face of packet loss.

I was a bit vague - compliance with the benchmark run and report rules, not with 
RFC's.

>>Internet connected systems experience non-trivial packet loss rates and so if TSO 
>>disabled upon packet loss it means a given benchmark result using TSO deviates 
>>even more from reality than one without TSO.
> 
> 
> And running the benchmark over a local gigabit subnet doesn't deviate
> from what Internet connected systems can expect to achieve how-so?

Benchmarking, not logic...

> Oh you mean I really can get 60,000 web or database connections a second
> when the users are over modems half-way across the planet?  Give me a
> break...

If there are enough users :)

> Anyways, see my other posting, we'll be able to keep TSO enabled in
> the face of packet loss, but that is an optimization not a correctness
> fix.

Cool.

rick jones

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2005-01-28  1:36 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-21 19:01 on the wire behaviour of TSO on/off is supposed to be the same yes? Rick Jones
2005-01-21 19:58 ` Jon Mason
2005-01-21 20:18   ` Rick Jones
2005-01-21 20:44     ` David S. Miller
2005-01-21 22:00       ` Rick Jones
2005-01-21 22:18         ` David S. Miller
2005-01-21 22:48           ` Rick Jones
2005-01-21 22:58             ` Rick Jones
2005-01-22  4:44               ` David S. Miller
2005-01-22 18:58                 ` rick jones
2005-01-22  4:49             ` David S. Miller
2005-01-22 19:05               ` rick jones
2005-01-24 20:33               ` Rick Jones
2005-01-24 20:43                 ` David S. Miller
2005-01-24 21:22                   ` Rick Jones
2005-01-28  0:10                   ` Rick Jones
2005-01-28  0:57                     ` David S. Miller
2005-01-28  1:36                       ` Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).