RFC: Nagle latency tuning

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RFC: Nagle latency tuning
@ 2008-09-08 21:56 Christopher Snook
  2008-09-08 22:39 ` Rick Jones
  2008-09-08 22:55 ` Andi Kleen
  0 siblings, 2 replies; 44+ messages in thread
From: Christopher Snook @ 2008-09-08 21:56 UTC (permalink / raw)
  To: Netdev

Hey folks --

We frequently get requests from customers for a tunable to disable Nagle 
system-wide, to be bug-for-bug compatible with Solaris.  We routinely reject 
these requests, as letting naive TCP apps accidentally flood the network is 
considered harmful.  Still, it would be very nice if we could reduce 
Nagle-induced latencies system-wide, if we could do so without disabling Nagle 
completely.

If you write a multi-threaded app that sends lots of small messages across TCP 
sockets, and you do not use TCP_NODELAY, you'll often see 40 ms latencies as the 
network stack waits for more senders to fill an MTU-sized packet before 
transmitting.  Even worse, these apps may work fine across the LAN with a 1500 
MTU and then counterintuitively perform much worse over loopback with a 16436 MTU.

To combat this, many apps set TCP_NODELAY, often without the abundance of 
caution that option should entail.  Other apps leave it alone, and suffer 
accordingly.

If we could simply lower this latency, without changing the fundamental behavior 
of the TCP stack, it would be a great benefit to many latency-sensitive apps, 
and discourage the unnecessary use of TCP_NODELAY.

I'm afraid I don't know the TCP stack intimately enough to understand what side 
effects this might have.  Can someone more familiar with the nagle 
implementations please enlighten me on how this could be done, or why it 
shouldn't be?

-- Chris

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-08 21:56 RFC: Nagle latency tuning Christopher Snook
@ 2008-09-08 22:39 ` Rick Jones
  2008-09-09  5:10   ` Chris Snook
  2008-09-08 22:55 ` Andi Kleen
  1 sibling, 1 reply; 44+ messages in thread
From: Rick Jones @ 2008-09-08 22:39 UTC (permalink / raw)
  To: Christopher Snook; +Cc: Netdev

Christopher Snook wrote:
> Hey folks --
> 
> We frequently get requests from customers for a tunable to disable Nagle 
> system-wide, to be bug-for-bug compatible with Solaris. 

Which ndd setting is that in Solaris, and is it an explicit disabling of 
Nagle (which wouldn't be much better than arbitrary setting of 
TCP_NODELAY by apps anyway), or is it a tuning of the send size against 
which Nagle is comparing?

> We routinely reject these requests, as letting naive TCP apps
> accidentally flood the network is considered harmful. Still, it would
> be very nice if we could reduce Nagle-induced latencies system-wide,
> if we could do so without disabling Nagle completely.
> 
> If you write a multi-threaded app that sends lots of small messages 
> across TCP sockets, and you do not use TCP_NODELAY, you'll often see 40 
> ms latencies as the network stack waits for more senders to fill an 
> MTU-sized packet before transmitting. 

How does an application being multi-threaded enter into it?  IIRC, it is 
simply a matter of wanting to go "write, write, read" on the socket 
where the writes are sub-MSS.

> Even worse, these apps may work fine across the LAN with a 1500 MTU
> and then counterintuitively perform much worse over loopback with a
> 16436 MTU.

Without knowing if those apps were fundamentally broken and just got 
"lucky" at a 1500 byte MTU we cannot really say if it is truly 
counterintuitive :)

> To combat this, many apps set TCP_NODELAY, often without the abundance 
> of caution that option should entail.  Other apps leave it alone, and 
> suffer accordingly.
> 
> If we could simply lower this latency, without changing the fundamental 
> behavior of the TCP stack, it would be a great benefit to many 
> latency-sensitive apps, and discourage the unnecessary use of TCP_NODELAY.
> 
> I'm afraid I don't know the TCP stack intimately enough to understand 
> what side effects this might have.  Can someone more familiar with the 
> nagle implementations please enlighten me on how this could be done, or 
> why it shouldn't be?

IIRC, the only way to lower the latency experienced by an application 
running into latencies associated with poor interaction with Nagle is to 
either start generating immediate ACKnowledgements at the reciever, 
lower the standalone ACK timer on the receiver, or start a very short 
timer on the sender.  I doubt that (m)any of those are terribly palatable.

Below is some boilerplate I have on Nagle that isn't Linux-specific:

<begin>

$ cat usenet_replies/nagle_algorithm

 > I'm not familiar with this issue, and I'm mostly ignorant about what
 > tcp does below the sockets interface. Can anybody briefly explain what
 > "nagle" is, and how and when to turn it off? Or point me to the
 > appropriate manual.

In broad terms, whenever an application does a send() call, the logic
of the Nagle algorithm is supposed to go something like this:

1) Is the quantity of data in this send, plus any queued, unsent data,
greater than the MSS (Maximum Segment Size) for this connection? If
yes, send the data in the user's send now (modulo any other
constraints such as receiver's advertised window and the TCP
congestion window). If no, go to 2.

2) Is the connection to the remote otherwise idle? That is, is there
no unACKed data outstanding on the network. If yes, send the data in
the user's send now. If no, queue the data and wait. Either the
application will continue to call send() with enough data to get to a
full MSS-worth of data, or the remote will ACK all the currently sent,
unACKed data, or our retransmission timer will expire.

Now, where applications run into trouble is when they have what might
be described as "write, write, read" behaviour, where they present
logically associated data to the transport in separate 'send' calls
and those sends are typically less than the MSS for the connection.
It isn't so much that they run afoul of Nagle as they run into issues
with the interaction of Nagle and the other heuristics operating on
the remote. In particular, the delayed ACK heuristics.

When a receiving TCP is deciding whether or not to send an ACK back to
the sender, in broad handwaving terms it goes through logic similar to
this:

a) is there data being sent back to the sender? if yes, piggy-back the
ACK on the data segment.

b) is there a window update being sent back to the sender? if yes,
piggy-back the ACK on the window update.

c) has the standalone ACK timer expired.

Window updates are generally triggered by the following heuristics:

i) would the window update be for a non-trivial fraction of the window
- typically somewhere at or above 1/4 the window, that is, has the
application "consumed" at least that much data? if yes, send a
window update. if no, check ii.

ii) would the window update be for, the application "consumed," at
least 2*MSS worth of data? if yes, send a window update, if no wait.

Now, going back to that write, write, read application, on the sending
side, the first write will be transmitted by TCP via logic rule 2 -
the connection is otherwise idle. However, the second small send will
be delayed as there is at that point unACKnowledged data outstanding
on the connection.

At the receiver, that small TCP segment will arrive and will be passed
to the application. The application does not have the entire app-level
message, so it will not send a reply (data to TCP) back. The typical
TCP window is much much larger than the MSS, so no window update would
be triggered by heuristic i. The data just arrived is < 2*MSS, so no
window update from heuristic ii. Since there is no window update, no
ACK is sent by heuristic b.

So, that leaves heuristic c - the standalone ACK timer. That ranges
anywhere between 50 and 200 milliseconds depending on the TCP stack in
use.

If you've read this far :) now we can take a look at the effect of
various things touted as "fixes" to applications experiencing this
interaction.  We take as our example a client-server application where
both the client and the server are implemented with a write of a small
application header, followed by application data.  First, the
"default" case which is with Nagle enabled (TCP_NODELAY _NOT_ set) and
with standard ACK behaviour:

               Client                     Server
              Req Header        ->
                                <-        Standalone ACK after Nms
              Req Data          ->
                                <-        Possible standalone ACK
                                <-        Rsp Header
              Standalone ACK    ->
                                <-        Rsp Data
     Possible standalone ACK    ->

For two "messages" we end-up with at least six segments on the wire.
The possible standalone ACKs will depend on whether the server's
response time, or client's think time is longer than the standalone
ACK interval on their respective sides. Now, if TCP_NODELAY is set we
see:

               Client                     Server
              Req Header        ->
              Req Data          ->
                                <-        Possible Standalone ACK after Nms
                                <-        Rsp Header
                                <-        Rsp Data
      Possible Standalone ACK   ->

In theory, we are down two four segments on the wire which seems good,
but frankly we can do better.  First though, consider what happens
when someone disables delayed ACKs

               Client                     Server
              Req Header        ->
                                <-        Immediate Standalone ACK
              Req Data          ->
                                <-        Immediate Standalone ACK
                                <-        Rsp Header
    Immediate Standalone ACK    ->
                                <-        Rsp Data
    Immediate Standalone ACK    ->

Now we definitly see 8 segments on the wire.  It will also be that way
if both TCP_NODELAY is set and delayed ACKs are disabled.

How about if the application did the "right" think in the first place?
That is sent the logically associated data at the same time:

               Client                     Server
              Request        ->
                             <-           Possible Standalone ACK
                                <-        Response
    Possible Standalone ACK    ->

We are down to two segments on the wire.

For "small" packets, the CPU cost is about the same regardless of data
or ACK.  This means that the application which is making the propper
gathering send call will spend far fewer CPU cycles in the networking
stack.

<end>

Now, there are further wrinkles :)  Is that application trying to 
pipeline requests on the application - then we have paths that can look 
rather like the separate header from data cases above until the 
concurrent requests outstanding get above the MSS threshold.

My recollection of the original Nagle writeups is the intention is to 
optimize the ratio of data to data+headers.  Back when those writeups 
were made, 536 byte MSSes were still considered pretty large, and 1460 
would have been positively spacious.  I doubt that anyone were 
considering the probability of a 16384 byte MTU.  It could be argued 
that in such an environment of the timeperiod, where stack tunables 
weren't all the rage and the MSS ranges were reasonably well bounded, it 
was a sufficient expedient to base the "is this enough data" decision 
off the MSS for the connection.  You certainly couldn't do any better 
than an MSS's worth of data per segment and segment sizes weren't 
astronomical.  Now that MTU's and MSS's can get so much larger, that 
expedient may indeed not be so worthwhile.   An argument could be made 
that a ratio of data to data plus headers of say 0.97 (1448/1500) is 
"good enough" and that requiring a ratio of 16384/16436 = 0.9968 is 
taking things too far by default.

That said, I still don't like covering the backsides of poorly written 
applications doing two or more writes for logically associated data.

rick jones

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-08 22:39 ` Rick Jones
@ 2008-09-09  5:10   ` Chris Snook
  2008-09-09  5:17     ` David Miller
                       ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Chris Snook @ 2008-09-09  5:10 UTC (permalink / raw)
  To: Rick Jones; +Cc: Netdev

Rick Jones wrote:
> Christopher Snook wrote:
>> Hey folks --
>>
>> We frequently get requests from customers for a tunable to disable 
>> Nagle system-wide, to be bug-for-bug compatible with Solaris. 
> 
> Which ndd setting is that in Solaris, and is it an explicit disabling of 
> Nagle (which wouldn't be much better than arbitrary setting of 
> TCP_NODELAY by apps anyway), or is it a tuning of the send size against 
> which Nagle is comparing?

Dunno, but I'm told it effectively sets TCP_NODELAY on every socket on 
the box.

>> We routinely reject these requests, as letting naive TCP apps
>> accidentally flood the network is considered harmful. Still, it would
>> be very nice if we could reduce Nagle-induced latencies system-wide,
>> if we could do so without disabling Nagle completely.
>>
>> If you write a multi-threaded app that sends lots of small messages 
>> across TCP sockets, and you do not use TCP_NODELAY, you'll often see 
>> 40 ms latencies as the network stack waits for more senders to fill an 
>> MTU-sized packet before transmitting. 
> 
> How does an application being multi-threaded enter into it?  IIRC, it is 
> simply a matter of wanting to go "write, write, read" on the socket 
> where the writes are sub-MSS.

Sorry, I'm getting my problems confused.  Being multi-threaded isn't the 
root problem, it just makes the behavior much less predictable.  Instead 
of getting the latency every other write, you might get it once in every 
million writes on a highly-threaded workload, which masks the source of 
the problem.

>> Even worse, these apps may work fine across the LAN with a 1500 MTU
>> and then counterintuitively perform much worse over loopback with a
>> 16436 MTU.
> 
> Without knowing if those apps were fundamentally broken and just got 
> "lucky" at a 1500 byte MTU we cannot really say if it is truly 
> counterintuitive :)

This is open to debate, but there are certainly a great many apps doing 
a great deal of very important business that are subject to this problem 
to some degree.

>> To combat this, many apps set TCP_NODELAY, often without the abundance 
>> of caution that option should entail.  Other apps leave it alone, and 
>> suffer accordingly.
>>
>> If we could simply lower this latency, without changing the 
>> fundamental behavior of the TCP stack, it would be a great benefit to 
>> many latency-sensitive apps, and discourage the unnecessary use of 
>> TCP_NODELAY.
>>
>> I'm afraid I don't know the TCP stack intimately enough to understand 
>> what side effects this might have.  Can someone more familiar with the 
>> nagle implementations please enlighten me on how this could be done, 
>> or why it shouldn't be?
> 
> 
> IIRC, the only way to lower the latency experienced by an application 
> running into latencies associated with poor interaction with Nagle is to 
> either start generating immediate ACKnowledgements at the reciever, 
> lower the standalone ACK timer on the receiver, or start a very short 
> timer on the sender.  I doubt that (m)any of those are terribly palatable.

I'd like to know where the 40 ms magic number comes from.  That's the 
one that really hurts, and if we could lower that without doing horrible 
things elsewhere in the stack, as a non-default tuning option, a lot of 
people would be very happy.

> Below is some boilerplate I have on Nagle that isn't Linux-specific:
> 
> <begin>
> 
> $ cat usenet_replies/nagle_algorithm
> 
>  > I'm not familiar with this issue, and I'm mostly ignorant about what
>  > tcp does below the sockets interface. Can anybody briefly explain what
>  > "nagle" is, and how and when to turn it off? Or point me to the
>  > appropriate manual.
> 
> In broad terms, whenever an application does a send() call, the logic
> of the Nagle algorithm is supposed to go something like this:
> 
> 1) Is the quantity of data in this send, plus any queued, unsent data,
> greater than the MSS (Maximum Segment Size) for this connection? If
> yes, send the data in the user's send now (modulo any other
> constraints such as receiver's advertised window and the TCP
> congestion window). If no, go to 2.
> 
> 2) Is the connection to the remote otherwise idle? That is, is there
> no unACKed data outstanding on the network. If yes, send the data in
> the user's send now. If no, queue the data and wait. Either the
> application will continue to call send() with enough data to get to a
> full MSS-worth of data, or the remote will ACK all the currently sent,
> unACKed data, or our retransmission timer will expire.
> 
> Now, where applications run into trouble is when they have what might
> be described as "write, write, read" behaviour, where they present
> logically associated data to the transport in separate 'send' calls
> and those sends are typically less than the MSS for the connection.
> It isn't so much that they run afoul of Nagle as they run into issues
> with the interaction of Nagle and the other heuristics operating on
> the remote. In particular, the delayed ACK heuristics.
> 
> When a receiving TCP is deciding whether or not to send an ACK back to
> the sender, in broad handwaving terms it goes through logic similar to
> this:
> 
> a) is there data being sent back to the sender? if yes, piggy-back the
> ACK on the data segment.
> 
> b) is there a window update being sent back to the sender? if yes,
> piggy-back the ACK on the window update.
> 
> c) has the standalone ACK timer expired.
> 
> Window updates are generally triggered by the following heuristics:
> 
> i) would the window update be for a non-trivial fraction of the window
> - typically somewhere at or above 1/4 the window, that is, has the
> application "consumed" at least that much data? if yes, send a
> window update. if no, check ii.
> 
> ii) would the window update be for, the application "consumed," at
> least 2*MSS worth of data? if yes, send a window update, if no wait.
> 
> Now, going back to that write, write, read application, on the sending
> side, the first write will be transmitted by TCP via logic rule 2 -
> the connection is otherwise idle. However, the second small send will
> be delayed as there is at that point unACKnowledged data outstanding
> on the connection.
> 
> At the receiver, that small TCP segment will arrive and will be passed
> to the application. The application does not have the entire app-level
> message, so it will not send a reply (data to TCP) back. The typical
> TCP window is much much larger than the MSS, so no window update would
> be triggered by heuristic i. The data just arrived is < 2*MSS, so no
> window update from heuristic ii. Since there is no window update, no
> ACK is sent by heuristic b.
> 
> So, that leaves heuristic c - the standalone ACK timer. That ranges
> anywhere between 50 and 200 milliseconds depending on the TCP stack in
> use.
> 
> If you've read this far :) now we can take a look at the effect of
> various things touted as "fixes" to applications experiencing this
> interaction.  We take as our example a client-server application where
> both the client and the server are implemented with a write of a small
> application header, followed by application data.  First, the
> "default" case which is with Nagle enabled (TCP_NODELAY _NOT_ set) and
> with standard ACK behaviour:
> 
>               Client                     Server
>              Req Header        ->
>                                <-        Standalone ACK after Nms
>              Req Data          ->
>                                <-        Possible standalone ACK
>                                <-        Rsp Header
>              Standalone ACK    ->
>                                <-        Rsp Data
>     Possible standalone ACK    ->
> 
> 
> For two "messages" we end-up with at least six segments on the wire.
> The possible standalone ACKs will depend on whether the server's
> response time, or client's think time is longer than the standalone
> ACK interval on their respective sides. Now, if TCP_NODELAY is set we
> see:
> 
> 
>               Client                     Server
>              Req Header        ->
>              Req Data          ->
>                                <-        Possible Standalone ACK after Nms
>                                <-        Rsp Header
>                                <-        Rsp Data
>      Possible Standalone ACK   ->
> 
> In theory, we are down two four segments on the wire which seems good,
> but frankly we can do better.  First though, consider what happens
> when someone disables delayed ACKs
> 
>               Client                     Server
>              Req Header        ->
>                                <-        Immediate Standalone ACK
>              Req Data          ->
>                                <-        Immediate Standalone ACK
>                                <-        Rsp Header
>    Immediate Standalone ACK    ->
>                                <-        Rsp Data
>    Immediate Standalone ACK    ->
> 
> Now we definitly see 8 segments on the wire.  It will also be that way
> if both TCP_NODELAY is set and delayed ACKs are disabled.
> 
> How about if the application did the "right" think in the first place?
> That is sent the logically associated data at the same time:
> 
> 
>               Client                     Server
>              Request        ->
>                             <-           Possible Standalone ACK
>                                <-        Response
>    Possible Standalone ACK    ->
> 
> We are down to two segments on the wire.
> 
> For "small" packets, the CPU cost is about the same regardless of data
> or ACK.  This means that the application which is making the propper
> gathering send call will spend far fewer CPU cycles in the networking
> stack.
> 
> <end>
> 
> Now, there are further wrinkles :)  Is that application trying to 
> pipeline requests on the application - then we have paths that can look 
> rather like the separate header from data cases above until the 
> concurrent requests outstanding get above the MSS threshold.
> 
> My recollection of the original Nagle writeups is the intention is to 
> optimize the ratio of data to data+headers.  Back when those writeups 
> were made, 536 byte MSSes were still considered pretty large, and 1460 
> would have been positively spacious.  I doubt that anyone were 
> considering the probability of a 16384 byte MTU.  It could be argued 
> that in such an environment of the timeperiod, where stack tunables 
> weren't all the rage and the MSS ranges were reasonably well bounded, it 
> was a sufficient expedient to base the "is this enough data" decision 
> off the MSS for the connection.  You certainly couldn't do any better 
> than an MSS's worth of data per segment and segment sizes weren't 
> astronomical.  Now that MTU's and MSS's can get so much larger, that 
> expedient may indeed not be so worthwhile.   An argument could be made 
> that a ratio of data to data plus headers of say 0.97 (1448/1500) is 
> "good enough" and that requiring a ratio of 16384/16436 = 0.9968 is 
> taking things too far by default.
> 
> That said, I still don't like covering the backsides of poorly written 
> applications doing two or more writes for logically associated data.
> 
> rick jones

Most of the apps where people care about this enough to complain to 
their vendor (the cases I hear about) are in messaging apps, where 
they're relaying a stream of events that have little to do with each 
other, and they want TCP to maintain the integrity of the connection and 
do a modicum of bandwidth management, but 40 ms stalls greatly exceed 
their latency tolerances.  Using TCP_NODELAY is often the least bad 
option, but sometimes it's infeasible because of its effect on the 
network, and it certainly adds to the network stack overhead.  A more 
tunable Nagle delay would probably serve many of these apps much better.

-- Chris

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09  5:10   ` Chris Snook
@ 2008-09-09  5:17     ` David Miller
  2008-09-09  5:56       ` Chris Snook
  2008-09-09 14:36     ` Andi Kleen
  2008-09-09 16:33     ` Rick Jones
  2 siblings, 1 reply; 44+ messages in thread
From: David Miller @ 2008-09-09  5:17 UTC (permalink / raw)
  To: csnook; +Cc: rick.jones2, netdev

From: Chris Snook <csnook@redhat.com>
Date: Tue, 09 Sep 2008 01:10:05 -0400

> This is open to debate, but there are certainly a great many apps
> doing a great deal of very important business that are subject to
> this problem to some degree.

Let's be frank and be honest that we're talking about message passing
financial service applications.

And I specifically know that the problem they run into is that the
congestion window doesn't open up because of Nagle _AND_ the fact that
congestion control is done using packet counts rather that data byte
totals.  So if you send lots of small stuff, the window doesn't open.
Nagle just makes this problem worse, rather than create it.

And we have a workaround for them, which is a combination of the
tcp_slow_start_after_idle sysctl in combination with route metrics
specifying the initial congestion window value to use.

I specifically added that sysctl for this specific situation.

Really, the situation here is well established and I highly encourage
you to take a deeper look into the actual problems being hit, and show
us some specific traces we can analyze properly if the problem is still
there.

Otherwise we're just shooting into the wind without any specifics to
work on whatsoever.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09  5:17     ` David Miller
@ 2008-09-09  5:56       ` Chris Snook
  2008-09-09  6:02         ` David Miller
  2008-09-09  6:22         ` Evgeniy Polyakov
  0 siblings, 2 replies; 44+ messages in thread
From: Chris Snook @ 2008-09-09  5:56 UTC (permalink / raw)
  To: David Miller; +Cc: rick.jones2, netdev

David Miller wrote:
> From: Chris Snook <csnook@redhat.com>
> Date: Tue, 09 Sep 2008 01:10:05 -0400
> 
>> This is open to debate, but there are certainly a great many apps
>> doing a great deal of very important business that are subject to
>> this problem to some degree.
> 
> Let's be frank and be honest that we're talking about message passing
> financial service applications.

Mostly.

> And I specifically know that the problem they run into is that the
> congestion window doesn't open up because of Nagle _AND_ the fact that
> congestion control is done using packet counts rather that data byte
> totals.  So if you send lots of small stuff, the window doesn't open.
> Nagle just makes this problem worse, rather than create it.
> 
> And we have a workaround for them, which is a combination of the
> tcp_slow_start_after_idle sysctl in combination with route metrics
> specifying the initial congestion window value to use.
> 
> I specifically added that sysctl for this specific situation.

That's not the problem I'm talking about here.  The problem I'm seeing 
is that if your burst of messages is too small to fill the MTU, the 
network stack will just sit there and stare at you for precisely 40 ms 
(an eternity for a financial app) before transmitting.  Andi may be 
correct that it's actually the delayed ACK we're seeing, but I can't 
figure out where that 40 ms magic number is coming from.

The easiest way to see the problem is to open a TCP socket to an echo 
daemon on loopback, make a bunch of small writes totaling less than your 
loopback MTU (accounting for overhead), and see how long it takes to get 
your echoes.  You can probably do this with netcat, though I haven't 
tried.  People don't expect loopback to have 40 ms latency when the box 
is lightly loaded, so they'd really like to tweak that down when it's 
hurting them.

-- Chris

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09  5:56       ` Chris Snook
@ 2008-09-09  6:02         ` David Miller
  2008-09-09 10:31           ` Mark Brown
  2008-09-09  6:22         ` Evgeniy Polyakov
  1 sibling, 1 reply; 44+ messages in thread
From: David Miller @ 2008-09-09  6:02 UTC (permalink / raw)
  To: csnook; +Cc: rick.jones2, netdev

From: Chris Snook <csnook@redhat.com>
Date: Tue, 09 Sep 2008 01:56:12 -0400

[ Please hit enter every 80 columns or so, your emails are
  unreadable until I reformat your text by hand, thanks. ]

> That's not the problem I'm talking about here.  The problem I'm
> seeing is that if your burst of messages is too small to fill the
> MTU, the network stack will just sit there and stare at you for
> precisely 40 ms (an eternity for a financial app) before
> transmitting.  Andi may be correct that it's actually the delayed
> ACK we're seeing, but I can't figure out where that 40 ms magic
> number is coming from.
>
> The easiest way to see the problem is to open a TCP socket to an
> echo daemon on loopback, make a bunch of small writes totaling less
> than your loopback MTU (accounting for overhead), and see how long
> it takes to get your echoes.  You can probably do this with netcat,
> though I haven't tried.  People don't expect loopback to have 40 ms
> latency when the box is lightly loaded, so they'd really like to
> tweak that down when it's hurting them.

That's informative, but please provide a specific test case and
example trace so we can discuss something concrete.

Thank you.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09  6:02         ` David Miller
@ 2008-09-09 10:31           ` Mark Brown
  2008-09-09 12:05             ` David Miller
  0 siblings, 1 reply; 44+ messages in thread
From: Mark Brown @ 2008-09-09 10:31 UTC (permalink / raw)
  To: David Miller; +Cc: csnook, rick.jones2, netdev

On Mon, Sep 08, 2008 at 11:02:21PM -0700, David Miller wrote:
> From: Chris Snook <csnook@redhat.com>
> Date: Tue, 09 Sep 2008 01:56:12 -0400

> [ Please hit enter every 80 columns or so, your emails are
>   unreadable until I reformat your text by hand, thanks. ]

FWIW the problem with this and some other mails you raised this on
recently is that they're sent with format=folowed which tells the
receiving MUA that it's OK to reflow the text to fit the display.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09 10:31           ` Mark Brown
@ 2008-09-09 12:05             ` David Miller
  2008-09-09 12:09               ` Mark Brown
  0 siblings, 1 reply; 44+ messages in thread
From: David Miller @ 2008-09-09 12:05 UTC (permalink / raw)
  To: broonie; +Cc: csnook, rick.jones2, netdev

From: Mark Brown <broonie@sirena.org.uk>
Date: Tue, 9 Sep 2008 11:31:38 +0100

> On Mon, Sep 08, 2008 at 11:02:21PM -0700, David Miller wrote:
> > From: Chris Snook <csnook@redhat.com>
> > Date: Tue, 09 Sep 2008 01:56:12 -0400
> 
> > [ Please hit enter every 80 columns or so, your emails are
> >   unreadable until I reformat your text by hand, thanks. ]
> 
> FWIW the problem with this and some other mails you raised this on
> recently is that they're sent with format=folowed which tells the
> receiving MUA that it's OK to reflow the text to fit the display.

I'll have to see why MEW isn't handling that correctly then,
it even claims to support this in the version I am using :)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09 12:05             ` David Miller
@ 2008-09-09 12:09               ` Mark Brown
  2008-09-09 12:19                 ` David Miller
  0 siblings, 1 reply; 44+ messages in thread
From: Mark Brown @ 2008-09-09 12:09 UTC (permalink / raw)
  To: David Miller; +Cc: csnook, rick.jones2, netdev

On Tue, Sep 09, 2008 at 05:05:00AM -0700, David Miller wrote:
> From: Mark Brown <broonie@sirena.org.uk>

> > FWIW the problem with this and some other mails you raised this on
> > recently is that they're sent with format=folowed which tells the
> > receiving MUA that it's OK to reflow the text to fit the display.

> I'll have to see why MEW isn't handling that correctly then,
> it even claims to support this in the version I am using :)

Handling it correctly is probably the problem - if it's working as
expected then when your window is wide enough to display lines longer
than 80 columns it'll go ahead and reflow the text to fill the window.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09 12:09               ` Mark Brown
@ 2008-09-09 12:19                 ` David Miller
  0 siblings, 0 replies; 44+ messages in thread
From: David Miller @ 2008-09-09 12:19 UTC (permalink / raw)
  To: broonie; +Cc: csnook, rick.jones2, netdev

From: Mark Brown <broonie@sirena.org.uk>
Date: Tue, 9 Sep 2008 13:09:34 +0100

> On Tue, Sep 09, 2008 at 05:05:00AM -0700, David Miller wrote:
> > From: Mark Brown <broonie@sirena.org.uk>
> 
> > > FWIW the problem with this and some other mails you raised this on
> > > recently is that they're sent with format=folowed which tells the
> > > receiving MUA that it's OK to reflow the text to fit the display.
> 
> > I'll have to see why MEW isn't handling that correctly then,
> > it even claims to support this in the version I am using :)
> 
> Handling it correctly is probably the problem - if it's working as
> expected then when your window is wide enough to display lines longer
> than 80 columns it'll go ahead and reflow the text to fill the window.

I'm using a VC console screen ~160 characters long, and many lines
were wrapped instead of flowed.

Emacs puts a special character at the end of the line when it has to be
wrapped, and I saw those when viewing the mails in quesiton.

When I see that character when reading emails, my blood starts to boil :)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09  5:56       ` Chris Snook
  2008-09-09  6:02         ` David Miller
@ 2008-09-09  6:22         ` Evgeniy Polyakov
  2008-09-09  6:28           ` Chris Snook
  2008-09-09 13:00           ` Arnaldo Carvalho de Melo
  1 sibling, 2 replies; 44+ messages in thread
From: Evgeniy Polyakov @ 2008-09-09  6:22 UTC (permalink / raw)
  To: Chris Snook; +Cc: David Miller, rick.jones2, netdev

Hi.

On Tue, Sep 09, 2008 at 01:56:12AM -0400, Chris Snook (csnook@redhat.com) wrote:
> The easiest way to see the problem is to open a TCP socket to an echo 
> daemon on loopback, make a bunch of small writes totaling less than your 
> loopback MTU (accounting for overhead), and see how long it takes to get 
> your echoes.  You can probably do this with netcat, though I haven't 
> tried.  People don't expect loopback to have 40 ms latency when the box 
> is lightly loaded, so they'd really like to tweak that down when it's 
> hurting them.

Isn't Nagle without corking a very bad idea? Or you can not change the
application?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09  6:22         ` Evgeniy Polyakov
@ 2008-09-09  6:28           ` Chris Snook
  2008-09-09 13:00           ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 44+ messages in thread
From: Chris Snook @ 2008-09-09  6:28 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, rick.jones2, netdev

Evgeniy Polyakov wrote:
> Hi.
> 
> On Tue, Sep 09, 2008 at 01:56:12AM -0400, Chris Snook (csnook@redhat.com) wrote:
>> The easiest way to see the problem is to open a TCP socket to an echo 
>> daemon on loopback, make a bunch of small writes totaling less than your 
>> loopback MTU (accounting for overhead), and see how long it takes to get 
>> your echoes.  You can probably do this with netcat, though I haven't 
>> tried.  People don't expect loopback to have 40 ms latency when the box 
>> is lightly loaded, so they'd really like to tweak that down when it's 
>> hurting them.
> 
> Isn't Nagle without corking a very bad idea? Or you can not change the
> application?
> 

Yes, it is a bad idea.  We want to make the corking tunable, so people 
don't disable it completely to avoid these 40 ms latencies.  Also, we 
often can't change the application, so tuning this system-wide would be 
nice, and would be a lot less dangerous than turning on TCP_NODELAY 
system-wide the way people often do with solaris.

-- Chris

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09  6:22         ` Evgeniy Polyakov
  2008-09-09  6:28           ` Chris Snook
@ 2008-09-09 13:00           ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 44+ messages in thread
From: Arnaldo Carvalho de Melo @ 2008-09-09 13:00 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Chris Snook, David Miller, rick.jones2, netdev

Em Tue, Sep 09, 2008 at 10:22:08AM +0400, Evgeniy Polyakov escreveu:
> Hi.
> 
> On Tue, Sep 09, 2008 at 01:56:12AM -0400, Chris Snook (csnook@redhat.com) wrote:
> > The easiest way to see the problem is to open a TCP socket to an echo 
> > daemon on loopback, make a bunch of small writes totaling less than your 
> > loopback MTU (accounting for overhead), and see how long it takes to get 
> > your echoes.  You can probably do this with netcat, though I haven't 
> > tried.  People don't expect loopback to have 40 ms latency when the box 
> > is lightly loaded, so they'd really like to tweak that down when it's 
> > hurting them.
> 
> Isn't Nagle without corking a very bad idea? Or you can not change the
> application?

In one such case, finantial app building logical packets via several
small buffer send calls I got it working with a "autocorking" LD_PRELOAD
library, libautocork:

http://git.kernel.org/?p=linux/kernel/git/acme/libautocork.git

git://git.kernel.org/pub/scm/linux/kernel/git/acme/libautocork.git

Details/test cases:

http://git.kernel.org/?p=linux/kernel/git/acme/libautocork.git;a=blob_plain;f=tcp_nodelay.txt

How to use it, what you get from using it:

http://git.kernel.org/?p=linux/kernel/git/acme/libautocork.git;a=blob_plain;f=libautocork.txt

- Arnaldo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09  5:10   ` Chris Snook
  2008-09-09  5:17     ` David Miller
@ 2008-09-09 14:36     ` Andi Kleen
  2008-09-09 18:40       ` Chris Snook
  2008-09-09 16:33     ` Rick Jones
  2 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2008-09-09 14:36 UTC (permalink / raw)
  To: Chris Snook; +Cc: Rick Jones, Netdev

Chris Snook <csnook@redhat.com> writes:
>
> I'd like to know where the 40 ms magic number comes from. 

>From TCP_ATO_MIN

#define TCP_ATO_MIN     ((unsigned)(HZ/25))

> That's the
> one that really hurts, and if we could lower that without doing
> horrible things elsewhere in the stack, 

You can lower it (with likely some bad side effects), but I don't think it 
would make these apps very happy in the end because they likely want
no delay at all.

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09 14:36     ` Andi Kleen
@ 2008-09-09 18:40       ` Chris Snook
  2008-09-09 19:07         ` Andi Kleen
  2008-09-09 19:59         ` David Miller
  0 siblings, 2 replies; 44+ messages in thread
From: Chris Snook @ 2008-09-09 18:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rick Jones, Netdev

Andi Kleen wrote:
> Chris Snook <csnook@redhat.com> writes:
>> I'd like to know where the 40 ms magic number comes from. 
> 
> From TCP_ATO_MIN
> 
> #define TCP_ATO_MIN     ((unsigned)(HZ/25))
> 
>> That's the
>> one that really hurts, and if we could lower that without doing
>> horrible things elsewhere in the stack, 
> 
> You can lower it (with likely some bad side effects), but I don't think it 
> would make these apps very happy in the end because they likely want
> no delay at all.
> 
> -Andi

These apps have a love/hate relationship with TCP.  They'll probably love SCTP 5 
years from now, but it's not mature enough for them yet.  They do want to 
minimize all latencies, and many of the apps explicitly set TCP_NODELAY.  The 
goal here is to improve latencies on the supporting apps that aren't quite as 
carefully optimized as the main message daemons themselves.  If we can give them 
a knob that bounds their worst-case latency to 2-3 times their average latency, 
without risking network floods that won't show up in testing, they'll be much 
happier.

-- Chris

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09 18:40       ` Chris Snook
@ 2008-09-09 19:07         ` Andi Kleen
  2008-09-09 19:21           ` Arnaldo Carvalho de Melo
  2008-09-11  4:08           ` Chris Snook
  2008-09-09 19:59         ` David Miller
  1 sibling, 2 replies; 44+ messages in thread
From: Andi Kleen @ 2008-09-09 19:07 UTC (permalink / raw)
  To: Chris Snook; +Cc: Andi Kleen, Rick Jones, Netdev

> These apps have a love/hate relationship with TCP.  They'll probably love 
> SCTP 5 years from now, but it's not mature enough for them yet.  They do 
> want to minimize all latencies, 

Then they should just TCP_NODELAY.

> and many of the apps explicitly set 
> TCP_NODELAY. 

That's the right thing for them.

> The goal here is to improve latencies on the supporting apps 
> that aren't quite as carefully optimized as the main message daemons 
> themselves.  If we can give them a knob that bounds their worst-case 
> latency to 2-3 times their average latency, without risking network floods 
> that won't show up in testing, they'll be much happier.

Hmm in theory I don't see a big drawback in making the these defaults sysctls.
As in this untested patch. It's probably not the right solution
for this problem. Still if you want to experiment. This makes both 
the ato default and the delack default tunable. You'll have to restart
sockets for it to take effect.

-Andi

---


Make ato min and delack min tunable 

This might potentially help with some programs which have problems with nagle.

Sockets have to be restarted

TBD documentation for the new sysctls

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Index: linux-2.6.27-rc4-misc/include/net/tcp.h
===================================================================
--- linux-2.6.27-rc4-misc.orig/include/net/tcp.h
+++ linux-2.6.27-rc4-misc/include/net/tcp.h
@@ -118,12 +118,16 @@ extern void tcp_time_wait(struct sock *s
 
 #define TCP_DELACK_MAX	((unsigned)(HZ/5))	/* maximal time to delay before sending an ACK */
 #if HZ >= 100
-#define TCP_DELACK_MIN	((unsigned)(HZ/25))	/* minimal time to delay before sending an ACK */
-#define TCP_ATO_MIN	((unsigned)(HZ/25))
+#define TCP_DELACK_MIN_DEFAULT	((unsigned)(HZ/25))	/* minimal time to delay before sending an ACK */
+#define TCP_ATO_MIN_DEFAULT	((unsigned)(HZ/25))
 #else
-#define TCP_DELACK_MIN	4U
-#define TCP_ATO_MIN	4U
+#define TCP_DELACK_MIN_DEFAULT	4U
+#define TCP_ATO_MIN_DEFAULT	4U
 #endif
+
+#define TCP_DELACK_MIN sysctl_tcp_delack_min
+#define TCP_ATO_MIN sysctl_tcp_ato_min
+
 #define TCP_RTO_MAX	((unsigned)(120*HZ))
 #define TCP_RTO_MIN	((unsigned)(HZ/5))
 #define TCP_TIMEOUT_INIT ((unsigned)(3*HZ))	/* RFC 1122 initial RTO value	*/
@@ -236,6 +240,8 @@ extern int sysctl_tcp_base_mss;
 extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_max_ssthresh;
+extern int sysctl_tcp_ato_min;
+extern int sysctl_tcp_delack_min;
 
 extern atomic_t tcp_memory_allocated;
 extern atomic_t tcp_sockets_allocated;
Index: linux-2.6.27-rc4-misc/net/ipv4/sysctl_net_ipv4.c
===================================================================
--- linux-2.6.27-rc4-misc.orig/net/ipv4/sysctl_net_ipv4.c
+++ linux-2.6.27-rc4-misc/net/ipv4/sysctl_net_ipv4.c
@@ -717,6 +717,24 @@ static struct ctl_table ipv4_table[] = {
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "tcp_delack_min",
+		.data		= &sysctl_tcp_delack_min,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_jiffies,
+		.strategy	= &sysctl_jiffies
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "tcp_ato_min",
+		.data		= &sysctl_tcp_ato_min,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_jiffies,
+		.strategy	= &sysctl_jiffies
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "udp_mem",
 		.data		= &sysctl_udp_mem,
 		.maxlen		= sizeof(sysctl_udp_mem),
Index: linux-2.6.27-rc4-misc/net/ipv4/tcp_timer.c
===================================================================
--- linux-2.6.27-rc4-misc.orig/net/ipv4/tcp_timer.c
+++ linux-2.6.27-rc4-misc/net/ipv4/tcp_timer.c
@@ -29,6 +29,8 @@ int sysctl_tcp_keepalive_intvl __read_mo
 int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
 int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
 int sysctl_tcp_orphan_retries __read_mostly;
+int sysctl_tcp_delack_min __read_mostly = TCP_DELACK_MIN_DEFAULT;
+int sysctl_tcp_ato_min __read_mostly = TCP_ATO_MIN_DEFAULT;
 
 static void tcp_write_timer(unsigned long);
 static void tcp_delack_timer(unsigned long);
Index: linux-2.6.27-rc4-misc/net/ipv4/tcp_output.c
===================================================================
--- linux-2.6.27-rc4-misc.orig/net/ipv4/tcp_output.c
+++ linux-2.6.27-rc4-misc/net/ipv4/tcp_output.c
@@ -2436,7 +2436,7 @@ void tcp_send_delayed_ack(struct sock *s
 		 * directly.
 		 */
 		if (tp->srtt) {
-			int rtt = max(tp->srtt >> 3, TCP_DELACK_MIN);
+			int rtt = max_t(unsigned, tp->srtt >> 3, TCP_DELACK_MIN);
 
 			if (rtt < max_ato)
 				max_ato = rtt;

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09 19:07         ` Andi Kleen
@ 2008-09-09 19:21           ` Arnaldo Carvalho de Melo
  2008-09-11  4:08           ` Chris Snook
  1 sibling, 0 replies; 44+ messages in thread
From: Arnaldo Carvalho de Melo @ 2008-09-09 19:21 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Chris Snook, Rick Jones, Netdev

Em Tue, Sep 09, 2008 at 09:07:37PM +0200, Andi Kleen escreveu:
> > These apps have a love/hate relationship with TCP.  They'll probably love 
> > SCTP 5 years from now, but it's not mature enough for them yet.  They do 
> > want to minimize all latencies, 
> 
> Then they should just TCP_NODELAY.
> 
> > and many of the apps explicitly set 
> > TCP_NODELAY. 
> 
> That's the right thing for them.

But please ask them to use writev or build the logical packet in
userspace, sending it as just one buffer, or they will start asking for
nagle tunables ;-)
 
- Arnaldo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09 19:07         ` Andi Kleen
  2008-09-09 19:21           ` Arnaldo Carvalho de Melo
@ 2008-09-11  4:08           ` Chris Snook
  1 sibling, 0 replies; 44+ messages in thread
From: Chris Snook @ 2008-09-11  4:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rick Jones, Netdev

Andi Kleen wrote:
>> These apps have a love/hate relationship with TCP.  They'll probably love 
>> SCTP 5 years from now, but it's not mature enough for them yet.  They do 
>> want to minimize all latencies, 
> 
> Then they should just TCP_NODELAY.
> 
>> and many of the apps explicitly set 
>> TCP_NODELAY. 
> 
> That's the right thing for them.
> 
>> The goal here is to improve latencies on the supporting apps 
>> that aren't quite as carefully optimized as the main message daemons 
>> themselves.  If we can give them a knob that bounds their worst-case 
>> latency to 2-3 times their average latency, without risking network floods 
>> that won't show up in testing, they'll be much happier.
> 
> Hmm in theory I don't see a big drawback in making the these defaults sysctls.
> As in this untested patch. It's probably not the right solution
> for this problem. Still if you want to experiment. This makes both 
> the ato default and the delack default tunable. You'll have to restart
> sockets for it to take effect.
> 
> -Andi
> 
> ---
> 
> 
> Make ato min and delack min tunable 
> 
> This might potentially help with some programs which have problems with nagle.
> 
> Sockets have to be restarted
> 
> TBD documentation for the new sysctls
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>

It needs the changed constants replaced with the _DEFAULT versions in 
net/dccp/timer.c and net/dccp/output.c to build with DCCP enabled.  I did that, 
and tested it (over loopback).  The tunables come up at 0, not the expected 
default values, and when that happens, latencies are extremely low, as would be 
expected with a value of 0, but when I set net.ipv4.tcp_delack_min to *any* 
non-zero value, the old 40 ms magic number becomes 200 ms.  I haven't yet 
figured out why.  Tweaking net.ipv4.tcp_ato_min isn't having any observable 
effect on my loopback latencies.

I think there may be something worth pursuing with a tcp_delack_min tunable. 
Any suggestions on where I should look to debug this?

-- Chris

> Index: linux-2.6.27-rc4-misc/include/net/tcp.h
> ===================================================================
> --- linux-2.6.27-rc4-misc.orig/include/net/tcp.h
> +++ linux-2.6.27-rc4-misc/include/net/tcp.h
> @@ -118,12 +118,16 @@ extern void tcp_time_wait(struct sock *s
>  
>  #define TCP_DELACK_MAX	((unsigned)(HZ/5))	/* maximal time to delay before sending an ACK */
>  #if HZ >= 100
> -#define TCP_DELACK_MIN	((unsigned)(HZ/25))	/* minimal time to delay before sending an ACK */
> -#define TCP_ATO_MIN	((unsigned)(HZ/25))
> +#define TCP_DELACK_MIN_DEFAULT	((unsigned)(HZ/25))	/* minimal time to delay before sending an ACK */
> +#define TCP_ATO_MIN_DEFAULT	((unsigned)(HZ/25))
>  #else
> -#define TCP_DELACK_MIN	4U
> -#define TCP_ATO_MIN	4U
> +#define TCP_DELACK_MIN_DEFAULT	4U
> +#define TCP_ATO_MIN_DEFAULT	4U
>  #endif
> +
> +#define TCP_DELACK_MIN sysctl_tcp_delack_min
> +#define TCP_ATO_MIN sysctl_tcp_ato_min
> +
>  #define TCP_RTO_MAX	((unsigned)(120*HZ))
>  #define TCP_RTO_MIN	((unsigned)(HZ/5))
>  #define TCP_TIMEOUT_INIT ((unsigned)(3*HZ))	/* RFC 1122 initial RTO value	*/
> @@ -236,6 +240,8 @@ extern int sysctl_tcp_base_mss;
>  extern int sysctl_tcp_workaround_signed_windows;
>  extern int sysctl_tcp_slow_start_after_idle;
>  extern int sysctl_tcp_max_ssthresh;
> +extern int sysctl_tcp_ato_min;
> +extern int sysctl_tcp_delack_min;
>  
>  extern atomic_t tcp_memory_allocated;
>  extern atomic_t tcp_sockets_allocated;
> Index: linux-2.6.27-rc4-misc/net/ipv4/sysctl_net_ipv4.c
> ===================================================================
> --- linux-2.6.27-rc4-misc.orig/net/ipv4/sysctl_net_ipv4.c
> +++ linux-2.6.27-rc4-misc/net/ipv4/sysctl_net_ipv4.c
> @@ -717,6 +717,24 @@ static struct ctl_table ipv4_table[] = {
>  	},
>  	{
>  		.ctl_name	= CTL_UNNUMBERED,
> +		.procname	= "tcp_delack_min",
> +		.data		= &sysctl_tcp_delack_min,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= &proc_dointvec_jiffies,
> +		.strategy	= &sysctl_jiffies
> +	},
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname	= "tcp_ato_min",
> +		.data		= &sysctl_tcp_ato_min,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= &proc_dointvec_jiffies,
> +		.strategy	= &sysctl_jiffies
> +	},
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
>  		.procname	= "udp_mem",
>  		.data		= &sysctl_udp_mem,
>  		.maxlen		= sizeof(sysctl_udp_mem),
> Index: linux-2.6.27-rc4-misc/net/ipv4/tcp_timer.c
> ===================================================================
> --- linux-2.6.27-rc4-misc.orig/net/ipv4/tcp_timer.c
> +++ linux-2.6.27-rc4-misc/net/ipv4/tcp_timer.c
> @@ -29,6 +29,8 @@ int sysctl_tcp_keepalive_intvl __read_mo
>  int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
>  int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
>  int sysctl_tcp_orphan_retries __read_mostly;
> +int sysctl_tcp_delack_min __read_mostly = TCP_DELACK_MIN_DEFAULT;
> +int sysctl_tcp_ato_min __read_mostly = TCP_ATO_MIN_DEFAULT;
>  
>  static void tcp_write_timer(unsigned long);
>  static void tcp_delack_timer(unsigned long);
> Index: linux-2.6.27-rc4-misc/net/ipv4/tcp_output.c
> ===================================================================
> --- linux-2.6.27-rc4-misc.orig/net/ipv4/tcp_output.c
> +++ linux-2.6.27-rc4-misc/net/ipv4/tcp_output.c
> @@ -2436,7 +2436,7 @@ void tcp_send_delayed_ack(struct sock *s
>  		 * directly.
>  		 */
>  		if (tp->srtt) {
> -			int rtt = max(tp->srtt >> 3, TCP_DELACK_MIN);
> +			int rtt = max_t(unsigned, tp->srtt >> 3, TCP_DELACK_MIN);
>  
>  			if (rtt < max_ato)
>  				max_ato = rtt;
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09 18:40       ` Chris Snook
  2008-09-09 19:07         ` Andi Kleen
@ 2008-09-09 19:59         ` David Miller
  2008-09-09 20:25           ` Chris Snook
  2008-09-22 10:49           ` David Miller
  1 sibling, 2 replies; 44+ messages in thread
From: David Miller @ 2008-09-09 19:59 UTC (permalink / raw)
  To: csnook; +Cc: andi, rick.jones2, netdev


Still waiting for your test program and example traces Chris.

Until I see that there really isn't anything more I can contribute
concretely to this discussion.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09 19:59         ` David Miller
@ 2008-09-09 20:25           ` Chris Snook
  2008-09-22 10:49           ` David Miller
  1 sibling, 0 replies; 44+ messages in thread
From: Chris Snook @ 2008-09-09 20:25 UTC (permalink / raw)
  To: David Miller; +Cc: andi, rick.jones2, netdev

David Miller wrote:
> Still waiting for your test program and example traces Chris.
> 
> Until I see that there really isn't anything more I can contribute
> concretely to this discussion.

No problem.  Right now I'm working on testing the target workloads with the 
tunables Andi posted, but I'll also work up a trivial test case to demonstrate 
it.  The apps where these problems are being reported are not exactly the sort 
of thing one posts to netdev.

-- Chris

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09 19:59         ` David Miller
  2008-09-09 20:25           ` Chris Snook
@ 2008-09-22 10:49           ` David Miller
  2008-09-22 11:09             ` David Miller
  1 sibling, 1 reply; 44+ messages in thread
From: David Miller @ 2008-09-22 10:49 UTC (permalink / raw)
  To: csnook; +Cc: andi, rick.jones2, netdev

From: David Miller <davem@davemloft.net>
Date: Tue, 09 Sep 2008 12:59:34 -0700 (PDT)

> 
> Still waiting for your test program and example traces Chris.
> 
> Until I see that there really isn't anything more I can contribute
> concretely to this discussion.

Ping, still waiting for this...  Can you provide the test case,
perhaps sometime this year? :-)

I'll try to figure out why Andi's patch doesn't behave as expected.
I suspect you may have a bum build if the sysctl values are coming
up as zero as that's completely impossible as far as I can tell.

If something so fundamental as that isn't behaving properly, all
bets are off for anything else you try to use when having that
patch applied.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-22 10:49           ` David Miller
@ 2008-09-22 11:09             ` David Miller
  2008-09-22 20:30               ` Andi Kleen
  0 siblings, 1 reply; 44+ messages in thread
From: David Miller @ 2008-09-22 11:09 UTC (permalink / raw)
  To: csnook; +Cc: andi, rick.jones2, netdev

From: David Miller <davem@davemloft.net>
Date: Mon, 22 Sep 2008 03:49:33 -0700 (PDT)

> I'll try to figure out why Andi's patch doesn't behave as expected.

Andi's patch uses proc_dointvec_jiffies, which is for sysctl values
stored as seconds, whereas these things are used to record values with
smaller granulatiry, are stored in jiffies, and that's why we get zero
on read and writes have crazy effects.

Also, as Andi stated, this is not the way to deal with this problem.

So we have a broken patch, which even if implemented properly isn't the
way forward, so I consider this discussion dead in the water until we
have some test cases.

Don't you think? :-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-22 11:09             ` David Miller
@ 2008-09-22 20:30               ` Andi Kleen
  2008-09-22 22:22                 ` Chris Snook
  0 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2008-09-22 20:30 UTC (permalink / raw)
  To: David Miller; +Cc: csnook, andi, rick.jones2, netdev

On Mon, Sep 22, 2008 at 04:09:12AM -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Mon, 22 Sep 2008 03:49:33 -0700 (PDT)
> 
> > I'll try to figure out why Andi's patch doesn't behave as expected.
> 
> Andi's patch uses proc_dointvec_jiffies, which is for sysctl values
> stored as seconds, whereas these things are used to record values with
> smaller granulatiry, are stored in jiffies, and that's why we get zero
> on read and writes have crazy effects.

Oops. Assume me with brown paper bag etc.etc.

It was a typo for proc_dointvec_ms_jiffies


> 
> Also, as Andi stated, this is not the way to deal with this problem.
> 
> So we have a broken patch, which even if implemented properly isn't the
> way forward, so I consider this discussion dead in the water until we
> have some test cases.

The patch is easy to fix with a s/_jiffies/_ms_jiffies/g

Also it was more intended for him to play around and get some data
points. I guess for that it's still useful.

Also while for that it's probably not the right solution, but 
I could imagine in some other situations where it might be useful
to tune these values. After all they are not written down in stone.
I wonder if it would even make sense to consider hr timers for TCP
now.

=Andi
-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-22 20:30               ` Andi Kleen
@ 2008-09-22 22:22                 ` Chris Snook
  2008-09-22 22:26                   ` David Miller
  2008-09-22 22:47                   ` Rick Jones
  0 siblings, 2 replies; 44+ messages in thread
From: Chris Snook @ 2008-09-22 22:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, rick.jones2, netdev

Andi Kleen wrote:
> On Mon, Sep 22, 2008 at 04:09:12AM -0700, David Miller wrote:
>> From: David Miller <davem@davemloft.net>
>> Date: Mon, 22 Sep 2008 03:49:33 -0700 (PDT)
>>
>>> I'll try to figure out why Andi's patch doesn't behave as expected.
>> Andi's patch uses proc_dointvec_jiffies, which is for sysctl values
>> stored as seconds, whereas these things are used to record values with
>> smaller granulatiry, are stored in jiffies, and that's why we get zero
>> on read and writes have crazy effects.
> 
> Oops. Assume me with brown paper bag etc.etc.
> 
> It was a typo for proc_dointvec_ms_jiffies
> 
> 
>> Also, as Andi stated, this is not the way to deal with this problem.
>>
>> So we have a broken patch, which even if implemented properly isn't the
>> way forward, so I consider this discussion dead in the water until we
>> have some test cases.

It's proven a little harder than anticipated to create a trivial test case, but 
I should be able to post some traces from a freely-available app soon.

> The patch is easy to fix with a s/_jiffies/_ms_jiffies/g

Thanks, will try.

> Also it was more intended for him to play around and get some data
> points. I guess for that it's still useful.

Indeed.  Setting tcp_delack_min to 0 completely eliminated the undesired 
latencies, though of course that would be a bit dangerous with naive apps 
talking across the network.  Changing tcp_ato_min didn't do anything interesting 
for this case.

> Also while for that it's probably not the right solution, but 
> I could imagine in some other situations where it might be useful
> to tune these values. After all they are not written down in stone.

The problem is that we're trying to use one set of values for links with 
extremely different performance characteristics.  We need to initialize TCP 
sockets with min/default/max values that are safe and perform well.

How horrendous of a layering violation would it be to attach TCP performance 
parameters (either user-supplied or based on interface stats) to route table 
entries, like route metrics but intended to guide TCP autotuning?  It seems like 
it shouldn't be that hard to teach TCP that it doesn't need to optimize my lo 
connections much, and that it should be optimizing my eth0 subnet connections 
for lower latency and higher bandwidth than the connections that go through my 
gateway into the great beyond.

> I wonder if it would even make sense to consider hr timers for TCP
> now.
> 
> =Andi

As long as we have hardcoded minimum delays > 10ms, I don't think there's much 
of a point, but it's something to keep in mind for the future.

-- Chris

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-22 22:22                 ` Chris Snook
@ 2008-09-22 22:26                   ` David Miller
  2008-09-22 23:00                     ` Chris Snook
  2008-09-22 22:47                   ` Rick Jones
  1 sibling, 1 reply; 44+ messages in thread
From: David Miller @ 2008-09-22 22:26 UTC (permalink / raw)
  To: csnook; +Cc: andi, rick.jones2, netdev

From: Chris Snook <csnook@redhat.com>
Date: Mon, 22 Sep 2008 18:22:13 -0400

> How horrendous of a layering violation would it be to attach TCP
> performance parameters (either user-supplied or based on interface
> stats) to route table entries, like route metrics but intended to
> guide TCP autotuning?  It seems like it shouldn't be that hard to
> teach TCP that it doesn't need to optimize my lo connections much,
> and that it should be optimizing my eth0 subnet connections for
> lower latency and higher bandwidth than the connections that go
> through my gateway into the great beyond.

We already do this for other TCP connection parameters, and I tend to
think these delack/ato values belong there too.

If we add a global knob, people are just going to turn it on even if
they are also connected to the real internet on the system rather than
only internal networks they completely control.

That tends to cause problems.  It gets an entry on all of these bogus
"Linux performance tuning" sections administrators read from the
various financial messaging products.  So everyone does it without
thinking and using their brains.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-22 22:26                   ` David Miller
@ 2008-09-22 23:00                     ` Chris Snook
  2008-09-22 23:13                       ` David Miller
  0 siblings, 1 reply; 44+ messages in thread
From: Chris Snook @ 2008-09-22 23:00 UTC (permalink / raw)
  To: David Miller; +Cc: andi, rick.jones2, netdev

David Miller wrote:
> From: Chris Snook <csnook@redhat.com>
> Date: Mon, 22 Sep 2008 18:22:13 -0400
> 
>> How horrendous of a layering violation would it be to attach TCP
>> performance parameters (either user-supplied or based on interface
>> stats) to route table entries, like route metrics but intended to
>> guide TCP autotuning?  It seems like it shouldn't be that hard to
>> teach TCP that it doesn't need to optimize my lo connections much,
>> and that it should be optimizing my eth0 subnet connections for
>> lower latency and higher bandwidth than the connections that go
>> through my gateway into the great beyond.
> 
> We already do this for other TCP connection parameters, and I tend to
> think these delack/ato values belong there too.
> 
> If we add a global knob, people are just going to turn it on even if
> they are also connected to the real internet on the system rather than
> only internal networks they completely control.
> 
> That tends to cause problems.  It gets an entry on all of these bogus
> "Linux performance tuning" sections administrators read from the
> various financial messaging products.  So everyone does it without
> thinking and using their brains.

I agree 100%.  Could you please point me to an example of a connection parameter 
that gets tuned and cached this way, so I can experiment with it?

-- Chris

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-22 23:00                     ` Chris Snook
@ 2008-09-22 23:13                       ` David Miller
  2008-09-22 23:24                         ` Andi Kleen
  0 siblings, 1 reply; 44+ messages in thread
From: David Miller @ 2008-09-22 23:13 UTC (permalink / raw)
  To: csnook; +Cc: andi, rick.jones2, netdev

From: Chris Snook <csnook@redhat.com>
Date: Mon, 22 Sep 2008 19:00:08 -0400

> Could you please point me to an example of a connection parameter
> that gets tuned and cached this way, so I can experiment with it?

You'll find tons of them in tcp_update_metrics().

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-22 23:13                       ` David Miller
@ 2008-09-22 23:24                         ` Andi Kleen
  2008-09-22 23:21                           ` David Miller
  0 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2008-09-22 23:24 UTC (permalink / raw)
  To: David Miller; +Cc: csnook, andi, rick.jones2, netdev

On Mon, Sep 22, 2008 at 04:13:23PM -0700, David Miller wrote:
> From: Chris Snook <csnook@redhat.com>
> Date: Mon, 22 Sep 2008 19:00:08 -0400
> 
> > Could you please point me to an example of a connection parameter
> > that gets tuned and cached this way, so I can experiment with it?
> 
> You'll find tons of them in tcp_update_metrics().

IMHO that is actually obsolete because it does not take NAT
into account. One IP does not necessarily share link characteristics.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-22 23:24                         ` Andi Kleen
@ 2008-09-22 23:21                           ` David Miller
  2008-09-23  0:14                             ` Andi Kleen
  0 siblings, 1 reply; 44+ messages in thread
From: David Miller @ 2008-09-22 23:21 UTC (permalink / raw)
  To: andi; +Cc: csnook, rick.jones2, netdev

From: Andi Kleen <andi@firstfloor.org>
Date: Tue, 23 Sep 2008 01:24:28 +0200

> On Mon, Sep 22, 2008 at 04:13:23PM -0700, David Miller wrote:
> > From: Chris Snook <csnook@redhat.com>
> > Date: Mon, 22 Sep 2008 19:00:08 -0400
> > 
> > > Could you please point me to an example of a connection parameter
> > > that gets tuned and cached this way, so I can experiment with it?
> > 
> > You'll find tons of them in tcp_update_metrics().
> 
> IMHO that is actually obsolete because it does not take NAT
> into account. One IP does not necessarily share link characteristics.

It is not an invalid estimate even in the NAT case, and it's not
so illegal like TCP timewait recycling would be.

Andi, don't rain on the party for something that might be terribly
useful for many people.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-22 23:21                           ` David Miller
@ 2008-09-23  0:14                             ` Andi Kleen
  2008-09-23  0:33                               ` Rick Jones
  2008-09-23  1:40                               ` David Miller
  0 siblings, 2 replies; 44+ messages in thread
From: Andi Kleen @ 2008-09-23  0:14 UTC (permalink / raw)
  To: David Miller; +Cc: andi, csnook, rick.jones2, netdev

> It is not an invalid estimate even in the NAT case, 

Typical case: you got a large company network behind a NAT.
First user has a very crappy wireless connection behind a slow
intercontinental link talking to the outgoing NAT router. He connectes to 
your internet server first and the window, slow start etc. parameters 
for him are saved in the dst_entry.  

The next guy behind the same NAT is in the same building
as the router who connects the company to the internet. He
has a much faster line. He connects to the same server. 
They will share the same dst and inetpeer entries.

The parameters saved earlier for the same IP are clearly invalid
for the second case. The link characteristics are completely 
different.

Also did you know there are there are whole countries behind
NAT. e.g. I was told that all of Saudi Arabia only comes from
a small handfull of IP addresses. It would surprise me if 
all of KSA has the same link characteristics? @)

Ah I see there's a sysctl now to disable this. How about
setting it by default?

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-23  0:14                             ` Andi Kleen
@ 2008-09-23  0:33                               ` Rick Jones
  2008-09-23  2:12                                 ` Andi Kleen
  2008-09-23  1:40                               ` David Miller
  1 sibling, 1 reply; 44+ messages in thread
From: Rick Jones @ 2008-09-23  0:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, csnook, netdev

Andi Kleen wrote:
>>It is not an invalid estimate even in the NAT case, 
> 
> 
> Typical case: you got a large company network behind a NAT.
> First user has a very crappy wireless connection behind a slow
> intercontinental link talking to the outgoing NAT router. He connectes to 
> your internet server first and the window, slow start etc. parameters 
> for him are saved in the dst_entry.  
> 
> The next guy behind the same NAT is in the same building
> as the router who connects the company to the internet. He
> has a much faster line. He connects to the same server. 
> They will share the same dst and inetpeer entries.
> 
> The parameters saved earlier for the same IP are clearly invalid
> for the second case. The link characteristics are completely 
> different.
> 
> Also did you know there are there are whole countries behind
> NAT. e.g. I was told that all of Saudi Arabia only comes from
> a small handfull of IP addresses. It would surprise me if 
> all of KSA has the same link characteristics? @)

That seems as much of a case against NAT as per-destintation attribute 
caching.

If my experience at "a large company" is any indication, for 99 
connections out of 10 I'm going through a proxy rather than NAT so all 
the remote server sees are the characteristics of the connection between 
it and the proxy.

And even if I were not, how is per-destination caching the possibly 
non-optimal characteristics based on one user behind a NAT really 
functionally different than having to tune the system-wide defaults to 
cover that corner-case user?  Seems that caching per-destination 
characteristics is actually limiting the alleged brokenness to that 
destination rather than all destinations?

rick jones

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-23  0:33                               ` Rick Jones
@ 2008-09-23  2:12                                 ` Andi Kleen
  0 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2008-09-23  2:12 UTC (permalink / raw)
  To: Rick Jones; +Cc: Andi Kleen, David Miller, csnook, netdev

> That seems as much of a case against NAT as per-destintation attribute 
> caching.

Sure in a ideal world NAT wouldn't exist. Unfortunately we're 
not in a ideal world.

Also in general my impression is that NAT is becoming more common.
e.g. a lot of the mobile networks seem to be NATed.

> 
> If my experience at "a large company" is any indication, for 99 

My experience at a large company was different. Also see my 
second example.

> 
> And even if I were not, how is per-destination caching the possibly 
> non-optimal characteristics based on one user behind a NAT really 
> functionally different than having to tune the system-wide defaults to 
> cover that corner-case user?  

It's just wasteful on network resouces. e.g. if you start
talking to the slow link with a too large congestion window 
a lot of packets are going to be dropped. Yes TCP will
eventually adapt, but the network and the user performance
suffers and the network is ineffectively used.

> Seems that caching per-destination 
> characteristics is actually limiting the alleged brokenness to that 
> destination rather than all destinations?

Not sure what you're talking about. There's no real brokenness 
in having a slow link.  And with default startup metrics
Linux TCP has no trouble talking to a slow link.

The brokenness is using the dst_entry TCP metrics of a fast link
to talk to a slow link and that happens with NAT.

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-23  0:14                             ` Andi Kleen
  2008-09-23  0:33                               ` Rick Jones
@ 2008-09-23  1:40                               ` David Miller
  2008-09-23  2:23                                 ` Andi Kleen
  1 sibling, 1 reply; 44+ messages in thread
From: David Miller @ 2008-09-23  1:40 UTC (permalink / raw)
  To: andi; +Cc: csnook, rick.jones2, netdev

From: Andi Kleen <andi@firstfloor.org>
Date: Tue, 23 Sep 2008 02:14:09 +0200

> Typical case: you got a large company network behind a NAT.
> First user has a very crappy wireless connection behind a slow
> intercontinental link talking to the outgoing NAT router. He connectes to 
> your internet server first and the window, slow start etc. parameters 
> for him are saved in the dst_entry.  
> 
> The next guy behind the same NAT is in the same building
> as the router who connects the company to the internet. He
> has a much faster line. He connects to the same server. 
> They will share the same dst and inetpeer entries.
> 
> The parameters saved earlier for the same IP are clearly invalid
> for the second case. The link characteristics are completely 
> different.

Just as typical are setups where the NAT clients are 1 or 2
fast hops behind the firewall.

There are many cases where perfectly acceptible heuristics
don't perform optimally, this doesn't mean we disable them
by default.

> Also did you know there are there are whole countries behind
> NAT.

I am more than aware of this, however this doesn't mean it is
sane and this kind of setup makes many useful internet services
inaccessible.

> Ah I see there's a sysctl now to disable this. How about
> setting it by default?

It's for debugging.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-23  1:40                               ` David Miller
@ 2008-09-23  2:23                                 ` Andi Kleen
  2008-09-23  2:28                                   ` David Miller
  0 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2008-09-23  2:23 UTC (permalink / raw)
  To: David Miller; +Cc: andi, csnook, rick.jones2, netdev

> There are many cases where perfectly acceptible heuristics

For very low values of "perfect" :)

> don't perform optimally, this doesn't mean we disable them
> by default.

Well they're just broken on a larger and larger fraction
of the internet.  Router technology more and more often
knows something about ports these days and handles flows
differently. Assuming they do not is more and more 
wrong. It's one of these things which looks cool on first
look but when you dig deeper is just a bad idea.

How should we call a heuristics that is often
wrong. A "wrongistic"?  :)

> > NAT.
> 
> I am more than aware of this, however this doesn't mean it is
> sane 

I agree with you that they're not sane, but they should still 
be supported. Technically at least they don't violate any standards
afaik.

Linux should work well even with such setups. In fact it has
no other choice because they're so common. "Be liberal
what you accept, be conservative what you send". This violates
the second principle.

> and this kind of setup makes many useful internet services
> inaccessible.

Sure, in fact that's usually why they were done in the first place,
but Linux shouldn't make it unnecessarily worse.

Anyways enough said. I guess we have to agree to disagree on
this. For completeness I'll still send the patch to set the sysctl
by default though just in case you reconsider.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-23  2:23                                 ` Andi Kleen
@ 2008-09-23  2:28                                   ` David Miller
  2008-09-23  2:41                                     ` Andi Kleen
  0 siblings, 1 reply; 44+ messages in thread
From: David Miller @ 2008-09-23  2:28 UTC (permalink / raw)
  To: andi; +Cc: csnook, rick.jones2, netdev

From: Andi Kleen <andi@firstfloor.org>
Date: Tue, 23 Sep 2008 04:23:29 +0200

> I'll still send the patch to set the sysctl
> by default though just in case you reconsider.

Please don't poop in my inbox like that, I already said I'm
not making the change.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-23  2:28                                   ` David Miller
@ 2008-09-23  2:41                                     ` Andi Kleen
  0 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2008-09-23  2:41 UTC (permalink / raw)
  To: David Miller; +Cc: andi, csnook, rick.jones2, netdev

On Mon, Sep 22, 2008 at 07:28:59PM -0700, David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: Tue, 23 Sep 2008 04:23:29 +0200
> 
> > I'll still send the patch to set the sysctl
> > by default though just in case you reconsider.
> 
> Please don't poop in my inbox like that, I already said I'm
> not making the change.

Thank you for the choice words to describe patches. But netdev
is not for your use alone. It will be then more
for the benefit of other list readers who might have a less prejudiced 
view on the usefulness of certains wrongistics.

-Andi
-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-22 22:22                 ` Chris Snook
  2008-09-22 22:26                   ` David Miller
@ 2008-09-22 22:47                   ` Rick Jones
  2008-09-22 22:57                     ` Chris Snook
  1 sibling, 1 reply; 44+ messages in thread
From: Rick Jones @ 2008-09-22 22:47 UTC (permalink / raw)
  To: Chris Snook; +Cc: Andi Kleen, David Miller, netdev

> Indeed.  Setting tcp_delack_min to 0 completely eliminated the undesired 
> latencies, though of course that would be a bit dangerous with naive 
> apps talking across the network. 

What did it do to the packets per second or per unit of work?  Depending 
on the nature of the race between the ACK returning from the remote and 
the application pushing more bytes into the socket, I'd think that 
setting the delayed ack timer to zero could result in more traffic on 
the network (those bare ACKs) than simply setting TCP_NODELAY at the source.

And since with small packets and/or copy avoidance an ACK is 
(handwaving) just as many CPU cycles at either end as a data segment 
that also means a bump in CPU utilization.

rick jones

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-22 22:47                   ` Rick Jones
@ 2008-09-22 22:57                     ` Chris Snook
  0 siblings, 0 replies; 44+ messages in thread
From: Chris Snook @ 2008-09-22 22:57 UTC (permalink / raw)
  To: Rick Jones; +Cc: Andi Kleen, David Miller, netdev

Rick Jones wrote:
>> Indeed.  Setting tcp_delack_min to 0 completely eliminated the 
>> undesired latencies, though of course that would be a bit dangerous 
>> with naive apps talking across the network. 
> 
> What did it do to the packets per second or per unit of work?  Depending 
> on the nature of the race between the ACK returning from the remote and 
> the application pushing more bytes into the socket, I'd think that 
> setting the delayed ack timer to zero could result in more traffic on 
> the network (those bare ACKs) than simply setting TCP_NODELAY at the 
> source.
> 
> And since with small packets and/or copy avoidance an ACK is 
> (handwaving) just as many CPU cycles at either end as a data segment 
> that also means a bump in CPU utilization.
> 
> rick jones

I never saw performance go down, but I was always using low latency/high 
bandwidth loopback or LAN connection, with only one socket per CPU.

I agree though, that turning this off is suboptimal.  I'm going to pursue 
David's idea of making delack_min and ato_min dynamically calculated by the kernel.

-- Chris

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09  5:10   ` Chris Snook
  2008-09-09  5:17     ` David Miller
  2008-09-09 14:36     ` Andi Kleen
@ 2008-09-09 16:33     ` Rick Jones
  2008-09-09 16:54       ` Chuck Lever
  2 siblings, 1 reply; 44+ messages in thread
From: Rick Jones @ 2008-09-09 16:33 UTC (permalink / raw)
  To: Chris Snook; +Cc: Netdev

> 
> Most of the apps where people care about this enough to complain to 
> their vendor (the cases I hear about) are in messaging apps, where 
> they're relaying a stream of events that have little to do with each 
> other, and they want TCP to maintain the integrity of the connection and 
> do a modicum of bandwidth management, but 40 ms stalls greatly exceed 
> their latency tolerances.

What _are_ their latency tolerances?  How often are they willing to 
tolerate a modicum of TCP bandwidth management?  Do they go ape when TCP 
sits waiting not just for 40ms, but for an entire RTO timer?

> Using TCP_NODELAY is often the least bad option, but sometimes it's
> infeasible because of its effect on the network, and it certainly
> adds to the network stack overhead.  A more tunable Nagle delay would
> probably serve many of these apps much better.

If the applications are sending streams of logically unrelated sends 
down the same socket, then setting TCP_NODELAY is IMO fine.  Where it 
isn't fine is where these applications are generating their logically 
associated data in two or more small sends.  One send per message good. 
Two sends per message bad.

BTW, is this magical mystery Solaris ndd setting "tcp_naglim_def?"  FWIW 
I believe there is a similar setting by the same name in HP-UX.

rick jones

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09 16:33     ` Rick Jones
@ 2008-09-09 16:54       ` Chuck Lever
  2008-09-09 17:21         ` Arnaldo Carvalho de Melo
  2008-09-09 17:54         ` Rick Jones
  0 siblings, 2 replies; 44+ messages in thread
From: Chuck Lever @ 2008-09-09 16:54 UTC (permalink / raw)
  To: Rick Jones; +Cc: Chris Snook, Netdev

On Sep 9, 2008, at Sep 9, 2008, 12:33 PM, Rick Jones wrote:
>> Most of the apps where people care about this enough to complain to  
>> their vendor (the cases I hear about) are in messaging apps, where  
>> they're relaying a stream of events that have little to do with  
>> each other, and they want TCP to maintain the integrity of the  
>> connection and do a modicum of bandwidth management, but 40 ms  
>> stalls greatly exceed their latency tolerances.
>
> What _are_ their latency tolerances?  How often are they willing to  
> tolerate a modicum of TCP bandwidth management?  Do they go ape when  
> TCP sits waiting not just for 40ms, but for an entire RTO timer?
>
>> Using TCP_NODELAY is often the least bad option, but sometimes it's
>> infeasible because of its effect on the network, and it certainly
>> adds to the network stack overhead.  A more tunable Nagle delay would
>> probably serve many of these apps much better.
>
> If the applications are sending streams of logically unrelated sends  
> down the same socket, then setting TCP_NODELAY is IMO fine.  Where  
> it isn't fine is where these applications are generating their  
> logically associated data in two or more small sends.  One send per  
> message good. Two sends per message bad.

Can the same be said of the Linux kernel's RPC client, which uses  
MSG_MORE and multiple sends to construct a single RPC request on a TCP  
socket?

See net/sunrpc/xprtsock.c:xs_send_pagedata() for details.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09 16:54       ` Chuck Lever
@ 2008-09-09 17:21         ` Arnaldo Carvalho de Melo
  2008-09-09 17:54         ` Rick Jones
  1 sibling, 0 replies; 44+ messages in thread
From: Arnaldo Carvalho de Melo @ 2008-09-09 17:21 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Rick Jones, Chris Snook, Netdev

Em Tue, Sep 09, 2008 at 12:54:31PM -0400, Chuck Lever escreveu:
> On Sep 9, 2008, at Sep 9, 2008, 12:33 PM, Rick Jones wrote:
>>> Most of the apps where people care about this enough to complain to  
>>> their vendor (the cases I hear about) are in messaging apps, where  
>>> they're relaying a stream of events that have little to do with each 
>>> other, and they want TCP to maintain the integrity of the connection 
>>> and do a modicum of bandwidth management, but 40 ms stalls greatly 
>>> exceed their latency tolerances.
>>
>> What _are_ their latency tolerances?  How often are they willing to  
>> tolerate a modicum of TCP bandwidth management?  Do they go ape when  
>> TCP sits waiting not just for 40ms, but for an entire RTO timer?
>>
>>> Using TCP_NODELAY is often the least bad option, but sometimes it's
>>> infeasible because of its effect on the network, and it certainly
>>> adds to the network stack overhead.  A more tunable Nagle delay would
>>> probably serve many of these apps much better.
>>
>> If the applications are sending streams of logically unrelated sends  
>> down the same socket, then setting TCP_NODELAY is IMO fine.  Where it 
>> isn't fine is where these applications are generating their logically 
>> associated data in two or more small sends.  One send per message good. 
>> Two sends per message bad.
>
> Can the same be said of the Linux kernel's RPC client, which uses  
> MSG_MORE and multiple sends to construct a single RPC request on a TCP  
> socket?
>
> See net/sunrpc/xprtsock.c:xs_send_pagedata() for details.

That is not a problem, it should be equivalent to corking the socket.
I.e. the uncorking operation will be the last part of the buffer, where
'more' will be 0.

- Arnaldo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-09 16:54       ` Chuck Lever
  2008-09-09 17:21         ` Arnaldo Carvalho de Melo
@ 2008-09-09 17:54         ` Rick Jones
  1 sibling, 0 replies; 44+ messages in thread
From: Rick Jones @ 2008-09-09 17:54 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Chris Snook, Netdev

>> If the applications are sending streams of logically unrelated sends  
>> down the same socket, then setting TCP_NODELAY is IMO fine.  Where  it 
>> isn't fine is where these applications are generating their  logically 
>> associated data in two or more small sends.  One send per  message 
>> good. Two sends per message bad.
> 
> 
> Can the same be said of the Linux kernel's RPC client, which uses  
> MSG_MORE and multiple sends to construct a single RPC request on a TCP  
> socket?

That drifts away from Nagle and into my (perhaps old fuddy-duddy) belief 
in minimizing the number of system/other calls to do a unit of work, and 
degrees of "badness:)"

rick jones

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-08 21:56 RFC: Nagle latency tuning Christopher Snook
  2008-09-08 22:39 ` Rick Jones
@ 2008-09-08 22:55 ` Andi Kleen
  2008-09-09  5:22   ` Chris Snook
  1 sibling, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2008-09-08 22:55 UTC (permalink / raw)
  To: Christopher Snook; +Cc: Netdev

Christopher Snook <csnook@redhat.com> writes:
>
> I'm afraid I don't know the TCP stack intimately enough to understand
> what side effects this might have.  Can someone more familiar with the
> nagle implementations please enlighten me on how this could be done,
> or why it shouldn't be?

The nagle delay you're seeing is really the delayed ack delay which
is variable on Linux (unlike a lot of other stacks). Unfortunately
due to the way delayed ack works on other stacks (especially traditional
BSD with its fixed 200ms delay) there are nasty interactions with that. 
Making it too short could lead to a lot more packets even in non nagle 
situations.

Ok in theory you could split the two, but that would likely have
other issues and also make nagle be a lot less useful.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: Nagle latency tuning
  2008-09-08 22:55 ` Andi Kleen
@ 2008-09-09  5:22   ` Chris Snook
  0 siblings, 0 replies; 44+ messages in thread
From: Chris Snook @ 2008-09-09  5:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Netdev

Andi Kleen wrote:
> Christopher Snook <csnook@redhat.com> writes:
>> I'm afraid I don't know the TCP stack intimately enough to understand
>> what side effects this might have.  Can someone more familiar with the
>> nagle implementations please enlighten me on how this could be done,
>> or why it shouldn't be?
> 
> The nagle delay you're seeing is really the delayed ack delay which
> is variable on Linux (unlike a lot of other stacks). Unfortunately
> due to the way delayed ack works on other stacks (especially traditional
> BSD with its fixed 200ms delay) there are nasty interactions with that. 
> Making it too short could lead to a lot more packets even in non nagle 
> situations.

How variable is it?  I've never seen any value other than 40 ms, from 
2.4.21 to the latest rt kernel.  I've tweaked every TCP tunable in 
/proc/sys/net/ipv4, to no effect.

The people who would care enough to tweak this would be more than happy 
to accept an increase in the number of packets.  They're usually asking 
us to disable the behavior completely, so if we can let them tune the 
middle-ground, they can test in their environments to decide what values 
their network peers will tolerate.  I have no interest in foisting this 
on the unsuspecting public.

> Ok in theory you could split the two, but that would likely have
> other issues and also make nagle be a lot less useful.

Perhaps a messaging-optimized non-default congestion control algorithm 
would be a suitable way of addressing this?

-- Chris

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2008-09-23  2:37 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-08 21:56 RFC: Nagle latency tuning Christopher Snook
2008-09-08 22:39 ` Rick Jones
2008-09-09  5:10   ` Chris Snook
2008-09-09  5:17     ` David Miller
2008-09-09  5:56       ` Chris Snook
2008-09-09  6:02         ` David Miller
2008-09-09 10:31           ` Mark Brown
2008-09-09 12:05             ` David Miller
2008-09-09 12:09               ` Mark Brown
2008-09-09 12:19                 ` David Miller
2008-09-09  6:22         ` Evgeniy Polyakov
2008-09-09  6:28           ` Chris Snook
2008-09-09 13:00           ` Arnaldo Carvalho de Melo
2008-09-09 14:36     ` Andi Kleen
2008-09-09 18:40       ` Chris Snook
2008-09-09 19:07         ` Andi Kleen
2008-09-09 19:21           ` Arnaldo Carvalho de Melo
2008-09-11  4:08           ` Chris Snook
2008-09-09 19:59         ` David Miller
2008-09-09 20:25           ` Chris Snook
2008-09-22 10:49           ` David Miller
2008-09-22 11:09             ` David Miller
2008-09-22 20:30               ` Andi Kleen
2008-09-22 22:22                 ` Chris Snook
2008-09-22 22:26                   ` David Miller
2008-09-22 23:00                     ` Chris Snook
2008-09-22 23:13                       ` David Miller
2008-09-22 23:24                         ` Andi Kleen
2008-09-22 23:21                           ` David Miller
2008-09-23  0:14                             ` Andi Kleen
2008-09-23  0:33                               ` Rick Jones
2008-09-23  2:12                                 ` Andi Kleen
2008-09-23  1:40                               ` David Miller
2008-09-23  2:23                                 ` Andi Kleen
2008-09-23  2:28                                   ` David Miller
2008-09-23  2:41                                     ` Andi Kleen
2008-09-22 22:47                   ` Rick Jones
2008-09-22 22:57                     ` Chris Snook
2008-09-09 16:33     ` Rick Jones
2008-09-09 16:54       ` Chuck Lever
2008-09-09 17:21         ` Arnaldo Carvalho de Melo
2008-09-09 17:54         ` Rick Jones
2008-09-08 22:55 ` Andi Kleen
2008-09-09  5:22   ` Chris Snook

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).