* RFC: Nagle latency tuning @ 2008-09-08 21:56 Christopher Snook 2008-09-08 22:39 ` Rick Jones 2008-09-08 22:55 ` Andi Kleen 0 siblings, 2 replies; 44+ messages in thread From: Christopher Snook @ 2008-09-08 21:56 UTC (permalink / raw) To: Netdev Hey folks -- We frequently get requests from customers for a tunable to disable Nagle system-wide, to be bug-for-bug compatible with Solaris. We routinely reject these requests, as letting naive TCP apps accidentally flood the network is considered harmful. Still, it would be very nice if we could reduce Nagle-induced latencies system-wide, if we could do so without disabling Nagle completely. If you write a multi-threaded app that sends lots of small messages across TCP sockets, and you do not use TCP_NODELAY, you'll often see 40 ms latencies as the network stack waits for more senders to fill an MTU-sized packet before transmitting. Even worse, these apps may work fine across the LAN with a 1500 MTU and then counterintuitively perform much worse over loopback with a 16436 MTU. To combat this, many apps set TCP_NODELAY, often without the abundance of caution that option should entail. Other apps leave it alone, and suffer accordingly. If we could simply lower this latency, without changing the fundamental behavior of the TCP stack, it would be a great benefit to many latency-sensitive apps, and discourage the unnecessary use of TCP_NODELAY. I'm afraid I don't know the TCP stack intimately enough to understand what side effects this might have. Can someone more familiar with the nagle implementations please enlighten me on how this could be done, or why it shouldn't be? -- Chris ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-08 21:56 RFC: Nagle latency tuning Christopher Snook @ 2008-09-08 22:39 ` Rick Jones 2008-09-09 5:10 ` Chris Snook 2008-09-08 22:55 ` Andi Kleen 1 sibling, 1 reply; 44+ messages in thread From: Rick Jones @ 2008-09-08 22:39 UTC (permalink / raw) To: Christopher Snook; +Cc: Netdev Christopher Snook wrote: > Hey folks -- > > We frequently get requests from customers for a tunable to disable Nagle > system-wide, to be bug-for-bug compatible with Solaris. Which ndd setting is that in Solaris, and is it an explicit disabling of Nagle (which wouldn't be much better than arbitrary setting of TCP_NODELAY by apps anyway), or is it a tuning of the send size against which Nagle is comparing? > We routinely reject these requests, as letting naive TCP apps > accidentally flood the network is considered harmful. Still, it would > be very nice if we could reduce Nagle-induced latencies system-wide, > if we could do so without disabling Nagle completely. > > If you write a multi-threaded app that sends lots of small messages > across TCP sockets, and you do not use TCP_NODELAY, you'll often see 40 > ms latencies as the network stack waits for more senders to fill an > MTU-sized packet before transmitting. How does an application being multi-threaded enter into it? IIRC, it is simply a matter of wanting to go "write, write, read" on the socket where the writes are sub-MSS. > Even worse, these apps may work fine across the LAN with a 1500 MTU > and then counterintuitively perform much worse over loopback with a > 16436 MTU. Without knowing if those apps were fundamentally broken and just got "lucky" at a 1500 byte MTU we cannot really say if it is truly counterintuitive :) > To combat this, many apps set TCP_NODELAY, often without the abundance > of caution that option should entail. Other apps leave it alone, and > suffer accordingly. > > If we could simply lower this latency, without changing the fundamental > behavior of the TCP stack, it would be a great benefit to many > latency-sensitive apps, and discourage the unnecessary use of TCP_NODELAY. > > I'm afraid I don't know the TCP stack intimately enough to understand > what side effects this might have. Can someone more familiar with the > nagle implementations please enlighten me on how this could be done, or > why it shouldn't be? IIRC, the only way to lower the latency experienced by an application running into latencies associated with poor interaction with Nagle is to either start generating immediate ACKnowledgements at the reciever, lower the standalone ACK timer on the receiver, or start a very short timer on the sender. I doubt that (m)any of those are terribly palatable. Below is some boilerplate I have on Nagle that isn't Linux-specific: <begin> $ cat usenet_replies/nagle_algorithm > I'm not familiar with this issue, and I'm mostly ignorant about what > tcp does below the sockets interface. Can anybody briefly explain what > "nagle" is, and how and when to turn it off? Or point me to the > appropriate manual. In broad terms, whenever an application does a send() call, the logic of the Nagle algorithm is supposed to go something like this: 1) Is the quantity of data in this send, plus any queued, unsent data, greater than the MSS (Maximum Segment Size) for this connection? If yes, send the data in the user's send now (modulo any other constraints such as receiver's advertised window and the TCP congestion window). If no, go to 2. 2) Is the connection to the remote otherwise idle? That is, is there no unACKed data outstanding on the network. If yes, send the data in the user's send now. If no, queue the data and wait. Either the application will continue to call send() with enough data to get to a full MSS-worth of data, or the remote will ACK all the currently sent, unACKed data, or our retransmission timer will expire. Now, where applications run into trouble is when they have what might be described as "write, write, read" behaviour, where they present logically associated data to the transport in separate 'send' calls and those sends are typically less than the MSS for the connection. It isn't so much that they run afoul of Nagle as they run into issues with the interaction of Nagle and the other heuristics operating on the remote. In particular, the delayed ACK heuristics. When a receiving TCP is deciding whether or not to send an ACK back to the sender, in broad handwaving terms it goes through logic similar to this: a) is there data being sent back to the sender? if yes, piggy-back the ACK on the data segment. b) is there a window update being sent back to the sender? if yes, piggy-back the ACK on the window update. c) has the standalone ACK timer expired. Window updates are generally triggered by the following heuristics: i) would the window update be for a non-trivial fraction of the window - typically somewhere at or above 1/4 the window, that is, has the application "consumed" at least that much data? if yes, send a window update. if no, check ii. ii) would the window update be for, the application "consumed," at least 2*MSS worth of data? if yes, send a window update, if no wait. Now, going back to that write, write, read application, on the sending side, the first write will be transmitted by TCP via logic rule 2 - the connection is otherwise idle. However, the second small send will be delayed as there is at that point unACKnowledged data outstanding on the connection. At the receiver, that small TCP segment will arrive and will be passed to the application. The application does not have the entire app-level message, so it will not send a reply (data to TCP) back. The typical TCP window is much much larger than the MSS, so no window update would be triggered by heuristic i. The data just arrived is < 2*MSS, so no window update from heuristic ii. Since there is no window update, no ACK is sent by heuristic b. So, that leaves heuristic c - the standalone ACK timer. That ranges anywhere between 50 and 200 milliseconds depending on the TCP stack in use. If you've read this far :) now we can take a look at the effect of various things touted as "fixes" to applications experiencing this interaction. We take as our example a client-server application where both the client and the server are implemented with a write of a small application header, followed by application data. First, the "default" case which is with Nagle enabled (TCP_NODELAY _NOT_ set) and with standard ACK behaviour: Client Server Req Header -> <- Standalone ACK after Nms Req Data -> <- Possible standalone ACK <- Rsp Header Standalone ACK -> <- Rsp Data Possible standalone ACK -> For two "messages" we end-up with at least six segments on the wire. The possible standalone ACKs will depend on whether the server's response time, or client's think time is longer than the standalone ACK interval on their respective sides. Now, if TCP_NODELAY is set we see: Client Server Req Header -> Req Data -> <- Possible Standalone ACK after Nms <- Rsp Header <- Rsp Data Possible Standalone ACK -> In theory, we are down two four segments on the wire which seems good, but frankly we can do better. First though, consider what happens when someone disables delayed ACKs Client Server Req Header -> <- Immediate Standalone ACK Req Data -> <- Immediate Standalone ACK <- Rsp Header Immediate Standalone ACK -> <- Rsp Data Immediate Standalone ACK -> Now we definitly see 8 segments on the wire. It will also be that way if both TCP_NODELAY is set and delayed ACKs are disabled. How about if the application did the "right" think in the first place? That is sent the logically associated data at the same time: Client Server Request -> <- Possible Standalone ACK <- Response Possible Standalone ACK -> We are down to two segments on the wire. For "small" packets, the CPU cost is about the same regardless of data or ACK. This means that the application which is making the propper gathering send call will spend far fewer CPU cycles in the networking stack. <end> Now, there are further wrinkles :) Is that application trying to pipeline requests on the application - then we have paths that can look rather like the separate header from data cases above until the concurrent requests outstanding get above the MSS threshold. My recollection of the original Nagle writeups is the intention is to optimize the ratio of data to data+headers. Back when those writeups were made, 536 byte MSSes were still considered pretty large, and 1460 would have been positively spacious. I doubt that anyone were considering the probability of a 16384 byte MTU. It could be argued that in such an environment of the timeperiod, where stack tunables weren't all the rage and the MSS ranges were reasonably well bounded, it was a sufficient expedient to base the "is this enough data" decision off the MSS for the connection. You certainly couldn't do any better than an MSS's worth of data per segment and segment sizes weren't astronomical. Now that MTU's and MSS's can get so much larger, that expedient may indeed not be so worthwhile. An argument could be made that a ratio of data to data plus headers of say 0.97 (1448/1500) is "good enough" and that requiring a ratio of 16384/16436 = 0.9968 is taking things too far by default. That said, I still don't like covering the backsides of poorly written applications doing two or more writes for logically associated data. rick jones ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-08 22:39 ` Rick Jones @ 2008-09-09 5:10 ` Chris Snook 2008-09-09 5:17 ` David Miller ` (2 more replies) 0 siblings, 3 replies; 44+ messages in thread From: Chris Snook @ 2008-09-09 5:10 UTC (permalink / raw) To: Rick Jones; +Cc: Netdev Rick Jones wrote: > Christopher Snook wrote: >> Hey folks -- >> >> We frequently get requests from customers for a tunable to disable >> Nagle system-wide, to be bug-for-bug compatible with Solaris. > > Which ndd setting is that in Solaris, and is it an explicit disabling of > Nagle (which wouldn't be much better than arbitrary setting of > TCP_NODELAY by apps anyway), or is it a tuning of the send size against > which Nagle is comparing? Dunno, but I'm told it effectively sets TCP_NODELAY on every socket on the box. >> We routinely reject these requests, as letting naive TCP apps >> accidentally flood the network is considered harmful. Still, it would >> be very nice if we could reduce Nagle-induced latencies system-wide, >> if we could do so without disabling Nagle completely. >> >> If you write a multi-threaded app that sends lots of small messages >> across TCP sockets, and you do not use TCP_NODELAY, you'll often see >> 40 ms latencies as the network stack waits for more senders to fill an >> MTU-sized packet before transmitting. > > How does an application being multi-threaded enter into it? IIRC, it is > simply a matter of wanting to go "write, write, read" on the socket > where the writes are sub-MSS. Sorry, I'm getting my problems confused. Being multi-threaded isn't the root problem, it just makes the behavior much less predictable. Instead of getting the latency every other write, you might get it once in every million writes on a highly-threaded workload, which masks the source of the problem. >> Even worse, these apps may work fine across the LAN with a 1500 MTU >> and then counterintuitively perform much worse over loopback with a >> 16436 MTU. > > Without knowing if those apps were fundamentally broken and just got > "lucky" at a 1500 byte MTU we cannot really say if it is truly > counterintuitive :) This is open to debate, but there are certainly a great many apps doing a great deal of very important business that are subject to this problem to some degree. >> To combat this, many apps set TCP_NODELAY, often without the abundance >> of caution that option should entail. Other apps leave it alone, and >> suffer accordingly. >> >> If we could simply lower this latency, without changing the >> fundamental behavior of the TCP stack, it would be a great benefit to >> many latency-sensitive apps, and discourage the unnecessary use of >> TCP_NODELAY. >> >> I'm afraid I don't know the TCP stack intimately enough to understand >> what side effects this might have. Can someone more familiar with the >> nagle implementations please enlighten me on how this could be done, >> or why it shouldn't be? > > > IIRC, the only way to lower the latency experienced by an application > running into latencies associated with poor interaction with Nagle is to > either start generating immediate ACKnowledgements at the reciever, > lower the standalone ACK timer on the receiver, or start a very short > timer on the sender. I doubt that (m)any of those are terribly palatable. I'd like to know where the 40 ms magic number comes from. That's the one that really hurts, and if we could lower that without doing horrible things elsewhere in the stack, as a non-default tuning option, a lot of people would be very happy. > Below is some boilerplate I have on Nagle that isn't Linux-specific: > > <begin> > > $ cat usenet_replies/nagle_algorithm > > > I'm not familiar with this issue, and I'm mostly ignorant about what > > tcp does below the sockets interface. Can anybody briefly explain what > > "nagle" is, and how and when to turn it off? Or point me to the > > appropriate manual. > > In broad terms, whenever an application does a send() call, the logic > of the Nagle algorithm is supposed to go something like this: > > 1) Is the quantity of data in this send, plus any queued, unsent data, > greater than the MSS (Maximum Segment Size) for this connection? If > yes, send the data in the user's send now (modulo any other > constraints such as receiver's advertised window and the TCP > congestion window). If no, go to 2. > > 2) Is the connection to the remote otherwise idle? That is, is there > no unACKed data outstanding on the network. If yes, send the data in > the user's send now. If no, queue the data and wait. Either the > application will continue to call send() with enough data to get to a > full MSS-worth of data, or the remote will ACK all the currently sent, > unACKed data, or our retransmission timer will expire. > > Now, where applications run into trouble is when they have what might > be described as "write, write, read" behaviour, where they present > logically associated data to the transport in separate 'send' calls > and those sends are typically less than the MSS for the connection. > It isn't so much that they run afoul of Nagle as they run into issues > with the interaction of Nagle and the other heuristics operating on > the remote. In particular, the delayed ACK heuristics. > > When a receiving TCP is deciding whether or not to send an ACK back to > the sender, in broad handwaving terms it goes through logic similar to > this: > > a) is there data being sent back to the sender? if yes, piggy-back the > ACK on the data segment. > > b) is there a window update being sent back to the sender? if yes, > piggy-back the ACK on the window update. > > c) has the standalone ACK timer expired. > > Window updates are generally triggered by the following heuristics: > > i) would the window update be for a non-trivial fraction of the window > - typically somewhere at or above 1/4 the window, that is, has the > application "consumed" at least that much data? if yes, send a > window update. if no, check ii. > > ii) would the window update be for, the application "consumed," at > least 2*MSS worth of data? if yes, send a window update, if no wait. > > Now, going back to that write, write, read application, on the sending > side, the first write will be transmitted by TCP via logic rule 2 - > the connection is otherwise idle. However, the second small send will > be delayed as there is at that point unACKnowledged data outstanding > on the connection. > > At the receiver, that small TCP segment will arrive and will be passed > to the application. The application does not have the entire app-level > message, so it will not send a reply (data to TCP) back. The typical > TCP window is much much larger than the MSS, so no window update would > be triggered by heuristic i. The data just arrived is < 2*MSS, so no > window update from heuristic ii. Since there is no window update, no > ACK is sent by heuristic b. > > So, that leaves heuristic c - the standalone ACK timer. That ranges > anywhere between 50 and 200 milliseconds depending on the TCP stack in > use. > > If you've read this far :) now we can take a look at the effect of > various things touted as "fixes" to applications experiencing this > interaction. We take as our example a client-server application where > both the client and the server are implemented with a write of a small > application header, followed by application data. First, the > "default" case which is with Nagle enabled (TCP_NODELAY _NOT_ set) and > with standard ACK behaviour: > > Client Server > Req Header -> > <- Standalone ACK after Nms > Req Data -> > <- Possible standalone ACK > <- Rsp Header > Standalone ACK -> > <- Rsp Data > Possible standalone ACK -> > > > For two "messages" we end-up with at least six segments on the wire. > The possible standalone ACKs will depend on whether the server's > response time, or client's think time is longer than the standalone > ACK interval on their respective sides. Now, if TCP_NODELAY is set we > see: > > > Client Server > Req Header -> > Req Data -> > <- Possible Standalone ACK after Nms > <- Rsp Header > <- Rsp Data > Possible Standalone ACK -> > > In theory, we are down two four segments on the wire which seems good, > but frankly we can do better. First though, consider what happens > when someone disables delayed ACKs > > Client Server > Req Header -> > <- Immediate Standalone ACK > Req Data -> > <- Immediate Standalone ACK > <- Rsp Header > Immediate Standalone ACK -> > <- Rsp Data > Immediate Standalone ACK -> > > Now we definitly see 8 segments on the wire. It will also be that way > if both TCP_NODELAY is set and delayed ACKs are disabled. > > How about if the application did the "right" think in the first place? > That is sent the logically associated data at the same time: > > > Client Server > Request -> > <- Possible Standalone ACK > <- Response > Possible Standalone ACK -> > > We are down to two segments on the wire. > > For "small" packets, the CPU cost is about the same regardless of data > or ACK. This means that the application which is making the propper > gathering send call will spend far fewer CPU cycles in the networking > stack. > > <end> > > Now, there are further wrinkles :) Is that application trying to > pipeline requests on the application - then we have paths that can look > rather like the separate header from data cases above until the > concurrent requests outstanding get above the MSS threshold. > > My recollection of the original Nagle writeups is the intention is to > optimize the ratio of data to data+headers. Back when those writeups > were made, 536 byte MSSes were still considered pretty large, and 1460 > would have been positively spacious. I doubt that anyone were > considering the probability of a 16384 byte MTU. It could be argued > that in such an environment of the timeperiod, where stack tunables > weren't all the rage and the MSS ranges were reasonably well bounded, it > was a sufficient expedient to base the "is this enough data" decision > off the MSS for the connection. You certainly couldn't do any better > than an MSS's worth of data per segment and segment sizes weren't > astronomical. Now that MTU's and MSS's can get so much larger, that > expedient may indeed not be so worthwhile. An argument could be made > that a ratio of data to data plus headers of say 0.97 (1448/1500) is > "good enough" and that requiring a ratio of 16384/16436 = 0.9968 is > taking things too far by default. > > That said, I still don't like covering the backsides of poorly written > applications doing two or more writes for logically associated data. > > rick jones Most of the apps where people care about this enough to complain to their vendor (the cases I hear about) are in messaging apps, where they're relaying a stream of events that have little to do with each other, and they want TCP to maintain the integrity of the connection and do a modicum of bandwidth management, but 40 ms stalls greatly exceed their latency tolerances. Using TCP_NODELAY is often the least bad option, but sometimes it's infeasible because of its effect on the network, and it certainly adds to the network stack overhead. A more tunable Nagle delay would probably serve many of these apps much better. -- Chris ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 5:10 ` Chris Snook @ 2008-09-09 5:17 ` David Miller 2008-09-09 5:56 ` Chris Snook 2008-09-09 14:36 ` Andi Kleen 2008-09-09 16:33 ` Rick Jones 2 siblings, 1 reply; 44+ messages in thread From: David Miller @ 2008-09-09 5:17 UTC (permalink / raw) To: csnook; +Cc: rick.jones2, netdev From: Chris Snook <csnook@redhat.com> Date: Tue, 09 Sep 2008 01:10:05 -0400 > This is open to debate, but there are certainly a great many apps > doing a great deal of very important business that are subject to > this problem to some degree. Let's be frank and be honest that we're talking about message passing financial service applications. And I specifically know that the problem they run into is that the congestion window doesn't open up because of Nagle _AND_ the fact that congestion control is done using packet counts rather that data byte totals. So if you send lots of small stuff, the window doesn't open. Nagle just makes this problem worse, rather than create it. And we have a workaround for them, which is a combination of the tcp_slow_start_after_idle sysctl in combination with route metrics specifying the initial congestion window value to use. I specifically added that sysctl for this specific situation. Really, the situation here is well established and I highly encourage you to take a deeper look into the actual problems being hit, and show us some specific traces we can analyze properly if the problem is still there. Otherwise we're just shooting into the wind without any specifics to work on whatsoever. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 5:17 ` David Miller @ 2008-09-09 5:56 ` Chris Snook 2008-09-09 6:02 ` David Miller 2008-09-09 6:22 ` Evgeniy Polyakov 0 siblings, 2 replies; 44+ messages in thread From: Chris Snook @ 2008-09-09 5:56 UTC (permalink / raw) To: David Miller; +Cc: rick.jones2, netdev David Miller wrote: > From: Chris Snook <csnook@redhat.com> > Date: Tue, 09 Sep 2008 01:10:05 -0400 > >> This is open to debate, but there are certainly a great many apps >> doing a great deal of very important business that are subject to >> this problem to some degree. > > Let's be frank and be honest that we're talking about message passing > financial service applications. Mostly. > And I specifically know that the problem they run into is that the > congestion window doesn't open up because of Nagle _AND_ the fact that > congestion control is done using packet counts rather that data byte > totals. So if you send lots of small stuff, the window doesn't open. > Nagle just makes this problem worse, rather than create it. > > And we have a workaround for them, which is a combination of the > tcp_slow_start_after_idle sysctl in combination with route metrics > specifying the initial congestion window value to use. > > I specifically added that sysctl for this specific situation. That's not the problem I'm talking about here. The problem I'm seeing is that if your burst of messages is too small to fill the MTU, the network stack will just sit there and stare at you for precisely 40 ms (an eternity for a financial app) before transmitting. Andi may be correct that it's actually the delayed ACK we're seeing, but I can't figure out where that 40 ms magic number is coming from. The easiest way to see the problem is to open a TCP socket to an echo daemon on loopback, make a bunch of small writes totaling less than your loopback MTU (accounting for overhead), and see how long it takes to get your echoes. You can probably do this with netcat, though I haven't tried. People don't expect loopback to have 40 ms latency when the box is lightly loaded, so they'd really like to tweak that down when it's hurting them. -- Chris ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 5:56 ` Chris Snook @ 2008-09-09 6:02 ` David Miller 2008-09-09 10:31 ` Mark Brown 2008-09-09 6:22 ` Evgeniy Polyakov 1 sibling, 1 reply; 44+ messages in thread From: David Miller @ 2008-09-09 6:02 UTC (permalink / raw) To: csnook; +Cc: rick.jones2, netdev From: Chris Snook <csnook@redhat.com> Date: Tue, 09 Sep 2008 01:56:12 -0400 [ Please hit enter every 80 columns or so, your emails are unreadable until I reformat your text by hand, thanks. ] > That's not the problem I'm talking about here. The problem I'm > seeing is that if your burst of messages is too small to fill the > MTU, the network stack will just sit there and stare at you for > precisely 40 ms (an eternity for a financial app) before > transmitting. Andi may be correct that it's actually the delayed > ACK we're seeing, but I can't figure out where that 40 ms magic > number is coming from. > > The easiest way to see the problem is to open a TCP socket to an > echo daemon on loopback, make a bunch of small writes totaling less > than your loopback MTU (accounting for overhead), and see how long > it takes to get your echoes. You can probably do this with netcat, > though I haven't tried. People don't expect loopback to have 40 ms > latency when the box is lightly loaded, so they'd really like to > tweak that down when it's hurting them. That's informative, but please provide a specific test case and example trace so we can discuss something concrete. Thank you. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 6:02 ` David Miller @ 2008-09-09 10:31 ` Mark Brown 2008-09-09 12:05 ` David Miller 0 siblings, 1 reply; 44+ messages in thread From: Mark Brown @ 2008-09-09 10:31 UTC (permalink / raw) To: David Miller; +Cc: csnook, rick.jones2, netdev On Mon, Sep 08, 2008 at 11:02:21PM -0700, David Miller wrote: > From: Chris Snook <csnook@redhat.com> > Date: Tue, 09 Sep 2008 01:56:12 -0400 > [ Please hit enter every 80 columns or so, your emails are > unreadable until I reformat your text by hand, thanks. ] FWIW the problem with this and some other mails you raised this on recently is that they're sent with format=folowed which tells the receiving MUA that it's OK to reflow the text to fit the display. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 10:31 ` Mark Brown @ 2008-09-09 12:05 ` David Miller 2008-09-09 12:09 ` Mark Brown 0 siblings, 1 reply; 44+ messages in thread From: David Miller @ 2008-09-09 12:05 UTC (permalink / raw) To: broonie; +Cc: csnook, rick.jones2, netdev From: Mark Brown <broonie@sirena.org.uk> Date: Tue, 9 Sep 2008 11:31:38 +0100 > On Mon, Sep 08, 2008 at 11:02:21PM -0700, David Miller wrote: > > From: Chris Snook <csnook@redhat.com> > > Date: Tue, 09 Sep 2008 01:56:12 -0400 > > > [ Please hit enter every 80 columns or so, your emails are > > unreadable until I reformat your text by hand, thanks. ] > > FWIW the problem with this and some other mails you raised this on > recently is that they're sent with format=folowed which tells the > receiving MUA that it's OK to reflow the text to fit the display. I'll have to see why MEW isn't handling that correctly then, it even claims to support this in the version I am using :) ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 12:05 ` David Miller @ 2008-09-09 12:09 ` Mark Brown 2008-09-09 12:19 ` David Miller 0 siblings, 1 reply; 44+ messages in thread From: Mark Brown @ 2008-09-09 12:09 UTC (permalink / raw) To: David Miller; +Cc: csnook, rick.jones2, netdev On Tue, Sep 09, 2008 at 05:05:00AM -0700, David Miller wrote: > From: Mark Brown <broonie@sirena.org.uk> > > FWIW the problem with this and some other mails you raised this on > > recently is that they're sent with format=folowed which tells the > > receiving MUA that it's OK to reflow the text to fit the display. > I'll have to see why MEW isn't handling that correctly then, > it even claims to support this in the version I am using :) Handling it correctly is probably the problem - if it's working as expected then when your window is wide enough to display lines longer than 80 columns it'll go ahead and reflow the text to fill the window. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 12:09 ` Mark Brown @ 2008-09-09 12:19 ` David Miller 0 siblings, 0 replies; 44+ messages in thread From: David Miller @ 2008-09-09 12:19 UTC (permalink / raw) To: broonie; +Cc: csnook, rick.jones2, netdev From: Mark Brown <broonie@sirena.org.uk> Date: Tue, 9 Sep 2008 13:09:34 +0100 > On Tue, Sep 09, 2008 at 05:05:00AM -0700, David Miller wrote: > > From: Mark Brown <broonie@sirena.org.uk> > > > > FWIW the problem with this and some other mails you raised this on > > > recently is that they're sent with format=folowed which tells the > > > receiving MUA that it's OK to reflow the text to fit the display. > > > I'll have to see why MEW isn't handling that correctly then, > > it even claims to support this in the version I am using :) > > Handling it correctly is probably the problem - if it's working as > expected then when your window is wide enough to display lines longer > than 80 columns it'll go ahead and reflow the text to fill the window. I'm using a VC console screen ~160 characters long, and many lines were wrapped instead of flowed. Emacs puts a special character at the end of the line when it has to be wrapped, and I saw those when viewing the mails in quesiton. When I see that character when reading emails, my blood starts to boil :) ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 5:56 ` Chris Snook 2008-09-09 6:02 ` David Miller @ 2008-09-09 6:22 ` Evgeniy Polyakov 2008-09-09 6:28 ` Chris Snook 2008-09-09 13:00 ` Arnaldo Carvalho de Melo 1 sibling, 2 replies; 44+ messages in thread From: Evgeniy Polyakov @ 2008-09-09 6:22 UTC (permalink / raw) To: Chris Snook; +Cc: David Miller, rick.jones2, netdev Hi. On Tue, Sep 09, 2008 at 01:56:12AM -0400, Chris Snook (csnook@redhat.com) wrote: > The easiest way to see the problem is to open a TCP socket to an echo > daemon on loopback, make a bunch of small writes totaling less than your > loopback MTU (accounting for overhead), and see how long it takes to get > your echoes. You can probably do this with netcat, though I haven't > tried. People don't expect loopback to have 40 ms latency when the box > is lightly loaded, so they'd really like to tweak that down when it's > hurting them. Isn't Nagle without corking a very bad idea? Or you can not change the application? -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 6:22 ` Evgeniy Polyakov @ 2008-09-09 6:28 ` Chris Snook 2008-09-09 13:00 ` Arnaldo Carvalho de Melo 1 sibling, 0 replies; 44+ messages in thread From: Chris Snook @ 2008-09-09 6:28 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: David Miller, rick.jones2, netdev Evgeniy Polyakov wrote: > Hi. > > On Tue, Sep 09, 2008 at 01:56:12AM -0400, Chris Snook (csnook@redhat.com) wrote: >> The easiest way to see the problem is to open a TCP socket to an echo >> daemon on loopback, make a bunch of small writes totaling less than your >> loopback MTU (accounting for overhead), and see how long it takes to get >> your echoes. You can probably do this with netcat, though I haven't >> tried. People don't expect loopback to have 40 ms latency when the box >> is lightly loaded, so they'd really like to tweak that down when it's >> hurting them. > > Isn't Nagle without corking a very bad idea? Or you can not change the > application? > Yes, it is a bad idea. We want to make the corking tunable, so people don't disable it completely to avoid these 40 ms latencies. Also, we often can't change the application, so tuning this system-wide would be nice, and would be a lot less dangerous than turning on TCP_NODELAY system-wide the way people often do with solaris. -- Chris ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 6:22 ` Evgeniy Polyakov 2008-09-09 6:28 ` Chris Snook @ 2008-09-09 13:00 ` Arnaldo Carvalho de Melo 1 sibling, 0 replies; 44+ messages in thread From: Arnaldo Carvalho de Melo @ 2008-09-09 13:00 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: Chris Snook, David Miller, rick.jones2, netdev Em Tue, Sep 09, 2008 at 10:22:08AM +0400, Evgeniy Polyakov escreveu: > Hi. > > On Tue, Sep 09, 2008 at 01:56:12AM -0400, Chris Snook (csnook@redhat.com) wrote: > > The easiest way to see the problem is to open a TCP socket to an echo > > daemon on loopback, make a bunch of small writes totaling less than your > > loopback MTU (accounting for overhead), and see how long it takes to get > > your echoes. You can probably do this with netcat, though I haven't > > tried. People don't expect loopback to have 40 ms latency when the box > > is lightly loaded, so they'd really like to tweak that down when it's > > hurting them. > > Isn't Nagle without corking a very bad idea? Or you can not change the > application? In one such case, finantial app building logical packets via several small buffer send calls I got it working with a "autocorking" LD_PRELOAD library, libautocork: http://git.kernel.org/?p=linux/kernel/git/acme/libautocork.git git://git.kernel.org/pub/scm/linux/kernel/git/acme/libautocork.git Details/test cases: http://git.kernel.org/?p=linux/kernel/git/acme/libautocork.git;a=blob_plain;f=tcp_nodelay.txt How to use it, what you get from using it: http://git.kernel.org/?p=linux/kernel/git/acme/libautocork.git;a=blob_plain;f=libautocork.txt - Arnaldo ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 5:10 ` Chris Snook 2008-09-09 5:17 ` David Miller @ 2008-09-09 14:36 ` Andi Kleen 2008-09-09 18:40 ` Chris Snook 2008-09-09 16:33 ` Rick Jones 2 siblings, 1 reply; 44+ messages in thread From: Andi Kleen @ 2008-09-09 14:36 UTC (permalink / raw) To: Chris Snook; +Cc: Rick Jones, Netdev Chris Snook <csnook@redhat.com> writes: > > I'd like to know where the 40 ms magic number comes from. >From TCP_ATO_MIN #define TCP_ATO_MIN ((unsigned)(HZ/25)) > That's the > one that really hurts, and if we could lower that without doing > horrible things elsewhere in the stack, You can lower it (with likely some bad side effects), but I don't think it would make these apps very happy in the end because they likely want no delay at all. -Andi ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 14:36 ` Andi Kleen @ 2008-09-09 18:40 ` Chris Snook 2008-09-09 19:07 ` Andi Kleen 2008-09-09 19:59 ` David Miller 0 siblings, 2 replies; 44+ messages in thread From: Chris Snook @ 2008-09-09 18:40 UTC (permalink / raw) To: Andi Kleen; +Cc: Rick Jones, Netdev Andi Kleen wrote: > Chris Snook <csnook@redhat.com> writes: >> I'd like to know where the 40 ms magic number comes from. > > From TCP_ATO_MIN > > #define TCP_ATO_MIN ((unsigned)(HZ/25)) > >> That's the >> one that really hurts, and if we could lower that without doing >> horrible things elsewhere in the stack, > > You can lower it (with likely some bad side effects), but I don't think it > would make these apps very happy in the end because they likely want > no delay at all. > > -Andi These apps have a love/hate relationship with TCP. They'll probably love SCTP 5 years from now, but it's not mature enough for them yet. They do want to minimize all latencies, and many of the apps explicitly set TCP_NODELAY. The goal here is to improve latencies on the supporting apps that aren't quite as carefully optimized as the main message daemons themselves. If we can give them a knob that bounds their worst-case latency to 2-3 times their average latency, without risking network floods that won't show up in testing, they'll be much happier. -- Chris ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 18:40 ` Chris Snook @ 2008-09-09 19:07 ` Andi Kleen 2008-09-09 19:21 ` Arnaldo Carvalho de Melo 2008-09-11 4:08 ` Chris Snook 2008-09-09 19:59 ` David Miller 1 sibling, 2 replies; 44+ messages in thread From: Andi Kleen @ 2008-09-09 19:07 UTC (permalink / raw) To: Chris Snook; +Cc: Andi Kleen, Rick Jones, Netdev > These apps have a love/hate relationship with TCP. They'll probably love > SCTP 5 years from now, but it's not mature enough for them yet. They do > want to minimize all latencies, Then they should just TCP_NODELAY. > and many of the apps explicitly set > TCP_NODELAY. That's the right thing for them. > The goal here is to improve latencies on the supporting apps > that aren't quite as carefully optimized as the main message daemons > themselves. If we can give them a knob that bounds their worst-case > latency to 2-3 times their average latency, without risking network floods > that won't show up in testing, they'll be much happier. Hmm in theory I don't see a big drawback in making the these defaults sysctls. As in this untested patch. It's probably not the right solution for this problem. Still if you want to experiment. This makes both the ato default and the delack default tunable. You'll have to restart sockets for it to take effect. -Andi --- Make ato min and delack min tunable This might potentially help with some programs which have problems with nagle. Sockets have to be restarted TBD documentation for the new sysctls Signed-off-by: Andi Kleen <ak@linux.intel.com> Index: linux-2.6.27-rc4-misc/include/net/tcp.h =================================================================== --- linux-2.6.27-rc4-misc.orig/include/net/tcp.h +++ linux-2.6.27-rc4-misc/include/net/tcp.h @@ -118,12 +118,16 @@ extern void tcp_time_wait(struct sock *s #define TCP_DELACK_MAX ((unsigned)(HZ/5)) /* maximal time to delay before sending an ACK */ #if HZ >= 100 -#define TCP_DELACK_MIN ((unsigned)(HZ/25)) /* minimal time to delay before sending an ACK */ -#define TCP_ATO_MIN ((unsigned)(HZ/25)) +#define TCP_DELACK_MIN_DEFAULT ((unsigned)(HZ/25)) /* minimal time to delay before sending an ACK */ +#define TCP_ATO_MIN_DEFAULT ((unsigned)(HZ/25)) #else -#define TCP_DELACK_MIN 4U -#define TCP_ATO_MIN 4U +#define TCP_DELACK_MIN_DEFAULT 4U +#define TCP_ATO_MIN_DEFAULT 4U #endif + +#define TCP_DELACK_MIN sysctl_tcp_delack_min +#define TCP_ATO_MIN sysctl_tcp_ato_min + #define TCP_RTO_MAX ((unsigned)(120*HZ)) #define TCP_RTO_MIN ((unsigned)(HZ/5)) #define TCP_TIMEOUT_INIT ((unsigned)(3*HZ)) /* RFC 1122 initial RTO value */ @@ -236,6 +240,8 @@ extern int sysctl_tcp_base_mss; extern int sysctl_tcp_workaround_signed_windows; extern int sysctl_tcp_slow_start_after_idle; extern int sysctl_tcp_max_ssthresh; +extern int sysctl_tcp_ato_min; +extern int sysctl_tcp_delack_min; extern atomic_t tcp_memory_allocated; extern atomic_t tcp_sockets_allocated; Index: linux-2.6.27-rc4-misc/net/ipv4/sysctl_net_ipv4.c =================================================================== --- linux-2.6.27-rc4-misc.orig/net/ipv4/sysctl_net_ipv4.c +++ linux-2.6.27-rc4-misc/net/ipv4/sysctl_net_ipv4.c @@ -717,6 +717,24 @@ static struct ctl_table ipv4_table[] = { }, { .ctl_name = CTL_UNNUMBERED, + .procname = "tcp_delack_min", + .data = &sysctl_tcp_delack_min, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec_jiffies, + .strategy = &sysctl_jiffies + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "tcp_ato_min", + .data = &sysctl_tcp_ato_min, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec_jiffies, + .strategy = &sysctl_jiffies + }, + { + .ctl_name = CTL_UNNUMBERED, .procname = "udp_mem", .data = &sysctl_udp_mem, .maxlen = sizeof(sysctl_udp_mem), Index: linux-2.6.27-rc4-misc/net/ipv4/tcp_timer.c =================================================================== --- linux-2.6.27-rc4-misc.orig/net/ipv4/tcp_timer.c +++ linux-2.6.27-rc4-misc/net/ipv4/tcp_timer.c @@ -29,6 +29,8 @@ int sysctl_tcp_keepalive_intvl __read_mo int sysctl_tcp_retries1 __read_mostly = TCP_RETR1; int sysctl_tcp_retries2 __read_mostly = TCP_RETR2; int sysctl_tcp_orphan_retries __read_mostly; +int sysctl_tcp_delack_min __read_mostly = TCP_DELACK_MIN_DEFAULT; +int sysctl_tcp_ato_min __read_mostly = TCP_ATO_MIN_DEFAULT; static void tcp_write_timer(unsigned long); static void tcp_delack_timer(unsigned long); Index: linux-2.6.27-rc4-misc/net/ipv4/tcp_output.c =================================================================== --- linux-2.6.27-rc4-misc.orig/net/ipv4/tcp_output.c +++ linux-2.6.27-rc4-misc/net/ipv4/tcp_output.c @@ -2436,7 +2436,7 @@ void tcp_send_delayed_ack(struct sock *s * directly. */ if (tp->srtt) { - int rtt = max(tp->srtt >> 3, TCP_DELACK_MIN); + int rtt = max_t(unsigned, tp->srtt >> 3, TCP_DELACK_MIN); if (rtt < max_ato) max_ato = rtt; -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 19:07 ` Andi Kleen @ 2008-09-09 19:21 ` Arnaldo Carvalho de Melo 2008-09-11 4:08 ` Chris Snook 1 sibling, 0 replies; 44+ messages in thread From: Arnaldo Carvalho de Melo @ 2008-09-09 19:21 UTC (permalink / raw) To: Andi Kleen; +Cc: Chris Snook, Rick Jones, Netdev Em Tue, Sep 09, 2008 at 09:07:37PM +0200, Andi Kleen escreveu: > > These apps have a love/hate relationship with TCP. They'll probably love > > SCTP 5 years from now, but it's not mature enough for them yet. They do > > want to minimize all latencies, > > Then they should just TCP_NODELAY. > > > and many of the apps explicitly set > > TCP_NODELAY. > > That's the right thing for them. But please ask them to use writev or build the logical packet in userspace, sending it as just one buffer, or they will start asking for nagle tunables ;-) - Arnaldo ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 19:07 ` Andi Kleen 2008-09-09 19:21 ` Arnaldo Carvalho de Melo @ 2008-09-11 4:08 ` Chris Snook 1 sibling, 0 replies; 44+ messages in thread From: Chris Snook @ 2008-09-11 4:08 UTC (permalink / raw) To: Andi Kleen; +Cc: Rick Jones, Netdev Andi Kleen wrote: >> These apps have a love/hate relationship with TCP. They'll probably love >> SCTP 5 years from now, but it's not mature enough for them yet. They do >> want to minimize all latencies, > > Then they should just TCP_NODELAY. > >> and many of the apps explicitly set >> TCP_NODELAY. > > That's the right thing for them. > >> The goal here is to improve latencies on the supporting apps >> that aren't quite as carefully optimized as the main message daemons >> themselves. If we can give them a knob that bounds their worst-case >> latency to 2-3 times their average latency, without risking network floods >> that won't show up in testing, they'll be much happier. > > Hmm in theory I don't see a big drawback in making the these defaults sysctls. > As in this untested patch. It's probably not the right solution > for this problem. Still if you want to experiment. This makes both > the ato default and the delack default tunable. You'll have to restart > sockets for it to take effect. > > -Andi > > --- > > > Make ato min and delack min tunable > > This might potentially help with some programs which have problems with nagle. > > Sockets have to be restarted > > TBD documentation for the new sysctls > > Signed-off-by: Andi Kleen <ak@linux.intel.com> It needs the changed constants replaced with the _DEFAULT versions in net/dccp/timer.c and net/dccp/output.c to build with DCCP enabled. I did that, and tested it (over loopback). The tunables come up at 0, not the expected default values, and when that happens, latencies are extremely low, as would be expected with a value of 0, but when I set net.ipv4.tcp_delack_min to *any* non-zero value, the old 40 ms magic number becomes 200 ms. I haven't yet figured out why. Tweaking net.ipv4.tcp_ato_min isn't having any observable effect on my loopback latencies. I think there may be something worth pursuing with a tcp_delack_min tunable. Any suggestions on where I should look to debug this? -- Chris > Index: linux-2.6.27-rc4-misc/include/net/tcp.h > =================================================================== > --- linux-2.6.27-rc4-misc.orig/include/net/tcp.h > +++ linux-2.6.27-rc4-misc/include/net/tcp.h > @@ -118,12 +118,16 @@ extern void tcp_time_wait(struct sock *s > > #define TCP_DELACK_MAX ((unsigned)(HZ/5)) /* maximal time to delay before sending an ACK */ > #if HZ >= 100 > -#define TCP_DELACK_MIN ((unsigned)(HZ/25)) /* minimal time to delay before sending an ACK */ > -#define TCP_ATO_MIN ((unsigned)(HZ/25)) > +#define TCP_DELACK_MIN_DEFAULT ((unsigned)(HZ/25)) /* minimal time to delay before sending an ACK */ > +#define TCP_ATO_MIN_DEFAULT ((unsigned)(HZ/25)) > #else > -#define TCP_DELACK_MIN 4U > -#define TCP_ATO_MIN 4U > +#define TCP_DELACK_MIN_DEFAULT 4U > +#define TCP_ATO_MIN_DEFAULT 4U > #endif > + > +#define TCP_DELACK_MIN sysctl_tcp_delack_min > +#define TCP_ATO_MIN sysctl_tcp_ato_min > + > #define TCP_RTO_MAX ((unsigned)(120*HZ)) > #define TCP_RTO_MIN ((unsigned)(HZ/5)) > #define TCP_TIMEOUT_INIT ((unsigned)(3*HZ)) /* RFC 1122 initial RTO value */ > @@ -236,6 +240,8 @@ extern int sysctl_tcp_base_mss; > extern int sysctl_tcp_workaround_signed_windows; > extern int sysctl_tcp_slow_start_after_idle; > extern int sysctl_tcp_max_ssthresh; > +extern int sysctl_tcp_ato_min; > +extern int sysctl_tcp_delack_min; > > extern atomic_t tcp_memory_allocated; > extern atomic_t tcp_sockets_allocated; > Index: linux-2.6.27-rc4-misc/net/ipv4/sysctl_net_ipv4.c > =================================================================== > --- linux-2.6.27-rc4-misc.orig/net/ipv4/sysctl_net_ipv4.c > +++ linux-2.6.27-rc4-misc/net/ipv4/sysctl_net_ipv4.c > @@ -717,6 +717,24 @@ static struct ctl_table ipv4_table[] = { > }, > { > .ctl_name = CTL_UNNUMBERED, > + .procname = "tcp_delack_min", > + .data = &sysctl_tcp_delack_min, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = &proc_dointvec_jiffies, > + .strategy = &sysctl_jiffies > + }, > + { > + .ctl_name = CTL_UNNUMBERED, > + .procname = "tcp_ato_min", > + .data = &sysctl_tcp_ato_min, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = &proc_dointvec_jiffies, > + .strategy = &sysctl_jiffies > + }, > + { > + .ctl_name = CTL_UNNUMBERED, > .procname = "udp_mem", > .data = &sysctl_udp_mem, > .maxlen = sizeof(sysctl_udp_mem), > Index: linux-2.6.27-rc4-misc/net/ipv4/tcp_timer.c > =================================================================== > --- linux-2.6.27-rc4-misc.orig/net/ipv4/tcp_timer.c > +++ linux-2.6.27-rc4-misc/net/ipv4/tcp_timer.c > @@ -29,6 +29,8 @@ int sysctl_tcp_keepalive_intvl __read_mo > int sysctl_tcp_retries1 __read_mostly = TCP_RETR1; > int sysctl_tcp_retries2 __read_mostly = TCP_RETR2; > int sysctl_tcp_orphan_retries __read_mostly; > +int sysctl_tcp_delack_min __read_mostly = TCP_DELACK_MIN_DEFAULT; > +int sysctl_tcp_ato_min __read_mostly = TCP_ATO_MIN_DEFAULT; > > static void tcp_write_timer(unsigned long); > static void tcp_delack_timer(unsigned long); > Index: linux-2.6.27-rc4-misc/net/ipv4/tcp_output.c > =================================================================== > --- linux-2.6.27-rc4-misc.orig/net/ipv4/tcp_output.c > +++ linux-2.6.27-rc4-misc/net/ipv4/tcp_output.c > @@ -2436,7 +2436,7 @@ void tcp_send_delayed_ack(struct sock *s > * directly. > */ > if (tp->srtt) { > - int rtt = max(tp->srtt >> 3, TCP_DELACK_MIN); > + int rtt = max_t(unsigned, tp->srtt >> 3, TCP_DELACK_MIN); > > if (rtt < max_ato) > max_ato = rtt; > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 18:40 ` Chris Snook 2008-09-09 19:07 ` Andi Kleen @ 2008-09-09 19:59 ` David Miller 2008-09-09 20:25 ` Chris Snook 2008-09-22 10:49 ` David Miller 1 sibling, 2 replies; 44+ messages in thread From: David Miller @ 2008-09-09 19:59 UTC (permalink / raw) To: csnook; +Cc: andi, rick.jones2, netdev Still waiting for your test program and example traces Chris. Until I see that there really isn't anything more I can contribute concretely to this discussion. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 19:59 ` David Miller @ 2008-09-09 20:25 ` Chris Snook 2008-09-22 10:49 ` David Miller 1 sibling, 0 replies; 44+ messages in thread From: Chris Snook @ 2008-09-09 20:25 UTC (permalink / raw) To: David Miller; +Cc: andi, rick.jones2, netdev David Miller wrote: > Still waiting for your test program and example traces Chris. > > Until I see that there really isn't anything more I can contribute > concretely to this discussion. No problem. Right now I'm working on testing the target workloads with the tunables Andi posted, but I'll also work up a trivial test case to demonstrate it. The apps where these problems are being reported are not exactly the sort of thing one posts to netdev. -- Chris ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 19:59 ` David Miller 2008-09-09 20:25 ` Chris Snook @ 2008-09-22 10:49 ` David Miller 2008-09-22 11:09 ` David Miller 1 sibling, 1 reply; 44+ messages in thread From: David Miller @ 2008-09-22 10:49 UTC (permalink / raw) To: csnook; +Cc: andi, rick.jones2, netdev From: David Miller <davem@davemloft.net> Date: Tue, 09 Sep 2008 12:59:34 -0700 (PDT) > > Still waiting for your test program and example traces Chris. > > Until I see that there really isn't anything more I can contribute > concretely to this discussion. Ping, still waiting for this... Can you provide the test case, perhaps sometime this year? :-) I'll try to figure out why Andi's patch doesn't behave as expected. I suspect you may have a bum build if the sysctl values are coming up as zero as that's completely impossible as far as I can tell. If something so fundamental as that isn't behaving properly, all bets are off for anything else you try to use when having that patch applied. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-22 10:49 ` David Miller @ 2008-09-22 11:09 ` David Miller 2008-09-22 20:30 ` Andi Kleen 0 siblings, 1 reply; 44+ messages in thread From: David Miller @ 2008-09-22 11:09 UTC (permalink / raw) To: csnook; +Cc: andi, rick.jones2, netdev From: David Miller <davem@davemloft.net> Date: Mon, 22 Sep 2008 03:49:33 -0700 (PDT) > I'll try to figure out why Andi's patch doesn't behave as expected. Andi's patch uses proc_dointvec_jiffies, which is for sysctl values stored as seconds, whereas these things are used to record values with smaller granulatiry, are stored in jiffies, and that's why we get zero on read and writes have crazy effects. Also, as Andi stated, this is not the way to deal with this problem. So we have a broken patch, which even if implemented properly isn't the way forward, so I consider this discussion dead in the water until we have some test cases. Don't you think? :-) ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-22 11:09 ` David Miller @ 2008-09-22 20:30 ` Andi Kleen 2008-09-22 22:22 ` Chris Snook 0 siblings, 1 reply; 44+ messages in thread From: Andi Kleen @ 2008-09-22 20:30 UTC (permalink / raw) To: David Miller; +Cc: csnook, andi, rick.jones2, netdev On Mon, Sep 22, 2008 at 04:09:12AM -0700, David Miller wrote: > From: David Miller <davem@davemloft.net> > Date: Mon, 22 Sep 2008 03:49:33 -0700 (PDT) > > > I'll try to figure out why Andi's patch doesn't behave as expected. > > Andi's patch uses proc_dointvec_jiffies, which is for sysctl values > stored as seconds, whereas these things are used to record values with > smaller granulatiry, are stored in jiffies, and that's why we get zero > on read and writes have crazy effects. Oops. Assume me with brown paper bag etc.etc. It was a typo for proc_dointvec_ms_jiffies > > Also, as Andi stated, this is not the way to deal with this problem. > > So we have a broken patch, which even if implemented properly isn't the > way forward, so I consider this discussion dead in the water until we > have some test cases. The patch is easy to fix with a s/_jiffies/_ms_jiffies/g Also it was more intended for him to play around and get some data points. I guess for that it's still useful. Also while for that it's probably not the right solution, but I could imagine in some other situations where it might be useful to tune these values. After all they are not written down in stone. I wonder if it would even make sense to consider hr timers for TCP now. =Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-22 20:30 ` Andi Kleen @ 2008-09-22 22:22 ` Chris Snook 2008-09-22 22:26 ` David Miller 2008-09-22 22:47 ` Rick Jones 0 siblings, 2 replies; 44+ messages in thread From: Chris Snook @ 2008-09-22 22:22 UTC (permalink / raw) To: Andi Kleen; +Cc: David Miller, rick.jones2, netdev Andi Kleen wrote: > On Mon, Sep 22, 2008 at 04:09:12AM -0700, David Miller wrote: >> From: David Miller <davem@davemloft.net> >> Date: Mon, 22 Sep 2008 03:49:33 -0700 (PDT) >> >>> I'll try to figure out why Andi's patch doesn't behave as expected. >> Andi's patch uses proc_dointvec_jiffies, which is for sysctl values >> stored as seconds, whereas these things are used to record values with >> smaller granulatiry, are stored in jiffies, and that's why we get zero >> on read and writes have crazy effects. > > Oops. Assume me with brown paper bag etc.etc. > > It was a typo for proc_dointvec_ms_jiffies > > >> Also, as Andi stated, this is not the way to deal with this problem. >> >> So we have a broken patch, which even if implemented properly isn't the >> way forward, so I consider this discussion dead in the water until we >> have some test cases. It's proven a little harder than anticipated to create a trivial test case, but I should be able to post some traces from a freely-available app soon. > The patch is easy to fix with a s/_jiffies/_ms_jiffies/g Thanks, will try. > Also it was more intended for him to play around and get some data > points. I guess for that it's still useful. Indeed. Setting tcp_delack_min to 0 completely eliminated the undesired latencies, though of course that would be a bit dangerous with naive apps talking across the network. Changing tcp_ato_min didn't do anything interesting for this case. > Also while for that it's probably not the right solution, but > I could imagine in some other situations where it might be useful > to tune these values. After all they are not written down in stone. The problem is that we're trying to use one set of values for links with extremely different performance characteristics. We need to initialize TCP sockets with min/default/max values that are safe and perform well. How horrendous of a layering violation would it be to attach TCP performance parameters (either user-supplied or based on interface stats) to route table entries, like route metrics but intended to guide TCP autotuning? It seems like it shouldn't be that hard to teach TCP that it doesn't need to optimize my lo connections much, and that it should be optimizing my eth0 subnet connections for lower latency and higher bandwidth than the connections that go through my gateway into the great beyond. > I wonder if it would even make sense to consider hr timers for TCP > now. > > =Andi As long as we have hardcoded minimum delays > 10ms, I don't think there's much of a point, but it's something to keep in mind for the future. -- Chris ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-22 22:22 ` Chris Snook @ 2008-09-22 22:26 ` David Miller 2008-09-22 23:00 ` Chris Snook 2008-09-22 22:47 ` Rick Jones 1 sibling, 1 reply; 44+ messages in thread From: David Miller @ 2008-09-22 22:26 UTC (permalink / raw) To: csnook; +Cc: andi, rick.jones2, netdev From: Chris Snook <csnook@redhat.com> Date: Mon, 22 Sep 2008 18:22:13 -0400 > How horrendous of a layering violation would it be to attach TCP > performance parameters (either user-supplied or based on interface > stats) to route table entries, like route metrics but intended to > guide TCP autotuning? It seems like it shouldn't be that hard to > teach TCP that it doesn't need to optimize my lo connections much, > and that it should be optimizing my eth0 subnet connections for > lower latency and higher bandwidth than the connections that go > through my gateway into the great beyond. We already do this for other TCP connection parameters, and I tend to think these delack/ato values belong there too. If we add a global knob, people are just going to turn it on even if they are also connected to the real internet on the system rather than only internal networks they completely control. That tends to cause problems. It gets an entry on all of these bogus "Linux performance tuning" sections administrators read from the various financial messaging products. So everyone does it without thinking and using their brains. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-22 22:26 ` David Miller @ 2008-09-22 23:00 ` Chris Snook 2008-09-22 23:13 ` David Miller 0 siblings, 1 reply; 44+ messages in thread From: Chris Snook @ 2008-09-22 23:00 UTC (permalink / raw) To: David Miller; +Cc: andi, rick.jones2, netdev David Miller wrote: > From: Chris Snook <csnook@redhat.com> > Date: Mon, 22 Sep 2008 18:22:13 -0400 > >> How horrendous of a layering violation would it be to attach TCP >> performance parameters (either user-supplied or based on interface >> stats) to route table entries, like route metrics but intended to >> guide TCP autotuning? It seems like it shouldn't be that hard to >> teach TCP that it doesn't need to optimize my lo connections much, >> and that it should be optimizing my eth0 subnet connections for >> lower latency and higher bandwidth than the connections that go >> through my gateway into the great beyond. > > We already do this for other TCP connection parameters, and I tend to > think these delack/ato values belong there too. > > If we add a global knob, people are just going to turn it on even if > they are also connected to the real internet on the system rather than > only internal networks they completely control. > > That tends to cause problems. It gets an entry on all of these bogus > "Linux performance tuning" sections administrators read from the > various financial messaging products. So everyone does it without > thinking and using their brains. I agree 100%. Could you please point me to an example of a connection parameter that gets tuned and cached this way, so I can experiment with it? -- Chris ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-22 23:00 ` Chris Snook @ 2008-09-22 23:13 ` David Miller 2008-09-22 23:24 ` Andi Kleen 0 siblings, 1 reply; 44+ messages in thread From: David Miller @ 2008-09-22 23:13 UTC (permalink / raw) To: csnook; +Cc: andi, rick.jones2, netdev From: Chris Snook <csnook@redhat.com> Date: Mon, 22 Sep 2008 19:00:08 -0400 > Could you please point me to an example of a connection parameter > that gets tuned and cached this way, so I can experiment with it? You'll find tons of them in tcp_update_metrics(). ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-22 23:13 ` David Miller @ 2008-09-22 23:24 ` Andi Kleen 2008-09-22 23:21 ` David Miller 0 siblings, 1 reply; 44+ messages in thread From: Andi Kleen @ 2008-09-22 23:24 UTC (permalink / raw) To: David Miller; +Cc: csnook, andi, rick.jones2, netdev On Mon, Sep 22, 2008 at 04:13:23PM -0700, David Miller wrote: > From: Chris Snook <csnook@redhat.com> > Date: Mon, 22 Sep 2008 19:00:08 -0400 > > > Could you please point me to an example of a connection parameter > > that gets tuned and cached this way, so I can experiment with it? > > You'll find tons of them in tcp_update_metrics(). IMHO that is actually obsolete because it does not take NAT into account. One IP does not necessarily share link characteristics. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-22 23:24 ` Andi Kleen @ 2008-09-22 23:21 ` David Miller 2008-09-23 0:14 ` Andi Kleen 0 siblings, 1 reply; 44+ messages in thread From: David Miller @ 2008-09-22 23:21 UTC (permalink / raw) To: andi; +Cc: csnook, rick.jones2, netdev From: Andi Kleen <andi@firstfloor.org> Date: Tue, 23 Sep 2008 01:24:28 +0200 > On Mon, Sep 22, 2008 at 04:13:23PM -0700, David Miller wrote: > > From: Chris Snook <csnook@redhat.com> > > Date: Mon, 22 Sep 2008 19:00:08 -0400 > > > > > Could you please point me to an example of a connection parameter > > > that gets tuned and cached this way, so I can experiment with it? > > > > You'll find tons of them in tcp_update_metrics(). > > IMHO that is actually obsolete because it does not take NAT > into account. One IP does not necessarily share link characteristics. It is not an invalid estimate even in the NAT case, and it's not so illegal like TCP timewait recycling would be. Andi, don't rain on the party for something that might be terribly useful for many people. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-22 23:21 ` David Miller @ 2008-09-23 0:14 ` Andi Kleen 2008-09-23 0:33 ` Rick Jones 2008-09-23 1:40 ` David Miller 0 siblings, 2 replies; 44+ messages in thread From: Andi Kleen @ 2008-09-23 0:14 UTC (permalink / raw) To: David Miller; +Cc: andi, csnook, rick.jones2, netdev > It is not an invalid estimate even in the NAT case, Typical case: you got a large company network behind a NAT. First user has a very crappy wireless connection behind a slow intercontinental link talking to the outgoing NAT router. He connectes to your internet server first and the window, slow start etc. parameters for him are saved in the dst_entry. The next guy behind the same NAT is in the same building as the router who connects the company to the internet. He has a much faster line. He connects to the same server. They will share the same dst and inetpeer entries. The parameters saved earlier for the same IP are clearly invalid for the second case. The link characteristics are completely different. Also did you know there are there are whole countries behind NAT. e.g. I was told that all of Saudi Arabia only comes from a small handfull of IP addresses. It would surprise me if all of KSA has the same link characteristics? @) Ah I see there's a sysctl now to disable this. How about setting it by default? -Andi ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-23 0:14 ` Andi Kleen @ 2008-09-23 0:33 ` Rick Jones 2008-09-23 2:12 ` Andi Kleen 2008-09-23 1:40 ` David Miller 1 sibling, 1 reply; 44+ messages in thread From: Rick Jones @ 2008-09-23 0:33 UTC (permalink / raw) To: Andi Kleen; +Cc: David Miller, csnook, netdev Andi Kleen wrote: >>It is not an invalid estimate even in the NAT case, > > > Typical case: you got a large company network behind a NAT. > First user has a very crappy wireless connection behind a slow > intercontinental link talking to the outgoing NAT router. He connectes to > your internet server first and the window, slow start etc. parameters > for him are saved in the dst_entry. > > The next guy behind the same NAT is in the same building > as the router who connects the company to the internet. He > has a much faster line. He connects to the same server. > They will share the same dst and inetpeer entries. > > The parameters saved earlier for the same IP are clearly invalid > for the second case. The link characteristics are completely > different. > > Also did you know there are there are whole countries behind > NAT. e.g. I was told that all of Saudi Arabia only comes from > a small handfull of IP addresses. It would surprise me if > all of KSA has the same link characteristics? @) That seems as much of a case against NAT as per-destintation attribute caching. If my experience at "a large company" is any indication, for 99 connections out of 10 I'm going through a proxy rather than NAT so all the remote server sees are the characteristics of the connection between it and the proxy. And even if I were not, how is per-destination caching the possibly non-optimal characteristics based on one user behind a NAT really functionally different than having to tune the system-wide defaults to cover that corner-case user? Seems that caching per-destination characteristics is actually limiting the alleged brokenness to that destination rather than all destinations? rick jones ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-23 0:33 ` Rick Jones @ 2008-09-23 2:12 ` Andi Kleen 0 siblings, 0 replies; 44+ messages in thread From: Andi Kleen @ 2008-09-23 2:12 UTC (permalink / raw) To: Rick Jones; +Cc: Andi Kleen, David Miller, csnook, netdev > That seems as much of a case against NAT as per-destintation attribute > caching. Sure in a ideal world NAT wouldn't exist. Unfortunately we're not in a ideal world. Also in general my impression is that NAT is becoming more common. e.g. a lot of the mobile networks seem to be NATed. > > If my experience at "a large company" is any indication, for 99 My experience at a large company was different. Also see my second example. > > And even if I were not, how is per-destination caching the possibly > non-optimal characteristics based on one user behind a NAT really > functionally different than having to tune the system-wide defaults to > cover that corner-case user? It's just wasteful on network resouces. e.g. if you start talking to the slow link with a too large congestion window a lot of packets are going to be dropped. Yes TCP will eventually adapt, but the network and the user performance suffers and the network is ineffectively used. > Seems that caching per-destination > characteristics is actually limiting the alleged brokenness to that > destination rather than all destinations? Not sure what you're talking about. There's no real brokenness in having a slow link. And with default startup metrics Linux TCP has no trouble talking to a slow link. The brokenness is using the dst_entry TCP metrics of a fast link to talk to a slow link and that happens with NAT. -Andi ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-23 0:14 ` Andi Kleen 2008-09-23 0:33 ` Rick Jones @ 2008-09-23 1:40 ` David Miller 2008-09-23 2:23 ` Andi Kleen 1 sibling, 1 reply; 44+ messages in thread From: David Miller @ 2008-09-23 1:40 UTC (permalink / raw) To: andi; +Cc: csnook, rick.jones2, netdev From: Andi Kleen <andi@firstfloor.org> Date: Tue, 23 Sep 2008 02:14:09 +0200 > Typical case: you got a large company network behind a NAT. > First user has a very crappy wireless connection behind a slow > intercontinental link talking to the outgoing NAT router. He connectes to > your internet server first and the window, slow start etc. parameters > for him are saved in the dst_entry. > > The next guy behind the same NAT is in the same building > as the router who connects the company to the internet. He > has a much faster line. He connects to the same server. > They will share the same dst and inetpeer entries. > > The parameters saved earlier for the same IP are clearly invalid > for the second case. The link characteristics are completely > different. Just as typical are setups where the NAT clients are 1 or 2 fast hops behind the firewall. There are many cases where perfectly acceptible heuristics don't perform optimally, this doesn't mean we disable them by default. > Also did you know there are there are whole countries behind > NAT. I am more than aware of this, however this doesn't mean it is sane and this kind of setup makes many useful internet services inaccessible. > Ah I see there's a sysctl now to disable this. How about > setting it by default? It's for debugging. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-23 1:40 ` David Miller @ 2008-09-23 2:23 ` Andi Kleen 2008-09-23 2:28 ` David Miller 0 siblings, 1 reply; 44+ messages in thread From: Andi Kleen @ 2008-09-23 2:23 UTC (permalink / raw) To: David Miller; +Cc: andi, csnook, rick.jones2, netdev > There are many cases where perfectly acceptible heuristics For very low values of "perfect" :) > don't perform optimally, this doesn't mean we disable them > by default. Well they're just broken on a larger and larger fraction of the internet. Router technology more and more often knows something about ports these days and handles flows differently. Assuming they do not is more and more wrong. It's one of these things which looks cool on first look but when you dig deeper is just a bad idea. How should we call a heuristics that is often wrong. A "wrongistic"? :) > > NAT. > > I am more than aware of this, however this doesn't mean it is > sane I agree with you that they're not sane, but they should still be supported. Technically at least they don't violate any standards afaik. Linux should work well even with such setups. In fact it has no other choice because they're so common. "Be liberal what you accept, be conservative what you send". This violates the second principle. > and this kind of setup makes many useful internet services > inaccessible. Sure, in fact that's usually why they were done in the first place, but Linux shouldn't make it unnecessarily worse. Anyways enough said. I guess we have to agree to disagree on this. For completeness I'll still send the patch to set the sysctl by default though just in case you reconsider. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-23 2:23 ` Andi Kleen @ 2008-09-23 2:28 ` David Miller 2008-09-23 2:41 ` Andi Kleen 0 siblings, 1 reply; 44+ messages in thread From: David Miller @ 2008-09-23 2:28 UTC (permalink / raw) To: andi; +Cc: csnook, rick.jones2, netdev From: Andi Kleen <andi@firstfloor.org> Date: Tue, 23 Sep 2008 04:23:29 +0200 > I'll still send the patch to set the sysctl > by default though just in case you reconsider. Please don't poop in my inbox like that, I already said I'm not making the change. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-23 2:28 ` David Miller @ 2008-09-23 2:41 ` Andi Kleen 0 siblings, 0 replies; 44+ messages in thread From: Andi Kleen @ 2008-09-23 2:41 UTC (permalink / raw) To: David Miller; +Cc: andi, csnook, rick.jones2, netdev On Mon, Sep 22, 2008 at 07:28:59PM -0700, David Miller wrote: > From: Andi Kleen <andi@firstfloor.org> > Date: Tue, 23 Sep 2008 04:23:29 +0200 > > > I'll still send the patch to set the sysctl > > by default though just in case you reconsider. > > Please don't poop in my inbox like that, I already said I'm > not making the change. Thank you for the choice words to describe patches. But netdev is not for your use alone. It will be then more for the benefit of other list readers who might have a less prejudiced view on the usefulness of certains wrongistics. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-22 22:22 ` Chris Snook 2008-09-22 22:26 ` David Miller @ 2008-09-22 22:47 ` Rick Jones 2008-09-22 22:57 ` Chris Snook 1 sibling, 1 reply; 44+ messages in thread From: Rick Jones @ 2008-09-22 22:47 UTC (permalink / raw) To: Chris Snook; +Cc: Andi Kleen, David Miller, netdev > Indeed. Setting tcp_delack_min to 0 completely eliminated the undesired > latencies, though of course that would be a bit dangerous with naive > apps talking across the network. What did it do to the packets per second or per unit of work? Depending on the nature of the race between the ACK returning from the remote and the application pushing more bytes into the socket, I'd think that setting the delayed ack timer to zero could result in more traffic on the network (those bare ACKs) than simply setting TCP_NODELAY at the source. And since with small packets and/or copy avoidance an ACK is (handwaving) just as many CPU cycles at either end as a data segment that also means a bump in CPU utilization. rick jones ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-22 22:47 ` Rick Jones @ 2008-09-22 22:57 ` Chris Snook 0 siblings, 0 replies; 44+ messages in thread From: Chris Snook @ 2008-09-22 22:57 UTC (permalink / raw) To: Rick Jones; +Cc: Andi Kleen, David Miller, netdev Rick Jones wrote: >> Indeed. Setting tcp_delack_min to 0 completely eliminated the >> undesired latencies, though of course that would be a bit dangerous >> with naive apps talking across the network. > > What did it do to the packets per second or per unit of work? Depending > on the nature of the race between the ACK returning from the remote and > the application pushing more bytes into the socket, I'd think that > setting the delayed ack timer to zero could result in more traffic on > the network (those bare ACKs) than simply setting TCP_NODELAY at the > source. > > And since with small packets and/or copy avoidance an ACK is > (handwaving) just as many CPU cycles at either end as a data segment > that also means a bump in CPU utilization. > > rick jones I never saw performance go down, but I was always using low latency/high bandwidth loopback or LAN connection, with only one socket per CPU. I agree though, that turning this off is suboptimal. I'm going to pursue David's idea of making delack_min and ato_min dynamically calculated by the kernel. -- Chris ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 5:10 ` Chris Snook 2008-09-09 5:17 ` David Miller 2008-09-09 14:36 ` Andi Kleen @ 2008-09-09 16:33 ` Rick Jones 2008-09-09 16:54 ` Chuck Lever 2 siblings, 1 reply; 44+ messages in thread From: Rick Jones @ 2008-09-09 16:33 UTC (permalink / raw) To: Chris Snook; +Cc: Netdev > > Most of the apps where people care about this enough to complain to > their vendor (the cases I hear about) are in messaging apps, where > they're relaying a stream of events that have little to do with each > other, and they want TCP to maintain the integrity of the connection and > do a modicum of bandwidth management, but 40 ms stalls greatly exceed > their latency tolerances. What _are_ their latency tolerances? How often are they willing to tolerate a modicum of TCP bandwidth management? Do they go ape when TCP sits waiting not just for 40ms, but for an entire RTO timer? > Using TCP_NODELAY is often the least bad option, but sometimes it's > infeasible because of its effect on the network, and it certainly > adds to the network stack overhead. A more tunable Nagle delay would > probably serve many of these apps much better. If the applications are sending streams of logically unrelated sends down the same socket, then setting TCP_NODELAY is IMO fine. Where it isn't fine is where these applications are generating their logically associated data in two or more small sends. One send per message good. Two sends per message bad. BTW, is this magical mystery Solaris ndd setting "tcp_naglim_def?" FWIW I believe there is a similar setting by the same name in HP-UX. rick jones ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 16:33 ` Rick Jones @ 2008-09-09 16:54 ` Chuck Lever 2008-09-09 17:21 ` Arnaldo Carvalho de Melo 2008-09-09 17:54 ` Rick Jones 0 siblings, 2 replies; 44+ messages in thread From: Chuck Lever @ 2008-09-09 16:54 UTC (permalink / raw) To: Rick Jones; +Cc: Chris Snook, Netdev On Sep 9, 2008, at Sep 9, 2008, 12:33 PM, Rick Jones wrote: >> Most of the apps where people care about this enough to complain to >> their vendor (the cases I hear about) are in messaging apps, where >> they're relaying a stream of events that have little to do with >> each other, and they want TCP to maintain the integrity of the >> connection and do a modicum of bandwidth management, but 40 ms >> stalls greatly exceed their latency tolerances. > > What _are_ their latency tolerances? How often are they willing to > tolerate a modicum of TCP bandwidth management? Do they go ape when > TCP sits waiting not just for 40ms, but for an entire RTO timer? > >> Using TCP_NODELAY is often the least bad option, but sometimes it's >> infeasible because of its effect on the network, and it certainly >> adds to the network stack overhead. A more tunable Nagle delay would >> probably serve many of these apps much better. > > If the applications are sending streams of logically unrelated sends > down the same socket, then setting TCP_NODELAY is IMO fine. Where > it isn't fine is where these applications are generating their > logically associated data in two or more small sends. One send per > message good. Two sends per message bad. Can the same be said of the Linux kernel's RPC client, which uses MSG_MORE and multiple sends to construct a single RPC request on a TCP socket? See net/sunrpc/xprtsock.c:xs_send_pagedata() for details. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 16:54 ` Chuck Lever @ 2008-09-09 17:21 ` Arnaldo Carvalho de Melo 2008-09-09 17:54 ` Rick Jones 1 sibling, 0 replies; 44+ messages in thread From: Arnaldo Carvalho de Melo @ 2008-09-09 17:21 UTC (permalink / raw) To: Chuck Lever; +Cc: Rick Jones, Chris Snook, Netdev Em Tue, Sep 09, 2008 at 12:54:31PM -0400, Chuck Lever escreveu: > On Sep 9, 2008, at Sep 9, 2008, 12:33 PM, Rick Jones wrote: >>> Most of the apps where people care about this enough to complain to >>> their vendor (the cases I hear about) are in messaging apps, where >>> they're relaying a stream of events that have little to do with each >>> other, and they want TCP to maintain the integrity of the connection >>> and do a modicum of bandwidth management, but 40 ms stalls greatly >>> exceed their latency tolerances. >> >> What _are_ their latency tolerances? How often are they willing to >> tolerate a modicum of TCP bandwidth management? Do they go ape when >> TCP sits waiting not just for 40ms, but for an entire RTO timer? >> >>> Using TCP_NODELAY is often the least bad option, but sometimes it's >>> infeasible because of its effect on the network, and it certainly >>> adds to the network stack overhead. A more tunable Nagle delay would >>> probably serve many of these apps much better. >> >> If the applications are sending streams of logically unrelated sends >> down the same socket, then setting TCP_NODELAY is IMO fine. Where it >> isn't fine is where these applications are generating their logically >> associated data in two or more small sends. One send per message good. >> Two sends per message bad. > > Can the same be said of the Linux kernel's RPC client, which uses > MSG_MORE and multiple sends to construct a single RPC request on a TCP > socket? > > See net/sunrpc/xprtsock.c:xs_send_pagedata() for details. That is not a problem, it should be equivalent to corking the socket. I.e. the uncorking operation will be the last part of the buffer, where 'more' will be 0. - Arnaldo ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-09 16:54 ` Chuck Lever 2008-09-09 17:21 ` Arnaldo Carvalho de Melo @ 2008-09-09 17:54 ` Rick Jones 1 sibling, 0 replies; 44+ messages in thread From: Rick Jones @ 2008-09-09 17:54 UTC (permalink / raw) To: Chuck Lever; +Cc: Chris Snook, Netdev >> If the applications are sending streams of logically unrelated sends >> down the same socket, then setting TCP_NODELAY is IMO fine. Where it >> isn't fine is where these applications are generating their logically >> associated data in two or more small sends. One send per message >> good. Two sends per message bad. > > > Can the same be said of the Linux kernel's RPC client, which uses > MSG_MORE and multiple sends to construct a single RPC request on a TCP > socket? That drifts away from Nagle and into my (perhaps old fuddy-duddy) belief in minimizing the number of system/other calls to do a unit of work, and degrees of "badness:)" rick jones ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-08 21:56 RFC: Nagle latency tuning Christopher Snook 2008-09-08 22:39 ` Rick Jones @ 2008-09-08 22:55 ` Andi Kleen 2008-09-09 5:22 ` Chris Snook 1 sibling, 1 reply; 44+ messages in thread From: Andi Kleen @ 2008-09-08 22:55 UTC (permalink / raw) To: Christopher Snook; +Cc: Netdev Christopher Snook <csnook@redhat.com> writes: > > I'm afraid I don't know the TCP stack intimately enough to understand > what side effects this might have. Can someone more familiar with the > nagle implementations please enlighten me on how this could be done, > or why it shouldn't be? The nagle delay you're seeing is really the delayed ack delay which is variable on Linux (unlike a lot of other stacks). Unfortunately due to the way delayed ack works on other stacks (especially traditional BSD with its fixed 200ms delay) there are nasty interactions with that. Making it too short could lead to a lot more packets even in non nagle situations. Ok in theory you could split the two, but that would likely have other issues and also make nagle be a lot less useful. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RFC: Nagle latency tuning 2008-09-08 22:55 ` Andi Kleen @ 2008-09-09 5:22 ` Chris Snook 0 siblings, 0 replies; 44+ messages in thread From: Chris Snook @ 2008-09-09 5:22 UTC (permalink / raw) To: Andi Kleen; +Cc: Netdev Andi Kleen wrote: > Christopher Snook <csnook@redhat.com> writes: >> I'm afraid I don't know the TCP stack intimately enough to understand >> what side effects this might have. Can someone more familiar with the >> nagle implementations please enlighten me on how this could be done, >> or why it shouldn't be? > > The nagle delay you're seeing is really the delayed ack delay which > is variable on Linux (unlike a lot of other stacks). Unfortunately > due to the way delayed ack works on other stacks (especially traditional > BSD with its fixed 200ms delay) there are nasty interactions with that. > Making it too short could lead to a lot more packets even in non nagle > situations. How variable is it? I've never seen any value other than 40 ms, from 2.4.21 to the latest rt kernel. I've tweaked every TCP tunable in /proc/sys/net/ipv4, to no effect. The people who would care enough to tweak this would be more than happy to accept an increase in the number of packets. They're usually asking us to disable the behavior completely, so if we can let them tune the middle-ground, they can test in their environments to decide what values their network peers will tolerate. I have no interest in foisting this on the unsuspecting public. > Ok in theory you could split the two, but that would likely have > other issues and also make nagle be a lot less useful. Perhaps a messaging-optimized non-default congestion control algorithm would be a suitable way of addressing this? -- Chris ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2008-09-23 2:37 UTC | newest] Thread overview: 44+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-09-08 21:56 RFC: Nagle latency tuning Christopher Snook 2008-09-08 22:39 ` Rick Jones 2008-09-09 5:10 ` Chris Snook 2008-09-09 5:17 ` David Miller 2008-09-09 5:56 ` Chris Snook 2008-09-09 6:02 ` David Miller 2008-09-09 10:31 ` Mark Brown 2008-09-09 12:05 ` David Miller 2008-09-09 12:09 ` Mark Brown 2008-09-09 12:19 ` David Miller 2008-09-09 6:22 ` Evgeniy Polyakov 2008-09-09 6:28 ` Chris Snook 2008-09-09 13:00 ` Arnaldo Carvalho de Melo 2008-09-09 14:36 ` Andi Kleen 2008-09-09 18:40 ` Chris Snook 2008-09-09 19:07 ` Andi Kleen 2008-09-09 19:21 ` Arnaldo Carvalho de Melo 2008-09-11 4:08 ` Chris Snook 2008-09-09 19:59 ` David Miller 2008-09-09 20:25 ` Chris Snook 2008-09-22 10:49 ` David Miller 2008-09-22 11:09 ` David Miller 2008-09-22 20:30 ` Andi Kleen 2008-09-22 22:22 ` Chris Snook 2008-09-22 22:26 ` David Miller 2008-09-22 23:00 ` Chris Snook 2008-09-22 23:13 ` David Miller 2008-09-22 23:24 ` Andi Kleen 2008-09-22 23:21 ` David Miller 2008-09-23 0:14 ` Andi Kleen 2008-09-23 0:33 ` Rick Jones 2008-09-23 2:12 ` Andi Kleen 2008-09-23 1:40 ` David Miller 2008-09-23 2:23 ` Andi Kleen 2008-09-23 2:28 ` David Miller 2008-09-23 2:41 ` Andi Kleen 2008-09-22 22:47 ` Rick Jones 2008-09-22 22:57 ` Chris Snook 2008-09-09 16:33 ` Rick Jones 2008-09-09 16:54 ` Chuck Lever 2008-09-09 17:21 ` Arnaldo Carvalho de Melo 2008-09-09 17:54 ` Rick Jones 2008-09-08 22:55 ` Andi Kleen 2008-09-09 5:22 ` Chris Snook
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).