* TCP delayed ACK heuristic [not found] <270756364.27707018.1355842632348.JavaMail.root@redhat.com> @ 2012-12-18 15:11 ` Cong Wang 2012-12-18 16:30 ` Eric Dumazet 2012-12-18 16:39 ` David Laight 0 siblings, 2 replies; 13+ messages in thread From: Cong Wang @ 2012-12-18 15:11 UTC (permalink / raw) To: netdev Cc: Ben Greear, David Miller, Eric Dumazet, Stephen Hemminger, Rick Jones, Thomas Graf Hello, TCP experts! Some time ago, Ben sent a patch [1] to add some knobs for tuning TCP delayed ACK, but rejected by David. David's point is that we can do some heuristics for TCP delayed ACK, so the question is that what kind of heuristics can we use? RFC1122 explicitly mentions: A TCP SHOULD implement a delayed ACK, but an ACK should not be excessively delayed; in particular, the delay MUST be less than 0.5 seconds, and in a stream of full-sized segments there SHOULD be an ACK for at least every second segment. so this prevents us from using any heuristic for the number of coalesced delayed ACK. For the timeout of a delayed ACK, my idea is guessing how many packet we suppose to receive is the TCP stream is fully utilized, something like below: +static inline u32 tcp_expect_packets(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + int rtt = tp->srtt >> 3; + u32 idle = tcp_time_stamp - inet_csk(sk)->icsk_ack.lrcvtime; + + return idle * 2 / rtt; +} ... + ato -= tcp_expect_packets(sk) * delta; The more we expect, the less we should delay. However this is not accurate due to congestion control. Meanwhile, we can also check how many packets are pending in TCP sending queue, the more we pend, the more we can piggyback with a single ACK, but not beyond how much we are able to send at that time. Comments? Ideas? Thanks. 1. http://thread.gmane.org/gmane.linux.network/233859 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: TCP delayed ACK heuristic 2012-12-18 15:11 ` TCP delayed ACK heuristic Cong Wang @ 2012-12-18 16:30 ` Eric Dumazet 2012-12-19 6:54 ` Cong Wang 2012-12-18 16:39 ` David Laight 1 sibling, 1 reply; 13+ messages in thread From: Eric Dumazet @ 2012-12-18 16:30 UTC (permalink / raw) To: Cong Wang Cc: netdev, Ben Greear, David Miller, Stephen Hemminger, Rick Jones, Thomas Graf On Tue, 2012-12-18 at 10:11 -0500, Cong Wang wrote: > Hello, TCP experts! > > Some time ago, Ben sent a patch [1] to add some knobs for > tuning TCP delayed ACK, but rejected by David. > > David's point is that we can do some heuristics for TCP > delayed ACK, so the question is that what kind of heuristics > can we use? > > RFC1122 explicitly mentions: > > A TCP SHOULD implement a delayed ACK, but an ACK should not > be excessively delayed; in particular, the delay MUST be > less than 0.5 seconds, and in a stream of full-sized > segments there SHOULD be an ACK for at least every second > segment. > > so this prevents us from using any heuristic for the number > of coalesced delayed ACK. > > For the timeout of a delayed ACK, my idea is guessing how many > packet we suppose to receive is the TCP stream is fully utilized, > something like below: > > +static inline u32 tcp_expect_packets(struct sock *sk) > +{ > + struct tcp_sock *tp = tcp_sk(sk); > + int rtt = tp->srtt >> 3; > + u32 idle = tcp_time_stamp - inet_csk(sk)->icsk_ack.lrcvtime; > + > + return idle * 2 / rtt; > +} > ... > + ato -= tcp_expect_packets(sk) * delta; > > > The more we expect, the less we should delay. However this is > not accurate due to congestion control. > > Meanwhile, we can also check how many packets are pending in TCP > sending queue, the more we pend, the more we can piggyback with > a single ACK, but not beyond how much we are able to send at > that time. > > Comments? Ideas? > ACKS might also be delayed because of bidirectional traffic, and is more controlled by the application response time. TCP stack can not easily estimate it. If you focus on bulk receive, LRO/GRO should already lower number of ACKS to an acceptable level and without major disruption. Stretch acks are not only the receiver concern, there are issues for the sender that you cannot always control/change. I recommend reading RFC2525 2.13 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: TCP delayed ACK heuristic 2012-12-18 16:30 ` Eric Dumazet @ 2012-12-19 6:54 ` Cong Wang 0 siblings, 0 replies; 13+ messages in thread From: Cong Wang @ 2012-12-19 6:54 UTC (permalink / raw) To: Eric Dumazet Cc: netdev, Ben Greear, David Miller, Stephen Hemminger, Rick Jones, Thomas Graf On Tue, 2012-12-18 at 08:30 -0800, Eric Dumazet wrote: > > > ACKS might also be delayed because of bidirectional traffic, and is more > controlled by the application response time. TCP stack can not easily > estimate it. So we still need a knob? > > If you focus on bulk receive, LRO/GRO should already lower number of > ACKS to an acceptable level and without major disruption. Indeed. > > Stretch acks are not only the receiver concern, there are issues for the > sender that you cannot always control/change. > > I recommend reading RFC2525 2.13 > Very helpful information! On the sender's side, it needs to "notify" the receiver not to send stretch acks when it is in slow-start. But I think the receiver can detect slow-start too on its own side (based one the window size?). Thanks! ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: TCP delayed ACK heuristic 2012-12-18 15:11 ` TCP delayed ACK heuristic Cong Wang 2012-12-18 16:30 ` Eric Dumazet @ 2012-12-18 16:39 ` David Laight 2012-12-18 17:54 ` Rick Jones 2012-12-19 7:00 ` Cong Wang 1 sibling, 2 replies; 13+ messages in thread From: David Laight @ 2012-12-18 16:39 UTC (permalink / raw) To: Cong Wang, netdev Cc: Ben Greear, David Miller, Eric Dumazet, Stephen Hemminger, Rick Jones, Thomas Graf > David's point is that we can do some heuristics for TCP > delayed ACK, so the question is that what kind of heuristics > can we use? > > RFC1122 explicitly mentions: > > A TCP SHOULD implement a delayed ACK, but an ACK should not > be excessively delayed; in particular, the delay MUST be > less than 0.5 seconds, and in a stream of full-sized > segments there SHOULD be an ACK for at least every second > segment. > > so this prevents us from using any heuristic for the number > of coalesced delayed ACK. There are problems with only implementing the acks specified by RFC1122. I've seen problems when the sending side is doing (I think) 'slow start' with Nagle disabled. The sender would only send 4 segments before waiting for an ACK - even when it had more than a full sized segment waiting. Sender was Linux 2.6.something (probably low 20s). I changed the application flow to send data in the reverse direction to avoid the problem. That was on a ~0 delay local connection - which means that there is almost never outstanding data, and the 'slow start' happened almost all the time. Nagle is completely the wrong algorithm for the data flow. David ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: TCP delayed ACK heuristic 2012-12-18 16:39 ` David Laight @ 2012-12-18 17:54 ` Rick Jones 2012-12-19 9:52 ` David Laight 2012-12-19 7:00 ` Cong Wang 1 sibling, 1 reply; 13+ messages in thread From: Rick Jones @ 2012-12-18 17:54 UTC (permalink / raw) To: David Laight Cc: Cong Wang, netdev, Ben Greear, David Miller, Eric Dumazet, Stephen Hemminger, Thomas Graf On 12/18/2012 08:39 AM, David Laight wrote: > There are problems with only implementing the acks > specified by RFC1122. > > I've seen problems when the sending side is doing (I think) > 'slow start' with Nagle disabled. > The sender would only send 4 segments before waiting for an > ACK - even when it had more than a full sized segment waiting. > Sender was Linux 2.6.something (probably low 20s). > I changed the application flow to send data in the reverse > direction to avoid the problem. > That was on a ~0 delay local connection - which means that > there is almost never outstanding data, and the 'slow start' > happened almost all the time. > Nagle is completely the wrong algorithm for the data flow. If Nagle was already disabled, why the last sentence? And from your description, even if Nagle were enabled, I would think that it was remote ACK+cwnd behaviour getting in your way, not Nagle, given that Nagle is to be decided on a user-send by user-send basis and release queued data (to the mercies of other heuristics) when it gets to be an MSS-worth. The joys of intertwined heuristics I suppose. Personally, I would love for there to be a way to have a cwnd's byte-limit's-worth of small segments outstanding at one time - it would make my netperf-life much easier as I could get rid of the netperf-level congestion window intended to keep successive requests (with Nagle already disabled) from getting coalesced by cwnd in a "burst-mode" test. * And perhaps make things nicer for the test when there is the occasional retransmission. I used to think that netperf was just "unique" in that regard, but it sounds like you have an actual application looking to do that?? rick jones * because I am trying to (ab)use the burst mode TCP_RR test for a maximum packets per second through the stack+NIC measurement that isn't also a context switching benchmark. But I cannot really come-up with a real-world rationale to support further cwnd behaviour changes. Allowing a byte-limit-cwnd's worth of single-byte-payload TCP segments could easily be seen as being rather anti-social :) And forcing/maintaining the original segment boundaries in retransmissions for small packets isn't such a hot idea either. ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: TCP delayed ACK heuristic 2012-12-18 17:54 ` Rick Jones @ 2012-12-19 9:52 ` David Laight 0 siblings, 0 replies; 13+ messages in thread From: David Laight @ 2012-12-19 9:52 UTC (permalink / raw) To: Rick Jones Cc: Cong Wang, netdev, Ben Greear, David Miller, Eric Dumazet, Stephen Hemminger, Thomas Graf > > I've seen problems when the sending side is doing (I think) > > 'slow start' with Nagle disabled. > > The sender would only send 4 segments before waiting for an > > ACK - even when it had more than a full sized segment waiting. > > Sender was Linux 2.6.something (probably low 20s). > > I changed the application flow to send data in the reverse > > direction to avoid the problem. > > That was on a ~0 delay local connection - which means that > > there is almost never outstanding data, and the 'slow start' > > happened almost all the time. > > Nagle is completely the wrong algorithm for the data flow. > > If Nagle was already disabled, why the last sentence? And from your > description, even if Nagle were enabled, I would think that it was > remote ACK+cwnd behaviour getting in your way, not Nagle, given that > Nagle is to be decided on a user-send by user-send basis and release > queued data (to the mercies of other heuristics) when it gets to be an > MSS-worth. With Nagle enabled the first segment is sent, the following ones get buffered until full segments can be sent. Although (probably) only 4 segments will be sent (1 small and 3 full) the 3rd of these does generate an ack. > ... but it sounds like you have an actual > application looking to do that?? We are relaying data packets received over multiple SS7 signalling links (64k hdlc) over a TCP connection. The connection will be local, in some cases the host ethernet MAC, switch, and target cpu are all on the same PCI(e) card (MII crossover links). While a delay of a millisecond or two wouldn't matter (1ms is 8 byte times) the Nagle delay is far too long - and since the data isn't command/response the Nagle would delay happen a lot. One of the conformance tests managed to make the system 'busy'. Since all it does is make one 64k channel busy it shouldn't have been able to generate a backlog of receive data - but it managed to get over 100 data packets unacked (app level ack). > Allowing a byte-limit-cwnd's worth of single-byte-payload TCP segments > could easily be seen as being rather anti-social :) If the actual RTT is almost zero (as in our case) and the network really shouldn't be dropping packets the it doesn't matter. I suspect that if the tx rate is faster than the RTT then the 'slow start' turns off and you can get a lot of small segments in flight. But when the RTT is zero 'slow start' almost always applies and you only send 4. > And forcing/maintaining the original segment boundaries in > retransmissions for small packets isn't such a hot idea either. True, not splitting them might be useful, but to need to avoid merges. David ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: TCP delayed ACK heuristic 2012-12-18 16:39 ` David Laight 2012-12-18 17:54 ` Rick Jones @ 2012-12-19 7:00 ` Cong Wang 2012-12-19 18:39 ` Rick Jones 1 sibling, 1 reply; 13+ messages in thread From: Cong Wang @ 2012-12-19 7:00 UTC (permalink / raw) To: David Laight Cc: netdev, Ben Greear, David Miller, Eric Dumazet, Stephen Hemminger, Rick Jones, Thomas Graf On Tue, 2012-12-18 at 16:39 +0000, David Laight wrote: > There are problems with only implementing the acks > specified by RFC1122. Yeah, the problem is if we can violate this RFC for getting better performance. Or it is just a no-no? Although RFC 2525 mentions this as "Stretch ACK Violation", I am still not sure if that means we can violate RFC1122 legally. Thanks. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: TCP delayed ACK heuristic 2012-12-19 7:00 ` Cong Wang @ 2012-12-19 18:39 ` Rick Jones 2012-12-19 20:59 ` David Miller 2012-12-19 23:08 ` Eric Dumazet 0 siblings, 2 replies; 13+ messages in thread From: Rick Jones @ 2012-12-19 18:39 UTC (permalink / raw) To: Cong Wang Cc: David Laight, netdev, Ben Greear, David Miller, Eric Dumazet, Stephen Hemminger, Thomas Graf On 12/18/2012 11:00 PM, Cong Wang wrote: > On Tue, 2012-12-18 at 16:39 +0000, David Laight wrote: >> There are problems with only implementing the acks >> specified by RFC1122. > > Yeah, the problem is if we can violate this RFC for getting better > performance. Or it is just a no-no? > > Although RFC 2525 mentions this as "Stretch ACK Violation", I am still > not sure if that means we can violate RFC1122 legally. The term used in RFC1122 is "SHOULD" not "MUST." Same for RFC2525 when it talks about "Stretch ACK Violation." A TCP stack may have behaviour which differs from a SHOULD so long as there is a reasonable reason for it. rick jones ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: TCP delayed ACK heuristic 2012-12-19 18:39 ` Rick Jones @ 2012-12-19 20:59 ` David Miller 2012-12-20 3:23 ` Cong Wang 2012-12-19 23:08 ` Eric Dumazet 1 sibling, 1 reply; 13+ messages in thread From: David Miller @ 2012-12-19 20:59 UTC (permalink / raw) To: rick.jones2 Cc: amwang, David.Laight, netdev, greearb, eric.dumazet, shemminger, tgraf From: Rick Jones <rick.jones2@hp.com> Date: Wed, 19 Dec 2012 10:39:37 -0800 > On 12/18/2012 11:00 PM, Cong Wang wrote: >> On Tue, 2012-12-18 at 16:39 +0000, David Laight wrote: >>> There are problems with only implementing the acks >>> specified by RFC1122. >> >> Yeah, the problem is if we can violate this RFC for getting better >> performance. Or it is just a no-no? >> >> Although RFC 2525 mentions this as "Stretch ACK Violation", I am still >> not sure if that means we can violate RFC1122 legally. > > The term used in RFC1122 is "SHOULD" not "MUST." Same for RFC2525 > when it talks about "Stretch ACK Violation." A TCP stack may have > behaviour which differs from a SHOULD so long as there is a reasonable > reason for it. Yes, but RFC2525 makes it very clear why we should not even consider doing crap like this. ACKs are the only information we have to detect loss. And, for the same reasons that TCP VEGAS is fundamentally broken, we cannot measure the pipe or some other receiver-side-visible piece of information to determine when it's "safe" to stretch ACK. And even if it's "safe", we should not do it so that losses are accurately detected and we don't spuriously retransmit. The only way to know when the bandwidth increases is to "test" it, by sending more and more packets until drops happen. That's why all successful congestion control algorithms must operate on explicited tested pieces of information. Similarly, it's not really possible to universally know if it's safe to stretch ACK or not. Can we please drop this idea? It has zero value and all downside as far as I'm concerned. Thanks. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: TCP delayed ACK heuristic 2012-12-19 20:59 ` David Miller @ 2012-12-20 3:23 ` Cong Wang 2012-12-20 9:57 ` David Laight 0 siblings, 1 reply; 13+ messages in thread From: Cong Wang @ 2012-12-20 3:23 UTC (permalink / raw) To: David Miller Cc: rick.jones2, David.Laight, netdev, greearb, eric.dumazet, shemminger, tgraf On Wed, 2012-12-19 at 12:59 -0800, David Miller wrote: > > Yes, but RFC2525 makes it very clear why we should not even > consider doing crap like this. > > ACKs are the only information we have to detect loss. > > And, for the same reasons that TCP VEGAS is fundamentally broken, we > cannot measure the pipe or some other receiver-side-visible piece of > information to determine when it's "safe" to stretch ACK. > > And even if it's "safe", we should not do it so that losses are > accurately detected and we don't spuriously retransmit. > > The only way to know when the bandwidth increases is to "test" it, by > sending more and more packets until drops happen. That's why all > successful congestion control algorithms must operate on explicited > tested pieces of information. > > Similarly, it's not really possible to universally know if it's safe > to stretch ACK or not. Sounds reasonable. Thanks for your explanation. > > Can we please drop this idea? It has zero value and all downside as > far as I'm concerned. > Yeah, I am just trying to see if there is any way to get a reasonable heuristic. So, can we at least have a sysctl to control the timeout of the delayed ACK? I mean the minimum 40ms. TCP_QUICKACK can help too, but it requires the receiver to modify the application and has to be set every time when calling recv(). Thanks! ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: TCP delayed ACK heuristic 2012-12-20 3:23 ` Cong Wang @ 2012-12-20 9:57 ` David Laight 2012-12-20 12:41 ` Cong Wang 0 siblings, 1 reply; 13+ messages in thread From: David Laight @ 2012-12-20 9:57 UTC (permalink / raw) To: Cong Wang, David Miller Cc: rick.jones2, netdev, greearb, eric.dumazet, shemminger, tgraf > So, can we at least have a sysctl to control the timeout of the delayed > ACK? I mean the minimum 40ms. TCP_QUICKACK can help too, but it requires > the receiver to modify the application and has to be set every time when > calling recv(). A sysctl in inappropriate - it affects the entire TCP protocol stack. You want different behaviour for different remote hosts (probably different subnets). In particular your local subnet is unlikely to have packet loss and very likely to have a very low RTT. AFAICT a lot of the recent 'tuning' has been done for web/ftp servers that are very remote from the client. These connections are also request-response ones - quite often with large responses. IMHO This has been to the detriment of local connections. David ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: TCP delayed ACK heuristic 2012-12-20 9:57 ` David Laight @ 2012-12-20 12:41 ` Cong Wang 0 siblings, 0 replies; 13+ messages in thread From: Cong Wang @ 2012-12-20 12:41 UTC (permalink / raw) To: David Laight Cc: David Miller, rick.jones2, netdev, greearb, eric.dumazet, shemminger, tgraf On Thu, 2012-12-20 at 09:57 +0000, David Laight wrote: > > So, can we at least have a sysctl to control the timeout of the delayed > > ACK? I mean the minimum 40ms. TCP_QUICKACK can help too, but it requires > > the receiver to modify the application and has to be set every time when > > calling recv(). > > A sysctl in inappropriate - it affects the entire TCP protocol stack. > > You want different behaviour for different remote hosts (probably > different subnets). > In particular your local subnet is unlikely to have packet loss > and very likely to have a very low RTT. > > AFAICT a lot of the recent 'tuning' has been done for web/ftp > servers that are very remote from the client. These connections > are also request-response ones - quite often with large responses. > > IMHO This has been to the detriment of local connections. > A customer prefers faster response in their low-loss environment, 40ms is not good. Of course, they are supposed to know their environment when they tune this. Or maybe a sysctl equals to TCP_QUICKACK? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: TCP delayed ACK heuristic 2012-12-19 18:39 ` Rick Jones 2012-12-19 20:59 ` David Miller @ 2012-12-19 23:08 ` Eric Dumazet 1 sibling, 0 replies; 13+ messages in thread From: Eric Dumazet @ 2012-12-19 23:08 UTC (permalink / raw) To: Rick Jones Cc: Cong Wang, David Laight, netdev, Ben Greear, David Miller, Stephen Hemminger, Thomas Graf On Wed, 2012-12-19 at 10:39 -0800, Rick Jones wrote: > On 12/18/2012 11:00 PM, Cong Wang wrote: > > On Tue, 2012-12-18 at 16:39 +0000, David Laight wrote: > >> There are problems with only implementing the acks > >> specified by RFC1122. > > > > Yeah, the problem is if we can violate this RFC for getting better > > performance. Or it is just a no-no? > > > > Although RFC 2525 mentions this as "Stretch ACK Violation", I am still > > not sure if that means we can violate RFC1122 legally. > > The term used in RFC1122 is "SHOULD" not "MUST." Same for RFC2525 when > it talks about "Stretch ACK Violation." A TCP stack may have behaviour > which differs from a SHOULD so long as there is a reasonable reason for it. Generally speaking, there are no reasonable reasons, unless you control both sender and receiver, and the path between. ACK can be incredibly useful to recover from losses in a short time. The vast majority of TCP sessions are small lived, and we send one ACK per received segment anyway at beginning [1] or retransmits to let the sender smoothly increase its cwnd, so an auto-tuning facility wont help them that much. For long and fast sessions, we have the LRO/GRO heuristic. This leaves a fraction of flows where the ACK rate should not really matter. [1] This refers to the quickack mode ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2012-12-20 12:41 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <270756364.27707018.1355842632348.JavaMail.root@redhat.com> 2012-12-18 15:11 ` TCP delayed ACK heuristic Cong Wang 2012-12-18 16:30 ` Eric Dumazet 2012-12-19 6:54 ` Cong Wang 2012-12-18 16:39 ` David Laight 2012-12-18 17:54 ` Rick Jones 2012-12-19 9:52 ` David Laight 2012-12-19 7:00 ` Cong Wang 2012-12-19 18:39 ` Rick Jones 2012-12-19 20:59 ` David Miller 2012-12-20 3:23 ` Cong Wang 2012-12-20 9:57 ` David Laight 2012-12-20 12:41 ` Cong Wang 2012-12-19 23:08 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).