* TCP_DEFER_ACCEPT is missing counter update
@ 2009-10-13 5:07 Willy Tarreau
2009-10-13 7:11 ` David Miller
2009-10-13 7:23 ` Eric Dumazet
0 siblings, 2 replies; 41+ messages in thread
From: Willy Tarreau @ 2009-10-13 5:07 UTC (permalink / raw)
To: netdev
Hello,
I was trying to use TCP_DEFER_ACCEPT and noticed that if the
client does not talk, the connection is never accepted and
remains in SYN_RECV state until the retransmits expire, where
it finally is deleted. This is bad when some firewall such as
netfilter sits between the client and the server because the
firewall sees the connection in ESTABLISHED state while the
server will finally silently drop it without sending an RST.
This behaviour contradicts the man page which says it should
wait only for some time :
TCP_DEFER_ACCEPT (since Linux 2.4)
Allows a listener to be awakened only when data arrives
on the socket. Takes an integer value (seconds), this
can bound the maximum number of attempts TCP will
make to complete the connection. This option should not
be used in code intended to be portable.
Also, looking at ipv4/tcp.c, a retransmit counter is correctly
computed :
case TCP_DEFER_ACCEPT:
icsk->icsk_accept_queue.rskq_defer_accept = 0;
if (val > 0) {
/* Translate value in seconds to number of
* retransmits */
while (icsk->icsk_accept_queue.rskq_defer_accept < 32 &&
val > ((TCP_TIMEOUT_INIT / HZ) <<
icsk->icsk_accept_queue.rskq_defer_accept))
icsk->icsk_accept_queue.rskq_defer_accept++;
icsk->icsk_accept_queue.rskq_defer_accept++;
}
break;
==> rskq_defer_accept is used as a counter of retransmits.
But in tcp_minisocks.c, this counter is only checked. And in
fact, I have found no location which updates it. So I think
that what was intended was to decrease it in tcp_minisocks
whenever it is checked, which the trivial patch below does :
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index f8d67cc..1b443b0 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -645,6 +645,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
+ inet_csk(sk)->icsk_accept_queue.rskq_defer_accept--;
inet_rsk(req)->acked = 1;
return NULL;
}
Is there anything I am missing ?
Regards,
Willy
^ permalink raw reply related [flat|nested] 41+ messages in thread* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-13 5:07 TCP_DEFER_ACCEPT is missing counter update Willy Tarreau @ 2009-10-13 7:11 ` David Miller 2009-10-13 7:19 ` Willy Tarreau 2009-10-13 7:23 ` Eric Dumazet 1 sibling, 1 reply; 41+ messages in thread From: David Miller @ 2009-10-13 7:11 UTC (permalink / raw) To: w; +Cc: netdev From: Willy Tarreau <w@1wt.eu> Date: Tue, 13 Oct 2009 07:07:06 +0200 > But in tcp_minisocks.c, this counter is only checked. And in > fact, I have found no location which updates it. So I think > that what was intended was to decrease it in tcp_minisocks > whenever it is checked, which the trivial patch below does : > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > index f8d67cc..1b443b0 100644 > --- a/net/ipv4/tcp_minisocks.c > +++ b/net/ipv4/tcp_minisocks.c > @@ -645,6 +645,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, > if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept && > TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) { > + inet_csk(sk)->icsk_accept_queue.rskq_defer_accept--; > inet_rsk(req)->acked = 1; > return NULL; > } > > Is there anything I am missing ? Your logic looks sound and I can't come to any other conclusion after going over this code, even going back to 2.4.x Feel free to make a formal patch submission, thanks Willy. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-13 7:11 ` David Miller @ 2009-10-13 7:19 ` Willy Tarreau 2009-10-13 7:27 ` David Miller 2009-10-13 21:27 ` Julian Anastasov 0 siblings, 2 replies; 41+ messages in thread From: Willy Tarreau @ 2009-10-13 7:19 UTC (permalink / raw) To: David Miller; +Cc: netdev On Tue, Oct 13, 2009 at 12:11:06AM -0700, David Miller wrote: > Your logic looks sound and I can't come to any other conclusion > after going over this code, even going back to 2.4.x > > Feel free to make a formal patch submission, thanks Willy. Ok, here it comes (I used my explanation as the commit message). Thanks David, Willy >From da80c99a503bab1256706ed8d967e2ab3f71afe0 Mon Sep 17 00:00:00 2001 From: Willy Tarreau <w@1wt.eu> Date: Tue, 13 Oct 2009 07:26:54 +0200 Subject: tcp: fix tcp_defer_accept to consider the timeout I was trying to use TCP_DEFER_ACCEPT and noticed that if the client does not talk, the connection is never accepted and remains in SYN_RECV state until the retransmits expire, where it finally is deleted. This is bad when some firewall such as netfilter sits between the client and the server because the firewall sees the connection in ESTABLISHED state while the server will finally silently drop it without sending an RST. This behaviour contradicts the man page which says it should wait only for some time : TCP_DEFER_ACCEPT (since Linux 2.4) Allows a listener to be awakened only when data arrives on the socket. Takes an integer value (seconds), this can bound the maximum number of attempts TCP will make to complete the connection. This option should not be used in code intended to be portable. Also, looking at ipv4/tcp.c, a retransmit counter is correctly computed : case TCP_DEFER_ACCEPT: icsk->icsk_accept_queue.rskq_defer_accept = 0; if (val > 0) { /* Translate value in seconds to number of * retransmits */ while (icsk->icsk_accept_queue.rskq_defer_accept < 32 && val > ((TCP_TIMEOUT_INIT / HZ) << icsk->icsk_accept_queue.rskq_defer_accept)) icsk->icsk_accept_queue.rskq_defer_accept++; icsk->icsk_accept_queue.rskq_defer_accept++; } break; ==> rskq_defer_accept is used as a counter of retransmits. But in tcp_minisocks.c, this counter is only checked. And in fact, I have found no location which updates it. So I think that what was intended was to decrease it in tcp_minisocks whenever it is checked, which the trivial patch below does. Signed-off-by: Willy Tarreau <w@1wt.eu> --- net/ipv4/tcp_minisocks.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index f8d67cc..2f676f3 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -644,6 +644,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, /* If TCP_DEFER_ACCEPT is set, drop bare ACK. */ if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept && TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) { + inet_csk(sk)->icsk_accept_queue.rskq_defer_accept--; inet_rsk(req)->acked = 1; return NULL; } -- 1.5.3.3 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-13 7:19 ` Willy Tarreau @ 2009-10-13 7:27 ` David Miller 2009-10-13 21:27 ` Julian Anastasov 1 sibling, 0 replies; 41+ messages in thread From: David Miller @ 2009-10-13 7:27 UTC (permalink / raw) To: w; +Cc: netdev From: Willy Tarreau <w@1wt.eu> Date: Tue, 13 Oct 2009 09:19:55 +0200 > On Tue, Oct 13, 2009 at 12:11:06AM -0700, David Miller wrote: >> Your logic looks sound and I can't come to any other conclusion >> after going over this code, even going back to 2.4.x >> >> Feel free to make a formal patch submission, thanks Willy. > > Ok, here it comes (I used my explanation as the commit message). Applied and queued up for -stable, thanks! ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-13 7:19 ` Willy Tarreau 2009-10-13 7:27 ` David Miller @ 2009-10-13 21:27 ` Julian Anastasov 2009-10-14 4:52 ` Willy Tarreau 1 sibling, 1 reply; 41+ messages in thread From: Julian Anastasov @ 2009-10-13 21:27 UTC (permalink / raw) To: Willy Tarreau; +Cc: David Miller, netdev Hello, On Tue, 13 Oct 2009, Willy Tarreau wrote: > >From da80c99a503bab1256706ed8d967e2ab3f71afe0 Mon Sep 17 00:00:00 2001 > From: Willy Tarreau <w@1wt.eu> > Date: Tue, 13 Oct 2009 07:26:54 +0200 > Subject: tcp: fix tcp_defer_accept to consider the timeout > > I was trying to use TCP_DEFER_ACCEPT and noticed that if the > client does not talk, the connection is never accepted and > remains in SYN_RECV state until the retransmits expire, where > it finally is deleted. This is bad when some firewall such as I think, this is by design, there is big comment in tcp_check_req(). > netfilter sits between the client and the server because the > firewall sees the connection in ESTABLISHED state while the > server will finally silently drop it without sending an RST. Client can stay ESTABLISHED for long time but RST will be sent when client sends DATA or FIN. > This behaviour contradicts the man page which says it should > wait only for some time : > > TCP_DEFER_ACCEPT (since Linux 2.4) > Allows a listener to be awakened only when data arrives > on the socket. Takes an integer value (seconds), this > can bound the maximum number of attempts TCP will > make to complete the connection. This option should not > be used in code intended to be portable. This works properly in 2.6.31.3, I set TCP_SYNCNT=1 and TCP_DEFER_ACCEPT then only 2 SYN-ACKs are sent. > Also, looking at ipv4/tcp.c, a retransmit counter is correctly > computed : rskq_defer_accept is threshold, not counter > case TCP_DEFER_ACCEPT: > icsk->icsk_accept_queue.rskq_defer_accept = 0; > if (val > 0) { > /* Translate value in seconds to number of > * retransmits */ > while (icsk->icsk_accept_queue.rskq_defer_accept < 32 && > val > ((TCP_TIMEOUT_INIT / HZ) << > icsk->icsk_accept_queue.rskq_defer_accept)) > icsk->icsk_accept_queue.rskq_defer_accept++; > icsk->icsk_accept_queue.rskq_defer_accept++; > } > break; > > ==> rskq_defer_accept is used as a counter of retransmits. as limit for retransmits, not as counter > But in tcp_minisocks.c, this counter is only checked. And in > fact, I have found no location which updates it. So I think > that what was intended was to decrease it in tcp_minisocks > whenever it is checked, which the trivial patch below does. You can check net/ipv4/inet_connection_sock.c, inet_csk_reqsk_queue_prune() where TCP_DEFER_ACCEPT can extend the retransmission threshold for acked sockets above the applied 'thresh'. So, there are 2 options: a) TCP_DEFER_ACCEPT is used as flag (eg. 1) or the period is below the TCP_SYNCNT period. In this case TCP_DEFER_ACCEPT does not extend the period for DATA (DATA must come before TCP_SYNCNT). Application is notified only when DATA comes. or b) TCP_DEFER_ACCEPT is set with seconds above the TCP_SYNCNT retrans limit and the first ACK extends the period up to TCP_DEFER_ACCEPT seconds (converted as retrans). By this way we provide more time for DATA after the empty ACKs. ACK again can come before TCP_SYNCNT but DATA after ACK can come even after TCP_SYNCNT but before TCP_DEFER_ACCEPT timeout. Again, application is notified only when DATA comes. > Signed-off-by: Willy Tarreau <w@1wt.eu> > --- > net/ipv4/tcp_minisocks.c | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > index f8d67cc..2f676f3 100644 > --- a/net/ipv4/tcp_minisocks.c > +++ b/net/ipv4/tcp_minisocks.c > @@ -644,6 +644,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, > /* If TCP_DEFER_ACCEPT is set, drop bare ACK. */ > if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept && > TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) { This is wrong considering inet_csk_reqsk_queue_prune() and patch should not be applied except if I'm missing something: > + inet_csk(sk)->icsk_accept_queue.rskq_defer_accept--; > inet_rsk(req)->acked = 1; > return NULL; > } Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-13 21:27 ` Julian Anastasov @ 2009-10-14 4:52 ` Willy Tarreau 2009-10-14 7:27 ` Julian Anastasov 0 siblings, 1 reply; 41+ messages in thread From: Willy Tarreau @ 2009-10-14 4:52 UTC (permalink / raw) To: Julian Anastasov; +Cc: David Miller, netdev, Eric Dumazet Hello Julian, On Wed, Oct 14, 2009 at 12:27:41AM +0300, Julian Anastasov wrote: > > Hello, > > On Tue, 13 Oct 2009, Willy Tarreau wrote: > > > >From da80c99a503bab1256706ed8d967e2ab3f71afe0 Mon Sep 17 00:00:00 2001 > > From: Willy Tarreau <w@1wt.eu> > > Date: Tue, 13 Oct 2009 07:26:54 +0200 > > Subject: tcp: fix tcp_defer_accept to consider the timeout > > > > I was trying to use TCP_DEFER_ACCEPT and noticed that if the > > client does not talk, the connection is never accepted and > > remains in SYN_RECV state until the retransmits expire, where > > it finally is deleted. This is bad when some firewall such as > > I think, this is by design, there is big comment in > tcp_check_req(). I'm not sure. That would considerably reduce the usefulness of the feature. The comment I see there is just a one line explaining why we drop the ACK. It does not indicate any strategy on what to do when the counter expires. > > netfilter sits between the client and the server because the > > firewall sees the connection in ESTABLISHED state while the > > server will finally silently drop it without sending an RST. > > Client can stay ESTABLISHED for long time but > RST will be sent when client sends DATA or FIN. Yes you're right. In fact, this only weakens firewalls in case of pure scans, but attacks on SYN cookies do that too, as well as TTL-based attacks. > > This behaviour contradicts the man page which says it should > > wait only for some time : > > > > TCP_DEFER_ACCEPT (since Linux 2.4) > > Allows a listener to be awakened only when data arrives > > on the socket. Takes an integer value (seconds), this > > can bound the maximum number of attempts TCP will > > make to complete the connection. This option should not > > be used in code intended to be portable. > > This works properly in 2.6.31.3, I set TCP_SYNCNT=1 > and TCP_DEFER_ACCEPT then only 2 SYN-ACKs are sent. That's what I observe too, but the connection is silently dropped afterwards and I'm clearly not sure this was the intended behaviour. > > Also, looking at ipv4/tcp.c, a retransmit counter is correctly > > computed : > > rskq_defer_accept is threshold, not counter > > > case TCP_DEFER_ACCEPT: > > icsk->icsk_accept_queue.rskq_defer_accept = 0; > > if (val > 0) { > > /* Translate value in seconds to number of > > * retransmits */ > > while (icsk->icsk_accept_queue.rskq_defer_accept < 32 && > > val > ((TCP_TIMEOUT_INIT / HZ) << > > icsk->icsk_accept_queue.rskq_defer_accept)) > > icsk->icsk_accept_queue.rskq_defer_accept++; > > icsk->icsk_accept_queue.rskq_defer_accept++; > > } > > break; > > > > ==> rskq_defer_accept is used as a counter of retransmits. > > as limit for retransmits, not as counter yes if you want, that's what I mean. > > But in tcp_minisocks.c, this counter is only checked. And in > > fact, I have found no location which updates it. So I think > > that what was intended was to decrease it in tcp_minisocks > > whenever it is checked, which the trivial patch below does. > > You can check net/ipv4/inet_connection_sock.c, > inet_csk_reqsk_queue_prune() where TCP_DEFER_ACCEPT can extend > the retransmission threshold for acked sockets above the > applied 'thresh'. So clearly this is in order to improve chances that the application will receive the connection, no ? > So, there are 2 options: > > a) TCP_DEFER_ACCEPT is used as flag (eg. 1) or the period is below > the TCP_SYNCNT period. In this case TCP_DEFER_ACCEPT does not > extend the period for DATA (DATA must come before TCP_SYNCNT). > Application is notified only when DATA comes. > > or > > b) TCP_DEFER_ACCEPT is set with seconds above the TCP_SYNCNT > retrans limit and the first ACK extends the period up to > TCP_DEFER_ACCEPT seconds (converted as retrans). By this > way we provide more time for DATA after the empty ACKs. > ACK again can come before TCP_SYNCNT but DATA after ACK > can come even after TCP_SYNCNT but before TCP_DEFER_ACCEPT > timeout. Again, application is notified only when DATA comes. Yes this is what happens right now, but reading the man again does not imply to me that the connection will not be accepted once we reach the retransmit limit. Maybe we have different usages and different interpretations of the man can satisfy either, but I don't see what this would be useful to in case we silently drop instead of finally accepting. Regards, Willy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-14 4:52 ` Willy Tarreau @ 2009-10-14 7:27 ` Julian Anastasov 2009-10-14 20:17 ` Willy Tarreau 2009-10-16 10:08 ` Ilpo Järvinen 0 siblings, 2 replies; 41+ messages in thread From: Julian Anastasov @ 2009-10-14 7:27 UTC (permalink / raw) To: Willy Tarreau; +Cc: David Miller, netdev, Eric Dumazet Hello, On Wed, 14 Oct 2009, Willy Tarreau wrote: > > > I was trying to use TCP_DEFER_ACCEPT and noticed that if the > > > client does not talk, the connection is never accepted and > > > remains in SYN_RECV state until the retransmits expire, where > > > it finally is deleted. This is bad when some firewall such as > > > > I think, this is by design, there is big comment in > > tcp_check_req(). > > I'm not sure. That would considerably reduce the usefulness of > the feature. The comment I see there is just a one line explaining > why we drop the ACK. It does not indicate any strategy on what to > do when the counter expires. There were attempts to switch the server socket to established state, no SYN-ACK retransmissions after client ACK and send FIN (or RST) to client on TCP_DEFER_ACCEPT expiration but due to some locking problems it was reverted: http://marc.info/?t=121318919000001&r=1&w=2 Current handling is as before: drop client ACKs before DATA, stay in SYN_RECV state, send RST on outdated ACKs. > > > netfilter sits between the client and the server because the > > > firewall sees the connection in ESTABLISHED state while the > > > server will finally silently drop it without sending an RST. > > > > Client can stay ESTABLISHED for long time but > > RST will be sent when client sends DATA or FIN. > > Yes you're right. In fact, this only weakens firewalls in case of > pure scans, but attacks on SYN cookies do that too, as well as > TTL-based attacks. Hm, yes, may be firewalls should be better here. They should know which server ports send welcome and which do no not. > > This works properly in 2.6.31.3, I set TCP_SYNCNT=1 > > and TCP_DEFER_ACCEPT then only 2 SYN-ACKs are sent. > > That's what I observe too, but the connection is silently dropped > afterwards and I'm clearly not sure this was the intended behaviour. Yes, bad only for firewalls. Note that if TCP_SYNCNT/TCP_DEFER_ACCEPT is shorter the client still gets RST on next ACK. So, the problem is when client is silent after first ACK. > > > But in tcp_minisocks.c, this counter is only checked. And in > > > fact, I have found no location which updates it. So I think > > > that what was intended was to decrease it in tcp_minisocks > > > whenever it is checked, which the trivial patch below does. > > > > You can check net/ipv4/inet_connection_sock.c, > > inet_csk_reqsk_queue_prune() where TCP_DEFER_ACCEPT can extend > > the retransmission threshold for acked sockets above the > > applied 'thresh'. > > So clearly this is in order to improve chances that the application > will receive the connection, no ? Yes, it is not logical to configure TCP_DEFER_ACCEPT < TCP_SYNCNT with the idea TCP_DEFER_ACCEPT to reduce the SYN_RECV period. TCP_DEFER_ACCEPT does not reduce the SYN_RECV period (before ACK), may be this is lacking in man page. The idea to optionally give more time for DATA to come after ACK is more logical. If TCP_DEFER_ACCEPT switches state to ESTABLISHED may be then the TCP_DEFER_ACCEPT period should start after ACK, eg. "wait for DATA 10 seconds after ACK" sounds logical, it will need new name... Note that TCP_DEFER_ACCEPT is used also as flag for clients but you can do the same with TCP_QUICKACK=0 (to delay first ACK and to send DATA with first ACK). > Yes this is what happens right now, but reading the man again > does not imply to me that the connection will not be accepted > once we reach the retransmit limit. > > Maybe we have different usages and different interpretations of > the man can satisfy either, but I don't see what this would be > useful to in case we silently drop instead of finally accepting. May be we want the same but currently TCP_DEFER_ACCEPT works as ACK dropper which is easier for implementation. The semantic 'TCP_DEFER_ACCEPT extends the period after ACK' is good, you can tune it together with TCP_SYNCNT, to extend or not to extend the period. What happens on TCP_DEFER_ACCEPT expiration after ACK - we all prefer to see FIN, so we have to wait someone to come with new implementation. Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-14 7:27 ` Julian Anastasov @ 2009-10-14 20:17 ` Willy Tarreau 2009-10-14 21:12 ` Olaf van der Spek 2009-10-14 22:43 ` David Miller 2009-10-16 10:08 ` Ilpo Järvinen 1 sibling, 2 replies; 41+ messages in thread From: Willy Tarreau @ 2009-10-14 20:17 UTC (permalink / raw) To: Julian Anastasov; +Cc: David Miller, netdev, Eric Dumazet Hello Julian, On Wed, Oct 14, 2009 at 10:27:50AM +0300, Julian Anastasov wrote: > There were attempts to switch the server socket to > established state, no SYN-ACK retransmissions after client ACK > and send FIN (or RST) to client on TCP_DEFER_ACCEPT expiration > but due to some locking problems it was reverted: > > http://marc.info/?t=121318919000001&r=1&w=2 interesting lecture, thanks for the link. > Current handling is as before: drop client ACKs before DATA, > stay in SYN_RECV state, send RST on outdated ACKs. OK, that's what I observed too. (...) > > > This works properly in 2.6.31.3, I set TCP_SYNCNT=1 > > > and TCP_DEFER_ACCEPT then only 2 SYN-ACKs are sent. > > > > That's what I observe too, but the connection is silently dropped > > afterwards and I'm clearly not sure this was the intended behaviour. > > Yes, bad only for firewalls. Note that if > TCP_SYNCNT/TCP_DEFER_ACCEPT is shorter the client still gets > RST on next ACK. So, the problem is when client is silent > after first ACK. Not only, we have the problem that the application layer has no way to be informed that someone connected and failed to send anything. I have added the defer_accept feature in haproxy (http load balancer), just as an optimization to avoid a EAGAIN upon first recv() attempt to read the HTTP request after an accept. But users rely on stats and logs a lot to check if their site works. If clients really connect and fail to send a request (most often because of PPPoE outgoing MTU issues), they want to see that in the stats in order to monitor their quality of service. Also, having the ability to return a "408 request timeout" to the users is a lot cleaner and easier to debug than not responding anything at all, especially when there are reverse-proxies between both ends. So this behaviour has visible effects to the end user, that are not expected at all when reading the doc. > > So clearly this is in order to improve chances that the application > > will receive the connection, no ? > > Yes, it is not logical to configure > TCP_DEFER_ACCEPT < TCP_SYNCNT with the idea TCP_DEFER_ACCEPT > to reduce the SYN_RECV period. TCP_DEFER_ACCEPT does not > reduce the SYN_RECV period (before ACK), may be this is lacking > in man page. The idea to optionally give more time for DATA > to come after ACK is more logical. If TCP_DEFER_ACCEPT > switches state to ESTABLISHED may be then the TCP_DEFER_ACCEPT > period should start after ACK, eg. "wait for DATA 10 seconds after > ACK" sounds logical, it will need new name... I don't even need to wait for 10 seconds after first ACK, being able to drop only the first ACK is already excellent IMHO. > Note that > TCP_DEFER_ACCEPT is used also as flag for clients but you can do > the same with TCP_QUICKACK=0 (to delay first ACK and to send > DATA with first ACK). Yes and I already use that, this gives substantial performance boosts on small HTTP requests ; unfortunately I have no control over the end-user's browser ! > > Yes this is what happens right now, but reading the man again > > does not imply to me that the connection will not be accepted > > once we reach the retransmit limit. > > > > Maybe we have different usages and different interpretations of > > the man can satisfy either, but I don't see what this would be > > useful to in case we silently drop instead of finally accepting. > > May be we want the same but currently TCP_DEFER_ACCEPT > works as ACK dropper which is easier for implementation. That's what I like too. No timer, very light requirements, just drop the first empty ACK I'm not interested in, and that's all. That's really what I understood from the man page. > The semantic 'TCP_DEFER_ACCEPT extends the period after ACK' > is good, you can tune it together with TCP_SYNCNT, to > extend or not to extend the period. What happens on > TCP_DEFER_ACCEPT expiration after ACK - we all prefer to > see FIN, so we have to wait someone to come with new > implementation. Well, too much complicated for very little gain IMHO. Regards, Willy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-14 20:17 ` Willy Tarreau @ 2009-10-14 21:12 ` Olaf van der Spek 2009-10-14 22:43 ` David Miller 1 sibling, 0 replies; 41+ messages in thread From: Olaf van der Spek @ 2009-10-14 21:12 UTC (permalink / raw) To: Willy Tarreau; +Cc: netdev On Wed, Oct 14, 2009 at 10:17 PM, Willy Tarreau <w@1wt.eu> wrote: > Yes and I already use that, this gives substantial performance > boosts on small HTTP requests ; unfortunately I have no control > over the end-user's browser ! Most accept feature requests, don't they? Olaf ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-14 20:17 ` Willy Tarreau 2009-10-14 21:12 ` Olaf van der Spek @ 2009-10-14 22:43 ` David Miller 2009-10-15 6:08 ` Willy Tarreau 2009-10-15 7:59 ` Julian Anastasov 1 sibling, 2 replies; 41+ messages in thread From: David Miller @ 2009-10-14 22:43 UTC (permalink / raw) To: w; +Cc: ja, netdev, eric.dumazet From: Willy Tarreau <w@1wt.eu> Date: Wed, 14 Oct 2009 22:17:06 +0200 > Hello Julian, > > On Wed, Oct 14, 2009 at 10:27:50AM +0300, Julian Anastasov wrote: >> The semantic 'TCP_DEFER_ACCEPT extends the period after ACK' >> is good, you can tune it together with TCP_SYNCNT, to >> extend or not to extend the period. What happens on >> TCP_DEFER_ACCEPT expiration after ACK - we all prefer to >> see FIN, so we have to wait someone to come with new >> implementation. > > Well, too much complicated for very little gain IMHO. For now I'm pushing Willy's change into Linus's tree. After more discussion we can revert if necessary. I won't submit this to -stable until the discussion is fully resolved. Thanks! ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-14 22:43 ` David Miller @ 2009-10-15 6:08 ` Willy Tarreau 2009-10-15 8:47 ` Julian Anastasov 2009-10-15 7:59 ` Julian Anastasov 1 sibling, 1 reply; 41+ messages in thread From: Willy Tarreau @ 2009-10-15 6:08 UTC (permalink / raw) To: David Miller; +Cc: ja, netdev, eric.dumazet On Wed, Oct 14, 2009 at 03:43:49PM -0700, David Miller wrote: > For now I'm pushing Willy's change into Linus's tree. > > After more discussion we can revert if necessary. > > I won't submit this to -stable until the discussion is fully resolved. Makes sense, thanks David. BTW, I found a use case I didn't think about where current behaviour causes trouble : https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/134274 http://lkml.indiana.edu/hypermail/linux/kernel/0711.0/0461.html In summary, when front proxies establish pools of connections to an apache server making use of TCP_DEFER_ACCEPT, the connection never establishes on the apache server but silently expires in SYN_RECV state. The front proxy sees lots of SYN/ACKs and sends many ACKs trying to complete this connection and finally believes it got it since the server eventually becomes silent. However, when trying to send data over such a socket, the server immediately returns an RST. Such a problem would not happen if we would only drop the first X packets (X >= 1 is already fine), because the front proxy would establish the connection, send a second ACK in response to the second SYN/ACK and the connection would then really be established and would not have to expire early in SYN_RECV state. If we really want to behave as it does today, well, let's not fix it, but obviously, I fail to see what real world use it has, except causing random and hard to debug issues :-/ Reading the articles below clearly make it think it was designed to help with HTTP connections by skipping the first expected and useless ACK packet before waking up the task : http://httpd.apache.org/docs/1.3/misc/perf-bsd44.html http://articles.techrepublic.com.com/5100-10878_11-1050771.html and people still get caught : http://lkml.indiana.edu/hypermail/linux/kernel/0711.0/0416.html Maybe it was a bit over-engineered, in the end causing it to fail to satisfy the primary goal ? Regards, Willy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-15 6:08 ` Willy Tarreau @ 2009-10-15 8:47 ` Julian Anastasov 2009-10-15 12:41 ` Willy Tarreau 0 siblings, 1 reply; 41+ messages in thread From: Julian Anastasov @ 2009-10-15 8:47 UTC (permalink / raw) To: Willy Tarreau; +Cc: David Miller, netdev, eric.dumazet Hello, On Thu, 15 Oct 2009, Willy Tarreau wrote: > BTW, I found a use case I didn't think about where current behaviour > causes trouble : > > https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/134274 > http://lkml.indiana.edu/hypermail/linux/kernel/0711.0/0461.html > > In summary, when front proxies establish pools of connections to > an apache server making use of TCP_DEFER_ACCEPT, the connection > never establishes on the apache server but silently expires in > SYN_RECV state. The front proxy sees lots of SYN/ACKs and sends > many ACKs trying to complete this connection and finally believes > it got it since the server eventually becomes silent. However, > when trying to send data over such a socket, the server immediately > returns an RST. Such proxies using open connections for insane long time should be prepared to retry idempotent methods such as GET and to send POST methods to fresh connections. They can close idle connections, say, after 5 seconds. Even if server runs with TCP_DEFER_ACCEPT=OFF there is possibility server to send FIN while request is flying (servers are configured with some period to wait for first request). > Such a problem would not happen if we would only drop the first > X packets (X >= 1 is already fine), because the front proxy would > establish the connection, send a second ACK in response to the > second SYN/ACK and the connection would then really be established > and would not have to expire early in SYN_RECV state. > > If we really want to behave as it does today, well, let's not fix > it, but obviously, I fail to see what real world use it has, except > causing random and hard to debug issues :-/ The reason is that in SYN_RECV state the server saves resources. Socket and FD are created on DATA and possibly for the short time while response is sent. If server is lucky such resources will live miliseconds. Short responses can be sent together with FIN. OTOH, servers running with TCP_DEFER_ACCEPT=OFF can live with some wakeups (epoll is fast enough) but the problem is that they have sockets for longer time (the difference between first ACK and first DATA). > Reading the articles below clearly make it think it was designed > to help with HTTP connections by skipping the first expected and > useless ACK packet before waking up the task : > > http://httpd.apache.org/docs/1.3/misc/perf-bsd44.html > http://articles.techrepublic.com.com/5100-10878_11-1050771.html > > and people still get caught : > > http://lkml.indiana.edu/hypermail/linux/kernel/0711.0/0416.html They wait for minutes because they do not configure TCP_SYNCNT. TCP_DEFER_ACCEPT works as expected if configured properly. > Maybe it was a bit over-engineered, in the end causing it to fail > to satisfy the primary goal ? If one changes TCP_DEFER_ACCEPT to create socket it will save wakeups but not resources. I'm wondering if the behavior should be changed at all. For me the options are two: a) you want to save resources: use TCP_DEFER_ACCEPT. To help proxies use large values for TCP_SYNCNT and TCP_DEFER_ACCEPT. b) you can live with wakeups and many sockets: do not use TCP_DEFER_ACCEPT. Suitable for servers using short timeouts for first request. > Regards, > Willy Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-15 8:47 ` Julian Anastasov @ 2009-10-15 12:41 ` Willy Tarreau 2009-10-15 22:44 ` Julian Anastasov 0 siblings, 1 reply; 41+ messages in thread From: Willy Tarreau @ 2009-10-15 12:41 UTC (permalink / raw) To: Julian Anastasov; +Cc: David Miller, netdev, eric.dumazet Hello Julian, On Thu, Oct 15, 2009 at 11:47:51AM +0300, Julian Anastasov wrote: (...) > If one changes TCP_DEFER_ACCEPT to create socket it > will save wakeups but not resources. I'm wondering if the > behavior should be changed at all. For me the options are two: > > a) you want to save resources: use TCP_DEFER_ACCEPT. To help > proxies use large values for TCP_SYNCNT and TCP_DEFER_ACCEPT. > > b) you can live with wakeups and many sockets: do not use > TCP_DEFER_ACCEPT. Suitable for servers using short timeouts > for first request. and c) you want to avoid wakeups as much as possible and you'd like to drop just one empty ACK packet, so that as soon as you accept a an HTTP connection, you can read the request without polling at all. Right now I'm able to process a complete HTTP request without registering the any FD in epoll *at all* for most requests if the first two ACKs are close enough and the server responds quickly. This saves a substantial amount of CPU cycles. Epoll is fast, but calling epoll_ctl() 100000 times a second still has a measurable cost. Doing an accept() on an empty connection implies this cost. Waiting for data always saves this cost, but causes the undesirable side effects that have been reported. Waiting for data just a few milliseconds is enough to save this cost 99.99% of the time, just as skipping the first empty packet. Since you're saying that updating the value is wrong when it's used as a flag, would a patch to implement a specific option for this usage be accepted ? Either by passing a negative value to TCP_DEFER_ACCEPT, or by using another flag ? Thanks, Willy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-15 12:41 ` Willy Tarreau @ 2009-10-15 22:44 ` Julian Anastasov 2009-10-16 3:51 ` Eric Dumazet 2009-10-16 5:03 ` Willy Tarreau 0 siblings, 2 replies; 41+ messages in thread From: Julian Anastasov @ 2009-10-15 22:44 UTC (permalink / raw) To: Willy Tarreau; +Cc: David Miller, netdev, eric.dumazet Hello, On Thu, 15 Oct 2009, Willy Tarreau wrote: > Hello Julian, > > On Thu, Oct 15, 2009 at 11:47:51AM +0300, Julian Anastasov wrote: > (...) > > If one changes TCP_DEFER_ACCEPT to create socket it > > will save wakeups but not resources. I'm wondering if the > > behavior should be changed at all. For me the options are two: > > > > a) you want to save resources: use TCP_DEFER_ACCEPT. To help > > proxies use large values for TCP_SYNCNT and TCP_DEFER_ACCEPT. > > > > b) you can live with wakeups and many sockets: do not use > > TCP_DEFER_ACCEPT. Suitable for servers using short timeouts > > for first request. > > and c) you want to avoid wakeups as much as possible and you'd like > to drop just one empty ACK packet, so that as soon as you accept a > an HTTP connection, you can read the request without polling at all. > > Right now I'm able to process a complete HTTP request without > registering the any FD in epoll *at all* for most requests if the first > two ACKs are close enough and the server responds quickly. This saves a > substantial amount of CPU cycles. Epoll is fast, but calling epoll_ctl() > 100000 times a second still has a measurable cost. Doing an accept() on > an empty connection implies this cost. Waiting for data always saves this > cost, but causes the undesirable side effects that have been reported. > Waiting for data just a few milliseconds is enough to save this cost > 99.99% of the time, just as skipping the first empty packet. > > Since you're saying that updating the value is wrong when it's used as > a flag, would a patch to implement a specific option for this usage be > accepted ? Either by passing a negative value to TCP_DEFER_ACCEPT, or > by using another flag ? Sorry for the long mail ... I was not clear enough in previous email. Your goal is to decrease period per client while the actually decreased threshold is on the listener's socket. 256 conns will be enough to completely disable TCP_DEFER_ACCEPT on the listener (u8). I'm not sure that you tested what happens after Nth client (where N matches your TCP_DEFER_ACCEPT value as retransmissions), do you still see accept deferring for next clients? Now if your patch is applied the deferring will be disabled soon after server start. The open requests are just 64 bytes for me: cat /proc/slabinfo | grep request_sock_TCP and there is no rskq_defer_accept flag in struct tcp_request_sock, struct inet_request_sock or struct request_sock. It is present only for normal sockets. These open connection requests have only 'retrans' and 'acked' flag. Please, check again what your patch does and test it with some simple client that sleeps after connect(). As for new flags, may be we should not change TCP_DEFER_ACCEPT values because current applications can depend on it. There is some free space in struct request_sock_queue just after u8 rskq_defer_accept. May be new flags/modes can go there to define another behavior but it means also changes in applications to support it. Because using TCP_DEFER_ACCEPT as flag is not documented one solution can be the following change, may be it matches your idea but implemented correctly: /* If TCP_DEFER_ACCEPT is set, drop bare ACK. */ - if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept && + if (req->retrans < inet_csk(sk)->icsk_accept_queue.rskq_defer_accept && TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) { inet_rsk(req)->acked = 1; return NULL; } with the meaning "When TCP_DEFER_ACCEPT retransmission period is shorter than SYN-ACK retrans period (eg. TCP_SYNCNT or sysctl_tcp_synack_retries) move the open request as established on client's packet after the TCP_DEFER_ACCEPT period has expired". Then if you set TCP_SYNCNT=5 and TCP_DEFER_ACCEPT=1retrans first ACK will set inet_rsk(req)->acked=1 because req->retrans is 0 and rskq_defer_accept is 1, later timer will send one SYN-ACK (which marks the end of TCP_DEFER_ACCEPT period), then client will send DATA or it will be forced by our SYN-ACK retransmission to send 2nd ACK packet for which we will create established socket. Such change will affect all servers that use TCP_DEFER_ACCEPT retransmissions less than TCP_SYNCNT. They will start to see wakeups without data after the TCP_DEFER_ACCEPT period. To summarize: SECOND CLIENT SERVER --------------------------------------------------------- 0 SYN SYN-ACK if DATA => ESTABLISH if ACK => acked=1 3 SYN-ACK (set retrans=1) if ACK and TCP_DEFER_ACCEPT=1retrans => ESTABLISH if DATA => ESTABLISH if ACK => acked=1 9 if TCP_SYNCNT=1 => expire else SYN-ACK (set retrans=2) if ACK and TCP_DEFER_ACCEPT=2retrans => ESTABLISH if DATA => ESTABLISH if ACK => acked=1 ... PRO: - if TCP_DEFER_ACCEPT<TCP_SYNCNT and client properly resends ACK on every SYN-ACK retransmission then we always will switch to established state on TCP_DEFER_ACCEPT expiration. Such conns will never expire in SYN_RECV state. They will be terminated by client's FIN or will be accepted by server application and terminated properly. Of course, there is some chance if client delays its ACKs or if SYN-ACK is lost the open request to expire in SYN_RECV state. CON: - if client refuses to send DATA we still need these SYN-ACKs to trigger ACK retransmissions from client because the only way to switch to established state is when packet is received, I don't know how TCP_DEFER_ACCEPT expiration can directly change the open request to established state. May be it is possible to send first SYN-ACK and if one ACK is received to send more SYN-ACKs after TCP_DEFER_ACCEPT period expires. Then client still has chance to send ACK or DATA that will switch open request to established socket. So, our timer will be silent when acked=1 while TCP_DEFER_ACCEPT period is active, for example: SYN SYN-ACK ACK ... acked=1 => no SYN-ACKs retrans (assume they are sent and lost) ... TCP_DEFER_ACCEPT expires => send 2nd SYN-ACK ... If no client's ACK then resend SYN-ACK while retrans<TCP_SYNCNT ... ACK or DATA => ESTABLISHED This will need little change in inet_csk_reqsk_queue_prune() but it saves SYN-ACK traffic during deferring period in the common case when client sends ACK. If such compromise is acceptable I can prepare and test some patch. Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-15 22:44 ` Julian Anastasov @ 2009-10-16 3:51 ` Eric Dumazet 2009-10-16 5:00 ` Eric Dumazet 2009-10-16 5:03 ` Willy Tarreau 1 sibling, 1 reply; 41+ messages in thread From: Eric Dumazet @ 2009-10-16 3:51 UTC (permalink / raw) To: Julian Anastasov; +Cc: Willy Tarreau, David Miller, netdev Julian Anastasov a écrit : > Sorry for the long mail ... > > I was not clear enough in previous email. Your goal > is to decrease period per client while the actually decreased > threshold is on the listener's socket. 256 conns will be enough > to completely disable TCP_DEFER_ACCEPT on the listener (u8). I'm not > sure that you tested what happens after Nth client (where > N matches your TCP_DEFER_ACCEPT value as retransmissions), do you > still see accept deferring for next clients? Now if your patch > is applied the deferring will be disabled soon after server start. So, it appears defer_accept value is not an inherited attribute, but shared by all embryons. Therefore we should not touch it. No need to type so much text Julian. Of course it should be done, or add a new connection field to count number of pure ACKS received on each SYN_RECV embryon. Do you volunter for this patch ? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-16 3:51 ` Eric Dumazet @ 2009-10-16 5:00 ` Eric Dumazet 2009-10-16 5:29 ` Willy Tarreau 0 siblings, 1 reply; 41+ messages in thread From: Eric Dumazet @ 2009-10-16 5:00 UTC (permalink / raw) Cc: Julian Anastasov, Willy Tarreau, David Miller, netdev Eric Dumazet a écrit : > > > So, it appears defer_accept value is not an inherited attribute, > but shared by all embryons. Therefore we should not touch it. > > Of course it should be done, or add a new connection field to count number > of pure ACKS received on each SYN_RECV embryon. > Could be something like this ? (on top of net-next-2.6 of course) 7 bits is more than enough, we could take 5 bits IMHO. Thanks [PATCH net-next-2.6] tcp: TCP_DEFER_ACCEPT fix Julian Anastasov pointed out defer_accept was an attribute of listening socket, shared by all embryons. We therefore need a new per connection attribute. We can split current u8 used by cookie_ts into 7 bits used to store number of pure ACKS received by the embryon, and 1 bit to store cookie_ts. Note, I use an unsigned int field so that kmemcheck doesnt shoot, so patch also touches cookie_v4_init_sequence()/cookie_v6_init_sequence() implementations. Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- include/net/request_sock.h | 9 ++++++--- include/net/tcp.h | 6 ++++-- net/ipv4/syncookies.c | 7 ++++--- net/ipv4/tcp_ipv4.c | 2 +- net/ipv4/tcp_minisocks.c | 4 ++-- net/ipv6/syncookies.c | 6 +++--- net/ipv6/tcp_ipv6.c | 2 +- 7 files changed, 21 insertions(+), 15 deletions(-) diff --git a/include/net/request_sock.h b/include/net/request_sock.h index c719084..3464d90 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -45,9 +45,12 @@ struct request_sock_ops { */ struct request_sock { struct request_sock *dl_next; /* Must be first member! */ - u16 mss; - u8 retrans; - u8 cookie_ts; /* syncookie: encode tcpopts in timestamp */ + kmemcheck_bitfield_begin(flags); + unsigned int mss : 16, + retrans : 8, + req_acks : 7, + cookie_ts : 1; /* syncookie: encode tcpopts in timestamp */ + kmemcheck_bitfield_end(flags); /* The following two fields can be easily recomputed I think -AK */ u32 window_clamp; /* window clamp at creation time */ u32 rcv_wnd; /* rcv_wnd offered first time */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 03a49c7..a2c0439 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -453,7 +453,7 @@ extern __u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS]; extern struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb, struct ip_options *opt); extern __u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, - __u16 *mss); + struct request_sock *req); extern __u32 cookie_init_timestamp(struct request_sock *req); extern void cookie_check_timestamp(struct tcp_options_received *tcp_opt); @@ -461,7 +461,7 @@ extern void cookie_check_timestamp(struct tcp_options_received *tcp_opt); /* From net/ipv6/syncookies.c */ extern struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb); extern __u32 cookie_v6_init_sequence(struct sock *sk, struct sk_buff *skb, - __u16 *mss); + struct request_sock *req); /* tcp_output.c */ @@ -1000,7 +1000,9 @@ static inline void tcp_openreq_init(struct request_sock *req, struct inet_request_sock *ireq = inet_rsk(req); req->rcv_wnd = 0; /* So that tcp_send_synack() knows! */ + kmemcheck_annotate_bitfield(req, flags); req->cookie_ts = 0; + req->req_acks = 0; tcp_rsk(req)->rcv_isn = TCP_SKB_CB(skb)->seq; req->mss = rx_opt->mss_clamp; req->ts_recent = rx_opt->saw_tstamp ? rx_opt->rcv_tsval : 0; diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index 5ec678a..8af9261 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -160,19 +160,20 @@ static __u16 const msstab[] = { * Generate a syncookie. mssp points to the mss, which is returned * rounded down to the value encoded in the cookie. */ -__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp) +__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, + struct request_sock *req) { const struct iphdr *iph = ip_hdr(skb); const struct tcphdr *th = tcp_hdr(skb); int mssind; - const __u16 mss = *mssp; + const __u16 mss = req->mss; tcp_synq_overflow(sk); /* XXX sort msstab[] by probability? Binary search? */ for (mssind = 0; mss > msstab[mssind + 1]; mssind++) ; - *mssp = msstab[mssind] + 1; + req->mss = msstab[mssind] + 1; NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESSENT); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 9971870..3a2dfc5 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1286,7 +1286,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb) syn_flood_warning(skb); req->cookie_ts = tmp_opt.tstamp_ok; #endif - isn = cookie_v4_init_sequence(sk, skb, &req->mss); + isn = cookie_v4_init_sequence(sk, skb, req); } else if (!isn) { struct inet_peer *peer = NULL; diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index e320afe..92778e6 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -642,9 +642,9 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, return NULL; /* If TCP_DEFER_ACCEPT is set, drop bare ACK. */ - if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept && + if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept > req->req_acks && TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) { - inet_csk(sk)->icsk_accept_queue.rskq_defer_accept--; + req->req_acks++; inet_rsk(req)->acked = 1; return NULL; } diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c index cbe55e5..60d024d 100644 --- a/net/ipv6/syncookies.c +++ b/net/ipv6/syncookies.c @@ -125,18 +125,18 @@ static __u32 check_tcp_syn_cookie(__u32 cookie, struct in6_addr *saddr, & COOKIEMASK; } -__u32 cookie_v6_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp) +__u32 cookie_v6_init_sequence(struct sock *sk, struct sk_buff *skb, struct request_sock *req) { struct ipv6hdr *iph = ipv6_hdr(skb); const struct tcphdr *th = tcp_hdr(skb); int mssind; - const __u16 mss = *mssp; + const __u16 mss = req->mss; tcp_synq_overflow(sk); for (mssind = 0; mss > msstab[mssind + 1]; mssind++) ; - *mssp = msstab[mssind] + 1; + req->mss = msstab[mssind] + 1; NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESSENT); diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 4517630..2717bcb 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1219,7 +1219,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb) TCP_ECN_create_request(req, tcp_hdr(skb)); if (want_cookie) { - isn = cookie_v6_init_sequence(sk, skb, &req->mss); + isn = cookie_v6_init_sequence(sk, skb, req); req->cookie_ts = tmp_opt.tstamp_ok; } else if (!isn) { if (ipv6_opt_accepted(sk, skb) || ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-16 5:00 ` Eric Dumazet @ 2009-10-16 5:29 ` Willy Tarreau 2009-10-16 6:05 ` Eric Dumazet 0 siblings, 1 reply; 41+ messages in thread From: Willy Tarreau @ 2009-10-16 5:29 UTC (permalink / raw) To: Eric Dumazet; +Cc: Julian Anastasov, David Miller, netdev On Fri, Oct 16, 2009 at 07:00:49AM +0200, Eric Dumazet wrote: > Eric Dumazet a écrit : > > > > > > So, it appears defer_accept value is not an inherited attribute, > > but shared by all embryons. Therefore we should not touch it. > > > > Of course it should be done, or add a new connection field to count number > > of pure ACKS received on each SYN_RECV embryon. > > > > Could be something like this ? (on top of net-next-2.6 of course) > > 7 bits is more than enough, we could take 5 bits IMHO. Couldn't we just rely on the retrans vs rskq_defer_accept comparison ? Willy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-16 5:29 ` Willy Tarreau @ 2009-10-16 6:05 ` Eric Dumazet 2009-10-16 6:18 ` Willy Tarreau 0 siblings, 1 reply; 41+ messages in thread From: Eric Dumazet @ 2009-10-16 6:05 UTC (permalink / raw) To: Willy Tarreau; +Cc: Julian Anastasov, David Miller, netdev Willy Tarreau a écrit : > On Fri, Oct 16, 2009 at 07:00:49AM +0200, Eric Dumazet wrote: >> Eric Dumazet a écrit : >>> >>> So, it appears defer_accept value is not an inherited attribute, >>> but shared by all embryons. Therefore we should not touch it. >>> >>> Of course it should be done, or add a new connection field to count number >>> of pure ACKS received on each SYN_RECV embryon. >>> >> Could be something like this ? (on top of net-next-2.6 of course) >> >> 7 bits is more than enough, we could take 5 bits IMHO. > > Couldn't we just rely on the retrans vs rskq_defer_accept comparison ? > In this case, we lose TCP_DEFER_ACCEPT advantage in case one SYN-ACK was dropped by the network : We wakeup the listening server when first ACK comes from client, instead of really wait the request. I think being able to count pure-acks would be slighly better, and cost nothing. retrans is the number of SYN-RECV (re)sent, while req_acks would count number of pure ACK received. Those numbers, in an ideal world should be related, but could differ in real world ? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-16 6:05 ` Eric Dumazet @ 2009-10-16 6:18 ` Willy Tarreau 2009-10-16 7:08 ` Eric Dumazet 0 siblings, 1 reply; 41+ messages in thread From: Willy Tarreau @ 2009-10-16 6:18 UTC (permalink / raw) To: Eric Dumazet; +Cc: Julian Anastasov, David Miller, netdev On Fri, Oct 16, 2009 at 08:05:19AM +0200, Eric Dumazet wrote: > > Couldn't we just rely on the retrans vs rskq_defer_accept comparison ? > > > > In this case, we lose TCP_DEFER_ACCEPT advantage in case one SYN-ACK was dropped > by the network : We wakeup the listening server when first ACK comes from client, > instead of really wait the request. > > I think being able to count pure-acks would be slighly better, and cost nothing. > > > retrans is the number of SYN-RECV (re)sent, while req_acks would count number of > pure ACK received. > > Those numbers, in an ideal world should be related, but could differ in real world ? Yes it could differ if a pure ACK is lost between the client and the server, but in my opinion what is important is not to precisely account the number of ACKs to ensure we wake up exactly after XXX ACKs received, but that in most common situations we avoid to wake up too early. Also, keep in mind that the TCP_DEFER_ACCEPT parameter is passed in number of seconds by the application, which are in turn converted to a number of retransmits based on our own timer, which means that our SYN-ACK counter is what most closely matches the application's expected delay, even if an ACK from the client gets lost in between or if a client's stack retransmits pure ACKs very fast for any implementation-specific reason. Regards, Willy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-16 6:18 ` Willy Tarreau @ 2009-10-16 7:08 ` Eric Dumazet 2009-10-16 7:19 ` Willy Tarreau 0 siblings, 1 reply; 41+ messages in thread From: Eric Dumazet @ 2009-10-16 7:08 UTC (permalink / raw) To: Willy Tarreau; +Cc: Julian Anastasov, David Miller, netdev Willy Tarreau a écrit : > On Fri, Oct 16, 2009 at 08:05:19AM +0200, Eric Dumazet wrote: >>> Couldn't we just rely on the retrans vs rskq_defer_accept comparison ? >>> >> In this case, we lose TCP_DEFER_ACCEPT advantage in case one SYN-ACK was dropped >> by the network : We wakeup the listening server when first ACK comes from client, >> instead of really wait the request. >> >> I think being able to count pure-acks would be slighly better, and cost nothing. >> >> >> retrans is the number of SYN-RECV (re)sent, while req_acks would count number of >> pure ACK received. >> >> Those numbers, in an ideal world should be related, but could differ in real world ? > > Yes it could differ if a pure ACK is lost between the client and the server, > but in my opinion what is important is not to precisely account the number > of ACKs to ensure we wake up exactly after XXX ACKs received, but that in > most common situations we avoid to wake up too early. > We basically same thing, but you misundertood me. I was concerning about one lost (server -> client SYN-ACK), not a lost (client -> server ACK) which is fine (even without playing with TCP_DEFER_ACCEPT at all) In this case, if we do the retrans test, we'll accept the first (client -> server ACK) and wakeup the application, while most probably we'll receive the client request few milli second later. > Also, keep in mind that the TCP_DEFER_ACCEPT parameter is passed in number > of seconds by the application, which are in turn converted to a number of > retransmits based on our own timer, which means that our SYN-ACK counter > is what most closely matches the application's expected delay, even if an > ACK from the client gets lost in between or if a client's stack retransmits > pure ACKs very fast for any implementation-specific reason. > Well, this is why converting application delay (sockopt() argument) in second units to a number of SYN-ACK counter is subobptimal and error prone. This might be changed to be mapped to what documentation states : a number of seconds, or even better a number of milli seconds (new TCP_DEFER_ACCEPT_MS setsockopt cmd), because a high performance server wont play with > 1 sec values anyway. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-16 7:08 ` Eric Dumazet @ 2009-10-16 7:19 ` Willy Tarreau 0 siblings, 0 replies; 41+ messages in thread From: Willy Tarreau @ 2009-10-16 7:19 UTC (permalink / raw) To: Eric Dumazet; +Cc: Julian Anastasov, David Miller, netdev On Fri, Oct 16, 2009 at 09:08:59AM +0200, Eric Dumazet wrote: (...) > > Yes it could differ if a pure ACK is lost between the client and the server, > > but in my opinion what is important is not to precisely account the number > > of ACKs to ensure we wake up exactly after XXX ACKs received, but that in > > most common situations we avoid to wake up too early. > > > > We basically same thing, but you misundertood me. I was concerning about > one lost (server -> client SYN-ACK), not a lost (client -> server ACK) which is fine > (even without playing with TCP_DEFER_ACCEPT at all) > > In this case, if we do the retrans test, we'll accept the first (client -> server ACK) > and wakeup the application, while most probably we'll receive the client request > few milli second later. OK I get your point. We can detect that though, as Julian explained it, with the ->acked field. It indicates we got an ACK, which proves the SYN-ACK was received. At first glance, I think that Julian's algorithm explained at the end of his mail exactly covers all cases without using any additional field, though this is not an issue anyway. > > Also, keep in mind that the TCP_DEFER_ACCEPT parameter is passed in number > > of seconds by the application, which are in turn converted to a number of > > retransmits based on our own timer, which means that our SYN-ACK counter > > is what most closely matches the application's expected delay, even if an > > ACK from the client gets lost in between or if a client's stack retransmits > > pure ACKs very fast for any implementation-specific reason. > > > > Well, this is why converting application delay (sockopt() argument) in second units > to a number of SYN-ACK counter is subobptimal and error prone. I agree, but it allows the application to be unware of retransmit timers. > This might be changed to be mapped to what documentation states : a number of seconds, > or even better a number of milli seconds (new TCP_DEFER_ACCEPT_MS setsockopt cmd), > because a high performance server wont play with > 1 sec values anyway. It would be nice but it would require a new timer. Current implementation does not need any and is efficient enough for most common cases. In fact it would have been better to simply be able to specify that we want to skip one empty ACK (or X empty ACKs). But let's make use of what we currently have, with your (or Julian's) changes, it should cover almost all usages without changing semantics for applications. Regards, Willy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-15 22:44 ` Julian Anastasov 2009-10-16 3:51 ` Eric Dumazet @ 2009-10-16 5:03 ` Willy Tarreau 2009-10-16 8:49 ` Julian Anastasov 1 sibling, 1 reply; 41+ messages in thread From: Willy Tarreau @ 2009-10-16 5:03 UTC (permalink / raw) To: Julian Anastasov; +Cc: David Miller, netdev, eric.dumazet Hello Julian, On Fri, Oct 16, 2009 at 01:44:34AM +0300, Julian Anastasov wrote: (...) > I was not clear enough in previous email. Your goal > is to decrease period per client while the actually decreased > threshold is on the listener's socket. 256 conns will be enough > to completely disable TCP_DEFER_ACCEPT on the listener (u8). Damn! Now that you're explaining this, it seems obvious and makes so much sense ! Of course you're right. We should not touch the listening socket, only the new socket being created ! Thanks for this clarification. David, Julian is right, please drop my patch. (...) > As for new flags, may be we should not change > TCP_DEFER_ACCEPT values because current applications can > depend on it. I have no problem with that, eventhough I'm really doubting that many applications depend on its current behaviour. > There is some free space in > struct request_sock_queue just after u8 rskq_defer_accept. > May be new flags/modes can go there to define another > behavior but it means also changes in applications to support > it. One idea I had was to make it signed, because there is currently no use for negative values. But let's see your proposal below. > /* If TCP_DEFER_ACCEPT is set, drop bare ACK. */ > - if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept && > + if (req->retrans < inet_csk(sk)->icsk_accept_queue.rskq_defer_accept && > TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) { > inet_rsk(req)->acked = 1; > return NULL; > } Yes, of course that looks a lot better ! (...) > Such change will affect all servers that use > TCP_DEFER_ACCEPT retransmissions less than TCP_SYNCNT. They > will start to see wakeups without data after the TCP_DEFER_ACCEPT > period. Indeed. But anyway a server must always be prepared to see wakeup without data, because there are some situations where it will still happen. For instance, if hardware RX cksum is not possible, a data packet will cause a wakeup and during the recv(), the copy_and_cksum will detect a cksum error and will not send anything to userland, so the application will get an empty read. > To summarize: > > SECOND CLIENT SERVER > --------------------------------------------------------- > 0 SYN SYN-ACK > if DATA => ESTABLISH > if ACK => acked=1 > 3 SYN-ACK (set retrans=1) > if ACK and TCP_DEFER_ACCEPT=1retrans => ESTABLISH > if DATA => ESTABLISH > if ACK => acked=1 > 9 if TCP_SYNCNT=1 => expire > else SYN-ACK (set retrans=2) > if ACK and TCP_DEFER_ACCEPT=2retrans => ESTABLISH > if DATA => ESTABLISH > if ACK => acked=1 > ... > > PRO: > > - if TCP_DEFER_ACCEPT<TCP_SYNCNT and client properly resends > ACK on every SYN-ACK retransmission then we always will > switch to established state on TCP_DEFER_ACCEPT expiration. That's what I wanted to achieve :-) > Such conns will never expire in SYN_RECV state. They > will be terminated by client's FIN or will be accepted > by server application and terminated properly. Of course, > there is some chance if client delays its ACKs or if SYN-ACK > is lost the open request to expire in SYN_RECV state. Yes, but that must be covered by both ends stacks. Network losses are normal and such events already happen in minor quantities everyday. > CON: > > - if client refuses to send DATA we still need these SYN-ACKs > to trigger ACK retransmissions from client because the only > way to switch to established state is when packet is received, > I don't know how TCP_DEFER_ACCEPT expiration can directly change > the open request to established state. Well anyway this is already better than the current situation where an apparently established connection silently dies. With this proposal, applications have a way to get a normal behaviour. > May be it is possible to send first SYN-ACK and > if one ACK is received to send more SYN-ACKs after > TCP_DEFER_ACCEPT period expires. Then client still has chance > to send ACK or DATA that will switch open request to established > socket. So, our timer will be silent when acked=1 while > TCP_DEFER_ACCEPT period is active, for example: > > SYN > SYN-ACK > ACK > ... > acked=1 => no SYN-ACKs retrans (assume they are sent and lost) > > ... > TCP_DEFER_ACCEPT expires => send 2nd SYN-ACK > ... If no client's ACK then resend SYN-ACK while retrans<TCP_SYNCNT > ... > ACK or DATA => ESTABLISHED On first reading I found it a bit dangerous because the client will assume an established connection when not seeing any new SYN-ACKs. And if it has no data to send, it will never retransmit its ACK. But after some thinking, it still matches the purpose of TCP_DEFER_ACCEPT, without the extra SYN-ACK / ACK dance that we currently observe. I was also worried about middle firewalls, but they will have no problem because they'd see an established connection (SYN,SYN-ACK,ACK), so their expiration timer will be large enough to cover the lack of SYN-ACK during TCP_DEFER_ACCEPT. > This will need little change in inet_csk_reqsk_queue_prune() > but it saves SYN-ACK traffic during deferring period in the > common case when client sends ACK. If such compromise is > acceptable I can prepare and test some patch. I would personally like this a lot ! This will satisfy people who expect it to establish at the end of the "TCP_DEFER_ACCEPT delay" as can be interpreted from the man page, will reduce the number of useless SYN-ACKs that annoy other people while still making no visible change for anyone who would rely on the current behaviour. Regards, Willy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-16 5:03 ` Willy Tarreau @ 2009-10-16 8:49 ` Julian Anastasov 2009-10-16 10:40 ` Eric Dumazet 0 siblings, 1 reply; 41+ messages in thread From: Julian Anastasov @ 2009-10-16 8:49 UTC (permalink / raw) To: Willy Tarreau; +Cc: David Miller, netdev, eric.dumazet Hello, On Fri, 16 Oct 2009, Willy Tarreau wrote: > > This will need little change in inet_csk_reqsk_queue_prune() > > but it saves SYN-ACK traffic during deferring period in the > > common case when client sends ACK. If such compromise is > > acceptable I can prepare and test some patch. > > I would personally like this a lot ! This will satisfy people who > expect it to establish at the end of the "TCP_DEFER_ACCEPT delay" > as can be interpreted from the man page, will reduce the number of > useless SYN-ACKs that annoy other people while still making no > visible change for anyone who would rely on the current behaviour. OK, I don't have much time now, this is what I'm going to test later today and later can provide proper comments: Signed-off-by: Julian Anastasov <ja@ssi.bg> Patch 1: Accept connections after deferring period: diff -urp v2.6.31/linux/net/ipv4/tcp_minisocks.c linux/net/ipv4/tcp_minisocks.c --- v2.6.31/linux/net/ipv4/tcp_minisocks.c 2009-09-11 10:27:17.000000000 +0300 +++ linux/net/ipv4/tcp_minisocks.c 2009-10-16 10:29:19.000000000 +0300 @@ -641,8 +641,8 @@ struct sock *tcp_check_req(struct sock * if (!(flg & TCP_FLAG_ACK)) return NULL; - /* If TCP_DEFER_ACCEPT is set, drop bare ACK. */ - if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept && + /* While TCP_DEFER_ACCEPT is active, drop bare ACK. */ + if (req->retrans < inet_csk(sk)->icsk_accept_queue.rskq_defer_accept && TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) { inet_rsk(req)->acked = 1; return NULL; Patch 2: Do not resend SYN-ACK while waiting for data after ACK. Also, do not drop acked request if sending of SYN-ACK fails. diff -urp v2.6.31/linux/net/ipv4/inet_connection_sock.c linux/net/ipv4/inet_connection_sock.c --- v2.6.31/linux/net/ipv4/inet_connection_sock.c 2009-06-13 10:53:58.000000000 +0300 +++ linux/net/ipv4/inet_connection_sock.c 2009-10-16 11:35:52.000000000 +0300 @@ -446,6 +446,28 @@ extern int sysctl_tcp_synack_retries; EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_hash_add); +/* Decide when to expire the request and when to resend SYN-ACK */ +static inline void syn_ack_recalc(struct request_sock *req, const int thresh, + const int max_retries, + const u8 rskq_defer_accept, + int *expire, int *resend) +{ + if (!rskq_defer_accept) { + *expire = req->retrans >= thresh; + *resend = 1; + return; + } + *expire = req->retrans >= thresh && + (!inet_rsk(req)->acked || req->retrans >= max_retries); + /* + * Do not resend while waiting for data after ACK, + * start to resend on end of deferring period to give + * last chance for data or ACK to create established socket. + */ + *resend = !inet_rsk(req)->acked || + req->retrans >= rskq_defer_accept - 1; +} + void inet_csk_reqsk_queue_prune(struct sock *parent, const unsigned long interval, const unsigned long timeout, @@ -501,9 +523,15 @@ void inet_csk_reqsk_queue_prune(struct s reqp=&lopt->syn_table[i]; while ((req = *reqp) != NULL) { if (time_after_eq(now, req->expires)) { - if ((req->retrans < thresh || - (inet_rsk(req)->acked && req->retrans < max_retries)) - && !req->rsk_ops->rtx_syn_ack(parent, req)) { + int expire = 0, resend = 0; + + syn_ack_recalc(req, thresh, max_retries, + queue->rskq_defer_accept, + &expire, &resend); + if (!expire && + (!resend || + !req->rsk_ops->rtx_syn_ack(parent, req) || + inet_rsk(req)->acked)) { unsigned long timeo; if (req->retrans++ == 0) Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-16 8:49 ` Julian Anastasov @ 2009-10-16 10:40 ` Eric Dumazet 2009-10-16 19:27 ` Willy Tarreau 2009-10-17 11:48 ` Julian Anastasov 0 siblings, 2 replies; 41+ messages in thread From: Eric Dumazet @ 2009-10-16 10:40 UTC (permalink / raw) To: Julian Anastasov; +Cc: Willy Tarreau, David Miller, netdev Julian Anastasov a écrit : > Hello, > > On Fri, 16 Oct 2009, Willy Tarreau wrote: > >>> This will need little change in inet_csk_reqsk_queue_prune() >>> but it saves SYN-ACK traffic during deferring period in the >>> common case when client sends ACK. If such compromise is >>> acceptable I can prepare and test some patch. >> I would personally like this a lot ! This will satisfy people who >> expect it to establish at the end of the "TCP_DEFER_ACCEPT delay" >> as can be interpreted from the man page, will reduce the number of >> useless SYN-ACKs that annoy other people while still making no >> visible change for anyone who would rely on the current behaviour. > > OK, I don't have much time now, this is what I'm > going to test later today and later can provide proper comments: > > Signed-off-by: Julian Anastasov <ja@ssi.bg> I tested both patches and they perform very well, thank you ! For the minimum 1 sec value, tcpdump looks like : 12:32:03.850456 IP 127.0.0.1.20000 > 127.0.0.1.2222: S 1879889239:1879889239(0) win 32792 <mss 16396,nop,nop,timestamp 952803 0,nop,wscale 6> 12:32:03.850463 IP 127.0.0.1.2222 > 127.0.0.1.20000: S 1890330616:1890330616(0) ack 1879889240 win 32768 <mss 16396,nop,nop,timestamp 952803 952803,nop,wscale 6> 12:32:03.850469 IP 127.0.0.1.20000 > 127.0.0.1.2222: . ack 1 win 513 <nop,nop,timestamp 952803 952803> 12:32:06.849989 IP 127.0.0.1.2222 > 127.0.0.1.20000: S 1890330616:1890330616(0) ack 1879889240 win 32768 <mss 16396,nop,nop,timestamp 955803 952803,nop,wscale 6> 12:32:06.849996 IP 127.0.0.1.20000 > 127.0.0.1.2222: . ack 1 win 513 <nop,nop,timestamp 955803 955803> So listening application gets the accept() 3 seconds after initial SYN # ss -emoian | grep SYN-RECV SYN-RECV 0 0 127.0.0.1:2222 127.0.0.1:20000 timer:(on,24sec,4) ino:0 sk:f6f0ec80 I wonder if tcp_diag should be extented a bit to reflect fact that the ACK was received from client (ie forward the inet_rsk(req)->acked information to idiag_rqueue) diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c index cb73fde..c172bd4 100644 --- a/net/ipv4/inet_diag.c +++ b/net/ipv4/inet_diag.c @@ -589,7 +589,7 @@ static int inet_diag_fill_req(struct sk_buff *skb, struct sock *sk, r->id.idiag_src[0] = ireq->loc_addr; r->id.idiag_dst[0] = ireq->rmt_addr; r->idiag_expires = jiffies_to_msecs(tmo); - r->idiag_rqueue = 0; + r->idiag_rqueue = ireq->acked; r->idiag_wqueue = 0; r->idiag_uid = sock_i_uid(sk); r->idiag_inode = 0; -> # ss -emoian | grep SYN-RECV SYN-RECV 1 0 127.0.0.1:2222 127.0.0.1:20000 timer:(on,24sec,4) ino:0 sk:f6f0ec80 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-16 10:40 ` Eric Dumazet @ 2009-10-16 19:27 ` Willy Tarreau 2009-10-17 11:48 ` Julian Anastasov 1 sibling, 0 replies; 41+ messages in thread From: Willy Tarreau @ 2009-10-16 19:27 UTC (permalink / raw) To: Eric Dumazet; +Cc: Julian Anastasov, David Miller, netdev Hi, On Fri, Oct 16, 2009 at 12:40:11PM +0200, Eric Dumazet wrote: > Julian Anastasov a écrit : > > Hello, > > > > On Fri, 16 Oct 2009, Willy Tarreau wrote: > > > >>> This will need little change in inet_csk_reqsk_queue_prune() > >>> but it saves SYN-ACK traffic during deferring period in the > >>> common case when client sends ACK. If such compromise is > >>> acceptable I can prepare and test some patch. > >> I would personally like this a lot ! This will satisfy people who > >> expect it to establish at the end of the "TCP_DEFER_ACCEPT delay" > >> as can be interpreted from the man page, will reduce the number of > >> useless SYN-ACKs that annoy other people while still making no > >> visible change for anyone who would rely on the current behaviour. > > > > OK, I don't have much time now, this is what I'm > > going to test later today and later can provide proper comments: > > > > Signed-off-by: Julian Anastasov <ja@ssi.bg> > > I tested both patches and they perform very well, thank you ! > > For the minimum 1 sec value, tcpdump looks like : > 12:32:03.850456 IP 127.0.0.1.20000 > 127.0.0.1.2222: S 1879889239:1879889239(0) win 32792 <mss 16396,nop,nop,timestamp 952803 0,nop,wscale 6> > 12:32:03.850463 IP 127.0.0.1.2222 > 127.0.0.1.20000: S 1890330616:1890330616(0) ack 1879889240 win 32768 <mss 16396,nop,nop,timestamp 952803 952803,nop,wscale 6> > 12:32:03.850469 IP 127.0.0.1.20000 > 127.0.0.1.2222: . ack 1 win 513 <nop,nop,timestamp 952803 952803> > > 12:32:06.849989 IP 127.0.0.1.2222 > 127.0.0.1.20000: S 1890330616:1890330616(0) ack 1879889240 win 32768 <mss 16396,nop,nop,timestamp 955803 952803,nop,wscale 6> > 12:32:06.849996 IP 127.0.0.1.20000 > 127.0.0.1.2222: . ack 1 win 513 <nop,nop,timestamp 955803 955803> > > So listening application gets the accept() 3 seconds after initial SYN Excellent! Nice work guys. > # ss -emoian | grep SYN-RECV > SYN-RECV 0 0 127.0.0.1:2222 127.0.0.1:20000 timer:(on,24sec,4) ino:0 sk:f6f0ec80 > > I wonder if tcp_diag should be extented a bit to reflect fact that the ACK was received from client > (ie forward the inet_rsk(req)->acked information to idiag_rqueue) I personally have no opinion on this point. Thanks! Willy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-16 10:40 ` Eric Dumazet 2009-10-16 19:27 ` Willy Tarreau @ 2009-10-17 11:48 ` Julian Anastasov 2009-10-17 12:07 ` Eric Dumazet 1 sibling, 1 reply; 41+ messages in thread From: Julian Anastasov @ 2009-10-17 11:48 UTC (permalink / raw) To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev Hello, On Fri, 16 Oct 2009, Eric Dumazet wrote: > I wonder if tcp_diag should be extented a bit to reflect fact that the ACK was received from client > (ie forward the inet_rsk(req)->acked information to idiag_rqueue) It is a good idea. > diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c > index cb73fde..c172bd4 100644 > --- a/net/ipv4/inet_diag.c > +++ b/net/ipv4/inet_diag.c > @@ -589,7 +589,7 @@ static int inet_diag_fill_req(struct sk_buff *skb, struct sock *sk, > r->id.idiag_src[0] = ireq->loc_addr; > r->id.idiag_dst[0] = ireq->rmt_addr; > r->idiag_expires = jiffies_to_msecs(tmo); > - r->idiag_rqueue = 0; > + r->idiag_rqueue = ireq->acked; > r->idiag_wqueue = 0; > r->idiag_uid = sock_i_uid(sk); > r->idiag_inode = 0; I tested both patches. It seems the current algorithm to convert seconds to retransmissions does not match well the TCP SYN-ACK timer and sometimes can convert the seconds to retransmissions which are 1 above the expected. For example, you set 9 seconds (expecting 2 retrans) but you get 3 retrans, visible with TCP_SYNCNT=1. Also, it is limited to period of 32 retransmissions. The following patch changes the TCP_DEFER_ACCEPT period calculation to match TCP SYN-ACK retransmissions and to help those folks who select the seconds with TCP SYN-ACK timing in mind. It also allows the retransmission threshold to be up to 255. Signed-off-by: Julian Anastasov <ja@ssi.bg> diff -urp v2.6.31/linux/net/ipv4/tcp.c linux/net/ipv4/tcp.c --- v2.6.31/linux/net/ipv4/tcp.c 2009-09-11 10:27:17.000000000 +0300 +++ linux/net/ipv4/tcp.c 2009-10-17 12:34:38.000000000 +0300 @@ -2165,13 +2165,20 @@ static int do_tcp_setsockopt(struct sock case TCP_DEFER_ACCEPT: icsk->icsk_accept_queue.rskq_defer_accept = 0; if (val > 0) { + int timeout = TCP_TIMEOUT_INIT / HZ; + int period = timeout; + /* Translate value in seconds to number of * retransmits */ - while (icsk->icsk_accept_queue.rskq_defer_accept < 32 && - val > ((TCP_TIMEOUT_INIT / HZ) << - icsk->icsk_accept_queue.rskq_defer_accept)) + icsk->icsk_accept_queue.rskq_defer_accept = 1; + while (icsk->icsk_accept_queue.rskq_defer_accept < 255 && + val > period) { icsk->icsk_accept_queue.rskq_defer_accept++; - icsk->icsk_accept_queue.rskq_defer_accept++; + timeout <<= 1; + if (timeout > TCP_RTO_MAX / HZ) + timeout = TCP_RTO_MAX / HZ; + period += timeout; + } } break; FYI, the old algorithm selects the following retransmissions for the configured seconds: defer_accept=1 retrans for 1-3 secs defer_accept=2 retrans for 4-6 secs defer_accept=3 retrans for 7-12 secs defer_accept=4 retrans for 13-24 secs defer_accept=5 retrans for 25-48 secs defer_accept=6 retrans for 49-96 secs defer_accept=7 retrans for 97-192 secs defer_accept=8 retrans for 193-384 secs While the new algorithm is as follows: defer_accept=1 retrans for 1-3 secs defer_accept=2 retrans for 4-9 secs defer_accept=3 retrans for 10-21 secs defer_accept=4 retrans for 22-45 secs defer_accept=5 retrans for 46-93 secs defer_accept=6 retrans for 94-189 secs defer_accept=7 retrans for 190-309 secs defer_accept=8 retrans for 310-429 secs Comments? Next step is to post the 3 patches separately for final review and applying. Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-17 11:48 ` Julian Anastasov @ 2009-10-17 12:07 ` Eric Dumazet 2009-10-17 14:20 ` Julian Anastasov 0 siblings, 1 reply; 41+ messages in thread From: Eric Dumazet @ 2009-10-17 12:07 UTC (permalink / raw) To: Julian Anastasov; +Cc: Willy Tarreau, David Miller, netdev Julian Anastasov a écrit : > > I tested both patches. It seems the current algorithm to > convert seconds to retransmissions does not match well the TCP > SYN-ACK timer and sometimes can convert the seconds to > retransmissions which are 1 above the expected. For example, > you set 9 seconds (expecting 2 retrans) but you get 3 retrans, > visible with TCP_SYNCNT=1. > > Also, it is limited to period of 32 retransmissions. > > The following patch changes the TCP_DEFER_ACCEPT > period calculation to match TCP SYN-ACK retransmissions and to > help those folks who select the seconds with TCP SYN-ACK > timing in mind. It also allows the retransmission threshold > to be up to 255. > > Signed-off-by: Julian Anastasov <ja@ssi.bg> > > diff -urp v2.6.31/linux/net/ipv4/tcp.c linux/net/ipv4/tcp.c > --- v2.6.31/linux/net/ipv4/tcp.c 2009-09-11 10:27:17.000000000 +0300 > +++ linux/net/ipv4/tcp.c 2009-10-17 12:34:38.000000000 +0300 > @@ -2165,13 +2165,20 @@ static int do_tcp_setsockopt(struct sock > case TCP_DEFER_ACCEPT: > icsk->icsk_accept_queue.rskq_defer_accept = 0; > if (val > 0) { > + int timeout = TCP_TIMEOUT_INIT / HZ; > + int period = timeout; > + > /* Translate value in seconds to number of > * retransmits */ > - while (icsk->icsk_accept_queue.rskq_defer_accept < 32 && > - val > ((TCP_TIMEOUT_INIT / HZ) << > - icsk->icsk_accept_queue.rskq_defer_accept)) > + icsk->icsk_accept_queue.rskq_defer_accept = 1; > + while (icsk->icsk_accept_queue.rskq_defer_accept < 255 && > + val > period) { > icsk->icsk_accept_queue.rskq_defer_accept++; > - icsk->icsk_accept_queue.rskq_defer_accept++; > + timeout <<= 1; > + if (timeout > TCP_RTO_MAX / HZ) > + timeout = TCP_RTO_MAX / HZ; > + period += timeout; > + } > } > break; > > > FYI, the old algorithm selects the following retransmissions > for the configured seconds: > > defer_accept=1 retrans for 1-3 secs > defer_accept=2 retrans for 4-6 secs > defer_accept=3 retrans for 7-12 secs > defer_accept=4 retrans for 13-24 secs > defer_accept=5 retrans for 25-48 secs > defer_accept=6 retrans for 49-96 secs > defer_accept=7 retrans for 97-192 secs > defer_accept=8 retrans for 193-384 secs > > While the new algorithm is as follows: > > defer_accept=1 retrans for 1-3 secs > defer_accept=2 retrans for 4-9 secs > defer_accept=3 retrans for 10-21 secs > defer_accept=4 retrans for 22-45 secs > defer_accept=5 retrans for 46-93 secs > defer_accept=6 retrans for 94-189 secs > defer_accept=7 retrans for 190-309 secs > defer_accept=8 retrans for 310-429 secs > > Comments? Next step is to post the 3 patches separately > for final review and applying. > I really like this, but please define helper functions out of do_tcp_setsockopt() /* Translate value in seconds to number of SYN-ACK retransmits */ static u8 secs_to_retrans(int seconds) { u8 res = 0; if (seconds > 0) { int timeout = TCP_TIMEOUT_INIT / HZ; int period = timeout; res = 1; while (res < 255 && seconds > period) { res++; timeout <<= 1; if (timeout > TCP_RTO_MAX / HZ) timeout = TCP_RTO_MAX / HZ; period += timeout; } } return res; } You also need the reverse function for getsockopt()... ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-17 12:07 ` Eric Dumazet @ 2009-10-17 14:20 ` Julian Anastasov 2009-10-19 20:01 ` Eric Dumazet 0 siblings, 1 reply; 41+ messages in thread From: Julian Anastasov @ 2009-10-17 14:20 UTC (permalink / raw) To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev Hello, On Sat, 17 Oct 2009, Eric Dumazet wrote: > I really like this, but please define helper functions out of do_tcp_setsockopt() > > /* Translate value in seconds to number of SYN-ACK retransmits */ > static u8 secs_to_retrans(int seconds) > { > u8 res = 0; > > if (seconds > 0) { > int timeout = TCP_TIMEOUT_INIT / HZ; > int period = timeout; > > res = 1; > while (res < 255 && seconds > period) { > res++; > timeout <<= 1; > if (timeout > TCP_RTO_MAX / HZ) > timeout = TCP_RTO_MAX / HZ; > period += timeout; > } > } > return res; > } > > You also need the reverse function for getsockopt()... Yes, I forgot that. Here is what I tested, should I split it later to 2 patches? May be it should go somewhere in net/core/sock.c as extern funcs with EXPORT_SYMBOL to allow other protocols one day to use it? Signed-off-by: Julian Anastasov <ja@ssi.bg> diff -urp v2.6.31/linux/net/ipv4/tcp.c linux/net/ipv4/tcp.c --- v2.6.31/linux/net/ipv4/tcp.c 2009-09-11 10:27:17.000000000 +0300 +++ linux/net/ipv4/tcp.c 2009-10-17 16:37:48.000000000 +0300 @@ -326,6 +326,43 @@ void tcp_enter_memory_pressure(struct so EXPORT_SYMBOL(tcp_enter_memory_pressure); +/* Convert seconds to retransmits based on initial and max timeout */ +static u8 secs_to_retrans(int seconds, int timeout, int rto_max) +{ + u8 res = 0; + + if (seconds > 0) { + int period = timeout; + + res = 1; + while (seconds > period && res < 255) { + res++; + timeout <<= 1; + if (timeout > rto_max) + timeout = rto_max; + period += timeout; + } + } + return res; +} + +/* Convert retransmits to seconds based on initial and max timeout */ +static int retrans_to_secs(u8 retrans, int timeout, int rto_max) +{ + int period = 0; + + if (retrans > 0) { + period = timeout; + while (--retrans) { + timeout <<= 1; + if (timeout > rto_max) + timeout = rto_max; + period += timeout; + } + } + return period; +} + /* * Wait for a TCP event. * @@ -2163,16 +2200,10 @@ static int do_tcp_setsockopt(struct sock break; case TCP_DEFER_ACCEPT: - icsk->icsk_accept_queue.rskq_defer_accept = 0; - if (val > 0) { - /* Translate value in seconds to number of - * retransmits */ - while (icsk->icsk_accept_queue.rskq_defer_accept < 32 && - val > ((TCP_TIMEOUT_INIT / HZ) << - icsk->icsk_accept_queue.rskq_defer_accept)) - icsk->icsk_accept_queue.rskq_defer_accept++; - icsk->icsk_accept_queue.rskq_defer_accept++; - } + /* Translate value in seconds to number of retransmits */ + icsk->icsk_accept_queue.rskq_defer_accept = + secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ, + TCP_RTO_MAX / HZ); break; case TCP_WINDOW_CLAMP: @@ -2353,8 +2384,8 @@ static int do_tcp_getsockopt(struct sock val = (val ? : sysctl_tcp_fin_timeout) / HZ; break; case TCP_DEFER_ACCEPT: - val = !icsk->icsk_accept_queue.rskq_defer_accept ? 0 : - ((TCP_TIMEOUT_INIT / HZ) << (icsk->icsk_accept_queue.rskq_defer_accept - 1)); + val = retrans_to_secs(icsk->icsk_accept_queue.rskq_defer_accept, + TCP_TIMEOUT_INIT / HZ, TCP_RTO_MAX / HZ); break; case TCP_WINDOW_CLAMP: val = tp->window_clamp; Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-17 14:20 ` Julian Anastasov @ 2009-10-19 20:01 ` Eric Dumazet 2009-10-19 20:11 ` Willy Tarreau 2009-10-20 2:23 ` David Miller 0 siblings, 2 replies; 41+ messages in thread From: Eric Dumazet @ 2009-10-19 20:01 UTC (permalink / raw) To: Julian Anastasov; +Cc: Willy Tarreau, David Miller, netdev Julian Anastasov a écrit : > Hello, > > On Sat, 17 Oct 2009, Eric Dumazet wrote: > >> I really like this, but please define helper functions out of do_tcp_setsockopt() >> >> /* Translate value in seconds to number of SYN-ACK retransmits */ >> static u8 secs_to_retrans(int seconds) >> { >> u8 res = 0; >> >> if (seconds > 0) { >> int timeout = TCP_TIMEOUT_INIT / HZ; >> int period = timeout; >> >> res = 1; >> while (res < 255 && seconds > period) { >> res++; >> timeout <<= 1; >> if (timeout > TCP_RTO_MAX / HZ) >> timeout = TCP_RTO_MAX / HZ; >> period += timeout; >> } >> } >> return res; >> } >> >> You also need the reverse function for getsockopt()... > > Yes, I forgot that. Here is what I tested, should > I split it later to 2 patches? May be it should go somewhere > in net/core/sock.c as extern funcs with EXPORT_SYMBOL to > allow other protocols one day to use it? > No need to EXPORT_SYMBOL it for the moment. We can add it later if really needed. David, I think we should revert 6d01a026b7d3009a418326bdcf313503a314f1ea (tcp: fix tcp_defer_accept to consider the timeout) since we know its broken. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-19 20:01 ` Eric Dumazet @ 2009-10-19 20:11 ` Willy Tarreau 2009-10-19 20:17 ` Eric Dumazet 2009-10-20 2:23 ` David Miller 1 sibling, 1 reply; 41+ messages in thread From: Willy Tarreau @ 2009-10-19 20:11 UTC (permalink / raw) To: Eric Dumazet; +Cc: Julian Anastasov, David Miller, netdev On Mon, Oct 19, 2009 at 10:01:15PM +0200, Eric Dumazet wrote: > David, I think we should revert 6d01a026b7d3009a418326bdcf313503a314f1ea > (tcp: fix tcp_defer_accept to consider the timeout) > since we know its broken. yes I agree with Eric, please revert it and wait for Julian's patch which gets it right. Sorry for the wrong fix, at least it will have brought the discussion on the table, but without a faulty patch it would have been better :-/ Thanks, Willy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-19 20:11 ` Willy Tarreau @ 2009-10-19 20:17 ` Eric Dumazet 0 siblings, 0 replies; 41+ messages in thread From: Eric Dumazet @ 2009-10-19 20:17 UTC (permalink / raw) To: Willy Tarreau; +Cc: Julian Anastasov, David Miller, netdev Willy Tarreau a écrit : > On Mon, Oct 19, 2009 at 10:01:15PM +0200, Eric Dumazet wrote: >> David, I think we should revert 6d01a026b7d3009a418326bdcf313503a314f1ea >> (tcp: fix tcp_defer_accept to consider the timeout) >> since we know its broken. > > yes I agree with Eric, please revert it and wait for Julian's > patch which gets it right. Sorry for the wrong fix, at least > it will have brought the discussion on the table, but without > a faulty patch it would have been better :-/ > Thanks again Willy, we made progress and this is nice :) ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-19 20:01 ` Eric Dumazet 2009-10-19 20:11 ` Willy Tarreau @ 2009-10-20 2:23 ` David Miller 1 sibling, 0 replies; 41+ messages in thread From: David Miller @ 2009-10-20 2:23 UTC (permalink / raw) To: eric.dumazet; +Cc: ja, w, netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Mon, 19 Oct 2009 22:01:15 +0200 > David, I think we should revert 6d01a026b7d3009a418326bdcf313503a314f1ea > (tcp: fix tcp_defer_accept to consider the timeout) > since we know its broken. I've reverted that patch and applied Julian's three tcp patches, thanks! ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-14 22:43 ` David Miller 2009-10-15 6:08 ` Willy Tarreau @ 2009-10-15 7:59 ` Julian Anastasov 1 sibling, 0 replies; 41+ messages in thread From: Julian Anastasov @ 2009-10-15 7:59 UTC (permalink / raw) To: David Miller; +Cc: w, netdev, eric.dumazet Hello, On Wed, 14 Oct 2009, David Miller wrote: > For now I'm pushing Willy's change into Linus's tree. Better do not apply this change because it's reasoning is wrong - counter is increased at the same time when threshold decreases - what this change does is to half the TCP_DEFER_ACCEPT period (assuming first ACK comes immediately) with the side effect to switch off the TCP_DEFER_ACCEPT flag during SYN-ACK next retransmits. When one uses TCP_DEFER_ACCEPT=1 as flag the listener will see the new socket on the 2nd ACK which contradicts with the TCP_DEFER_ACCEPT semantic to wakeup only when data comes. > After more discussion we can revert if necessary. > > I won't submit this to -stable until the discussion is fully resolved. > > Thanks! Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-14 7:27 ` Julian Anastasov 2009-10-14 20:17 ` Willy Tarreau @ 2009-10-16 10:08 ` Ilpo Järvinen 1 sibling, 0 replies; 41+ messages in thread From: Ilpo Järvinen @ 2009-10-16 10:08 UTC (permalink / raw) To: Julian Anastasov; +Cc: Willy Tarreau, David Miller, Netdev, Eric Dumazet On Wed, 14 Oct 2009, Julian Anastasov wrote: > > Hello, > > On Wed, 14 Oct 2009, Willy Tarreau wrote: > > > > > I was trying to use TCP_DEFER_ACCEPT and noticed that if the > > > > client does not talk, the connection is never accepted and > > > > remains in SYN_RECV state until the retransmits expire, where > > > > it finally is deleted. This is bad when some firewall such as > > > > > > I think, this is by design, there is big comment in > > > tcp_check_req(). > > > > I'm not sure. That would considerably reduce the usefulness of > > the feature. The comment I see there is just a one line explaining > > why we drop the ACK. It does not indicate any strategy on what to > > do when the counter expires. > > There were attempts to switch the server socket to > established state, no SYN-ACK retransmissions after client ACK > and send FIN (or RST) to client on TCP_DEFER_ACCEPT expiration > but due to some locking problems it was reverted: > > http://marc.info/?t=121318919000001&r=1&w=2 For the record, I think you underestimate that particular change by putting it like that. After being there and figuring out all the what that change did I'd say it attempted even more ambitious goal, ie., to allow ofo data to be queued. And that was the cause for all the troubles as the full state was created without a wakeup resulting in plurality in locks that one had to acquire when waking up and all the havoc broke loose due to lack of locking (or deadlockish ordering)... ...So the problem is not in the "ESTABLISHED" (or like) state itself, but in the fact that full state was created before the wakeup which needs to access the listening socket state too. -- i. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-13 5:07 TCP_DEFER_ACCEPT is missing counter update Willy Tarreau 2009-10-13 7:11 ` David Miller @ 2009-10-13 7:23 ` Eric Dumazet 2009-10-13 7:34 ` Willy Tarreau 2009-10-13 7:35 ` David Miller 1 sibling, 2 replies; 41+ messages in thread From: Eric Dumazet @ 2009-10-13 7:23 UTC (permalink / raw) To: Willy Tarreau; +Cc: netdev Willy Tarreau a écrit : > Hello, > > I was trying to use TCP_DEFER_ACCEPT and noticed that if the > client does not talk, the connection is never accepted and > remains in SYN_RECV state until the retransmits expire, where > it finally is deleted. This is bad when some firewall such as > netfilter sits between the client and the server because the > firewall sees the connection in ESTABLISHED state while the > server will finally silently drop it without sending an RST. > > This behaviour contradicts the man page which says it should > wait only for some time : > > TCP_DEFER_ACCEPT (since Linux 2.4) > Allows a listener to be awakened only when data arrives > on the socket. Takes an integer value (seconds), this > can bound the maximum number of attempts TCP will > make to complete the connection. This option should not > be used in code intended to be portable. > > Also, looking at ipv4/tcp.c, a retransmit counter is correctly > computed : > > case TCP_DEFER_ACCEPT: > icsk->icsk_accept_queue.rskq_defer_accept = 0; > if (val > 0) { > /* Translate value in seconds to number of > * retransmits */ > while (icsk->icsk_accept_queue.rskq_defer_accept < 32 && > val > ((TCP_TIMEOUT_INIT / HZ) << > icsk->icsk_accept_queue.rskq_defer_accept)) > icsk->icsk_accept_queue.rskq_defer_accept++; > icsk->icsk_accept_queue.rskq_defer_accept++; > } > break; > > ==> rskq_defer_accept is used as a counter of retransmits. > > But in tcp_minisocks.c, this counter is only checked. And in > fact, I have found no location which updates it. So I think > that what was intended was to decrease it in tcp_minisocks > whenever it is checked, which the trivial patch below does : > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > index f8d67cc..1b443b0 100644 > --- a/net/ipv4/tcp_minisocks.c > +++ b/net/ipv4/tcp_minisocks.c > @@ -645,6 +645,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, > if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept && > TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) { > + inet_csk(sk)->icsk_accept_queue.rskq_defer_accept--; > inet_rsk(req)->acked = 1; > return NULL; > } > I dont understand why you want to decrement rskq_defer_accept here. We receive a pure ACK (wihout DATA). We should receive exactly one such ACK. (This is the third packet of the three way TCP handshake) How this can solve the problem you mention ? (ie, not sending an RST when we timeout the SYN_RECV session) ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-13 7:23 ` Eric Dumazet @ 2009-10-13 7:34 ` Willy Tarreau 2009-10-13 8:08 ` Olaf van der Spek 2009-10-13 8:29 ` Eric Dumazet 2009-10-13 7:35 ` David Miller 1 sibling, 2 replies; 41+ messages in thread From: Willy Tarreau @ 2009-10-13 7:34 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Tue, Oct 13, 2009 at 09:23:59AM +0200, Eric Dumazet wrote: > Willy Tarreau a écrit : > > Hello, > > > > I was trying to use TCP_DEFER_ACCEPT and noticed that if the > > client does not talk, the connection is never accepted and > > remains in SYN_RECV state until the retransmits expire, where > > it finally is deleted. This is bad when some firewall such as > > netfilter sits between the client and the server because the > > firewall sees the connection in ESTABLISHED state while the > > server will finally silently drop it without sending an RST. > > > > This behaviour contradicts the man page which says it should > > wait only for some time : > > > > TCP_DEFER_ACCEPT (since Linux 2.4) > > Allows a listener to be awakened only when data arrives > > on the socket. Takes an integer value (seconds), this > > can bound the maximum number of attempts TCP will > > make to complete the connection. This option should not > > be used in code intended to be portable. > > > > Also, looking at ipv4/tcp.c, a retransmit counter is correctly > > computed : > > > > case TCP_DEFER_ACCEPT: > > icsk->icsk_accept_queue.rskq_defer_accept = 0; > > if (val > 0) { > > /* Translate value in seconds to number of > > * retransmits */ > > while (icsk->icsk_accept_queue.rskq_defer_accept < 32 && > > val > ((TCP_TIMEOUT_INIT / HZ) << > > icsk->icsk_accept_queue.rskq_defer_accept)) > > icsk->icsk_accept_queue.rskq_defer_accept++; > > icsk->icsk_accept_queue.rskq_defer_accept++; > > } > > break; > > > > ==> rskq_defer_accept is used as a counter of retransmits. > > > > But in tcp_minisocks.c, this counter is only checked. And in > > fact, I have found no location which updates it. So I think > > that what was intended was to decrease it in tcp_minisocks > > whenever it is checked, which the trivial patch below does : > > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > > index f8d67cc..1b443b0 100644 > > --- a/net/ipv4/tcp_minisocks.c > > +++ b/net/ipv4/tcp_minisocks.c > > @@ -645,6 +645,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, > > if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept && > > TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) { > > + inet_csk(sk)->icsk_accept_queue.rskq_defer_accept--; > > inet_rsk(req)->acked = 1; > > return NULL; > > } > > > > > I dont understand why you want to decrement rskq_defer_accept here. Because the "timeout" as set by setsockopt() is converted into number of retransmits. > We receive a pure ACK (wihout DATA). > We should receive exactly one such ACK. No, we will receive other ones because the socket remains in SYN_RECV and since the local system ignores this ACK, it will send a SYN-ACK again, triggering a new ACK from the client. Although the concept looks strange at first, I think its implementation is in fact very smart because it manages to defer acceptation with an approximate timeout without using another timer. The most common requirement should only be to wait for an HTTP request to come in, and setting the timeout to anything non-zero is enough to just drop the first empty ACK and immediately accept on the data segment, so this method fits this purpose perfectly. Regards, Willy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-13 7:34 ` Willy Tarreau @ 2009-10-13 8:08 ` Olaf van der Spek 2009-10-13 8:29 ` Eric Dumazet 1 sibling, 0 replies; 41+ messages in thread From: Olaf van der Spek @ 2009-10-13 8:08 UTC (permalink / raw) To: netdev On Tue, Oct 13, 2009 at 9:34 AM, Willy Tarreau <w@1wt.eu> wrote: >> We receive a pure ACK (wihout DATA). >> We should receive exactly one such ACK. > > No, we will receive other ones because the socket remains in SYN_RECV > and since the local system ignores this ACK, it will send a SYN-ACK > again, triggering a new ACK from the client. Why does it ignore the ACK? Just because that's the simplest implementation of defer_accept? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-13 7:34 ` Willy Tarreau 2009-10-13 8:08 ` Olaf van der Spek @ 2009-10-13 8:29 ` Eric Dumazet 2009-10-13 8:35 ` David Miller 1 sibling, 1 reply; 41+ messages in thread From: Eric Dumazet @ 2009-10-13 8:29 UTC (permalink / raw) To: Willy Tarreau; +Cc: Eric Dumazet, netdev Willy Tarreau a écrit : > On Tue, Oct 13, 2009 at 09:23:59AM +0200, Eric Dumazet wrote: >> Willy Tarreau a écrit : >>> Hello, >>> >>> I was trying to use TCP_DEFER_ACCEPT and noticed that if the >>> client does not talk, the connection is never accepted and >>> remains in SYN_RECV state until the retransmits expire, where >>> it finally is deleted. This is bad when some firewall such as >>> netfilter sits between the client and the server because the >>> firewall sees the connection in ESTABLISHED state while the >>> server will finally silently drop it without sending an RST. >>> >>> This behaviour contradicts the man page which says it should >>> wait only for some time : >>> >>> TCP_DEFER_ACCEPT (since Linux 2.4) >>> Allows a listener to be awakened only when data arrives >>> on the socket. Takes an integer value (seconds), this >>> can bound the maximum number of attempts TCP will >>> make to complete the connection. This option should not >>> be used in code intended to be portable. >>> >>> Also, looking at ipv4/tcp.c, a retransmit counter is correctly >>> computed : >>> >>> case TCP_DEFER_ACCEPT: >>> icsk->icsk_accept_queue.rskq_defer_accept = 0; >>> if (val > 0) { >>> /* Translate value in seconds to number of >>> * retransmits */ >>> while (icsk->icsk_accept_queue.rskq_defer_accept < 32 && >>> val > ((TCP_TIMEOUT_INIT / HZ) << >>> icsk->icsk_accept_queue.rskq_defer_accept)) >>> icsk->icsk_accept_queue.rskq_defer_accept++; >>> icsk->icsk_accept_queue.rskq_defer_accept++; >>> } >>> break; >>> >>> ==> rskq_defer_accept is used as a counter of retransmits. >>> >>> But in tcp_minisocks.c, this counter is only checked. And in >>> fact, I have found no location which updates it. So I think >>> that what was intended was to decrease it in tcp_minisocks >>> whenever it is checked, which the trivial patch below does : >>> >>> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c >>> index f8d67cc..1b443b0 100644 >>> --- a/net/ipv4/tcp_minisocks.c >>> +++ b/net/ipv4/tcp_minisocks.c >>> @@ -645,6 +645,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, >>> if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept && >>> TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) { >>> + inet_csk(sk)->icsk_accept_queue.rskq_defer_accept--; >>> inet_rsk(req)->acked = 1; >>> return NULL; >>> } >>> >> >> I dont understand why you want to decrement rskq_defer_accept here. > > Because the "timeout" as set by setsockopt() is converted into number > of retransmits. > >> We receive a pure ACK (wihout DATA). >> We should receive exactly one such ACK. > > No, we will receive other ones because the socket remains in SYN_RECV > and since the local system ignores this ACK, it will send a SYN-ACK > again, triggering a new ACK from the client. > > Although the concept looks strange at first, I think its implementation > is in fact very smart because it manages to defer acceptation with an > approximate timeout without using another timer. > > The most common requirement should only be to wait for an HTTP request > to come in, and setting the timeout to anything non-zero is enough to > just drop the first empty ACK and immediately accept on the data > segment, so this method fits this purpose perfectly. > Indeed, thanks for this detailed explanation, I missed the SYN-ACK timer and retransmits. I played some years ago with TCP_DEFER_ACCEPT and got some unexpected results on transmitted packets (server was consuming more bandwidth), and I know understand it was very broken until today ! Thanks again Willy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-13 8:29 ` Eric Dumazet @ 2009-10-13 8:35 ` David Miller 0 siblings, 0 replies; 41+ messages in thread From: David Miller @ 2009-10-13 8:35 UTC (permalink / raw) To: eric.dumazet; +Cc: w, netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 13 Oct 2009 10:29:03 +0200 > Indeed, thanks for this detailed explanation, I missed the SYN-ACK timer > and retransmits. > > I played some years ago with TCP_DEFER_ACCEPT and got some unexpected results > on transmitted packets (server was consuming more bandwidth), and I know understand > it was very broken until today ! > > Thanks again Willy Ok, and with that I put Willy's patch back in. Thanks! ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-13 7:23 ` Eric Dumazet 2009-10-13 7:34 ` Willy Tarreau @ 2009-10-13 7:35 ` David Miller 2009-10-13 8:12 ` Willy Tarreau 1 sibling, 1 reply; 41+ messages in thread From: David Miller @ 2009-10-13 7:35 UTC (permalink / raw) To: eric.dumazet; +Cc: w, netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 13 Oct 2009 09:23:59 +0200 > I dont understand why you want to decrement rskq_defer_accept here. > We receive a pure ACK (wihout DATA). > We should receive exactly one such ACK. > (This is the third packet of the three way TCP handshake) > > How this can solve the problem you mention ? > (ie, not sending an RST when we timeout the SYN_RECV session) I'll hold off on this patch until Eric's comments are addressed, thanks. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: TCP_DEFER_ACCEPT is missing counter update 2009-10-13 7:35 ` David Miller @ 2009-10-13 8:12 ` Willy Tarreau 0 siblings, 0 replies; 41+ messages in thread From: Willy Tarreau @ 2009-10-13 8:12 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev On Tue, Oct 13, 2009 at 12:35:06AM -0700, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Tue, 13 Oct 2009 09:23:59 +0200 > > > I dont understand why you want to decrement rskq_defer_accept here. > > We receive a pure ACK (wihout DATA). > > We should receive exactly one such ACK. > > (This is the third packet of the three way TCP handshake) > > > > How this can solve the problem you mention ? > > (ie, not sending an RST when we timeout the SYN_RECV session) > > I'll hold off on this patch until Eric's comments are > addressed, thanks. I have replied, but let's wait for Eric's response. Willy ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2009-10-20 2:23 UTC | newest] Thread overview: 41+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-10-13 5:07 TCP_DEFER_ACCEPT is missing counter update Willy Tarreau 2009-10-13 7:11 ` David Miller 2009-10-13 7:19 ` Willy Tarreau 2009-10-13 7:27 ` David Miller 2009-10-13 21:27 ` Julian Anastasov 2009-10-14 4:52 ` Willy Tarreau 2009-10-14 7:27 ` Julian Anastasov 2009-10-14 20:17 ` Willy Tarreau 2009-10-14 21:12 ` Olaf van der Spek 2009-10-14 22:43 ` David Miller 2009-10-15 6:08 ` Willy Tarreau 2009-10-15 8:47 ` Julian Anastasov 2009-10-15 12:41 ` Willy Tarreau 2009-10-15 22:44 ` Julian Anastasov 2009-10-16 3:51 ` Eric Dumazet 2009-10-16 5:00 ` Eric Dumazet 2009-10-16 5:29 ` Willy Tarreau 2009-10-16 6:05 ` Eric Dumazet 2009-10-16 6:18 ` Willy Tarreau 2009-10-16 7:08 ` Eric Dumazet 2009-10-16 7:19 ` Willy Tarreau 2009-10-16 5:03 ` Willy Tarreau 2009-10-16 8:49 ` Julian Anastasov 2009-10-16 10:40 ` Eric Dumazet 2009-10-16 19:27 ` Willy Tarreau 2009-10-17 11:48 ` Julian Anastasov 2009-10-17 12:07 ` Eric Dumazet 2009-10-17 14:20 ` Julian Anastasov 2009-10-19 20:01 ` Eric Dumazet 2009-10-19 20:11 ` Willy Tarreau 2009-10-19 20:17 ` Eric Dumazet 2009-10-20 2:23 ` David Miller 2009-10-15 7:59 ` Julian Anastasov 2009-10-16 10:08 ` Ilpo Järvinen 2009-10-13 7:23 ` Eric Dumazet 2009-10-13 7:34 ` Willy Tarreau 2009-10-13 8:08 ` Olaf van der Spek 2009-10-13 8:29 ` Eric Dumazet 2009-10-13 8:35 ` David Miller 2009-10-13 7:35 ` David Miller 2009-10-13 8:12 ` Willy Tarreau
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).