* Problem with implementation of TCP_DEFER_ACCEPT?
@ 2007-08-24 0:08 TJ
2007-08-24 4:40 ` John Heffner
2007-08-24 7:15 ` Lennert Buytenhek
0 siblings, 2 replies; 7+ messages in thread
From: TJ @ 2007-08-24 0:08 UTC (permalink / raw)
To: netdev
I'd welcome the views of those familiar with TCP_DEFER_ACCEPT on a
recent issue I've worked on where connections between a Juniper DX (aka
redline) load-balancer and Apache 2.2 cluster caused random connection
failures.
Today, after 2 weeks debugging the issue, we confirmed the problem was
related to TCP_DEFER_ACCEPT. Part of the issue is caused by Juniper's
implementation of persistent connections, but there remains a question
as to whether the Linux kernel is correctly handling handshakes when a
listening socket has TCP_DEFER_ACCEPT enabled.
Upon reflection, and after having worked with the RFCs this past few
weeks, I'm finding myself doubting the kernel's TCP_DEFER_ACCEPT
implementation.
Also, I'm unable to locate an RFC or other specification for
TCP_DEFER_ACCEPT aka BSD's SO_ACCEPTFILTER - can you point me to one?
The complete background and observations of the original problem and the
workaround are available here:
https://bugs.launchpad.net/ubuntu/+bug/134274
My specific concerns are explained in the following comments, for which
I'd appreciate your views.
----------------------------------------------------
An RFC 793 standard TCP handshake requires three packets:
client SYN > server LISTENING
client < SYN ACK server SYN_RECEIVED
client ACK > server ESTABLISHED
client PSH ACK + data > server
TCP_DEFER_ACCEPT is designed to increase performance by reducing the
number of TCP packets exchanged before the client can pass data:
client SYN > server LISTENING
client < SYN ACK server SYN_RECEIVED
client PSH ACK + data > server ESTABLISHED
At present with TCP_DEFER_ACCEPT the kernel treats the RFC 793 handshake
as invalid; dropping the ACK from the client without replying so the
client doesn't know the server has in fact set it's internal ACKed flag.
If the client doesn't send a packet containing data before the SYN_ACK
time-outs finally expire the connection will be dropped.
For a client obeying RFC 793 what we see is:
client SYN > server LISTENING
client < SYN ACK server SYN_RECEIVED (time-out 3s)
server: inet_rsk(req)->acked = 1
client ACK > server (discarded)
client < SYN ACK (DUP) server (time-out 6s)
client ACK (DUP) > server (discarded)
client < SYN ACK (DUP) server (time-out 12s)
client ACK (DUP) > server (discarded)
client < SYN ACK (DUP) server (time-out 24s)
client ACK (DUP) > server (discarded)
client < SYN ACK (DUP) server (time-out 48s)
client ACK (DUP) > server (discarded)
client < SYN ACK (DUP) server (time-out 96s)
client ACK (DUP) > server (discarded)
server: half-open socket closed.
With each client ACK being dropped by the kernel's TCP_DEFER_ACCEPT
mechanism eventually the handshake fails after the 'SYN ACK' retries and
time-outs expire.
There is a case for arguing the kernel should be operating in an
enhanced handshaking mode when TCP_DEFER_ACCEPT is enabled, not an
alternative mode, and therefore should accept *both* RFC 793 and
TCP_DEFER_ACCEPT. I've been unable to find a specification or RFC for
implementing TCP_DEFER_ACCEPT aka BSD's SO_ACCEPTFILTER to give me firm
guidance.
It seems incorrect to penalise a client that is trying to complete the
handshake according to the RFC 793 specification, especially as the
client has no way of knowing ahead of time whether or not the server is
operating deferred accept.
-------------------------------------------
net/ipv4/tcp_minisocks.c::tcp_check_req() implements the
TCP_DEFER_ACCEPT check:
/* If TCP_DEFER_ACCEPT is set, drop bare ACK. */
if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
inet_rsk(req)->acked = 1;
return NULL;
}
--------------------------------------------
Thanks
TJ.
Ubuntu ACPI Kernel Team
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Problem with implementation of TCP_DEFER_ACCEPT?
2007-08-24 0:08 Problem with implementation of TCP_DEFER_ACCEPT? TJ
@ 2007-08-24 4:40 ` John Heffner
2007-08-24 7:15 ` Lennert Buytenhek
1 sibling, 0 replies; 7+ messages in thread
From: John Heffner @ 2007-08-24 4:40 UTC (permalink / raw)
To: TJ; +Cc: netdev
[-- Attachment #1: Type: text/plain, Size: 2665 bytes --]
TJ wrote:
> client SYN > server LISTENING
> client < SYN ACK server SYN_RECEIVED (time-out 3s)
> server: inet_rsk(req)->acked = 1
>
> client ACK > server (discarded)
>
> client < SYN ACK (DUP) server (time-out 6s)
> client ACK (DUP) > server (discarded)
>
> client < SYN ACK (DUP) server (time-out 12s)
> client ACK (DUP) > server (discarded)
>
> client < SYN ACK (DUP) server (time-out 24s)
> client ACK (DUP) > server (discarded)
>
> client < SYN ACK (DUP) server (time-out 48s)
> client ACK (DUP) > server (discarded)
>
> client < SYN ACK (DUP) server (time-out 96s)
> client ACK (DUP) > server (discarded)
>
> server: half-open socket closed.
>
> With each client ACK being dropped by the kernel's TCP_DEFER_ACCEPT
> mechanism eventually the handshake fails after the 'SYN ACK' retries and
> time-outs expire.
>
> There is a case for arguing the kernel should be operating in an
> enhanced handshaking mode when TCP_DEFER_ACCEPT is enabled, not an
> alternative mode, and therefore should accept *both* RFC 793 and
> TCP_DEFER_ACCEPT. I've been unable to find a specification or RFC for
> implementing TCP_DEFER_ACCEPT aka BSD's SO_ACCEPTFILTER to give me firm
> guidance.
>
> It seems incorrect to penalise a client that is trying to complete the
> handshake according to the RFC 793 specification, especially as the
> client has no way of knowing ahead of time whether or not the server is
> operating deferred accept.
Interesting problem. TCP_DEFER_ACCEPT does not conform to any standard
I'm aware of. (In fact, I'd say it's in violation of RFC 793.) The
implementation does exactly what it claims, though -- it "allows a
listener to be awakened only when data arrives on the socket."
I think a more useful spec might have been "allows a listener to be
awakened only when data arrives on the socket, unless the specified
timeout has expired." Once the timeout expires, it should process the
embryonic connection as if TCP_DEFER_ACCEPT is not set. Unfortunately,
I don't think we can retroactively change this definition, as an
application might depend on data being available and do a non-blocking
read() after the accept(), expecting data to be there. Is this worth
trying to fix?
Also, a listen socket with a backlog and TCP_DEFER_ACCEPT will have reqs
sit in the backlog for the full defer timeout, even if they've received
data, which is not really the right thing to do.
I've attached a patch implementing this suggestion (compile tested only
-- I think I got the logic right but it's late ;). Kind of ugly, and
uses up a bit in struct inet_request_sock. Maybe can be done better...
-John
[-- Attachment #2: tcp_defer_accept.patch --]
[-- Type: text/plain, Size: 2737 bytes --]
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 62daf21..f9f64a5 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -72,7 +72,8 @@ struct inet_request_sock {
sack_ok : 1,
wscale_ok : 1,
ecn_ok : 1,
- acked : 1;
+ acked : 1,
+ deferred : 1;
struct ip_options *opt;
};
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 185c7ec..cad2490 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -978,6 +978,7 @@ static inline void tcp_openreq_init(struct request_sock *req,
ireq->snd_wscale = rx_opt->snd_wscale;
ireq->wscale_ok = rx_opt->wscale_ok;
ireq->acked = 0;
+ ireq->deferred = 0;
ireq->ecn_ok = 0;
ireq->rmt_port = tcp_hdr(skb)->source;
}
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index fbe7714..1207fb8 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -444,9 +444,6 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
}
}
- if (queue->rskq_defer_accept)
- max_retries = queue->rskq_defer_accept;
-
budget = 2 * (lopt->nr_table_entries / (timeout / interval));
i = lopt->clock_hand;
@@ -455,7 +452,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
while ((req = *reqp) != NULL) {
if (time_after_eq(now, req->expires)) {
if ((req->retrans < thresh ||
- (inet_rsk(req)->acked && req->retrans < max_retries))
+ (inet_rsk(req)->acked && req->retrans < max_retries) ||
+ (inet_rsk(req)->deferred && req->retrans <
+ queue->rskq_defer_accept + max_retries))
&& !req->rsk_ops->rtx_syn_ack(parent, req, NULL)) {
unsigned long timeo;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index a12b08f..c4867f3 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -637,8 +637,10 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
/* If TCP_DEFER_ACCEPT is set, drop bare ACK. */
if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
- TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
- inet_rsk(req)->acked = 1;
+ TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1 &&
+ !inet_rsk(req)->acked && req->retrans <
+ inet_csk(sk)->icsk_accept_queue.rskq_defer_accept) {
+ inet_rsk(req)->deferred = 1;
return NULL;
}
@@ -686,6 +688,9 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
listen_overflow:
if (!sysctl_tcp_abort_on_overflow) {
inet_rsk(req)->acked = 1;
+ /* If deferred, ACK must contain data. Shortcut defer. */
+ if (inet_rsk(req)->deferred)
+ req->retrans = inet_csk(sk)->icsk_accept_queue.rskq_defer_accept;
return NULL;
}
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: Problem with implementation of TCP_DEFER_ACCEPT?
2007-08-24 0:08 Problem with implementation of TCP_DEFER_ACCEPT? TJ
2007-08-24 4:40 ` John Heffner
@ 2007-08-24 7:15 ` Lennert Buytenhek
2007-08-24 8:40 ` Alexey Kuznetsov
1 sibling, 1 reply; 7+ messages in thread
From: Lennert Buytenhek @ 2007-08-24 7:15 UTC (permalink / raw)
To: TJ; +Cc: netdev, kuznet
On Fri, Aug 24, 2007 at 01:08:25AM +0100, TJ wrote:
> An RFC 793 standard TCP handshake requires three packets:
>
> client SYN > server LISTENING
> client < SYN ACK server SYN_RECEIVED
> client ACK > server ESTABLISHED
>
> client PSH ACK + data > server
>
> TCP_DEFER_ACCEPT is designed to increase performance by reducing the
> number of TCP packets exchanged before the client can pass data:
>
> client SYN > server LISTENING
> client < SYN ACK server SYN_RECEIVED
>
> client PSH ACK + data > server ESTABLISHED
>
> At present with TCP_DEFER_ACCEPT the kernel treats the RFC 793 handshake
> as invalid; dropping the ACK from the client without replying so the
> client doesn't know the server has in fact set it's internal ACKed flag.
>
> If the client doesn't send a packet containing data before the SYN_ACK
> time-outs finally expire the connection will be dropped.
A brought this up a long, long time ago, and I seem to remember
Alexey Kuznetsov explained me at the time that this was intentional.
I can't find the thread in the mailing list archives anymore, though
-- and my memory might be failing me.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Problem with implementation of TCP_DEFER_ACCEPT?
2007-08-24 7:15 ` Lennert Buytenhek
@ 2007-08-24 8:40 ` Alexey Kuznetsov
2007-08-24 9:31 ` TJ
0 siblings, 1 reply; 7+ messages in thread
From: Alexey Kuznetsov @ 2007-08-24 8:40 UTC (permalink / raw)
To: Lennert Buytenhek; +Cc: TJ, netdev
Hello!
> > At present with TCP_DEFER_ACCEPT the kernel treats the RFC 793 handshake
> > as invalid; dropping the ACK from the client without replying so the
> > client doesn't know the server has in fact set it's internal ACKed flag.
> >
> > If the client doesn't send a packet containing data before the SYN_ACK
> > time-outs finally expire the connection will be dropped.
>
> A brought this up a long, long time ago, and I seem to remember
> Alexey Kuznetsov explained me at the time that this was intentional.
Obviously, I said something like "it is exactly what TCP_DEFER_ACCEPT does".
There is no protocol violation here, ACK from client is considered as lost,
it is quite normal and happens all the time. Handshake is not complete,
server remains in SYN-RECV state and continues to retransmit SYN-ACK.
If client tried to cheat and is not going to send its request,
connection will time out.
Alexey
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Problem with implementation of TCP_DEFER_ACCEPT?
2007-08-24 8:40 ` Alexey Kuznetsov
@ 2007-08-24 9:31 ` TJ
2007-08-24 16:09 ` John Heffner
2007-09-02 7:30 ` Andi Kleen
0 siblings, 2 replies; 7+ messages in thread
From: TJ @ 2007-08-24 9:31 UTC (permalink / raw)
To: netdev
On Fri, 2007-08-24 at 12:40 +0400, Alexey Kuznetsov wrote:
> There is no protocol violation here, ACK from client is considered as lost,
> it is quite normal and happens all the time. Handshake is not complete,
> server remains in SYN-RECV state and continues to retransmit SYN-ACK.
> If client tried to cheat and is not going to send its request,
> connection will time out.
Thanks for the responses.
Do we have any authoritative references on this? Who implemented it
originally?
Right now Juniper are claiming the issue that brought this to the
surface (the bug linked to in my original post) is a problem with the
implementation of TCP_DEFER_ACCEPT.
My position so far is that the Juniper DX OS is not following the HTTP
standard because it doesn't send a request with the connection, and as I
read the end of section 1.4 of RFC2616, an HTTP connection should be
accompanied by a request.
Can anyone confirm my interpretation or provide references to firm it
up, or refute it?
There is also a very real practical problem here:
Since version 2.1.5 apache enables TCP_DEFER_ACCEPT *by default* without
mention of it in the configuration file.
As time goes on the number of apache v2.1.5+ deployments is only going
to rise, and I'd hate for anyone else to go through the 5+ weeks of pain
the system admins at the e-commerce operation I was helping went
through, not to mention the last 2 weeks feeling like I was chasing
ghosts - it's an absolute pain to track down and identify!
Therefore, anyone deploying apache web servers in a web-farm behind the
Juniper DX load-balanders and using TCP multiplexing (for which they pay
a hefty licence fee!) is liable to suffer the random drop effects
described in my bug report.
Because several other HTTP load-balancers deploy similar methods of
holding open connections to the servers and pipe-lining requests, this
could affect more than just Juniper.
Any other suggestions/reactions on the Linux kernel side? I'm intending
posting a comment to the apache-dev mailing list once I've gathered the
strands together.
Thanks again.
TJ.
Ubuntu ACPI Kernel Team.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Problem with implementation of TCP_DEFER_ACCEPT?
2007-08-24 9:31 ` TJ
@ 2007-08-24 16:09 ` John Heffner
2007-09-02 7:30 ` Andi Kleen
1 sibling, 0 replies; 7+ messages in thread
From: John Heffner @ 2007-08-24 16:09 UTC (permalink / raw)
To: TJ; +Cc: netdev
TJ wrote:
> Right now Juniper are claiming the issue that brought this to the
> surface (the bug linked to in my original post) is a problem with the
> implementation of TCP_DEFER_ACCEPT.
>
> My position so far is that the Juniper DX OS is not following the HTTP
> standard because it doesn't send a request with the connection, and as I
> read the end of section 1.4 of RFC2616, an HTTP connection should be
> accompanied by a request.
>
> Can anyone confirm my interpretation or provide references to firm it
> up, or refute it?
You can think of TCP_DEFER_ACCEPT as an implicit application close()
after a certain timeout, when not receiving a request. All HTTP servers
do this anyway (though I think technically they're supposed to send a
408 Request Timeout error it seems many do not). It's a very valid
question for Juniper as to why their box is failing to fill requests
when its back-end connection has gone away, instead of re-establishing
the connection and filling the request.
-John
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Problem with implementation of TCP_DEFER_ACCEPT?
2007-08-24 9:31 ` TJ
2007-08-24 16:09 ` John Heffner
@ 2007-09-02 7:30 ` Andi Kleen
1 sibling, 0 replies; 7+ messages in thread
From: Andi Kleen @ 2007-09-02 7:30 UTC (permalink / raw)
To: TJ; +Cc: netdev
TJ <linux@tjworld.net> writes:
>
> Therefore, anyone deploying apache web servers in a web-farm behind the
> Juniper DX load-balanders and using TCP multiplexing (for which they pay
> a hefty licence fee!)
If they ask for that much money they can surely fix it to work
properly too?
-Andi
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2007-09-02 7:30 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-24 0:08 Problem with implementation of TCP_DEFER_ACCEPT? TJ
2007-08-24 4:40 ` John Heffner
2007-08-24 7:15 ` Lennert Buytenhek
2007-08-24 8:40 ` Alexey Kuznetsov
2007-08-24 9:31 ` TJ
2007-08-24 16:09 ` John Heffner
2007-09-02 7:30 ` Andi Kleen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).