From: Eric Dumazet <eric.dumazet@gmail.com>
To: kapil dakhane <kdakhane@gmail.com>
Cc: netdev@vger.kernel.org, netfilter@vger.kernel.org,
"David S. Miller" <davem@davemloft.net>,
Evgeniy Polyakov <zbr@ioremap.net>
Subject: [PATCH] tcp: Fix a connect() race with timewait sockets
Date: Tue, 01 Dec 2009 16:00:39 +0100 [thread overview]
Message-ID: <4B152F97.1090409@gmail.com> (raw)
In-Reply-To: <99d458640911301802i4bde20f4wa314668d543e3170@mail.gmail.com>
kapil dakhane a écrit :
> Hello,
>
> I am trying to analyze the capacity of linux network stack on x6270
> which has 16 Hyper threads on two 8-core Intel(r) Xeon(r) CPU. I see
> that at around 150000 simultaneous connections, after around 1.6 gbps,
> a cpu get stuck in an infinite loop in inet_csk_bind_conflict, then
> other cpus get locked up doing spin_lock. Before the lockup cpu usage
> was around 25%. It appears to be a bug, unless I am hitting some kind
> of resource limit. It would be good if someone familiar with network
> code would confirm this, or point me in the right direction.
>
> Important details are:
>
> I am using kernel version 2.6.31.4 recompiled with TPROXY related
> options: NF_CONNTRACK, NETFILTER_TPROXY, NETFILTER_XT_MATCH_SOCKET,
> NETFILTER_XT_TARGET_TPROXY.
>
>
> I have enabled transparent capture and transparent forward using
> iptables and ip rules. I have 10 instances of a single threaded user
> space bits-forwarding-proxy (fast), each bound to different
> hyper-threads (CPUs). Rest 6 CPUs are dedicated to interrupt
> processing, each handling interrupts from six different network cards.
> TCP flow from a 4-tuple always get handled by the same proxy process,
> interrupt thread, and network card. In this way, network traffic is
> segregated as much as possible to achieve high degree of parallelism.
>
> First /var/log/message entry shows CPU#7 is stuck in inet_csk_bind_conflict
>
> Nov 17 23:02:04 cap-x6270-01 kernel: BUG: soft lockup - CPU#7 stuck
> for 61s! [fast:20701]
After some more audit and coffee, I finally found one subtle bug in our
connect() code, that periodically triggers but never got tracked.
Here is a patch cooked on top of current linux-2.6 git tree, it should probably
apply on 2.6.31.6 as well...
Thanks
[PATCH] tcp: Fix a connect() race with timewait sockets
When we find a timewait connection in __inet_hash_connect() and reuse
it for a new connection request, we have a race window, releasing bind
list lock and reacquiring it in __inet_twsk_kill() to remove timewait
socket from list.
Another thread might find the timewait socket we already chose, leading to
list corruption and crashes.
Fix is to remove timewait socket from bind list before releasing the lock.
Reported-by: kapil dakhane <kdakhane@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/net/inet_timewait_sock.h | 4 +++
net/ipv4/inet_hashtables.c | 4 +++
net/ipv4/inet_timewait_sock.c | 37 ++++++++++++++++++++---------
3 files changed, 34 insertions(+), 11 deletions(-)
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index f93ad90..e18e5df 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -206,6 +206,10 @@ extern void __inet_twsk_hashdance(struct inet_timewait_sock *tw,
struct sock *sk,
struct inet_hashinfo *hashinfo);
+extern void inet_twsk_unhash(struct inet_timewait_sock *tw,
+ struct inet_hashinfo *hashinfo,
+ bool mustlock);
+
extern void inet_twsk_schedule(struct inet_timewait_sock *tw,
struct inet_timewait_death_row *twdr,
const int timeo, const int timewait_len);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 625cc5f..76d81e4 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -488,6 +488,10 @@ ok:
inet_sk(sk)->sport = htons(port);
hash(sk);
}
+
+ if (tw)
+ inet_twsk_unhash(tw, hinfo, false);
+
spin_unlock(&head->lock);
if (tw) {
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 13f0781..2d6d543 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -14,12 +14,34 @@
#include <net/inet_timewait_sock.h>
#include <net/ip.h>
+
+void inet_twsk_unhash(struct inet_timewait_sock *tw,
+ struct inet_hashinfo *hashinfo,
+ bool mustlock)
+{
+ struct inet_bind_hashbucket *bhead;
+ struct inet_bind_bucket *tb = tw->tw_tb;
+
+ if (!tb)
+ return;
+
+ /* Disassociate with bind bucket. */
+ bhead = &hashinfo->bhash[inet_bhashfn(twsk_net(tw),
+ tw->tw_num,
+ hashinfo->bhash_size)];
+ if (mustlock)
+ spin_lock(&bhead->lock);
+ __hlist_del(&tw->tw_bind_node);
+ tw->tw_tb = NULL;
+ inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+ if (mustlock)
+ spin_unlock(&bhead->lock);
+}
+
/* Must be called with locally disabled BHs. */
static void __inet_twsk_kill(struct inet_timewait_sock *tw,
struct inet_hashinfo *hashinfo)
{
- struct inet_bind_hashbucket *bhead;
- struct inet_bind_bucket *tb;
/* Unlink from established hashes. */
spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
@@ -32,15 +54,8 @@ static void __inet_twsk_kill(struct inet_timewait_sock *tw,
sk_nulls_node_init(&tw->tw_node);
spin_unlock(lock);
- /* Disassociate with bind bucket. */
- bhead = &hashinfo->bhash[inet_bhashfn(twsk_net(tw), tw->tw_num,
- hashinfo->bhash_size)];
- spin_lock(&bhead->lock);
- tb = tw->tw_tb;
- __hlist_del(&tw->tw_bind_node);
- tw->tw_tb = NULL;
- inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
- spin_unlock(&bhead->lock);
+ inet_twsk_unhash(tw, hashinfo, true);
+
#ifdef SOCK_REFCNT_DEBUG
if (atomic_read(&tw->tw_refcnt) != 1) {
printk(KERN_DEBUG "%s timewait_sock %p refcnt=%d\n",
next prev parent reply other threads:[~2009-12-01 15:00 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-12-01 2:02 soft lockup in inet_csk_get_port kapil dakhane
2009-12-01 6:10 ` Eric Dumazet
2009-12-01 15:00 ` Eric Dumazet [this message]
2009-12-02 8:59 ` [PATCH] tcp: Fix a connect() race with timewait sockets David Miller
2009-12-02 9:23 ` Eric Dumazet
2009-12-02 10:33 ` Eric Dumazet
2009-12-02 11:32 ` Evgeniy Polyakov
2009-12-02 19:18 ` kapil dakhane
2009-12-03 2:43 ` kapil dakhane
2009-12-03 10:49 ` [PATCH] tcp: fix a timewait refcnt race Eric Dumazet
2009-12-04 0:19 ` David Miller
2009-12-04 3:20 ` kapil dakhane
2009-12-04 6:29 ` Eric Dumazet
2009-12-04 6:39 ` David Miller
2009-12-02 15:08 ` [PATCH net-next-2.6] tcp: connect() race with timewait reuse Eric Dumazet
2009-12-02 22:15 ` Evgeniy Polyakov
2009-12-03 6:44 ` Eric Dumazet
2009-12-03 8:31 ` Eric Dumazet
2009-12-03 23:22 ` Evgeniy Polyakov
2009-12-04 0:18 ` David Miller
2009-12-02 16:05 ` [PATCH] tcp: Fix a connect() race with timewait sockets Ashwani Wason
2009-12-03 6:38 ` David Miller
2009-12-04 13:45 ` [PATCH 0/2] tcp: Fix connect() races " Eric Dumazet
2009-12-04 13:46 ` [PATCH 1/2] tcp: Fix a connect() race " Eric Dumazet
2009-12-05 21:21 ` Evgeniy Polyakov
2009-12-07 9:59 ` [PATCH] tcp: documents timewait refcnt tricks Eric Dumazet
2009-12-07 16:06 ` Randy Dunlap
2009-12-09 4:20 ` David Miller
2009-12-09 4:18 ` [PATCH 1/2] tcp: Fix a connect() race with timewait sockets David Miller
2009-12-04 13:47 ` [PATCH 2/2] " Eric Dumazet
2009-12-09 4:19 ` David Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4B152F97.1090409@gmail.com \
--to=eric.dumazet@gmail.com \
--cc=davem@davemloft.net \
--cc=kdakhane@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=netfilter@vger.kernel.org \
--cc=zbr@ioremap.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).