From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: [PATCH] tcp: Fix a connect() race with timewait sockets Date: Tue, 01 Dec 2009 16:00:39 +0100 Message-ID: <4B152F97.1090409@gmail.com> References: <99d458640911301802i4bde20f4wa314668d543e3170@mail.gmail.com> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <99d458640911301802i4bde20f4wa314668d543e3170@mail.gmail.com> Sender: netfilter-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: kapil dakhane Cc: netdev@vger.kernel.org, netfilter@vger.kernel.org, "David S. Miller" , Evgeniy Polyakov kapil dakhane a =E9crit : > Hello, >=20 > I am trying to analyze the capacity of linux network stack on x6270 > which has 16 Hyper threads on two 8-core Intel(r) Xeon(r) CPU. I see > that at around 150000 simultaneous connections, after around 1.6 gbps= , > a cpu get stuck in an infinite loop in inet_csk_bind_conflict, then > other cpus get locked up doing spin_lock. Before the lockup cpu usage > was around 25%. It appears to be a bug, unless I am hitting some kind > of resource limit. It would be good if someone familiar with network > code would confirm this, or point me in the right direction. >=20 > Important details are: >=20 > I am using kernel version 2.6.31.4 recompiled with TPROXY related > options: NF_CONNTRACK, NETFILTER_TPROXY, NETFILTER_XT_MATCH_SOCKET, > NETFILTER_XT_TARGET_TPROXY. >=20 >=20 > I have enabled transparent capture and transparent forward using > iptables and ip rules. I have 10 instances of a single threaded user > space bits-forwarding-proxy (fast), each bound to different > hyper-threads (CPUs). Rest 6 CPUs are dedicated to interrupt > processing, each handling interrupts from six different network cards= =2E > TCP flow from a 4-tuple always get handled by the same proxy process, > interrupt thread, and network card. In this way, network traffic is > segregated as much as possible to achieve high degree of parallelism. >=20 > First /var/log/message entry shows CPU#7 is stuck in inet_csk_bind_co= nflict >=20 > Nov 17 23:02:04 cap-x6270-01 kernel: BUG: soft lockup - CPU#7 stuck > for 61s! [fast:20701] After some more audit and coffee, I finally found one subtle bug in our connect() code, that periodically triggers but never got tracked. Here is a patch cooked on top of current linux-2.6 git tree, it should = probably apply on 2.6.31.6 as well... Thanks [PATCH] tcp: Fix a connect() race with timewait sockets When we find a timewait connection in __inet_hash_connect() and reuse it for a new connection request, we have a race window, releasing bind list lock and reacquiring it in __inet_twsk_kill() to remove timewait socket from list. Another thread might find the timewait socket we already chose, leading= to list corruption and crashes. =46ix is to remove timewait socket from bind list before releasing the = lock. Reported-by: kapil dakhane Signed-off-by: Eric Dumazet --- include/net/inet_timewait_sock.h | 4 +++ net/ipv4/inet_hashtables.c | 4 +++ net/ipv4/inet_timewait_sock.c | 37 ++++++++++++++++++++--------- 3 files changed, 34 insertions(+), 11 deletions(-) diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewa= it_sock.h index f93ad90..e18e5df 100644 --- a/include/net/inet_timewait_sock.h +++ b/include/net/inet_timewait_sock.h @@ -206,6 +206,10 @@ extern void __inet_twsk_hashdance(struct inet_time= wait_sock *tw, struct sock *sk, struct inet_hashinfo *hashinfo); =20 +extern void inet_twsk_unhash(struct inet_timewait_sock *tw, + struct inet_hashinfo *hashinfo, + bool mustlock); + extern void inet_twsk_schedule(struct inet_timewait_sock *tw, struct inet_timewait_death_row *twdr, const int timeo, const int timewait_len); diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 625cc5f..76d81e4 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -488,6 +488,10 @@ ok: inet_sk(sk)->sport =3D htons(port); hash(sk); } + + if (tw) + inet_twsk_unhash(tw, hinfo, false); + spin_unlock(&head->lock); =20 if (tw) { diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_soc= k.c index 13f0781..2d6d543 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -14,12 +14,34 @@ #include #include =20 + +void inet_twsk_unhash(struct inet_timewait_sock *tw, + struct inet_hashinfo *hashinfo, + bool mustlock) +{ + struct inet_bind_hashbucket *bhead; + struct inet_bind_bucket *tb =3D tw->tw_tb; + + if (!tb) + return; + + /* Disassociate with bind bucket. */ + bhead =3D &hashinfo->bhash[inet_bhashfn(twsk_net(tw), + tw->tw_num, + hashinfo->bhash_size)]; + if (mustlock) + spin_lock(&bhead->lock); + __hlist_del(&tw->tw_bind_node); + tw->tw_tb =3D NULL; + inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb); + if (mustlock) + spin_unlock(&bhead->lock); +} + /* Must be called with locally disabled BHs. */ static void __inet_twsk_kill(struct inet_timewait_sock *tw, struct inet_hashinfo *hashinfo) { - struct inet_bind_hashbucket *bhead; - struct inet_bind_bucket *tb; /* Unlink from established hashes. */ spinlock_t *lock =3D inet_ehash_lockp(hashinfo, tw->tw_hash); =20 @@ -32,15 +54,8 @@ static void __inet_twsk_kill(struct inet_timewait_so= ck *tw, sk_nulls_node_init(&tw->tw_node); spin_unlock(lock); =20 - /* Disassociate with bind bucket. */ - bhead =3D &hashinfo->bhash[inet_bhashfn(twsk_net(tw), tw->tw_num, - hashinfo->bhash_size)]; - spin_lock(&bhead->lock); - tb =3D tw->tw_tb; - __hlist_del(&tw->tw_bind_node); - tw->tw_tb =3D NULL; - inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb); - spin_unlock(&bhead->lock); + inet_twsk_unhash(tw, hashinfo, true); + #ifdef SOCK_REFCNT_DEBUG if (atomic_read(&tw->tw_refcnt) !=3D 1) { printk(KERN_DEBUG "%s timewait_sock %p refcnt=3D%d\n",