soft lockup in inet_csk_get

netfilter.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* soft lockup in inet_csk_get_port
@ 2009-12-01  2:02 kapil dakhane
  2009-12-01  6:10 ` Eric Dumazet
  2009-12-01 15:00 ` [PATCH] tcp: Fix a connect() race with timewait sockets Eric Dumazet
  0 siblings, 2 replies; 31+ messages in thread
From: kapil dakhane @ 2009-12-01  2:02 UTC (permalink / raw)
  To: netdev; +Cc: netfilter

Hello,

I am trying to analyze the capacity of linux network stack on x6270
which has 16 Hyper threads on two 8-core Intel(r) Xeon(r) CPU. I see
that at around 150000 simultaneous connections, after around 1.6 gbps,
a cpu get stuck in an infinite loop in inet_csk_bind_conflict, then
other cpus get locked up doing spin_lock. Before the lockup cpu usage
was around 25%. It appears to be a bug, unless I am hitting some kind
of resource limit. It would be good if someone familiar with network
code would confirm this, or point me in the right direction.

Important details are:

I am using kernel version 2.6.31.4 recompiled with TPROXY related
options: NF_CONNTRACK, NETFILTER_TPROXY, NETFILTER_XT_MATCH_SOCKET,
NETFILTER_XT_TARGET_TPROXY.


I have enabled transparent capture and transparent forward using
iptables and ip rules.  I have 10 instances of a single threaded user
space bits-forwarding-proxy (fast), each bound to different
hyper-threads (CPUs). Rest 6 CPUs are dedicated to interrupt
processing, each handling interrupts from six different network cards.
TCP flow from a 4-tuple always get handled by the same proxy process,
interrupt thread, and network card. In this way, network traffic is
segregated as much as possible to achieve high degree of parallelism.

First /var/log/message entry shows CPU#7 is stuck in inet_csk_bind_conflict

Nov 17 23:02:04 cap-x6270-01 kernel: BUG: soft lockup - CPU#7 stuck
for 61s! [fast:20701]
Nov 17 23:02:04 cap-x6270-01 kernel: Modules linked in: xt_TPROXY
xt_MARK xt_socket nf_defrag_ipv4 nf_tproxy_core iptable_mangle ipv6
autofs4 hidp rfcomm l2cap bluetooth rfkill sunrpc 8021q xt_state
nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables
cpufreq_ondemand acpi_cpufreq freq_table dm_multipath scsi_dh video
output sbs sbshc battery acpi_memhotplug ac parport_pc lp parport
joydev sg rtc_cmos serio_raw rtc_core button igb rtc_lib niu i2c_i801
i2c_core pcspkr dm_snapshot dm_zero dm_mirror dm_region_hash dm_log
dm_mod usb_storage ahci libata shpchp aacraid sd_mod scsi_mod ext3 jbd
uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Nov 17 23:02:04 cap-x6270-01 kernel: CPU 7:
Nov 17 23:02:04 cap-x6270-01 kernel: Modules linked in: ...
Nov 17 23:02:04 cap-x6270-01 kernel: Pid: 20701, comm: fast Not
tainted 2.6.31.4 #1 SUN BLADE X6270 SERVER MODULE
Nov 17 23:02:04 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c53>]
[<ffffffff81285c53>] inet_csk_bind_conflict+0x99/0xa6
Nov 17 23:02:04 cap-x6270-01 kernel: RSP: 0018:ffff88095ac7fe30
EFLAGS: 00000202
Nov 17 23:02:04 cap-x6270-01 kernel: RAX: 000000003c0ba8c0 RBX:
ffff88097b14bae0 RCX: ffff8804b3d57820
Nov 17 23:02:04 cap-x6270-01 kernel: RDX: ffff8804b3d57800 RSI:
0000000000000000 RDI: ffff880940421840
Nov 17 23:02:04 cap-x6270-01 kernel: RBP: ffffffff8100c42e R08:
000000002d01960c R09: ffff880909832ea0
Nov 17 23:02:04 cap-x6270-01 kernel: R10: 00007fffc5d92700 R11:
ffff880940421840 R12: 0000000000000001
Nov 17 23:02:04 cap-x6270-01 kernel: R13: ffff88097c5e9400 R14:
ffffffff810b7a92 R15: 0000000000000001
Nov 17 23:02:04 cap-x6270-01 kernel: FS:  00007f20a08416e0(0000)
GS:ffffc90000e00000(0000) knlGS:0000000000000000
Nov 17 23:02:04 cap-x6270-01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Nov 17 23:02:04 cap-x6270-01 kernel: CR2: 00000000081df408 CR3:
000000049ac01000 CR4: 00000000000006e0
Nov 17 23:02:04 cap-x6270-01 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Nov 17 23:02:04 cap-x6270-01 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Nov 17 23:02:04 cap-x6270-01 kernel: Call Trace:
Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff812859e4>] ?
inet_csk_get_port+0x1b2/0x29e
Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff812a1596>] ?
inet_bind+0x10c/0x1b7
Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff8124ef53>] ? sys_bind+0x6e/0x9e
Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff8106de14>] ?
audit_syscall_entry+0x1a4/0x1cf
Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff8100b92b>] ?
system_call_fastpath+0x16/0x1b

While other CPUs get stuck doing _spin_lock:


Nov 17 23:02:04 cap-x6270-01 kernel: BUG: soft lockup - CPU#15 stuck
for 61s! [fast:20702]
Nov 17 23:02:04 cap-x6270-01 kernel: Modules linked in: ...
Nov 17 23:02:04 cap-x6270-01 kernel: CPU 15:
Nov 17 23:02:04 cap-x6270-01 kernel: Modules linked in: ...
Nov 17 23:02:04 cap-x6270-01 kernel: Pid: 20702, comm: fast Not
tainted 2.6.31.4 #1 SUN BLADE X6270 SERVER MODULE
Nov 17 23:02:04 cap-x6270-01 kernel: RIP: 0010:[<ffffffff812dedff>]
[<ffffffff812dedff>] _spin_lock+0x10/0x15
Nov 17 23:02:04 cap-x6270-01 kernel: RSP: 0018:ffff88090ecabe30
EFLAGS: 00000297
Nov 17 23:02:04 cap-x6270-01 kernel: RAX: 0000000000000504 RBX:
00000000ffffffea RCX: 0000000000000000
Nov 17 23:02:04 cap-x6270-01 kernel: RDX: 00000000000000a2 RSI:
0000000000000fa2 RDI: ffffc90019f82a20
Nov 17 23:02:04 cap-x6270-01 kernel: RBP: ffffffff8100c42e R08:
ffff880905d89840 R09: 0000000000000000
Nov 17 23:02:04 cap-x6270-01 kernel: R10: 00007fffd8db4574 R11:
ffff880905d89840 R12: ffffffff812509d4
Nov 17 23:02:04 cap-x6270-01 kernel: R13: ffff88094c0bd280 R14:
0000000000000246 R15: 0000000000000001
Nov 17 23:02:04 cap-x6270-01 kernel: FS:  00007f0deff696e0(0000)
GS:ffffc90001e00000(0000) knlGS:0000000000000000
Nov 17 23:02:04 cap-x6270-01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Nov 17 23:02:04 cap-x6270-01 kernel: CR2: 0000000007ec2258 CR3:
00000009494b1000 CR4: 00000000000006e0
Nov 17 23:02:04 cap-x6270-01 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Nov 17 23:02:04 cap-x6270-01 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Nov 17 23:02:04 cap-x6270-01 kernel: Call Trace:
Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff81285989>] ?
inet_csk_get_port+0x157/0x29e
Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff812a1596>] ?
inet_bind+0x10c/0x1b7
Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff8124ef53>] ? sys_bind+0x6e/0x9e
Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff8106de14>] ?
audit_syscall_entry+0x1a4/0x1cf
Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff8100b92b>] ?
system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: soft lockup in inet_csk_get_port
  2009-12-01  2:02 soft lockup in inet_csk_get_port kapil dakhane
@ 2009-12-01  6:10 ` Eric Dumazet
  2009-12-01 15:00 ` [PATCH] tcp: Fix a connect() race with timewait sockets Eric Dumazet
  1 sibling, 0 replies; 31+ messages in thread
From: Eric Dumazet @ 2009-12-01  6:10 UTC (permalink / raw)
  To: kapil dakhane; +Cc: netdev, netfilter

kapil dakhane a écrit :
> Hello,
> 
> I am trying to analyze the capacity of linux network stack on x6270
> which has 16 Hyper threads on two 8-core Intel(r) Xeon(r) CPU. I see
> that at around 150000 simultaneous connections, after around 1.6 gbps,
> a cpu get stuck in an infinite loop in inet_csk_bind_conflict, then
> other cpus get locked up doing spin_lock. Before the lockup cpu usage
> was around 25%. It appears to be a bug, unless I am hitting some kind
> of resource limit. It would be good if someone familiar with network
> code would confirm this, or point me in the right direction.
> 
> Important details are:
> 
> I am using kernel version 2.6.31.4 recompiled with TPROXY related
> options: NF_CONNTRACK, NETFILTER_TPROXY, NETFILTER_XT_MATCH_SOCKET,
> NETFILTER_XT_TARGET_TPROXY.
> 
> 
> I have enabled transparent capture and transparent forward using
> iptables and ip rules.  I have 10 instances of a single threaded user
> space bits-forwarding-proxy (fast), each bound to different
> hyper-threads (CPUs). Rest 6 CPUs are dedicated to interrupt
> processing, each handling interrupts from six different network cards.
> TCP flow from a 4-tuple always get handled by the same proxy process,
> interrupt thread, and network card. In this way, network traffic is
> segregated as much as possible to achieve high degree of parallelism.
> 
> First /var/log/message entry shows CPU#7 is stuck in inet_csk_bind_conflict
> 
> Nov 17 23:02:04 cap-x6270-01 kernel: BUG: soft lockup - CPU#7 stuck
> for 61s! [fast:20701]
> Nov 17 23:02:04 cap-x6270-01 kernel: Modules linked in: xt_TPROXY
> xt_MARK xt_socket nf_defrag_ipv4 nf_tproxy_core iptable_mangle ipv6
> autofs4 hidp rfcomm l2cap bluetooth rfkill sunrpc 8021q xt_state
> nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables
> cpufreq_ondemand acpi_cpufreq freq_table dm_multipath scsi_dh video
> output sbs sbshc battery acpi_memhotplug ac parport_pc lp parport
> joydev sg rtc_cmos serio_raw rtc_core button igb rtc_lib niu i2c_i801
> i2c_core pcspkr dm_snapshot dm_zero dm_mirror dm_region_hash dm_log
> dm_mod usb_storage ahci libata shpchp aacraid sd_mod scsi_mod ext3 jbd
> uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
> Nov 17 23:02:04 cap-x6270-01 kernel: CPU 7:
> Nov 17 23:02:04 cap-x6270-01 kernel: Modules linked in: ...
> Nov 17 23:02:04 cap-x6270-01 kernel: Pid: 20701, comm: fast Not
> tainted 2.6.31.4 #1 SUN BLADE X6270 SERVER MODULE
> Nov 17 23:02:04 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c53>]
> [<ffffffff81285c53>] inet_csk_bind_conflict+0x99/0xa6
> Nov 17 23:02:04 cap-x6270-01 kernel: RSP: 0018:ffff88095ac7fe30
> EFLAGS: 00000202
> Nov 17 23:02:04 cap-x6270-01 kernel: RAX: 000000003c0ba8c0 RBX:
> ffff88097b14bae0 RCX: ffff8804b3d57820
> Nov 17 23:02:04 cap-x6270-01 kernel: RDX: ffff8804b3d57800 RSI:
> 0000000000000000 RDI: ffff880940421840
> Nov 17 23:02:04 cap-x6270-01 kernel: RBP: ffffffff8100c42e R08:
> 000000002d01960c R09: ffff880909832ea0
> Nov 17 23:02:04 cap-x6270-01 kernel: R10: 00007fffc5d92700 R11:
> ffff880940421840 R12: 0000000000000001
> Nov 17 23:02:04 cap-x6270-01 kernel: R13: ffff88097c5e9400 R14:
> ffffffff810b7a92 R15: 0000000000000001
> Nov 17 23:02:04 cap-x6270-01 kernel: FS:  00007f20a08416e0(0000)
> GS:ffffc90000e00000(0000) knlGS:0000000000000000
> Nov 17 23:02:04 cap-x6270-01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> Nov 17 23:02:04 cap-x6270-01 kernel: CR2: 00000000081df408 CR3:
> 000000049ac01000 CR4: 00000000000006e0
> Nov 17 23:02:04 cap-x6270-01 kernel: DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> Nov 17 23:02:04 cap-x6270-01 kernel: DR3: 0000000000000000 DR6:
> 00000000ffff0ff0 DR7: 0000000000000400
> Nov 17 23:02:04 cap-x6270-01 kernel: Call Trace:
> Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff812859e4>] ?
> inet_csk_get_port+0x1b2/0x29e
> Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff812a1596>] ?
> inet_bind+0x10c/0x1b7
> Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff8124ef53>] ? sys_bind+0x6e/0x9e
> Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff8106de14>] ?
> audit_syscall_entry+0x1a4/0x1cf
> Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff8100b92b>] ?
> system_call_fastpath+0x16/0x1b
> 
> While other CPUs get stuck doing _spin_lock:
> 
> 
> Nov 17 23:02:04 cap-x6270-01 kernel: BUG: soft lockup - CPU#15 stuck
> for 61s! [fast:20702]
> Nov 17 23:02:04 cap-x6270-01 kernel: Modules linked in: ...
> Nov 17 23:02:04 cap-x6270-01 kernel: CPU 15:
> Nov 17 23:02:04 cap-x6270-01 kernel: Modules linked in: ...
> Nov 17 23:02:04 cap-x6270-01 kernel: Pid: 20702, comm: fast Not
> tainted 2.6.31.4 #1 SUN BLADE X6270 SERVER MODULE
> Nov 17 23:02:04 cap-x6270-01 kernel: RIP: 0010:[<ffffffff812dedff>]
> [<ffffffff812dedff>] _spin_lock+0x10/0x15
> Nov 17 23:02:04 cap-x6270-01 kernel: RSP: 0018:ffff88090ecabe30
> EFLAGS: 00000297
> Nov 17 23:02:04 cap-x6270-01 kernel: RAX: 0000000000000504 RBX:
> 00000000ffffffea RCX: 0000000000000000
> Nov 17 23:02:04 cap-x6270-01 kernel: RDX: 00000000000000a2 RSI:
> 0000000000000fa2 RDI: ffffc90019f82a20
> Nov 17 23:02:04 cap-x6270-01 kernel: RBP: ffffffff8100c42e R08:
> ffff880905d89840 R09: 0000000000000000
> Nov 17 23:02:04 cap-x6270-01 kernel: R10: 00007fffd8db4574 R11:
> ffff880905d89840 R12: ffffffff812509d4
> Nov 17 23:02:04 cap-x6270-01 kernel: R13: ffff88094c0bd280 R14:
> 0000000000000246 R15: 0000000000000001
> Nov 17 23:02:04 cap-x6270-01 kernel: FS:  00007f0deff696e0(0000)
> GS:ffffc90001e00000(0000) knlGS:0000000000000000
> Nov 17 23:02:04 cap-x6270-01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> Nov 17 23:02:04 cap-x6270-01 kernel: CR2: 0000000007ec2258 CR3:
> 00000009494b1000 CR4: 00000000000006e0
> Nov 17 23:02:04 cap-x6270-01 kernel: DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> Nov 17 23:02:04 cap-x6270-01 kernel: DR3: 0000000000000000 DR6:
> 00000000ffff0ff0 DR7: 0000000000000400
> Nov 17 23:02:04 cap-x6270-01 kernel: Call Trace:
> Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff81285989>] ?
> inet_csk_get_port+0x157/0x29e
> Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff812a1596>] ?
> inet_bind+0x10c/0x1b7
> Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff8124ef53>] ? sys_bind+0x6e/0x9e
> Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff8106de14>] ?
> audit_syscall_entry+0x1a4/0x1cf
> Nov 17 23:02:04 cap-x6270-01 kernel:  [<ffffffff8100b92b>] ?
> system_call_fastpath+0x16/0x1b
> --

Hmm, I did an one hour audit and could not yet find the bug.

Is it a reproductible error, and any chance I can have a snapshot of
"netstat -atn" before the lockup ? (maybe privately, since it might be
too big for netdev)

What is the 'fast' program, is it freely available somewhere ?

Thanks


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] tcp: Fix a connect() race with timewait sockets
  2009-12-01  2:02 soft lockup in inet_csk_get_port kapil dakhane
  2009-12-01  6:10 ` Eric Dumazet
@ 2009-12-01 15:00 ` Eric Dumazet
  2009-12-02  8:59   ` David Miller
                     ` (3 more replies)
  1 sibling, 4 replies; 31+ messages in thread
From: Eric Dumazet @ 2009-12-01 15:00 UTC (permalink / raw)
  To: kapil dakhane; +Cc: netdev, netfilter, David S. Miller, Evgeniy Polyakov

kapil dakhane a écrit :
> Hello,
> 
> I am trying to analyze the capacity of linux network stack on x6270
> which has 16 Hyper threads on two 8-core Intel(r) Xeon(r) CPU. I see
> that at around 150000 simultaneous connections, after around 1.6 gbps,
> a cpu get stuck in an infinite loop in inet_csk_bind_conflict, then
> other cpus get locked up doing spin_lock. Before the lockup cpu usage
> was around 25%. It appears to be a bug, unless I am hitting some kind
> of resource limit. It would be good if someone familiar with network
> code would confirm this, or point me in the right direction.
> 
> Important details are:
> 
> I am using kernel version 2.6.31.4 recompiled with TPROXY related
> options: NF_CONNTRACK, NETFILTER_TPROXY, NETFILTER_XT_MATCH_SOCKET,
> NETFILTER_XT_TARGET_TPROXY.
> 
> 
> I have enabled transparent capture and transparent forward using
> iptables and ip rules.  I have 10 instances of a single threaded user
> space bits-forwarding-proxy (fast), each bound to different
> hyper-threads (CPUs). Rest 6 CPUs are dedicated to interrupt
> processing, each handling interrupts from six different network cards.
> TCP flow from a 4-tuple always get handled by the same proxy process,
> interrupt thread, and network card. In this way, network traffic is
> segregated as much as possible to achieve high degree of parallelism.
> 
> First /var/log/message entry shows CPU#7 is stuck in inet_csk_bind_conflict
> 
> Nov 17 23:02:04 cap-x6270-01 kernel: BUG: soft lockup - CPU#7 stuck
> for 61s! [fast:20701]

After some more audit and coffee, I finally found one subtle bug in our
connect() code, that periodically triggers but never got tracked.

Here is a patch cooked on top of current linux-2.6 git tree, it should probably
apply on 2.6.31.6 as well...

Thanks

[PATCH] tcp: Fix a connect() race with timewait sockets

When we find a timewait connection in __inet_hash_connect() and reuse
it for a new connection request, we have a race window, releasing bind
list lock and reacquiring it in __inet_twsk_kill() to remove timewait
socket from list.

Another thread might find the timewait socket we already chose, leading to
list corruption and crashes.

Fix is to remove timewait socket from bind list before releasing the lock.

Reported-by: kapil dakhane <kdakhane@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/inet_timewait_sock.h |    4 +++
 net/ipv4/inet_hashtables.c       |    4 +++
 net/ipv4/inet_timewait_sock.c    |   37 ++++++++++++++++++++---------
 3 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index f93ad90..e18e5df 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -206,6 +206,10 @@ extern void __inet_twsk_hashdance(struct inet_timewait_sock *tw,
 				  struct sock *sk,
 				  struct inet_hashinfo *hashinfo);
 
+extern void inet_twsk_unhash(struct inet_timewait_sock *tw,
+			     struct inet_hashinfo *hashinfo,
+			     bool mustlock);
+
 extern void inet_twsk_schedule(struct inet_timewait_sock *tw,
 			       struct inet_timewait_death_row *twdr,
 			       const int timeo, const int timewait_len);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 625cc5f..76d81e4 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -488,6 +488,10 @@ ok:
 			inet_sk(sk)->sport = htons(port);
 			hash(sk);
 		}
+
+		if (tw)
+			inet_twsk_unhash(tw, hinfo, false);
+
 		spin_unlock(&head->lock);
 
 		if (tw) {
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 13f0781..2d6d543 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -14,12 +14,34 @@
 #include <net/inet_timewait_sock.h>
 #include <net/ip.h>
 
+
+void inet_twsk_unhash(struct inet_timewait_sock *tw,
+		      struct inet_hashinfo *hashinfo,
+		      bool mustlock)
+{
+	struct inet_bind_hashbucket *bhead;
+	struct inet_bind_bucket *tb = tw->tw_tb;
+
+	if (!tb)
+		return;
+
+	/* Disassociate with bind bucket. */
+	bhead = &hashinfo->bhash[inet_bhashfn(twsk_net(tw),
+					      tw->tw_num,
+					      hashinfo->bhash_size)];
+	if (mustlock)
+		spin_lock(&bhead->lock);
+	__hlist_del(&tw->tw_bind_node);
+	tw->tw_tb = NULL;
+	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+	if (mustlock)
+		spin_unlock(&bhead->lock);
+}
+
 /* Must be called with locally disabled BHs. */
 static void __inet_twsk_kill(struct inet_timewait_sock *tw,
 			     struct inet_hashinfo *hashinfo)
 {
-	struct inet_bind_hashbucket *bhead;
-	struct inet_bind_bucket *tb;
 	/* Unlink from established hashes. */
 	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
 
@@ -32,15 +54,8 @@ static void __inet_twsk_kill(struct inet_timewait_sock *tw,
 	sk_nulls_node_init(&tw->tw_node);
 	spin_unlock(lock);
 
-	/* Disassociate with bind bucket. */
-	bhead = &hashinfo->bhash[inet_bhashfn(twsk_net(tw), tw->tw_num,
-			hashinfo->bhash_size)];
-	spin_lock(&bhead->lock);
-	tb = tw->tw_tb;
-	__hlist_del(&tw->tw_bind_node);
-	tw->tw_tb = NULL;
-	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
-	spin_unlock(&bhead->lock);
+	inet_twsk_unhash(tw, hashinfo, true);
+
 #ifdef SOCK_REFCNT_DEBUG
 	if (atomic_read(&tw->tw_refcnt) != 1) {
 		printk(KERN_DEBUG "%s timewait_sock %p refcnt=%d\n",

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH] tcp: Fix a connect() race with timewait sockets
  2009-12-01 15:00 ` [PATCH] tcp: Fix a connect() race with timewait sockets Eric Dumazet
@ 2009-12-02  8:59   ` David Miller
  2009-12-02  9:23     ` Eric Dumazet
  2009-12-02 16:05     ` [PATCH] tcp: Fix a connect() race with timewait sockets Ashwani Wason
  2009-12-04 13:45   ` [PATCH 0/2] tcp: Fix connect() races " Eric Dumazet
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 31+ messages in thread
From: David Miller @ 2009-12-02  8:59 UTC (permalink / raw)
  To: eric.dumazet; +Cc: kdakhane, netdev, netfilter, zbr

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 01 Dec 2009 16:00:39 +0100

> [PATCH] tcp: Fix a connect() race with timewait sockets

This condition would only trigger if the timewait recycling sysctl is
enabled.

It is off by default, and I can't find any mention in this bug report
that it has been turned on.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] tcp: Fix a connect() race with timewait sockets
  2009-12-02  8:59   ` David Miller
@ 2009-12-02  9:23     ` Eric Dumazet
  2009-12-02 10:33       ` Eric Dumazet
  2009-12-02 16:05     ` [PATCH] tcp: Fix a connect() race with timewait sockets Ashwani Wason
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2009-12-02  9:23 UTC (permalink / raw)
  To: David Miller; +Cc: kdakhane, netdev, netfilter, zbr

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 01 Dec 2009 16:00:39 +0100
> 
>> [PATCH] tcp: Fix a connect() race with timewait sockets
> 
> This condition would only trigger if the timewait recycling sysctl is
> enabled.
> 
> It is off by default, and I can't find any mention in this bug report
> that it has been turned on.

Very true. I know nothing about context of the reporter, he didnt
answered to my queries.

Yes, if sysctl_tw_reuse is set, bug can triggers without any extra conditions.

But even if sysctl_tw_reuse is cleared, we might trigger the bug if
local port is bound to a value.

[User application called bind( port=XXX) before connect() ]


__inet_hash_connect() can indeed call check_established(... twp = NULL)

...
        head = &hinfo->bhash[inet_bhashfn(net, snum, hinfo->bhash_size)];
        tb  = inet_csk(sk)->icsk_bind_hash;
        spin_lock_bh(&head->lock);
        if (sk_head(&tb->owners) == sk && !sk->sk_bind_node.next) {
                hash(sk);
                spin_unlock_bh(&head->lock);
                return 0;
        } else {
                spin_unlock(&head->lock);
                /* No definite answer... Walk to established hash table */
                ret = check_established(death_row, sk, snum, NULL);         <<< HERE >>>
out:
                local_bh_enable();
                return ret;
        }



In this case, we call tcp_twsk_unique() with twp = NULL,
this bypass the sysctl_tcp_tw_reuse test.


int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
{
        const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw);
        struct tcp_sock *tp = tcp_sk(sk);

        /* With PAWS, it is safe from the viewpoint
           of data integrity. Even without PAWS it is safe provided sequence
           spaces do not overlap i.e. at data rates <= 80Mbit/sec.

           Actually, the idea is close to VJ's one, only timestamp cache is
           held not per host, but per port pair and TW bucket is used as state
           holder.

           If TW bucket has been already destroyed we fall back to VJ's scheme
           and use initial timestamp retrieved from peer table.
         */
        if (tcptw->tw_ts_recent_stamp &&
<<HERE>>       (twp == NULL || (sysctl_tcp_tw_reuse &&
                             get_seconds() - tcptw->tw_ts_recent_stamp > 1))) {

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] tcp: Fix a connect() race with timewait sockets
  2009-12-02  9:23     ` Eric Dumazet
@ 2009-12-02 10:33       ` Eric Dumazet
  2009-12-02 11:32         ` Evgeniy Polyakov
  2009-12-02 15:08         ` [PATCH net-next-2.6] tcp: connect() race with timewait reuse Eric Dumazet
  0 siblings, 2 replies; 31+ messages in thread
From: Eric Dumazet @ 2009-12-02 10:33 UTC (permalink / raw)
  To: David Miller; +Cc: kdakhane, netdev, netfilter, zbr

Eric Dumazet a écrit :
> 
> But even if sysctl_tw_reuse is cleared, we might trigger the bug if
> local port is bound to a value.

Oh well, that's more subtle than that.

__inet_check_established() is called not only with bh disabled,
but also with a lock on bind list if twp != NULL.

However, if twp is NULL, lock is not held by caller.

[ Thats the final
  ret = check_established(death_row, sk, snum, NULL);
  in __inet_hash_connect()]

So triggering this bug with tw_reuse clear is tricky :

You need several threads, using sockets with REUSEADDR set,
and bind() to same address/port before connect() to same target.

We need another patch to correct this.

I wonder if always hold lock before calling check_established()
would be cleaner.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] tcp: Fix a connect() race with timewait sockets
  2009-12-02 10:33       ` Eric Dumazet
@ 2009-12-02 11:32         ` Evgeniy Polyakov
  2009-12-02 19:18           ` kapil dakhane
  2009-12-02 15:08         ` [PATCH net-next-2.6] tcp: connect() race with timewait reuse Eric Dumazet
  1 sibling, 1 reply; 31+ messages in thread
From: Evgeniy Polyakov @ 2009-12-02 11:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, kdakhane, netdev, netfilter

On Wed, Dec 02, 2009 at 11:33:55AM +0100, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> You need several threads, using sockets with REUSEADDR set,
> and bind() to same address/port before connect() to same target.
> 
> We need another patch to correct this.
> 
> I wonder if always hold lock before calling check_established()
> would be cleaner.

Isnt this a too big overhead?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] tcp: Fix a connect() race with timewait sockets
  2009-12-02 11:32         ` Evgeniy Polyakov
@ 2009-12-02 19:18           ` kapil dakhane
  2009-12-03  2:43             ` kapil dakhane
  0 siblings, 1 reply; 31+ messages in thread
From: kapil dakhane @ 2009-12-02 19:18 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Eric Dumazet, David Miller, netdev, netfilter

Here's the list of tuning parameters used:
net.ipv4.tcp_keepalive_intvl = 5
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 180
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_max_tw_buckets = 360000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_syncookies = 0
net.core.netdev_max_backlog = 5000

Kapil

On Wed, Dec 2, 2009 at 3:32 AM, Evgeniy Polyakov <zbr@ioremap.net> wrote:
> On Wed, Dec 02, 2009 at 11:33:55AM +0100, Eric Dumazet (eric.dumazet@gmail.com) wrote:
>> You need several threads, using sockets with REUSEADDR set,
>> and bind() to same address/port before connect() to same target.
>>
>> We need another patch to correct this.
>>
>> I wonder if always hold lock before calling check_established()
>> would be cleaner.
>
> Isnt this a too big overhead?
>
> --
>        Evgeniy Polyakov
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] tcp: Fix a connect() race with timewait sockets
  2009-12-02 19:18           ` kapil dakhane
@ 2009-12-03  2:43             ` kapil dakhane
  2009-12-03 10:49               ` [PATCH] tcp: fix a timewait refcnt race Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: kapil dakhane @ 2009-12-03  2:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, netfilter, Evgeniy Polyakov

Eric,

I ran the test again after patching my kernel with your changes.
Unfortunately, the result appear to be the same.
Here's what I get from /var/log/messages:

Dec  2 14:42:17 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
---same message repeats every minute---
Dec  2 14:55:25 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 14:56:31 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c9f>]
[<ffffffff81285c9f>] inet_csk_bind_conflict+0x99/0xa6
Dec  2 14:57:37 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 14:58:42 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c9f>]
[<ffffffff81285c9f>] inet_csk_bind_conflict+0x99/0xa6
Dec  2 14:59:48 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c9f>]
[<ffffffff81285c9f>] inet_csk_bind_conflict+0x99/0xa6
Dec  2 15:00:54 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c9f>]
[<ffffffff81285c9f>] inet_csk_bind_conflict+0x99/0xa6
Dec  2 15:01:59 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c9f>]
[<ffffffff81285c9f>] inet_csk_bind_conflict+0x99/0xa6
Dec  2 15:03:05 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:04:11 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:05:16 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:06:22 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c24>]
[<ffffffff81285c24>] inet_csk_bind_conflict+0x1e/0xa6
Dec  2 15:07:28 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c9f>]
[<ffffffff81285c9f>] inet_csk_bind_conflict+0x99/0xa6
Dec  2 15:08:33 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c45>]
[<ffffffff81285c45>] inet_csk_bind_conflict+0x3f/0xa6
Dec  2 15:09:39 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:10:45 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:11:50 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:12:56 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:14:02 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c65>]
[<ffffffff81285c65>] inet_csk_bind_conflict+0x5f/0xa6
Dec  2 15:15:07 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c9f>]
[<ffffffff81285c9f>] inet_csk_bind_conflict+0x99/0xa6
Dec  2 15:16:13 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:17:19 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:18:25 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:19:30 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c9f>]
[<ffffffff81285c9f>] inet_csk_bind_conflict+0x99/0xa6
Dec  2 15:20:36 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c9f>]
[<ffffffff81285c9f>] inet_csk_bind_conflict+0x99/0xa6
Dec  2 15:21:42 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c9f>]
[<ffffffff81285c9f>] inet_csk_bind_conflict+0x99/0xa6
Dec  2 15:22:47 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:23:53 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:24:59 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:26:04 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c9f>]
[<ffffffff81285c9f>] inet_csk_bind_conflict+0x99/0xa6
Dec  2 15:27:10 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
---same message repeats every minute---
Dec  2 15:43:35 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:44:41 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c9f>]
[<ffffffff81285c9f>] inet_csk_bind_conflict+0x99/0xa6
Dec  2 15:45:47 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:46:52 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:47:58 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:49:04 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:50:09 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c9f>]
[<ffffffff81285c9f>] inet_csk_bind_conflict+0x99/0xa6
Dec  2 15:51:15 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:52:21 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c9f>]
[<ffffffff81285c9f>] inet_csk_bind_conflict+0x99/0xa6
Dec  2 15:53:26 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 15:54:32 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6

Here's detailed stack from first minute:

Dec  2 14:42:17 cap-x6270-01 kernel: BUG: soft lockup - CPU#14 stuck
for 61s! [fast:14591]
Dec  2 14:42:17 cap-x6270-01 kernel: Modules linked in: xt_TPROXY
xt_MARK xt_socket nf_defrag_ipv4 nf_tproxy_core iptable_mangle ipv6
autofs4 hidp rfcomm l2cap bluetooth rfkill sunrpc 8021q xt_state
nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables
cpufreq_ondemand acpi_cpufreq freq_table dm_multipath scsi_dh video
output sbs sbshc battery acpi_memhotplug ac parport_pc lp parport
joydev sg serio_raw rtc_cmos button rtc_core rtc_lib igb niu i2c_i801
i2c_core pcspkr dm_snapshot dm_zero dm_mirror dm_region_hash dm_log
dm_mod usb_storage ahci libata shpchp aacraid sd_mod scsi_mod ext3 jbd
uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Dec  2 14:42:17 cap-x6270-01 kernel: CPU 14:
Dec  2 14:42:17 cap-x6270-01 kernel: Modules linked in: ....
Dec  2 14:42:17 cap-x6270-01 kernel: Pid: 14591, comm: fast Tainted: G
       W  2.6.31.4 #3 SUN BLADE X6270 SERVER MODULE
Dec  2 14:42:17 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 14:42:17 cap-x6270-01 kernel: RSP: 0018:ffff8804e1471e30
EFLAGS: 00000282
Dec  2 14:42:17 cap-x6270-01 kernel: RAX: ffffffff815c4101 RBX:
ffff8804ea477da0 RCX: ffff8808c54a5820
Dec  2 14:42:17 cap-x6270-01 kernel: RDX: ffff8809041922c0 RSI:
0000000000000000 RDI: ffff8809071340c0
Dec  2 14:42:17 cap-x6270-01 kernel: RBP: ffffffff8100c42e R08:
00000000a900220e R09: ffff8804de0bc0a0
Dec  2 14:42:17 cap-x6270-01 kernel: R10: 00007fff281a1501 R11:
ffff8809071340c0 R12: ffffffff812509d4
Dec  2 14:42:17 cap-x6270-01 kernel: R13: ffff88097b9a38c0 R14:
0000000000000246 R15: 0000000000000001
Dec  2 14:42:17 cap-x6270-01 kernel: FS:  00007f2c0e2006e0(0000)
GS:ffffc90001c00000(0000) knlGS:0000000000000000
Dec  2 14:42:17 cap-x6270-01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Dec  2 14:42:17 cap-x6270-01 kernel: CR2: 0000000020fe7000 CR3:
000000097b1b1000 CR4: 00000000000006e0
Dec  2 14:42:17 cap-x6270-01 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Dec  2 14:42:17 cap-x6270-01 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Dec  2 14:42:17 cap-x6270-01 kernel: Call Trace:
Dec  2 14:42:17 cap-x6270-01 kernel:  [<ffffffff81285a30>] ?
inet_csk_get_port+0x1b2/0x29e
Dec  2 14:42:17 cap-x6270-01 kernel:  [<ffffffff812a15e2>] ?
inet_bind+0x10c/0x1b7
Dec  2 14:42:17 cap-x6270-01 kernel:  [<ffffffff8124ef53>] ? sys_bind+0x6e/0x9e
Dec  2 14:42:17 cap-x6270-01 kernel:  [<ffffffff8106de14>] ?
audit_syscall_entry+0x1a4/0x1cf
Dec  2 14:42:17 cap-x6270-01 kernel:  [<ffffffff8100b92b>] ?
system_call_fastpath+0x16/0x1b

Dec  2 14:42:26 cap-x6270-01 kernel: BUG: soft lockup - CPU#4 stuck
for 61s! [swapper:0]
Dec  2 14:42:26 cap-x6270-01 kernel: Modules linked in: ...
Dec  2 14:42:26 cap-x6270-01 kernel: CPU 4:
Dec  2 14:42:26 cap-x6270-01 kernel: Modules linked in: ...
Dec  2 14:42:26 cap-x6270-01 kernel: Pid: 0, comm: swapper Tainted: G
      W  2.6.31.4 #3 SUN BLADE X6270 SERVER MODULE
Dec  2 14:42:26 cap-x6270-01 kernel: RIP: 0010:[<ffffffff812dee49>]
[<ffffffff812dee49>] _spin_lock+0xa/0x15
Dec  2 14:42:26 cap-x6270-01 kernel: RSP: 0018:ffffc90000803c98
EFLAGS: 00000297
Dec  2 14:42:26 cap-x6270-01 kernel: RAX: 000000000000e5e4 RBX:
ffff8808d94f1bc0 RCX: ffff8808d94f1bc0
Dec  2 14:42:26 cap-x6270-01 kernel: RDX: ffffffff81b4b340 RSI:
ffff8808c8570e80 RDI: ffffc90019f82a20
Dec  2 14:42:26 cap-x6270-01 kernel: RBP: ffffffff8100c433 R08:
8000000000000000 R09: 0400000000000000
Dec  2 14:42:26 cap-x6270-01 kernel: R10: 0000000000000000 R11:
0000000000000000 R12: ffffc90000803c10
Dec  2 14:42:26 cap-x6270-01 kernel: R13: ffffc90019f82a20 R14:
ffffc90000803c10 R15: ffffffff8101da86
Dec  2 14:42:26 cap-x6270-01 kernel: FS:  0000000000000000(0000)
GS:ffffc90000800000(0000) knlGS:0000000000000000
Dec  2 14:42:26 cap-x6270-01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
000000008005003b
Dec  2 14:42:26 cap-x6270-01 kernel: CR2: 00007f436db8f000 CR3:
0000000001001000 CR4: 00000000000006e0
Dec  2 14:42:26 cap-x6270-01 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Dec  2 14:42:26 cap-x6270-01 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Dec  2 14:42:26 cap-x6270-01 kernel: Call Trace:
Dec  2 14:42:26 cap-x6270-01 kernel:  <IRQ>  [<ffffffff81284d1d>] ?
__inet_twsk_hashdance+0x54/0x127
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff812979d0>] ?
tcp_time_wait+0x13c/0x1c0
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8128bbc5>] ? tcp_fin+0x7e/0x178
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8128c782>] ?
tcp_data_queue+0x2b4/0xaf9
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8128fce1>] ?
tcp_rcv_state_process+0x8a7/0x909
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff812951cc>] ?
tcp_v4_do_rcv+0x181/0x1d5
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff81296be2>] ?
tcp_v4_rcv+0x4ac/0x706
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8127c950>] ?
ip_rcv_finish+0x0/0x366
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8127cdd6>] ?
ip_local_deliver_finish+0x120/0x1e3
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8127cc9c>] ?
ip_rcv_finish+0x34c/0x366
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8127d20e>] ? ip_rcv+0x289/0x2d0
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8125e156>] ?
process_backlog+0x6f/0x98
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8125df97>] ?
net_rx_action+0xa9/0x17d
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff810444ca>] ?
__do_softirq+0xc5/0x183
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8100ca5c>] ?
call_softirq+0x1c/0x28
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8100ddf2>] ?
do_softirq+0x2c/0x68
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8100d472>] ? do_IRQ+0xa0/0xb6
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8100c2d3>] ?
ret_from_intr+0x0/0xa
Dec  2 14:42:26 cap-x6270-01 kernel:  <EOI>  [<ffffffff8100c42e>] ?
apic_timer_interrupt+0xe/0x20
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff811b3292>] ?
acpi_idle_enter_simple+0x120/0x14e
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff811b3288>] ?
acpi_idle_enter_simple+0x116/0x14e
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8123bbb4>] ?
cpuidle_idle_call+0x7f/0xbb
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8100aa1d>] ? cpu_idle+0x40/0x5e

Dec  2 14:42:26 cap-x6270-01 kernel: BUG: soft lockup - CPU#12 stuck
for 61s! [swapper:0]
...
Dec  2 14:42:26 cap-x6270-01 kernel: Pid: 0, comm: swapper Tainted: G
      W  2.6.31.4 #3 SUN BLADE X6270 SERVER MODULE
Dec  2 14:42:26 cap-x6270-01 kernel: RIP: 0010:[<ffffffff812dee4f>]
[<ffffffff812dee4f>] _spin_lock+0x10/0x15
Dec  2 14:42:26 cap-x6270-01 kernel: RSP: 0018:ffffc90001803ec8
EFLAGS: 00000297
Dec  2 14:42:26 cap-x6270-01 kernel: RAX: 0000000000004a49 RBX:
ffff8808c8570e80 RCX: ffff8808c8571168
Dec  2 14:42:26 cap-x6270-01 kernel: RDX: ffffc90001803f00 RSI:
0000000000000100 RDI: ffff8808c8570ec8
Dec  2 14:42:26 cap-x6270-01 kernel: RBP: ffffffff8100c433 R08:
0000000000000010 R09: 0000000000000000
Dec  2 14:42:26 cap-x6270-01 kernel: R10: 0000000000000000 R11:
0000000000000000 R12: ffffc90001803e40
Dec  2 14:42:26 cap-x6270-01 kernel: R13: ffff88097cdf0000 R14:
0000000000000082 R15: ffffffff8101da86
Dec  2 14:42:26 cap-x6270-01 kernel: FS:  0000000000000000(0000)
GS:ffffc90001800000(0000) knlGS:0000000000000000
Dec  2 14:42:26 cap-x6270-01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
000000008005003b
Dec  2 14:42:26 cap-x6270-01 kernel: CR2: 0000000021601ff8 CR3:
0000000001001000 CR4: 00000000000006e0
Dec  2 14:42:26 cap-x6270-01 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Dec  2 14:42:26 cap-x6270-01 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Dec  2 14:42:26 cap-x6270-01 kernel: Call Trace:
Dec  2 14:42:26 cap-x6270-01 kernel:  <IRQ>  [<ffffffff81293b71>] ?
tcp_write_timer+0x16/0x5c9
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff81293b5b>] ?
tcp_write_timer+0x0/0x5c9
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff81048575>] ?
run_timer_softirq+0x131/0x197
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff810444ca>] ?
__do_softirq+0xc5/0x183
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8100ca5c>] ?
call_softirq+0x1c/0x28
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8100ddf2>] ?
do_softirq+0x2c/0x68
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8101da8b>] ?
smp_apic_timer_interrupt+0x88/0x95
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8100c433>] ?
apic_timer_interrupt+0x13/0x20
Dec  2 14:42:26 cap-x6270-01 kernel:  <EOI>  [<ffffffff8100c42e>] ?
apic_timer_interrupt+0xe/0x20
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff811b3147>] ?
acpi_idle_enter_bm+0x249/0x274
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff811b313d>] ?
acpi_idle_enter_bm+0x23f/0x274
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8123bbb4>] ?
cpuidle_idle_call+0x7f/0xbb
Dec  2 14:42:26 cap-x6270-01 kernel:  [<ffffffff8100aa1d>] ? cpu_idle+0x40/0x5e

Dec  2 14:42:28 cap-x6270-01 kernel: BUG: soft lockup - CPU#11 stuck
for 61s! [swapper:0]
...
Dec  2 14:42:28 cap-x6270-01 kernel: Pid: 0, comm: swapper Tainted: G
      W  2.6.31.4 #3 SUN BLADE X6270 SERVER MODULE
Dec  2 14:42:28 cap-x6270-01 kernel: RIP: 0010:[<ffffffff812dee4f>]
[<ffffffff812dee4f>] _spin_lock+0x10/0x15
Dec  2 14:42:28 cap-x6270-01 kernel: RSP: 0018:ffffc90001603eb8
EFLAGS: 00000297
Dec  2 14:42:28 cap-x6270-01 kernel: RAX: 0000000000004746 RBX:
ffff8804a798ad00 RCX: ffff8804a798afe8
Dec  2 14:42:28 cap-x6270-01 kernel: RDX: ffffc90001603ef0 RSI:
ffff8804f2858a40 RDI: ffff8804a798ad48
Dec  2 14:42:28 cap-x6270-01 kernel: RBP: ffffffff8100c433 R08:
ffff88097cdea000 R09: 0000000000000003
Dec  2 14:42:28 cap-x6270-01 kernel: R10: ffffffff811a52f0 R11:
0000000000000000 R12: ffffc90001603e30
Dec  2 14:42:28 cap-x6270-01 kernel: R13: ffff8804fcdd4000 R14:
ffffc90001603e30 R15: ffffffff8101da86
Dec  2 14:42:28 cap-x6270-01 kernel: FS:  0000000000000000(0000)
GS:ffffc90001600000(0000) knlGS:0000000000000000
Dec  2 14:42:28 cap-x6270-01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
000000008005003b
Dec  2 14:42:28 cap-x6270-01 kernel: CR2: 0000000007225000 CR3:
00000004e318d000 CR4: 00000000000006e0
Dec  2 14:42:28 cap-x6270-01 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Dec  2 14:42:28 cap-x6270-01 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Dec  2 14:42:28 cap-x6270-01 kernel: Call Trace:
Dec  2 14:42:28 cap-x6270-01 kernel:  <IRQ>  [<ffffffff81293b71>] ?
tcp_write_timer+0x16/0x5c9
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff81293b5b>] ?
tcp_write_timer+0x0/0x5c9
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff81048575>] ?
run_timer_softirq+0x131/0x197
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff810444ca>] ?
__do_softirq+0xc5/0x183
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8100ca5c>] ?
call_softirq+0x1c/0x28
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8100ddf2>] ?
do_softirq+0x2c/0x68
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8100d472>] ? do_IRQ+0xa0/0xb6
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8100c2d3>] ?
ret_from_intr+0x0/0xa
Dec  2 14:42:28 cap-x6270-01 kernel:  <EOI>  [<ffffffff811a52f0>] ?
acpi_hw_register_read+0x52/0xe5
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff811b3292>] ?
acpi_idle_enter_simple+0x120/0x14e
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff811b3288>] ?
acpi_idle_enter_simple+0x116/0x14e
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff811b2fd3>] ?
acpi_idle_enter_bm+0xd5/0x274
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8123bbb4>] ?
cpuidle_idle_call+0x7f/0xbb
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8100aa1d>] ? cpu_idle+0x40/0x5e

Dec  2 14:42:28 cap-x6270-01 kernel: BUG: soft lockup - CPU#15 stuck
for 61s! [swapper:0]
...
Dec  2 14:42:28 cap-x6270-01 kernel: Pid: 0, comm: swapper Tainted: G
      W  2.6.31.4 #3 SUN BLADE X6270 SERVER MODULE
Dec  2 14:42:28 cap-x6270-01 kernel: RIP: 0010:[<ffffffff812dee51>]
[<ffffffff812dee51>] _spin_lock+0x12/0x15
Dec  2 14:42:28 cap-x6270-01 kernel: RSP: 0018:ffffc90001e03ce8
EFLAGS: 00000293
Dec  2 14:42:28 cap-x6270-01 kernel: RAX: 000000000000e6e4 RBX:
ffff8808df9e9b40 RCX: ffff8808df9e9b40
Dec  2 14:42:28 cap-x6270-01 kernel: RDX: ffffffff81b4b340 RSI:
ffff8804a798ad00 RDI: ffffc90019f82a20
Dec  2 14:42:28 cap-x6270-01 kernel: RBP: ffffffff8100c433 R08:
000000001a8082c7 R09: 0000000003069b61
Dec  2 14:42:28 cap-x6270-01 kernel: R10: 0000001400c21ca1 R11:
ffff8804681763c0 R12: ffffc90001e03c60
Dec  2 14:42:28 cap-x6270-01 kernel: R13: ffffc90019f82a20 R14:
ffffc90001e03c60 R15: ffffffff8101da86
Dec  2 14:42:28 cap-x6270-01 kernel: FS:  0000000000000000(0000)
GS:ffffc90001e00000(0000) knlGS:0000000000000000
Dec  2 14:42:28 cap-x6270-01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
000000008005003b
Dec  2 14:42:28 cap-x6270-01 kernel: CR2: 0000000007438398 CR3:
00000004f45a5000 CR4: 00000000000006e0
Dec  2 14:42:28 cap-x6270-01 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Dec  2 14:42:28 cap-x6270-01 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Dec  2 14:42:28 cap-x6270-01 kernel: Call Trace:
Dec  2 14:42:28 cap-x6270-01 kernel:  <IRQ>  [<ffffffff81284d1d>] ?
__inet_twsk_hashdance+0x54/0x127
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff812979d0>] ?
tcp_time_wait+0x13c/0x1c0
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8128fc1a>] ?
tcp_rcv_state_process+0x7e0/0x909
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff812951cc>] ?
tcp_v4_do_rcv+0x181/0x1d5
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff81296be2>] ?
tcp_v4_rcv+0x4ac/0x706
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8127c950>] ?
ip_rcv_finish+0x0/0x366
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8127cdd6>] ?
ip_local_deliver_finish+0x120/0x1e3
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8127cc9c>] ?
ip_rcv_finish+0x34c/0x366
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8127d20e>] ? ip_rcv+0x289/0x2d0
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8125e156>] ?
process_backlog+0x6f/0x98
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8125df97>] ?
net_rx_action+0xa9/0x17d
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff810444ca>] ?
__do_softirq+0xc5/0x183
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8100ca5c>] ?
call_softirq+0x1c/0x28
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8100ddf2>] ?
do_softirq+0x2c/0x68
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8100d472>] ? do_IRQ+0xa0/0xb6
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8100c2d3>] ?
ret_from_intr+0x0/0xa
Dec  2 14:42:28 cap-x6270-01 kernel:  <EOI>  [<ffffffff811a52f0>] ?
acpi_hw_register_read+0x52/0xe5
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff812def52>] ?
_spin_unlock_irqrestore+0x4/0x5
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff811b32ab>] ?
acpi_idle_enter_simple+0x139/0x14e
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff811b2fd3>] ?
acpi_idle_enter_bm+0xd5/0x274
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8123bbb4>] ?
cpuidle_idle_call+0x7f/0xbb
Dec  2 14:42:28 cap-x6270-01 kernel:  [<ffffffff8100aa1d>] ? cpu_idle+0x40/0x5e

Dec  2 14:42:31 cap-x6270-01 kernel: BUG: soft lockup - CPU#13 stuck
for 61s! [fast:14590]
...
Dec  2 14:42:31 cap-x6270-01 kernel: Pid: 14590, comm: fast Tainted: G
       W  2.6.31.4 #3 SUN BLADE X6270 SERVER MODULE
Dec  2 14:42:31 cap-x6270-01 kernel: RIP: 0010:[<ffffffff812dee4f>]
[<ffffffff812dee4f>] _spin_lock+0x10/0x15
Dec  2 14:42:31 cap-x6270-01 kernel: RSP: 0018:ffff8804dd95be30
EFLAGS: 00000293
Dec  2 14:42:31 cap-x6270-01 kernel: RAX: 000000000000e7e4 RBX:
00000000ffffffea RCX: 0000000000000000
Dec  2 14:42:31 cap-x6270-01 kernel: RDX: 00000000000000a2 RSI:
0000000000000fa2 RDI: ffffc90019f82a20
Dec  2 14:42:31 cap-x6270-01 kernel: RBP: ffffffff8100c42e R08:
ffff8809311a6840 R09: 000000004b16ed16
Dec  2 14:42:31 cap-x6270-01 kernel: R10: 00007ffff6230734 R11:
ffff8809311a6840 R12: ffffffff812509d4
Dec  2 14:42:31 cap-x6270-01 kernel: R13: ffff88097404b2c0 R14:
0000000000000246 R15: 0000000000000000
Dec  2 14:42:31 cap-x6270-01 kernel: FS:  00007f0b9511d6e0(0000)
GS:ffffc90001a00000(0000) knlGS:0000000000000000
Dec  2 14:42:31 cap-x6270-01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Dec  2 14:42:31 cap-x6270-01 kernel: CR2: 00007f11263c7000 CR3:
00000004de426000 CR4: 00000000000006e0
Dec  2 14:42:31 cap-x6270-01 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Dec  2 14:42:31 cap-x6270-01 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Dec  2 14:42:31 cap-x6270-01 kernel: Call Trace:
Dec  2 14:42:31 cap-x6270-01 kernel:  [<ffffffff812859d5>] ?
inet_csk_get_port+0x157/0x29e
Dec  2 14:42:31 cap-x6270-01 kernel:  [<ffffffff812a15e2>] ?
inet_bind+0x10c/0x1b7
Dec  2 14:42:31 cap-x6270-01 kernel:  [<ffffffff8124ef53>] ? sys_bind+0x6e/0x9e
Dec  2 14:42:31 cap-x6270-01 kernel:  [<ffffffff8106de14>] ?
audit_syscall_entry+0x1a4/0x1cf
Dec  2 14:42:31 cap-x6270-01 kernel:  [<ffffffff8100b92b>] ?
system_call_fastpath+0x16/0x1b
...
Dec  2 14:42:45 cap-x6270-01 kernel: BUG: soft lockup - CPU#5 stuck
for 61s! [swapper:0]
...
Dec  2 14:42:45 cap-x6270-01 kernel: Pid: 0, comm: swapper Tainted: G
      W  2.6.31.4 #3 SUN BLADE X6270 SERVER MODULE
Dec  2 14:42:45 cap-x6270-01 kernel: RIP: 0010:[<ffffffff812dee4f>]
[<ffffffff812dee4f>] _spin_lock+0x10/0x15
Dec  2 14:42:45 cap-x6270-01 kernel: RSP: 0018:ffffc90000a03ce8
EFLAGS: 00000297
Dec  2 14:42:45 cap-x6270-01 kernel: RAX: 000000000000e8e4 RBX:
ffff8808be888200 RCX: ffff8808be888200
Dec  2 14:42:45 cap-x6270-01 kernel: RDX: ffffffff81b4b340 RSI:
ffff88092090ec40 RDI: ffffc90019f82a20
Dec  2 14:42:45 cap-x6270-01 kernel: RBP: ffffffff8100c433 R08:
000000006d4b0f59 R09: 00000000a15a9061
Dec  2 14:42:45 cap-x6270-01 kernel: R10: 0000001400c25cd8 R11:
ffff880460516d80 R12: ffffc90000a03c60
Dec  2 14:42:45 cap-x6270-01 kernel: R13: ffffc90019f82a20 R14:
ffffc90000a03c60 R15: ffffffff8101da86
Dec  2 14:42:45 cap-x6270-01 kernel: FS:  0000000000000000(0000)
GS:ffffc90000a00000(0000) knlGS:0000000000000000
Dec  2 14:42:45 cap-x6270-01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
000000008005003b
Dec  2 14:42:45 cap-x6270-01 kernel: CR2: 00007f436db8f000 CR3:
0000000001001000 CR4: 00000000000006e0
Dec  2 14:42:45 cap-x6270-01 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Dec  2 14:42:45 cap-x6270-01 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Dec  2 14:42:45 cap-x6270-01 kernel: Call Trace:
Dec  2 14:42:45 cap-x6270-01 kernel:  <IRQ>  [<ffffffff81284d1d>] ?
__inet_twsk_hashdance+0x54/0x127
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff812979d0>] ?
tcp_time_wait+0x13c/0x1c0
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8128fc1a>] ?
tcp_rcv_state_process+0x7e0/0x909
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff812951cc>] ?
tcp_v4_do_rcv+0x181/0x1d5
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff81296be2>] ?
tcp_v4_rcv+0x4ac/0x706
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8127c950>] ?
ip_rcv_finish+0x0/0x366
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8127cdd6>] ?
ip_local_deliver_finish+0x120/0x1e3
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8127cc9c>] ?
ip_rcv_finish+0x34c/0x366
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8127d20e>] ? ip_rcv+0x289/0x2d0
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8125e156>] ?
process_backlog+0x6f/0x98
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8125df97>] ?
net_rx_action+0xa9/0x17d
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff810444ca>] ?
__do_softirq+0xc5/0x183
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8100ca5c>] ?
call_softirq+0x1c/0x28
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8100ddf2>] ?
do_softirq+0x2c/0x68
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8100d472>] ? do_IRQ+0xa0/0xb6
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8100c2d3>] ?
ret_from_intr+0x0/0xa
Dec  2 14:42:45 cap-x6270-01 kernel:  <EOI>  [<ffffffff8100c2ce>] ?
common_interrupt+0xe/0x13
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff811b3147>] ?
acpi_idle_enter_bm+0x249/0x274
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff811b313d>] ?
acpi_idle_enter_bm+0x23f/0x274
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8123bbb4>] ?
cpuidle_idle_call+0x7f/0xbb
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8100aa1d>] ? cpu_idle+0x40/0x5e
...
Dec  2 14:42:45 cap-x6270-01 kernel: BUG: soft lockup - CPU#7 stuck
for 61s! [swapper:0]
...
Dec  2 14:42:45 cap-x6270-01 kernel: Pid: 0, comm: swapper Tainted: G
      W  2.6.31.4 #3 SUN BLADE X6270 SERVER MODULE
Dec  2 14:42:45 cap-x6270-01 kernel: RIP: 0010:[<ffffffff812dee4f>]
[<ffffffff812dee4f>] _spin_lock+0x10/0x15
Dec  2 14:42:45 cap-x6270-01 kernel: RSP: 0018:ffffc90000e03ec8
EFLAGS: 00000297
Dec  2 14:42:45 cap-x6270-01 kernel: RAX: 0000000000005251 RBX:
ffff88092090ec40 RCX: ffff88092090ef28
Dec  2 14:42:45 cap-x6270-01 kernel: RDX: ffffc90000e03f00 RSI:
000000001a35d954 RDI: ffff88092090ec88
Dec  2 14:42:45 cap-x6270-01 kernel: RBP: ffffffff8100c433 R08:
0000000000000010 R09: 0000000000000000
Dec  2 14:42:45 cap-x6270-01 kernel: R10: ffffffff811a52f0 R11:
0000000000000000 R12: ffffc90000e03e40
Dec  2 14:42:45 cap-x6270-01 kernel: R13: ffff88097cd9c000 R14:
ffffc90000e03e40 R15: ffffffff8101da86
Dec  2 14:42:45 cap-x6270-01 kernel: FS:  0000000000000000(0000)
GS:ffffc90000e00000(0000) knlGS:0000000000000000
Dec  2 14:42:45 cap-x6270-01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
000000008005003b
Dec  2 14:42:45 cap-x6270-01 kernel: CR2: 000000000c0b25f8 CR3:
0000000973c37000 CR4: 00000000000006e0
Dec  2 14:42:45 cap-x6270-01 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Dec  2 14:42:45 cap-x6270-01 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Dec  2 14:42:45 cap-x6270-01 kernel: Call Trace:
Dec  2 14:42:45 cap-x6270-01 kernel:  <IRQ>  [<ffffffff81054c63>] ?
hrtimer_run_queues+0xed/0x193
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff81293b71>] ?
tcp_write_timer+0x16/0x5c9
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff81293b5b>] ?
tcp_write_timer+0x0/0x5c9
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff81048575>] ?
run_timer_softirq+0x131/0x197
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff810444ca>] ?
__do_softirq+0xc5/0x183
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8100ca5c>] ?
call_softirq+0x1c/0x28
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8100ddf2>] ?
do_softirq+0x2c/0x68
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8101da8b>] ?
smp_apic_timer_interrupt+0x88/0x95
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8100c433>] ?
apic_timer_interrupt+0x13/0x20
Dec  2 14:42:45 cap-x6270-01 kernel:  <EOI>  [<ffffffff811a52f0>] ?
acpi_hw_register_read+0x52/0xe5
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff811b3292>] ?
acpi_idle_enter_simple+0x120/0x14e
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff811b3288>] ?
acpi_idle_enter_simple+0x116/0x14e
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff811b2fd3>] ?
acpi_idle_enter_bm+0xd5/0x274
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8123bbb4>] ?
cpuidle_idle_call+0x7f/0xbb
Dec  2 14:42:45 cap-x6270-01 kernel:  [<ffffffff8100aa1d>] ? cpu_idle+0x40/0x5e
...
Dec  2 14:43:01 cap-x6270-01 kernel: BUG: soft lockup - CPU#1 stuck
for 61s! [swapper:0]
...
Dec  2 14:43:01 cap-x6270-01 kernel: Pid: 0, comm: swapper Tainted: G
      W  2.6.31.4 #3 SUN BLADE X6270 SERVER MODULE
Dec  2 14:43:01 cap-x6270-01 kernel: RIP: 0010:[<ffffffff812dee4f>]
[<ffffffff812dee4f>] _spin_lock+0x10/0x15
Dec  2 14:43:01 cap-x6270-01 kernel: RSP: 0018:ffffc90000203c98
EFLAGS: 00000293
Dec  2 14:43:01 cap-x6270-01 kernel: RAX: 000000000000e9e4 RBX:
ffff88045ad1a3c0 RCX: ffff88045ad1a3c0
Dec  2 14:43:01 cap-x6270-01 kernel: RDX: ffffffff81b4b340 RSI:
ffff880496117800 RDI: ffffc90019f82a20
Dec  2 14:43:01 cap-x6270-01 kernel: RBP: ffffffff8100c433 R08:
8000000000000000 R09: 0400000000000000
Dec  2 14:43:01 cap-x6270-01 kernel: R10: 0000000000000000 R11:
0000000000000000 R12: ffffc90000203c10
Dec  2 14:43:01 cap-x6270-01 kernel: R13: ffffc90019f82a20 R14:
ffffc90000203c10 R15: ffffffff8101da86
Dec  2 14:43:01 cap-x6270-01 kernel: FS:  0000000000000000(0000)
GS:ffffc90000200000(0000) knlGS:0000000000000000
Dec  2 14:43:01 cap-x6270-01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
000000008005003b
Dec  2 14:43:01 cap-x6270-01 kernel: CR2: 00007f3ff426a000 CR3:
0000000001001000 CR4: 00000000000006e0
Dec  2 14:43:01 cap-x6270-01 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Dec  2 14:43:01 cap-x6270-01 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Dec  2 14:43:01 cap-x6270-01 kernel: Call Trace:
Dec  2 14:43:01 cap-x6270-01 kernel:  <IRQ>  [<ffffffff81284d1d>] ?
__inet_twsk_hashdance+0x54/0x127
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff812979d0>] ?
tcp_time_wait+0x13c/0x1c0
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff8128bbc5>] ? tcp_fin+0x7e/0x178
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff8128c782>] ?
tcp_data_queue+0x2b4/0xaf9
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff8128fce1>] ?
tcp_rcv_state_process+0x8a7/0x909
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff812951cc>] ?
tcp_v4_do_rcv+0x181/0x1d5
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff81296be2>] ?
tcp_v4_rcv+0x4ac/0x706
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff8127c950>] ?
ip_rcv_finish+0x0/0x366
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff8127cdd6>] ?
ip_local_deliver_finish+0x120/0x1e3
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff8127cc9c>] ?
ip_rcv_finish+0x34c/0x366
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff8127d20e>] ? ip_rcv+0x289/0x2d0
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff8125e156>] ?
process_backlog+0x6f/0x98
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff8125df97>] ?
net_rx_action+0xa9/0x17d
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff810444ca>] ?
__do_softirq+0xc5/0x183
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff8100ca5c>] ?
call_softirq+0x1c/0x28
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff8100ddf2>] ?
do_softirq+0x2c/0x68
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff8100d472>] ? do_IRQ+0xa0/0xb6
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff8100c2d3>] ?
ret_from_intr+0x0/0xa
Dec  2 14:43:01 cap-x6270-01 kernel:  <EOI>  [<ffffffff8100c2ce>] ?
common_interrupt+0xe/0x13
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff811b3147>] ?
acpi_idle_enter_bm+0x249/0x274
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff811b313d>] ?
acpi_idle_enter_bm+0x23f/0x274
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff8123bbb4>] ?
cpuidle_idle_call+0x7f/0xbb
Dec  2 14:43:01 cap-x6270-01 kernel:  [<ffffffff8100aa1d>] ? cpu_idle+0x40/0x5e
...
Dec  2 14:43:04 cap-x6270-01 kernel: BUG: soft lockup - CPU#9 stuck
for 61s! [swapper:0]
...
Dec  2 14:43:04 cap-x6270-01 kernel: Pid: 0, comm: swapper Tainted: G
      W  2.6.31.4 #3 SUN BLADE X6270 SERVER MODULE
Dec  2 14:43:04 cap-x6270-01 kernel: RIP: 0010:[<ffffffff812dee4f>]
[<ffffffff812dee4f>] _spin_lock+0x10/0x15
Dec  2 14:43:04 cap-x6270-01 kernel: RSP: 0018:ffffc90001203ec8
EFLAGS: 00000297
Dec  2 14:43:04 cap-x6270-01 kernel: RAX: 0000000000008382 RBX:
ffff880496117800 RCX: ffff880496117ae8
Dec  2 14:43:04 cap-x6270-01 kernel: RDX: ffff88048a8fe4e8 RSI:
0000000039b18490 RDI: ffff880496117848
Dec  2 14:43:04 cap-x6270-01 kernel: RBP: ffffffff8100c433 R08:
0000000000000010 R09: 0000000000000000
Dec  2 14:43:04 cap-x6270-01 kernel: R10: 0000000000000009 R11:
0000000000000000 R12: ffffc90001203e40
Dec  2 14:43:04 cap-x6270-01 kernel: R13: ffff8804fcd9c000 R14:
ffffc90001203e40 R15: ffffffff8101da86
Dec  2 14:43:04 cap-x6270-01 kernel: FS:  0000000000000000(0000)
GS:ffffc90001200000(0000) knlGS:0000000000000000
Dec  2 14:43:04 cap-x6270-01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
000000008005003b
Dec  2 14:43:04 cap-x6270-01 kernel: CR2: 0000000003dd0000 CR3:
0000000001001000 CR4: 00000000000006e0
Dec  2 14:43:04 cap-x6270-01 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Dec  2 14:43:04 cap-x6270-01 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Dec  2 14:43:04 cap-x6270-01 kernel: Call Trace:
Dec  2 14:43:04 cap-x6270-01 kernel:  <IRQ>  [<ffffffff81054c63>] ?
hrtimer_run_queues+0xed/0x193
Dec  2 14:43:04 cap-x6270-01 kernel:  [<ffffffff81293b71>] ?
tcp_write_timer+0x16/0x5c9
Dec  2 14:43:04 cap-x6270-01 kernel:  [<ffffffff81293b5b>] ?
tcp_write_timer+0x0/0x5c9
Dec  2 14:43:04 cap-x6270-01 kernel:  [<ffffffff81048575>] ?
run_timer_softirq+0x131/0x197
Dec  2 14:43:04 cap-x6270-01 kernel:  [<ffffffff810444ca>] ?
__do_softirq+0xc5/0x183
Dec  2 14:43:04 cap-x6270-01 kernel:  [<ffffffff8100ca5c>] ?
call_softirq+0x1c/0x28
Dec  2 14:43:04 cap-x6270-01 kernel:  [<ffffffff8100ddf2>] ?
do_softirq+0x2c/0x68
Dec  2 14:43:04 cap-x6270-01 kernel:  [<ffffffff8101da8b>] ?
smp_apic_timer_interrupt+0x88/0x95
Dec  2 14:43:04 cap-x6270-01 kernel:  [<ffffffff8100c433>] ?
apic_timer_interrupt+0x13/0x20
Dec  2 14:43:04 cap-x6270-01 kernel:  <EOI>  [<ffffffff8100c42e>] ?
apic_timer_interrupt+0xe/0x20
Dec  2 14:43:04 cap-x6270-01 kernel:  [<ffffffff811b3147>] ?
acpi_idle_enter_bm+0x249/0x274
Dec  2 14:43:04 cap-x6270-01 kernel:  [<ffffffff811b313d>] ?
acpi_idle_enter_bm+0x23f/0x274
Dec  2 14:43:04 cap-x6270-01 kernel:  [<ffffffff8123bbb4>] ?
cpuidle_idle_call+0x7f/0xbb
Dec  2 14:43:04 cap-x6270-01 kernel:  [<ffffffff8100aa1d>] ? cpu_idle+0x40/0x5e
...
Dec  2 14:43:23 cap-x6270-01 kernel: BUG: soft lockup - CPU#14 stuck
for 61s! [fast:14591]
...
Dec  2 14:43:23 cap-x6270-01 kernel: Pid: 14591, comm: fast Tainted: G
       W  2.6.31.4 #3 SUN BLADE X6270 SERVER MODULE
Dec  2 14:43:23 cap-x6270-01 kernel: RIP: 0010:[<ffffffff81285c94>]
[<ffffffff81285c94>] inet_csk_bind_conflict+0x8e/0xa6
Dec  2 14:43:23 cap-x6270-01 kernel: RSP: 0018:ffff8804e1471e30
EFLAGS: 00000286
Dec  2 14:43:23 cap-x6270-01 kernel: RAX: ffffffff815c4101 RBX:
ffff8804ea477da0 RCX: ffff8804e1028ea0
Dec  2 14:43:23 cap-x6270-01 kernel: RDX: ffff88048a4842c0 RSI:
0000000000000000 RDI: ffff8809071340c0
Dec  2 14:43:23 cap-x6270-01 kernel: RBP: ffffffff8100c42e R08:
00000000a900220e R09: ffff8804d2cbcca0
Dec  2 14:43:23 cap-x6270-01 kernel: R10: 00007fff281a1501 R11:
ffff8809071340c0 R12: ffffffff812509d4
Dec  2 14:43:23 cap-x6270-01 kernel: R13: ffff88097b9a38c0 R14:
0000000000000246 R15: 0000000000000001
Dec  2 14:43:23 cap-x6270-01 kernel: FS:  00007f2c0e2006e0(0000)
GS:ffffc90001c00000(0000) knlGS:0000000000000000
Dec  2 14:43:23 cap-x6270-01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Dec  2 14:43:23 cap-x6270-01 kernel: CR2: 0000000020fe7000 CR3:
000000097b1b1000 CR4: 00000000000006e0
Dec  2 14:43:23 cap-x6270-01 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Dec  2 14:43:23 cap-x6270-01 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Dec  2 14:43:23 cap-x6270-01 kernel: Call Trace:
Dec  2 14:43:23 cap-x6270-01 kernel:  [<ffffffff81285a30>] ?
inet_csk_get_port+0x1b2/0x29e
Dec  2 14:43:23 cap-x6270-01 kernel:  [<ffffffff812a15e2>] ?
inet_bind+0x10c/0x1b7
Dec  2 14:43:23 cap-x6270-01 kernel:  [<ffffffff8124ef53>] ? sys_bind+0x6e/0x9e
Dec  2 14:43:23 cap-x6270-01 kernel:  [<ffffffff8106de14>] ?
audit_syscall_entry+0x1a4/0x1cf
Dec  2 14:43:23 cap-x6270-01 kernel:  [<ffffffff8100b92b>] ?
system_call_fastpath+0x16/0x1b

Either there are more places for race condition, or the fix didn't
address the issue effectively.

Regards,
Kapil

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] tcp: fix a timewait refcnt race
  2009-12-03  2:43             ` kapil dakhane
@ 2009-12-03 10:49               ` Eric Dumazet
  2009-12-04  0:19                 ` David Miller
  2009-12-04  3:20                 ` kapil dakhane
  0 siblings, 2 replies; 31+ messages in thread
From: Eric Dumazet @ 2009-12-03 10:49 UTC (permalink / raw)
  To: kapil dakhane; +Cc: David Miller, netdev, netfilter, Evgeniy Polyakov

kapil dakhane a écrit :

> Either there are more places for race condition, or the fix didn't
> address the issue effectively.

Thanks a lot for all these details ! It definitly is very usefull to
localize problems.

I believe I found another timewait problem, I am not sure
it is what makes your test fail, but we make progress :)

I cooked a patch against last net-next-2.6 + my previous patch.

(2nd take of [PATCH net-next-2.6] tcp: connect() race with timewait reuse)

[PATCH net-next-2.6] tcp: fix a timewait refcnt race

After TCP RCU conversion, tw->tw_refcnt should not be set to 1 in
inet_twsk_alloc(). It allows a RCU reader to get this timewait
socket, while we not yet stabilized it.

Only choice we have is to set tw_refcnt to 0 in inet_twsk_alloc(),
then atomic_add() it later, once everything is done.

Location of this atomic_add() is tricky, because we dont want another
writer to find this timewait in ehash, while tw_refcnt is still zero !

Thanks to Kapil Dakhane tests and reports.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/inet_timewait_sock.c |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)


diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 11380e6..91680ec 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -109,7 +109,6 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 	tw->tw_tb = icsk->icsk_bind_hash;
 	WARN_ON(!icsk->icsk_bind_hash);
 	inet_twsk_add_bind_node(tw, &tw->tw_tb->owners);
-	atomic_inc(&tw->tw_refcnt);
 	spin_unlock(&bhead->lock);
 
 	spin_lock(lock);
@@ -119,13 +118,22 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 	 * Should be done before removing sk from established chain
 	 * because readers are lockless and search established first.
 	 */
-	atomic_inc(&tw->tw_refcnt);
 	inet_twsk_add_node_rcu(tw, &ehead->twchain);
 
 	/* Step 3: Remove SK from established hash. */
 	if (__sk_nulls_del_node_init_rcu(sk))
 		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
 
+	/*
+	 * Notes :
+ 	 * - We initially set tw_refcnt to 0 in inet_twsk_alloc()
+	 * - We add one reference for the bhash link
+	 * - We add one reference for the ehash link
+	 * - We want this refcnt update done before allowing other
+	 *   threads to find this tw in ehash chain.
+	 */
+	atomic_add(1 + 1 + 1, &tw->tw_refcnt);
+
 	spin_unlock(lock);
 }
 
@@ -157,7 +165,12 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk, const int stat
 		tw->tw_transparent  = inet->transparent;
 		tw->tw_prot	    = sk->sk_prot_creator;
 		twsk_net_set(tw, hold_net(sock_net(sk)));
-		atomic_set(&tw->tw_refcnt, 1);
+		/*
+		 * Because we use RCU lookups, we should not set tw_refcnt
+		 * to a non null value before everything is setup for this
+		 * timewait socket.
+		 */
+		atomic_set(&tw->tw_refcnt, 0);
 		inet_twsk_dead_node_init(tw);
 		__module_get(tw->tw_prot->owner);
 	}

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH] tcp: fix a timewait refcnt race
  2009-12-03 10:49               ` [PATCH] tcp: fix a timewait refcnt race Eric Dumazet
@ 2009-12-04  0:19                 ` David Miller
  2009-12-04  3:20                 ` kapil dakhane
  1 sibling, 0 replies; 31+ messages in thread
From: David Miller @ 2009-12-04  0:19 UTC (permalink / raw)
  To: eric.dumazet; +Cc: kdakhane, netdev, netfilter, zbr

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 03 Dec 2009 11:49:01 +0100

> [PATCH net-next-2.6] tcp: fix a timewait refcnt race

Applied.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] tcp: fix a timewait refcnt race
  2009-12-03 10:49               ` [PATCH] tcp: fix a timewait refcnt race Eric Dumazet
  2009-12-04  0:19                 ` David Miller
@ 2009-12-04  3:20                 ` kapil dakhane
  2009-12-04  6:29                   ` Eric Dumazet
  1 sibling, 1 reply; 31+ messages in thread
From: kapil dakhane @ 2009-12-04  3:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, netfilter, Evgeniy Polyakov

Eric,

On Thu, Dec 3, 2009 at 2:49 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> I believe I found another timewait problem, I am not sure
> it is what makes your test fail, but we make progress :)
>
> I cooked a patch against last net-next-2.6 + my previous patch.
>
> (2nd take of [PATCH net-next-2.6] tcp: connect() race with timewait reuse)
>
> [PATCH net-next-2.6] tcp: fix a timewait refcnt race
>

I applied your changes are suggested above. It appears that the patch
successfully resolves the race condition it addressed. I managed to
push the system to 1.9 gbps, previously I could not push it beyond 1.6
gbps. Unfortunately, there appear to be more race conditions, as the
fault happened again when I attempted to push it to 2.3 gbps. This
time I did not get any error message in /var/log/messages, although
they appear on the console the same way as before.

Kapil

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] tcp: fix a timewait refcnt race
  2009-12-04  3:20                 ` kapil dakhane
@ 2009-12-04  6:29                   ` Eric Dumazet
  2009-12-04  6:39                     ` David Miller
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2009-12-04  6:29 UTC (permalink / raw)
  To: kapil dakhane; +Cc: David Miller, netdev, netfilter, Evgeniy Polyakov

kapil dakhane a écrit :
> 
> I applied your changes are suggested above. It appears that the patch
> successfully resolves the race condition it addressed. I managed to
> push the system to 1.9 gbps, previously I could not push it beyond 1.6
> gbps. Unfortunately, there appear to be more race conditions, as the
> fault happened again when I attempted to push it to 2.3 gbps. This
> time I did not get any error message in /var/log/messages, although
> they appear on the console the same way as before.
> 

Thanks for testing Kapil

I am not sure of exact set of changes you have in your kernel.

David, could you push your net-next-2.6 tree with pending changes so that
I can work on bind() side (interaction with timewait) and be sure Kapil
can work on latest net-next-2.6 ?

Thanks

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] tcp: fix a timewait refcnt race
  2009-12-04  6:29                   ` Eric Dumazet
@ 2009-12-04  6:39                     ` David Miller
  0 siblings, 0 replies; 31+ messages in thread
From: David Miller @ 2009-12-04  6:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: kdakhane, netdev, netfilter, zbr

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 04 Dec 2009 07:29:05 +0100

> kapil dakhane a écrit :
>> 
>> I applied your changes are suggested above. It appears that the patch
>> successfully resolves the race condition it addressed. I managed to
>> push the system to 1.9 gbps, previously I could not push it beyond 1.6
>> gbps. Unfortunately, there appear to be more race conditions, as the
>> fault happened again when I attempted to push it to 2.3 gbps. This
>> time I did not get any error message in /var/log/messages, although
>> they appear on the console the same way as before.
>> 
> 
> Thanks for testing Kapil
> 
> I am not sure of exact set of changes you have in your kernel.
> 
> David, could you push your net-next-2.6 tree with pending changes so that
> I can work on bind() side (interaction with timewait) and be sure Kapil
> can work on latest net-next-2.6 ?

Done.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH net-next-2.6] tcp: connect() race with timewait reuse
  2009-12-02 10:33       ` Eric Dumazet
  2009-12-02 11:32         ` Evgeniy Polyakov
@ 2009-12-02 15:08         ` Eric Dumazet
  2009-12-02 22:15           ` Evgeniy Polyakov
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2009-12-02 15:08 UTC (permalink / raw)
  To: David Miller; +Cc: kdakhane, netdev, netfilter, zbr, Evgeniy Polyakov

Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> But even if sysctl_tw_reuse is cleared, we might trigger the bug if
>> local port is bound to a value.
> 
> Oh well, that's more subtle than that.
> 
> __inet_check_established() is called not only with bh disabled,
> but also with a lock on bind list if twp != NULL.
> 
> However, if twp is NULL, lock is not held by caller.
> 
> [ Thats the final
>   ret = check_established(death_row, sk, snum, NULL);
>   in __inet_hash_connect()]
> 
> So triggering this bug with tw_reuse clear is tricky :
> 
> You need several threads, using sockets with REUSEADDR set,
> and bind() to same address/port before connect() to same target.
> 
> We need another patch to correct this.
> 

Here is a separate patch for this issue, cooked on top of net-next-2.6
for testing purposes, and public discussion.

Thanks

[PATCH net-next-2.6] tcp: connect() race with timewait reuse

Its currently possible that several threads issuing a connect() find the same
timewait socket and try to reuse it, leading to list corruptions.

Condition for bug is that these threads bound their socket on same address/port
of to be found timewait socket, and connected to same target. (SO_REUSEADDR needed)

To fix this problem, we could unhash timewait socket while holding ehash lock,
to make sure lookups/changes will be serialized. Only first one find the timewait
socket, other ones find the established socket and return an EADDRNOTAVAIL error.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/inet_timewait_sock.h |    2 +
 net/ipv4/inet_hashtables.c       |    7 +++--
 net/ipv4/inet_timewait_sock.c    |   36 ++++++++++++++++++++---------
 net/ipv6/inet6_hashtables.c      |   12 +++++----
 4 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 773b10f..59c80a0 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -199,6 +199,8 @@ static inline __be32 inet_rcv_saddr(const struct sock *sk)
 
 extern void inet_twsk_put(struct inet_timewait_sock *tw);
 
+extern void inet_twsk_unhash(struct inet_timewait_sock *tw);
+
 extern struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
 						  const int state);
 
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 94ef51a..143ddb4 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -318,20 +318,21 @@ unique:
 	sk->sk_hash = hash;
 	WARN_ON(!sk_unhashed(sk));
 	__sk_nulls_add_node_rcu(sk, &head->chain);
+	if (tw) {
+		inet_twsk_unhash(tw);
+		NET_INC_STATS_BH(net, LINUX_MIB_TIMEWAITRECYCLED);
+	}
 	spin_unlock(lock);
 	sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
 
 	if (twp) {
 		*twp = tw;
-		NET_INC_STATS_BH(net, LINUX_MIB_TIMEWAITRECYCLED);
 	} else if (tw) {
 		/* Silly. Should hash-dance instead... */
 		inet_twsk_deschedule(tw, death_row);
-		NET_INC_STATS_BH(net, LINUX_MIB_TIMEWAITRECYCLED);
 
 		inet_twsk_put(tw);
 	}
-
 	return 0;
 
 not_unique:
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 1f5d508..680d09b 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -14,6 +14,21 @@
 #include <net/inet_timewait_sock.h>
 #include <net/ip.h>
 
+
+/*
+ * unhash a timewait socket from established hash
+ * lock must be hold by caller
+ */
+void inet_twsk_unhash(struct inet_timewait_sock *tw)
+{
+	if (hlist_nulls_unhashed(&tw->tw_node))
+		return;
+
+	hlist_nulls_del_rcu(&tw->tw_node);
+	sk_nulls_node_init(&tw->tw_node);
+	inet_twsk_put(tw);
+}
+
 /* Must be called with locally disabled BHs. */
 static void __inet_twsk_kill(struct inet_timewait_sock *tw,
 			     struct inet_hashinfo *hashinfo)
@@ -24,12 +39,9 @@ static void __inet_twsk_kill(struct inet_timewait_sock *tw,
 	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
 
 	spin_lock(lock);
-	if (hlist_nulls_unhashed(&tw->tw_node)) {
-		spin_unlock(lock);
-		return;
-	}
-	hlist_nulls_del_rcu(&tw->tw_node);
-	sk_nulls_node_init(&tw->tw_node);
+
+	inet_twsk_unhash(tw);
+
 	spin_unlock(lock);
 
 	/* Disassociate with bind bucket. */
@@ -37,9 +49,11 @@ static void __inet_twsk_kill(struct inet_timewait_sock *tw,
 			hashinfo->bhash_size)];
 	spin_lock(&bhead->lock);
 	tb = tw->tw_tb;
-	__hlist_del(&tw->tw_bind_node);
-	tw->tw_tb = NULL;
-	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+	if (tb) {
+		__hlist_del(&tw->tw_bind_node);
+		tw->tw_tb = NULL;
+		inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+	}
 	spin_unlock(&bhead->lock);
 #ifdef SOCK_REFCNT_DEBUG
 	if (atomic_read(&tw->tw_refcnt) != 1) {
@@ -47,7 +61,8 @@ static void __inet_twsk_kill(struct inet_timewait_sock *tw,
 		       tw->tw_prot->name, tw, atomic_read(&tw->tw_refcnt));
 	}
 #endif
-	inet_twsk_put(tw);
+	if (tb)
+		inet_twsk_put(tw);
 }
 
 static noinline void inet_twsk_free(struct inet_timewait_sock *tw)
@@ -92,6 +107,7 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 	tw->tw_tb = icsk->icsk_bind_hash;
 	WARN_ON(!icsk->icsk_bind_hash);
 	inet_twsk_add_bind_node(tw, &tw->tw_tb->owners);
+	atomic_inc(&tw->tw_refcnt);
 	spin_unlock(&bhead->lock);
 
 	spin_lock(lock);
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 00c6a3e..3681c00 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -250,19 +250,21 @@ unique:
 	 * in hash table socket with a funny identity. */
 	inet->inet_num = lport;
 	inet->inet_sport = htons(lport);
+	sk->sk_hash = hash;
 	WARN_ON(!sk_unhashed(sk));
 	__sk_nulls_add_node_rcu(sk, &head->chain);
-	sk->sk_hash = hash;
+	if (tw) {
+		inet_twsk_unhash(tw);
+		NET_INC_STATS_BH(net, LINUX_MIB_TIMEWAITRECYCLED);
+	}
 	spin_unlock(lock);
 	sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
 
-	if (twp != NULL) {
+	if (twp) {
 		*twp = tw;
-		NET_INC_STATS_BH(net, LINUX_MIB_TIMEWAITRECYCLED);
-	} else if (tw != NULL) {
+	} else if (tw) {
 		/* Silly. Should hash-dance instead... */
 		inet_twsk_deschedule(tw, death_row);
-		NET_INC_STATS_BH(net, LINUX_MIB_TIMEWAITRECYCLED);
 
 		inet_twsk_put(tw);
 	}

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next-2.6] tcp: connect() race with timewait reuse
  2009-12-02 15:08         ` [PATCH net-next-2.6] tcp: connect() race with timewait reuse Eric Dumazet
@ 2009-12-02 22:15           ` Evgeniy Polyakov
  2009-12-03  6:44             ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: Evgeniy Polyakov @ 2009-12-02 22:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, kdakhane, netdev, netfilter

Hi.

Looks very good, thanks Eric, I have one question.

On Wed, Dec 02, 2009 at 04:08:59PM +0100, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> +
> +/*
> + * unhash a timewait socket from established hash
> + * lock must be hold by caller
> + */
> +void inet_twsk_unhash(struct inet_timewait_sock *tw)
> +{
> +	if (hlist_nulls_unhashed(&tw->tw_node))
> +		return;
> +
> +	hlist_nulls_del_rcu(&tw->tw_node);
> +	sk_nulls_node_init(&tw->tw_node);
> +	inet_twsk_put(tw);

Is it safe to call in locked context? inet_twsk_put() schedules
preemption, also I did not check what tw destructor does.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next-2.6] tcp: connect() race with timewait reuse
  2009-12-02 22:15           ` Evgeniy Polyakov
@ 2009-12-03  6:44             ` Eric Dumazet
  2009-12-03  8:31               ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2009-12-03  6:44 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, kdakhane, netdev, netfilter

Evgeniy Polyakov a écrit :
> Hi.
> 
> Looks very good, thanks Eric, I have one question.
> 
> On Wed, Dec 02, 2009 at 04:08:59PM +0100, Eric Dumazet (eric.dumazet@gmail.com) wrote:
>> +
>> +/*
>> + * unhash a timewait socket from established hash
>> + * lock must be hold by caller
>> + */
>> +void inet_twsk_unhash(struct inet_timewait_sock *tw)
>> +{
>> +	if (hlist_nulls_unhashed(&tw->tw_node))
>> +		return;
>> +
>> +	hlist_nulls_del_rcu(&tw->tw_node);
>> +	sk_nulls_node_init(&tw->tw_node);
>> +	inet_twsk_put(tw);
> 
> Is it safe to call in locked context? inet_twsk_put() schedules
> preemption, also I did not check what tw destructor does.
> 

You are probably right, we could defer the inet_twsk_put(tw) out of locked
section, you or I will submit another patch to correct this.

Thanks !

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next-2.6] tcp: connect() race with timewait reuse
  2009-12-03  6:44             ` Eric Dumazet
@ 2009-12-03  8:31               ` Eric Dumazet
  2009-12-03 23:22                 ` Evgeniy Polyakov
  2009-12-04  0:18                 ` David Miller
  0 siblings, 2 replies; 31+ messages in thread
From: Eric Dumazet @ 2009-12-03  8:31 UTC (permalink / raw)
  To: Evgeniy Polyakov, David Miller; +Cc: kdakhane, netdev, netfilter

Eric Dumazet a écrit :
>
> 
> You are probably right, we could defer the inet_twsk_put(tw) out of locked
> section, you or I will submit another patch to correct this.
> 

Here is an updated patch, tested on my dev machine.

I found another problem about tw refcnt I am going to address ASAP.

Thanks !

[PATCH net-next-2.6] tcp: connect() race with timewait reuse

Its currently possible that several threads issuing a connect() find the same
timewait socket and try to reuse it, leading to list corruptions.

Condition for bug is that these threads bound their socket on same address/port
of to-be-find timewait socket, and connected to same target. (SO_REUSEADDR needed)

To fix this problem, we could unhash timewait socket while holding ehash lock,
to make sure lookups/changes will be serialized. Only first thread finds the timewait
socket, other ones find the established socket and return an EADDRNOTAVAIL error.

This second version takes into account Evgeniy's review and makes sure
inet_twsk_put() is called outside of locked sections.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/inet_timewait_sock.h |    2 +
 net/ipv4/inet_hashtables.c       |   10 +++++--
 net/ipv4/inet_timewait_sock.c    |   38 +++++++++++++++++++++--------
 net/ipv6/inet6_hashtables.c      |   15 +++++++----
 4 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 773b10f..cb7d93b 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -199,6 +199,8 @@ static inline __be32 inet_rcv_saddr(const struct sock *sk)
 
 extern void inet_twsk_put(struct inet_timewait_sock *tw);
 
+extern int inet_twsk_unhash(struct inet_timewait_sock *tw);
+
 extern struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
 						  const int state);
 
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 94ef51a..30e73c5 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -286,6 +286,7 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row,
 	struct sock *sk2;
 	const struct hlist_nulls_node *node;
 	struct inet_timewait_sock *tw;
+	int twrefcnt = 0;
 
 	spin_lock(lock);
 
@@ -318,20 +319,23 @@ unique:
 	sk->sk_hash = hash;
 	WARN_ON(!sk_unhashed(sk));
 	__sk_nulls_add_node_rcu(sk, &head->chain);
+	if (tw) {
+		twrefcnt = inet_twsk_unhash(tw);
+		NET_INC_STATS_BH(net, LINUX_MIB_TIMEWAITRECYCLED);
+	}
 	spin_unlock(lock);
+	if (twrefcnt)
+		inet_twsk_put(tw);
 	sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
 
 	if (twp) {
 		*twp = tw;
-		NET_INC_STATS_BH(net, LINUX_MIB_TIMEWAITRECYCLED);
 	} else if (tw) {
 		/* Silly. Should hash-dance instead... */
 		inet_twsk_deschedule(tw, death_row);
-		NET_INC_STATS_BH(net, LINUX_MIB_TIMEWAITRECYCLED);
 
 		inet_twsk_put(tw);
 	}
-
 	return 0;
 
 not_unique:
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 1f5d508..11380e6 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -14,22 +14,33 @@
 #include <net/inet_timewait_sock.h>
 #include <net/ip.h>
 
+
+/*
+ * unhash a timewait socket from established hash
+ * lock must be hold by caller
+ */
+int inet_twsk_unhash(struct inet_timewait_sock *tw)
+{
+	if (hlist_nulls_unhashed(&tw->tw_node))
+		return 0;
+
+	hlist_nulls_del_rcu(&tw->tw_node);
+	sk_nulls_node_init(&tw->tw_node);
+	return 1;
+}
+
 /* Must be called with locally disabled BHs. */
 static void __inet_twsk_kill(struct inet_timewait_sock *tw,
 			     struct inet_hashinfo *hashinfo)
 {
 	struct inet_bind_hashbucket *bhead;
 	struct inet_bind_bucket *tb;
+	int refcnt;
 	/* Unlink from established hashes. */
 	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
 
 	spin_lock(lock);
-	if (hlist_nulls_unhashed(&tw->tw_node)) {
-		spin_unlock(lock);
-		return;
-	}
-	hlist_nulls_del_rcu(&tw->tw_node);
-	sk_nulls_node_init(&tw->tw_node);
+	refcnt = inet_twsk_unhash(tw);
 	spin_unlock(lock);
 
 	/* Disassociate with bind bucket. */
@@ -37,9 +48,12 @@ static void __inet_twsk_kill(struct inet_timewait_sock *tw,
 			hashinfo->bhash_size)];
 	spin_lock(&bhead->lock);
 	tb = tw->tw_tb;
-	__hlist_del(&tw->tw_bind_node);
-	tw->tw_tb = NULL;
-	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+	if (tb) {
+		__hlist_del(&tw->tw_bind_node);
+		tw->tw_tb = NULL;
+		inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+		refcnt++;
+	}
 	spin_unlock(&bhead->lock);
 #ifdef SOCK_REFCNT_DEBUG
 	if (atomic_read(&tw->tw_refcnt) != 1) {
@@ -47,7 +61,10 @@ static void __inet_twsk_kill(struct inet_timewait_sock *tw,
 		       tw->tw_prot->name, tw, atomic_read(&tw->tw_refcnt));
 	}
 #endif
-	inet_twsk_put(tw);
+	while (refcnt) {
+		inet_twsk_put(tw);
+		refcnt--;
+	}
 }
 
 static noinline void inet_twsk_free(struct inet_timewait_sock *tw)
@@ -92,6 +109,7 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 	tw->tw_tb = icsk->icsk_bind_hash;
 	WARN_ON(!icsk->icsk_bind_hash);
 	inet_twsk_add_bind_node(tw, &tw->tw_tb->owners);
+	atomic_inc(&tw->tw_refcnt);
 	spin_unlock(&bhead->lock);
 
 	spin_lock(lock);
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 00c6a3e..7207801 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -223,6 +223,7 @@ static int __inet6_check_established(struct inet_timewait_death_row *death_row,
 	struct sock *sk2;
 	const struct hlist_nulls_node *node;
 	struct inet_timewait_sock *tw;
+	int twrefcnt = 0;
 
 	spin_lock(lock);
 
@@ -250,19 +251,23 @@ unique:
 	 * in hash table socket with a funny identity. */
 	inet->inet_num = lport;
 	inet->inet_sport = htons(lport);
+	sk->sk_hash = hash;
 	WARN_ON(!sk_unhashed(sk));
 	__sk_nulls_add_node_rcu(sk, &head->chain);
-	sk->sk_hash = hash;
+	if (tw) {
+		twrefcnt = inet_twsk_unhash(tw);
+		NET_INC_STATS_BH(net, LINUX_MIB_TIMEWAITRECYCLED);
+	}
 	spin_unlock(lock);
+	if (twrefcnt)
+		inet_twsk_put(tw);
 	sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
 
-	if (twp != NULL) {
+	if (twp) {
 		*twp = tw;
-		NET_INC_STATS_BH(net, LINUX_MIB_TIMEWAITRECYCLED);
-	} else if (tw != NULL) {
+	} else if (tw) {
 		/* Silly. Should hash-dance instead... */
 		inet_twsk_deschedule(tw, death_row);
-		NET_INC_STATS_BH(net, LINUX_MIB_TIMEWAITRECYCLED);
 
 		inet_twsk_put(tw);
 	}

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next-2.6] tcp: connect() race with timewait reuse
  2009-12-03  8:31               ` Eric Dumazet
@ 2009-12-03 23:22                 ` Evgeniy Polyakov
  2009-12-04  0:18                 ` David Miller
  1 sibling, 0 replies; 31+ messages in thread
From: Evgeniy Polyakov @ 2009-12-03 23:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, kdakhane, netdev, netfilter

Hi Eric.

Patch looks good, thank you!

On Thu, Dec 03, 2009 at 09:31:19AM +0100, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> +/*
> + * unhash a timewait socket from established hash
> + * lock must be hold by caller
> + */
> +int inet_twsk_unhash(struct inet_timewait_sock *tw)
> +{
> +	if (hlist_nulls_unhashed(&tw->tw_node))
> +		return 0;
> +
> +	hlist_nulls_del_rcu(&tw->tw_node);
> +	sk_nulls_node_init(&tw->tw_node);
> +	return 1;
> +}
> +
>  /* Must be called with locally disabled BHs. */
>  static void __inet_twsk_kill(struct inet_timewait_sock *tw,
>  			     struct inet_hashinfo *hashinfo)
>  {
>  	struct inet_bind_hashbucket *bhead;
>  	struct inet_bind_bucket *tb;
> +	int refcnt;
>  	/* Unlink from established hashes. */
>  	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
>  
>  	spin_lock(lock);
> -	if (hlist_nulls_unhashed(&tw->tw_node)) {
> -		spin_unlock(lock);
> -		return;
> -	}
> -	hlist_nulls_del_rcu(&tw->tw_node);
> -	sk_nulls_node_init(&tw->tw_node);
> +	refcnt = inet_twsk_unhash(tw);

Tricky :)

>  	spin_unlock(lock);
>  
>  	/* Disassociate with bind bucket. */
> @@ -37,9 +48,12 @@ static void __inet_twsk_kill(struct inet_timewait_sock *tw,
>  			hashinfo->bhash_size)];
>  	spin_lock(&bhead->lock);
>  	tb = tw->tw_tb;
> -	__hlist_del(&tw->tw_bind_node);
> -	tw->tw_tb = NULL;
> -	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
> +	if (tb) {
> +		__hlist_del(&tw->tw_bind_node);
> +		tw->tw_tb = NULL;
> +		inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
> +		refcnt++;
> +	}
>  	spin_unlock(&bhead->lock);
>  #ifdef SOCK_REFCNT_DEBUG
>  	if (atomic_read(&tw->tw_refcnt) != 1) {
> @@ -47,7 +61,10 @@ static void __inet_twsk_kill(struct inet_timewait_sock *tw,
>  		       tw->tw_prot->name, tw, atomic_read(&tw->tw_refcnt));
>  	}
>  #endif
> -	inet_twsk_put(tw);
> +	while (refcnt) {
> +		inet_twsk_put(tw);
> +		refcnt--;
> +	}
>  }

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next-2.6] tcp: connect() race with timewait reuse
  2009-12-03  8:31               ` Eric Dumazet
  2009-12-03 23:22                 ` Evgeniy Polyakov
@ 2009-12-04  0:18                 ` David Miller
  1 sibling, 0 replies; 31+ messages in thread
From: David Miller @ 2009-12-04  0:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: zbr, kdakhane, netdev, netfilter

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 03 Dec 2009 09:31:19 +0100

> [PATCH net-next-2.6] tcp: connect() race with timewait reuse

Applied.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] tcp: Fix a connect() race with timewait sockets
  2009-12-02  8:59   ` David Miller
  2009-12-02  9:23     ` Eric Dumazet
@ 2009-12-02 16:05     ` Ashwani Wason
  2009-12-03  6:38       ` David Miller
  1 sibling, 1 reply; 31+ messages in thread
From: Ashwani Wason @ 2009-12-02 16:05 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, kdakhane, netdev, netfilter, zbr

Both reuse and recycle were enabled for this test. (I know because we,
Kapil and I are working together on different aspects of this.)

- Ashwani



On Wed, Dec 2, 2009 at 12:59 AM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 01 Dec 2009 16:00:39 +0100
>
>> [PATCH] tcp: Fix a connect() race with timewait sockets
>
> This condition would only trigger if the timewait recycling sysctl is
> enabled.
>
> It is off by default, and I can't find any mention in this bug report
> that it has been turned on.
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] tcp: Fix a connect() race with timewait sockets
  2009-12-02 16:05     ` [PATCH] tcp: Fix a connect() race with timewait sockets Ashwani Wason
@ 2009-12-03  6:38       ` David Miller
  0 siblings, 0 replies; 31+ messages in thread
From: David Miller @ 2009-12-03  6:38 UTC (permalink / raw)
  To: ashwas; +Cc: eric.dumazet, kdakhane, netdev, netfilter, zbr

From: Ashwani Wason <ashwas@gmail.com>
Date: Wed, 2 Dec 2009 08:05:51 -0800

> Both reuse and recycle were enabled for this test. (I know because we,
> Kapil and I are working together on different aspects of this.)

Thanks, so the timewait recycling code paths really are relevant.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 0/2] tcp: Fix connect() races with timewait sockets
  2009-12-01 15:00 ` [PATCH] tcp: Fix a connect() race with timewait sockets Eric Dumazet
  2009-12-02  8:59   ` David Miller
@ 2009-12-04 13:45   ` Eric Dumazet
  2009-12-04 13:46   ` [PATCH 1/2] tcp: Fix a connect() race " Eric Dumazet
  2009-12-04 13:47   ` [PATCH 2/2] " Eric Dumazet
  3 siblings, 0 replies; 31+ messages in thread
From: Eric Dumazet @ 2009-12-04 13:45 UTC (permalink / raw)
  To: kapil dakhane, David S. Miller; +Cc: netdev, netfilter, Evgeniy Polyakov

Eric Dumazet a écrit :
> [PATCH] tcp: Fix a connect() race with timewait sockets
> 
> When we find a timewait connection in __inet_hash_connect() and reuse
> it for a new connection request, we have a race window, releasing bind
> list lock and reacquiring it in __inet_twsk_kill() to remove timewait
> socket from list.
> 
> Another thread might find the timewait socket we already chose, leading to
> list corruption and crashes.
> 
> Fix is to remove timewait socket from bind list before releasing the lock.

I cooked two patches on top of net-next-2.6 to solve the two last
race problems I am aware of.

Kapil, if you want to test them, make sure you take last net-next-2.6 snapshot.

First patch changes __inet_hash_nolisten() and __inet6_hash()
to get a timewait parameter to be able to unhash it from ehash
at same time the new socket is inserted into ehash.

Second patch is a respin of the first patch I sent :
It makes sure __inet_has_connect() cannot give same timewait socket
to different threads.

Thanks !

Reported-by: kapil dakhane <kdakhane@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/2] tcp: Fix a connect() race with timewait sockets
  2009-12-01 15:00 ` [PATCH] tcp: Fix a connect() race with timewait sockets Eric Dumazet
  2009-12-02  8:59   ` David Miller
  2009-12-04 13:45   ` [PATCH 0/2] tcp: Fix connect() races " Eric Dumazet
@ 2009-12-04 13:46   ` Eric Dumazet
  2009-12-05 21:21     ` Evgeniy Polyakov
  2009-12-09  4:18     ` [PATCH 1/2] tcp: Fix a connect() race with timewait sockets David Miller
  2009-12-04 13:47   ` [PATCH 2/2] " Eric Dumazet
  3 siblings, 2 replies; 31+ messages in thread
From: Eric Dumazet @ 2009-12-04 13:46 UTC (permalink / raw)
  To: kapil dakhane, David S. Miller; +Cc: netdev, netfilter, Evgeniy Polyakov

First patch changes __inet_hash_nolisten() and __inet6_hash()
to get a timewait parameter to be able to unhash it from ehash
at same time the new socket is inserted in hash.

This makes sure timewait socket wont be found by a concurrent
writer in __inet_check_established()

Reported-by: kapil dakhane <kdakhane@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/inet6_hashtables.h |    2 +-
 include/net/inet_hashtables.h  |    8 +++++---
 net/dccp/ipv4.c                |    2 +-
 net/dccp/ipv6.c                |    4 ++--
 net/ipv4/inet_hashtables.c     |   22 ++++++++++++++++------
 net/ipv4/tcp_ipv4.c            |    2 +-
 net/ipv6/inet6_hashtables.c    |    8 +++++++-
 net/ipv6/tcp_ipv6.c            |    4 ++--
 8 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 92838d3..e46674d 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -53,7 +53,7 @@ static inline int inet6_sk_ehashfn(const struct sock *sk)
 	return inet6_ehashfn(net, laddr, lport, faddr, fport);
 }
 
-extern void __inet6_hash(struct sock *sk);
+extern int __inet6_hash(struct sock *sk, struct inet_timewait_sock *twp);
 
 /*
  * Sockets in TCP_CLOSE state are _always_ taken out of the hash, so
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 41cbddd..74358d1 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -251,7 +251,7 @@ extern void inet_put_port(struct sock *sk);
 
 void inet_hashinfo_init(struct inet_hashinfo *h);
 
-extern void __inet_hash_nolisten(struct sock *sk);
+extern int __inet_hash_nolisten(struct sock *sk, struct inet_timewait_sock *tw);
 extern void inet_hash(struct sock *sk);
 extern void inet_unhash(struct sock *sk);
 
@@ -391,10 +391,12 @@ static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
 }
 
 extern int __inet_hash_connect(struct inet_timewait_death_row *death_row,
-		struct sock *sk, u32 port_offset,
+		struct sock *sk,
+		u32 port_offset,
 		int (*check_established)(struct inet_timewait_death_row *,
 			struct sock *, __u16, struct inet_timewait_sock **),
-			       void (*hash)(struct sock *sk));
+		int (*hash)(struct sock *sk, struct inet_timewait_sock *twp));
+
 extern int inet_hash_connect(struct inet_timewait_death_row *death_row,
 			     struct sock *sk);
 #endif /* _INET_HASHTABLES_H */
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index efbcfdc..dad7bc4 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -408,7 +408,7 @@ struct sock *dccp_v4_request_recv_sock(struct sock *sk, struct sk_buff *skb,
 
 	dccp_sync_mss(newsk, dst_mtu(dst));
 
-	__inet_hash_nolisten(newsk);
+	__inet_hash_nolisten(newsk, NULL);
 	__inet_inherit_port(sk, newsk);
 
 	return newsk;
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 6574215..baf05cf 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -46,7 +46,7 @@ static void dccp_v6_hash(struct sock *sk)
 			return;
 		}
 		local_bh_disable();
-		__inet6_hash(sk);
+		__inet6_hash(sk, NULL);
 		local_bh_enable();
 	}
 }
@@ -644,7 +644,7 @@ static struct sock *dccp_v6_request_recv_sock(struct sock *sk,
 	newinet->inet_daddr = newinet->inet_saddr = LOOPBACK4_IPV6;
 	newinet->inet_rcv_saddr = LOOPBACK4_IPV6;
 
-	__inet6_hash(newsk);
+	__inet6_hash(newsk, NULL);
 	__inet_inherit_port(sk, newsk);
 
 	return newsk;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 21e5e32..c4201b7 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -351,12 +351,13 @@ static inline u32 inet_sk_port_offset(const struct sock *sk)
 					  inet->inet_dport);
 }
 
-void __inet_hash_nolisten(struct sock *sk)
+int __inet_hash_nolisten(struct sock *sk, struct inet_timewait_sock *tw)
 {
 	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
 	struct hlist_nulls_head *list;
 	spinlock_t *lock;
 	struct inet_ehash_bucket *head;
+	int twrefcnt = 0;
 
 	WARN_ON(!sk_unhashed(sk));
 
@@ -367,8 +368,13 @@ void __inet_hash_nolisten(struct sock *sk)
 
 	spin_lock(lock);
 	__sk_nulls_add_node_rcu(sk, list);
+	if (tw) {
+		WARN_ON(sk->sk_hash != tw->tw_hash);
+		twrefcnt = inet_twsk_unhash(tw);
+	}
 	spin_unlock(lock);
 	sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
+	return twrefcnt;
 }
 EXPORT_SYMBOL_GPL(__inet_hash_nolisten);
 
@@ -378,7 +384,7 @@ static void __inet_hash(struct sock *sk)
 	struct inet_listen_hashbucket *ilb;
 
 	if (sk->sk_state != TCP_LISTEN) {
-		__inet_hash_nolisten(sk);
+		__inet_hash_nolisten(sk, NULL);
 		return;
 	}
 
@@ -427,7 +433,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		struct sock *sk, u32 port_offset,
 		int (*check_established)(struct inet_timewait_death_row *,
 			struct sock *, __u16, struct inet_timewait_sock **),
-		void (*hash)(struct sock *sk))
+		int (*hash)(struct sock *sk, struct inet_timewait_sock *twp))
 {
 	struct inet_hashinfo *hinfo = death_row->hashinfo;
 	const unsigned short snum = inet_sk(sk)->inet_num;
@@ -435,6 +441,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 	struct inet_bind_bucket *tb;
 	int ret;
 	struct net *net = sock_net(sk);
+	int twrefcnt = 1;
 
 	if (!snum) {
 		int i, remaining, low, high, port;
@@ -493,13 +500,16 @@ ok:
 		inet_bind_hash(sk, tb, port);
 		if (sk_unhashed(sk)) {
 			inet_sk(sk)->inet_sport = htons(port);
-			hash(sk);
+			twrefcnt += hash(sk, tw);
 		}
 		spin_unlock(&head->lock);
 
 		if (tw) {
 			inet_twsk_deschedule(tw, death_row);
-			inet_twsk_put(tw);
+			while (twrefcnt) {
+				twrefcnt--;
+				inet_twsk_put(tw);
+			}
 		}
 
 		ret = 0;
@@ -510,7 +520,7 @@ ok:
 	tb  = inet_csk(sk)->icsk_bind_hash;
 	spin_lock_bh(&head->lock);
 	if (sk_head(&tb->owners) == sk && !sk->sk_bind_node.next) {
-		hash(sk);
+		hash(sk, NULL);
 		spin_unlock_bh(&head->lock);
 		return 0;
 	} else {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 29002ab..15e9603 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1464,7 +1464,7 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 	}
 #endif
 
-	__inet_hash_nolisten(newsk);
+	__inet_hash_nolisten(newsk, NULL);
 	__inet_inherit_port(sk, newsk);
 
 	return newsk;
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index c813e29..633a6c2 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -22,9 +22,10 @@
 #include <net/inet6_hashtables.h>
 #include <net/ip.h>
 
-void __inet6_hash(struct sock *sk)
+int __inet6_hash(struct sock *sk, struct inet_timewait_sock *tw)
 {
 	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
+	int twrefcnt = 0;
 
 	WARN_ON(!sk_unhashed(sk));
 
@@ -45,10 +46,15 @@ void __inet6_hash(struct sock *sk)
 		lock = inet_ehash_lockp(hashinfo, hash);
 		spin_lock(lock);
 		__sk_nulls_add_node_rcu(sk, list);
+		if (tw) {
+			WARN_ON(sk->sk_hash != tw->tw_hash);
+			twrefcnt = inet_twsk_unhash(tw);
+		}
 		spin_unlock(lock);
 	}
 
 	sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
+	return twrefcnt;
 }
 EXPORT_SYMBOL(__inet6_hash);
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index aadd7ce..ee9cf62 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -96,7 +96,7 @@ static void tcp_v6_hash(struct sock *sk)
 			return;
 		}
 		local_bh_disable();
-		__inet6_hash(sk);
+		__inet6_hash(sk, NULL);
 		local_bh_enable();
 	}
 }
@@ -1496,7 +1496,7 @@ static struct sock * tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 	}
 #endif
 
-	__inet6_hash(newsk);
+	__inet6_hash(newsk, NULL);
 	__inet_inherit_port(sk, newsk);
 
 	return newsk;


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/2] tcp: Fix a connect() race with timewait sockets
  2009-12-04 13:46   ` [PATCH 1/2] tcp: Fix a connect() race " Eric Dumazet
@ 2009-12-05 21:21     ` Evgeniy Polyakov
  2009-12-07  9:59       ` [PATCH] tcp: documents timewait refcnt tricks Eric Dumazet
  2009-12-09  4:18     ` [PATCH 1/2] tcp: Fix a connect() race with timewait sockets David Miller
  1 sibling, 1 reply; 31+ messages in thread
From: Evgeniy Polyakov @ 2009-12-05 21:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: kapil dakhane, David S. Miller, netdev, netfilter

Hi Eric.

On Fri, Dec 04, 2009 at 02:46:54PM +0100, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> First patch changes __inet_hash_nolisten() and __inet6_hash()
> to get a timewait parameter to be able to unhash it from ehash
> at same time the new socket is inserted in hash.
> 
> This makes sure timewait socket wont be found by a concurrent
> writer in __inet_check_established()

Both patches look good, although trick with returning reference counter
may look like a hack especially when only viewing into ip code and not
hashtable itself. Can you please cook up a documentation update for hash
function that it is supposed to return refcnt when socket was in hash
table.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] tcp: documents timewait refcnt tricks
  2009-12-05 21:21     ` Evgeniy Polyakov
@ 2009-12-07  9:59       ` Eric Dumazet
  2009-12-07 16:06         ` Randy Dunlap
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2009-12-07  9:59 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: kapil dakhane, David S. Miller, netdev, netfilter

Evgeniy Polyakov a écrit :
> Hi Eric.
> 
> On Fri, Dec 04, 2009 at 02:46:54PM +0100, Eric Dumazet (eric.dumazet@gmail.com) wrote:
>> First patch changes __inet_hash_nolisten() and __inet6_hash()
>> to get a timewait parameter to be able to unhash it from ehash
>> at same time the new socket is inserted in hash.
>>
>> This makes sure timewait socket wont be found by a concurrent
>> writer in __inet_check_established()
> 
> Both patches look good, although trick with returning reference counter
> may look like a hack especially when only viewing into ip code and not
> hashtable itself. Can you please cook up a documentation update for hash
> function that it is supposed to return refcnt when socket was in hash
> table.
> 

Sure, here it is :

Thanks !

[PATCH] tcp: documents timewait refcnt tricks 

Adds kerneldoc for inet_twsk_unhash() & inet_twsk_bind_unhash().

Suggested-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 1958cf5..cf719c2 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -15,9 +15,13 @@
 #include <net/ip.h>
 
 
-/*
- * unhash a timewait socket from established hash
- * lock must be hold by caller
+/**
+ *	inet_twsk_unhash - unhash a timewait socket from established hash
+ *	@tw: timewait socket
+ *
+ *	unhash a timewait socket from established hash, if hashed.
+ *	ehash lock must be hold by caller.
+ *	Returns 1 if caller should call inet_twsk_put() after lock release.
  */
 int inet_twsk_unhash(struct inet_timewait_sock *tw)
 {
@@ -26,12 +30,21 @@ int inet_twsk_unhash(struct inet_timewait_sock *tw)
 
 	hlist_nulls_del_rcu(&tw->tw_node);
 	sk_nulls_node_init(&tw->tw_node);
+	/*
+	 * We cannot call inet_twsk_put() ourself under lock,
+	 * caller must call it for us.
+	 */
 	return 1;
 }
 
-/*
- * unhash a timewait socket from bind hash
- * lock must be hold by caller
+/**
+ *	inet_twsk_bind_unhash - unhash a timewait socket from bind hash
+ *	@tw: timewait socket
+ *	@hashinfo: hashinfo pointer
+ *
+ *	unhash a timewait socket from bind hash, if hashed.
+ *	bind hash lock must be hold by caller.
+ *	Returns 1 if caller should call inet_twsk_put() after lock release.
  */
 int inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
 			  struct inet_hashinfo *hashinfo)
@@ -44,6 +57,10 @@ int inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
 	__hlist_del(&tw->tw_bind_node);
 	tw->tw_tb = NULL;
 	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+	/*
+	 * We cannot call inet_twsk_put() ourself under lock,
+	 * caller must call it for us.
+	 */
 	return 1;
 }
 
@@ -140,7 +157,7 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 
 	/*
 	 * Notes :
- 	 * - We initially set tw_refcnt to 0 in inet_twsk_alloc()
+	 * - We initially set tw_refcnt to 0 in inet_twsk_alloc()
 	 * - We add one reference for the bhash link
 	 * - We add one reference for the ehash link
 	 * - We want this refcnt update done before allowing other
@@ -150,7 +167,6 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 
 	spin_unlock(lock);
 }
-
 EXPORT_SYMBOL_GPL(__inet_twsk_hashdance);
 
 struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk, const int state)
@@ -191,7 +207,6 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk, const int stat
 
 	return tw;
 }
-
 EXPORT_SYMBOL_GPL(inet_twsk_alloc);
 
 /* Returns non-zero if quota exceeded.  */
@@ -270,7 +285,6 @@ void inet_twdr_hangman(unsigned long data)
 out:
 	spin_unlock(&twdr->death_lock);
 }
-
 EXPORT_SYMBOL_GPL(inet_twdr_hangman);
 
 void inet_twdr_twkill_work(struct work_struct *work)
@@ -301,7 +315,6 @@ void inet_twdr_twkill_work(struct work_struct *work)
 		spin_unlock_bh(&twdr->death_lock);
 	}
 }
-
 EXPORT_SYMBOL_GPL(inet_twdr_twkill_work);
 
 /* These are always called from BH context.  See callers in
@@ -321,7 +334,6 @@ void inet_twsk_deschedule(struct inet_timewait_sock *tw,
 	spin_unlock(&twdr->death_lock);
 	__inet_twsk_kill(tw, twdr->hashinfo);
 }
-
 EXPORT_SYMBOL(inet_twsk_deschedule);
 
 void inet_twsk_schedule(struct inet_timewait_sock *tw,
@@ -402,7 +414,6 @@ void inet_twsk_schedule(struct inet_timewait_sock *tw,
 		mod_timer(&twdr->tw_timer, jiffies + twdr->period);
 	spin_unlock(&twdr->death_lock);
 }
-
 EXPORT_SYMBOL_GPL(inet_twsk_schedule);
 
 void inet_twdr_twcal_tick(unsigned long data)
@@ -463,7 +474,6 @@ out:
 #endif
 	spin_unlock(&twdr->death_lock);
 }
-
 EXPORT_SYMBOL_GPL(inet_twdr_twcal_tick);
 
 void inet_twsk_purge(struct inet_hashinfo *hashinfo,

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH] tcp: documents timewait refcnt tricks
  2009-12-07  9:59       ` [PATCH] tcp: documents timewait refcnt tricks Eric Dumazet
@ 2009-12-07 16:06         ` Randy Dunlap
  2009-12-09  4:20           ` David Miller
  0 siblings, 1 reply; 31+ messages in thread
From: Randy Dunlap @ 2009-12-07 16:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Evgeniy Polyakov, kapil dakhane, David S. Miller, netdev,
	netfilter

Eric Dumazet wrote:
> Evgeniy Polyakov a écrit :
>> Hi Eric.
>>
>> On Fri, Dec 04, 2009 at 02:46:54PM +0100, Eric Dumazet (eric.dumazet@gmail.com) wrote:
>>> First patch changes __inet_hash_nolisten() and __inet6_hash()
>>> to get a timewait parameter to be able to unhash it from ehash
>>> at same time the new socket is inserted in hash.
>>>
>>> This makes sure timewait socket wont be found by a concurrent
>>> writer in __inet_check_established()
>> Both patches look good, although trick with returning reference counter
>> may look like a hack especially when only viewing into ip code and not
>> hashtable itself. Can you please cook up a documentation update for hash
>> function that it is supposed to return refcnt when socket was in hash
>> table.
>>
> 
> Sure, here it is :
> 
> Thanks !
> 
> [PATCH] tcp: documents timewait refcnt tricks 
> 
> Adds kerneldoc for inet_twsk_unhash() & inet_twsk_bind_unhash().
> 
> Suggested-by: Evgeniy Polyakov <zbr@ioremap.net>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> 
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 1958cf5..cf719c2 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -15,9 +15,13 @@
>  #include <net/ip.h>
>  
>  
> -/*
> - * unhash a timewait socket from established hash
> - * lock must be hold by caller
> +/**
> + *	inet_twsk_unhash - unhash a timewait socket from established hash
> + *	@tw: timewait socket
> + *
> + *	unhash a timewait socket from established hash, if hashed.
> + *	ehash lock must be hold by caller.

	                   held

> + *	Returns 1 if caller should call inet_twsk_put() after lock release.
>   */
>  int inet_twsk_unhash(struct inet_timewait_sock *tw)
>  {
> @@ -26,12 +30,21 @@ int inet_twsk_unhash(struct inet_timewait_sock *tw)
>  
>  	hlist_nulls_del_rcu(&tw->tw_node);
>  	sk_nulls_node_init(&tw->tw_node);
> +	/*
> +	 * We cannot call inet_twsk_put() ourself under lock,
> +	 * caller must call it for us.
> +	 */
>  	return 1;
>  }
>  
> -/*
> - * unhash a timewait socket from bind hash
> - * lock must be hold by caller
> +/**
> + *	inet_twsk_bind_unhash - unhash a timewait socket from bind hash
> + *	@tw: timewait socket
> + *	@hashinfo: hashinfo pointer
> + *
> + *	unhash a timewait socket from bind hash, if hashed.
> + *	bind hash lock must be hold by caller.

	                       held

> + *	Returns 1 if caller should call inet_twsk_put() after lock release.
>   */
>  int inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
>  			  struct inet_hashinfo *hashinfo)


thanks.
-- 
~Randy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] tcp: documents timewait refcnt tricks
  2009-12-07 16:06         ` Randy Dunlap
@ 2009-12-09  4:20           ` David Miller
  0 siblings, 0 replies; 31+ messages in thread
From: David Miller @ 2009-12-09  4:20 UTC (permalink / raw)
  To: rdunlap; +Cc: eric.dumazet, zbr, kdakhane, netdev, netfilter

From: Randy Dunlap <rdunlap@xenotime.net>
Date: Mon, 07 Dec 2009 08:06:58 -0800

> Eric Dumazet wrote:
>> Evgeniy Polyakov a écrit :
>> [PATCH] tcp: documents timewait refcnt tricks 
>> 
>> Adds kerneldoc for inet_twsk_unhash() & inet_twsk_bind_unhash().
>> 
>> Suggested-by: Evgeniy Polyakov <zbr@ioremap.net>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
 ...
>> + *	unhash a timewait socket from established hash, if hashed.
>> + *	ehash lock must be hold by caller.
> 
> 	                   held
> 
 ...
>> + *	unhash a timewait socket from bind hash, if hashed.
>> + *	bind hash lock must be hold by caller.
> 
> 	                       held

I've applied the patch with Randy's corrections added.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/2] tcp: Fix a connect() race with timewait sockets
  2009-12-04 13:46   ` [PATCH 1/2] tcp: Fix a connect() race " Eric Dumazet
  2009-12-05 21:21     ` Evgeniy Polyakov
@ 2009-12-09  4:18     ` David Miller
  1 sibling, 0 replies; 31+ messages in thread
From: David Miller @ 2009-12-09  4:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: kdakhane, netdev, netfilter, zbr

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 04 Dec 2009 14:46:54 +0100

> First patch changes __inet_hash_nolisten() and __inet6_hash()
> to get a timewait parameter to be able to unhash it from ehash
> at same time the new socket is inserted in hash.
> 
> This makes sure timewait socket wont be found by a concurrent
> writer in __inet_check_established()
> 
> Reported-by: kapil dakhane <kdakhane@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied and queued up for -stable.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/2] tcp: Fix a connect() race with timewait sockets
  2009-12-01 15:00 ` [PATCH] tcp: Fix a connect() race with timewait sockets Eric Dumazet
                     ` (2 preceding siblings ...)
  2009-12-04 13:46   ` [PATCH 1/2] tcp: Fix a connect() race " Eric Dumazet
@ 2009-12-04 13:47   ` Eric Dumazet
  2009-12-09  4:19     ` David Miller
  3 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2009-12-04 13:47 UTC (permalink / raw)
  To: kapil dakhane, David S. Miller; +Cc: netdev, netfilter, Evgeniy Polyakov

When we find a timewait connection in __inet_hash_connect() and reuse
it for a new connection request, we have a race window, releasing bind
list lock and reacquiring it in __inet_twsk_kill() to remove timewait
socket from list.

Another thread might find the timewait socket we already chose, leading to
list corruption and crashes.

Fix is to remove timewait socket from bind list before releasing the bind lock.

Note: This problem happens if sysctl_tcp_tw_reuse is set.

Reported-by: kapil dakhane <kdakhane@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/inet_timewait_sock.h |    3 +++
 net/ipv4/inet_hashtables.c       |    2 ++
 net/ipv4/inet_timewait_sock.c    |   29 +++++++++++++++++++++--------
 3 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index b801ade..79f67ea 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -201,6 +201,9 @@ extern void inet_twsk_put(struct inet_timewait_sock *tw);
 
 extern int inet_twsk_unhash(struct inet_timewait_sock *tw);
 
+extern int inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
+				 struct inet_hashinfo *hashinfo);
+
 extern struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
 						  const int state);
 
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index c4201b7..2b79377 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -502,6 +502,8 @@ ok:
 			inet_sk(sk)->inet_sport = htons(port);
 			twrefcnt += hash(sk, tw);
 		}
+		if (tw)
+			twrefcnt += inet_twsk_bind_unhash(tw, hinfo);
 		spin_unlock(&head->lock);
 
 		if (tw) {
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 0a1c62b..1958cf5 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -29,12 +29,29 @@ int inet_twsk_unhash(struct inet_timewait_sock *tw)
 	return 1;
 }
 
+/*
+ * unhash a timewait socket from bind hash
+ * lock must be hold by caller
+ */
+int inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
+			  struct inet_hashinfo *hashinfo)
+{
+	struct inet_bind_bucket *tb = tw->tw_tb;
+
+	if (!tb)
+		return 0;
+
+	__hlist_del(&tw->tw_bind_node);
+	tw->tw_tb = NULL;
+	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+	return 1;
+}
+
 /* Must be called with locally disabled BHs. */
 static void __inet_twsk_kill(struct inet_timewait_sock *tw,
 			     struct inet_hashinfo *hashinfo)
 {
 	struct inet_bind_hashbucket *bhead;
-	struct inet_bind_bucket *tb;
 	int refcnt;
 	/* Unlink from established hashes. */
 	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
@@ -46,15 +63,11 @@ static void __inet_twsk_kill(struct inet_timewait_sock *tw,
 	/* Disassociate with bind bucket. */
 	bhead = &hashinfo->bhash[inet_bhashfn(twsk_net(tw), tw->tw_num,
 			hashinfo->bhash_size)];
+
 	spin_lock(&bhead->lock);
-	tb = tw->tw_tb;
-	if (tb) {
-		__hlist_del(&tw->tw_bind_node);
-		tw->tw_tb = NULL;
-		inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
-		refcnt++;
-	}
+	refcnt += inet_twsk_bind_unhash(tw, hashinfo);
 	spin_unlock(&bhead->lock);
+
 #ifdef SOCK_REFCNT_DEBUG
 	if (atomic_read(&tw->tw_refcnt) != 1) {
 		printk(KERN_DEBUG "%s timewait_sock %p refcnt=%d\n",

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] tcp: Fix a connect() race with timewait sockets
  2009-12-04 13:47   ` [PATCH 2/2] " Eric Dumazet
@ 2009-12-09  4:19     ` David Miller
  0 siblings, 0 replies; 31+ messages in thread
From: David Miller @ 2009-12-09  4:19 UTC (permalink / raw)
  To: eric.dumazet; +Cc: kdakhane, netdev, netfilter, zbr

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 04 Dec 2009 14:47:42 +0100

> When we find a timewait connection in __inet_hash_connect() and reuse
> it for a new connection request, we have a race window, releasing bind
> list lock and reacquiring it in __inet_twsk_kill() to remove timewait
> socket from list.
> 
> Another thread might find the timewait socket we already chose, leading to
> list corruption and crashes.
> 
> Fix is to remove timewait socket from bind list before releasing the bind lock.
> 
> Note: This problem happens if sysctl_tcp_tw_reuse is set.
> 
> Reported-by: kapil dakhane <kdakhane@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied and queued up for -stable, thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2009-12-09  4:20 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-01  2:02 soft lockup in inet_csk_get_port kapil dakhane
2009-12-01  6:10 ` Eric Dumazet
2009-12-01 15:00 ` [PATCH] tcp: Fix a connect() race with timewait sockets Eric Dumazet
2009-12-02  8:59   ` David Miller
2009-12-02  9:23     ` Eric Dumazet
2009-12-02 10:33       ` Eric Dumazet
2009-12-02 11:32         ` Evgeniy Polyakov
2009-12-02 19:18           ` kapil dakhane
2009-12-03  2:43             ` kapil dakhane
2009-12-03 10:49               ` [PATCH] tcp: fix a timewait refcnt race Eric Dumazet
2009-12-04  0:19                 ` David Miller
2009-12-04  3:20                 ` kapil dakhane
2009-12-04  6:29                   ` Eric Dumazet
2009-12-04  6:39                     ` David Miller
2009-12-02 15:08         ` [PATCH net-next-2.6] tcp: connect() race with timewait reuse Eric Dumazet
2009-12-02 22:15           ` Evgeniy Polyakov
2009-12-03  6:44             ` Eric Dumazet
2009-12-03  8:31               ` Eric Dumazet
2009-12-03 23:22                 ` Evgeniy Polyakov
2009-12-04  0:18                 ` David Miller
2009-12-02 16:05     ` [PATCH] tcp: Fix a connect() race with timewait sockets Ashwani Wason
2009-12-03  6:38       ` David Miller
2009-12-04 13:45   ` [PATCH 0/2] tcp: Fix connect() races " Eric Dumazet
2009-12-04 13:46   ` [PATCH 1/2] tcp: Fix a connect() race " Eric Dumazet
2009-12-05 21:21     ` Evgeniy Polyakov
2009-12-07  9:59       ` [PATCH] tcp: documents timewait refcnt tricks Eric Dumazet
2009-12-07 16:06         ` Randy Dunlap
2009-12-09  4:20           ` David Miller
2009-12-09  4:18     ` [PATCH 1/2] tcp: Fix a connect() race with timewait sockets David Miller
2009-12-04 13:47   ` [PATCH 2/2] " Eric Dumazet
2009-12-09  4:19     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).