From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephen Buck Subject: Re: tproxy related crash in inet_hashtables Date: Fri, 13 Aug 2010 23:05:51 +1000 Message-ID: <4C65432F.4030809@exinda.com> References: <4C64FF2E.6060509@exinda.com> <1281696852.4470.20.camel@edumazet-laptop> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: "netdev@vger.kernel.org" , Netfilter Development Mailinglist , Patrick McHardy , KOVACS Krisztian To: Eric Dumazet Return-path: Received: from bld-mail12.adl6.internode.on.net ([150.101.137.97]:52155 "EHLO mail.internode.on.net" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S934263Ab0HMNXt (ORCPT ); Fri, 13 Aug 2010 09:23:49 -0400 In-Reply-To: <1281696852.4470.20.camel@edumazet-laptop> Sender: netfilter-devel-owner@vger.kernel.org List-ID: On 13/08/10 20:54, Eric Dumazet wrote: > Le vendredi 13 ao=C3=BBt 2010 =C3=A0 18:15 +1000, Stephen Buck a =C3=A9= crit : >> Recently I encountered a number of crashes related to tproxy on the >> 2.6.34.1 (x86_64 SMP) kernel. These usually manifested as a bug like= the >> following (Although the bug was confirmed to be present on a vanilla >> kernel, this particular trace is from a kernel with some customisati= ons): >> >> [ 1504.765077] BUG: unable to handle kernel NULL pointer dereference= at >> (null) >> [ 1504.848183] IP: [] inet_bind_bucket_destroy+0x1= b/0x30 >> [ 1504.927126] PGD 1a9933067 PUD 1ad909067 PMD 0 >> [ 1504.980125] Thread overran stack, or stack corrupted >> [ 1505.039325] Oops: 0002 #1 SMP >> [ 1505.077775] last sysfs file: >> /sys/devices/system/cpu/cpu15/topology/thread_siblings >> [ 1505.169166] CPU 0 >> [ 1505.193070] Modules linked in: sch_sfq cls_fw sch_htb xt_physdev >> 8021q bridge stp llc >> [ 1505.923769] Pid: 0, comm: swapper Not tainted 2.6.31-12EXINDAsmp = #0 >> PowerEdge R710 >> [ 1506.014118] RIP: 0010:[] [] >> inet_bind_bucket_destroy+0x1b/0x30 >> [ 1506.122242] RSP: 0018:ffffc90000003e10 EFLAGS: 00010246 >> [ 1506.185655] RAX: 0000000000000000 RBX: ffffc900164a02a0 RCX: >> ffffea00098e24b0 >> [ 1506.270863] RDX: 0000000000000000 RSI: ffff8802e1186280 RDI: >> ffffffff815b4600 >> [ 1506.356077] RBP: ffffc90000003e10 R08: 0000000000000016 R09: >> 0000000000000001 >> [ 1506.441284] R10: 0000000000000000 R11: 0000000000000000 R12: >> ffff88011b894fc0 >> [ 1506.526500] R13: ffffffff81744c80 R14: ffffffff815fe6c0 R15: >> 0000000000000003 >> [ 1506.611722] FS: 0000000000000000(0000) GS:ffffc90000000000(0000) >> knlGS:0000000000000000 >> [ 1506.708410] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b >> [ 1506.777011] CR2: 0000000000000000 CR3: 00000001ae17a000 CR4: >> 00000000000006f0 >> [ 1506.862225] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >> 0000000000000000 >> [ 1506.947442] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: >> 0000000000000400 >> [ 1507.032650] Process swapper (pid: 0, threadinfo ffffffff815b0000, >> task ffffffff815b6bc0) >> [ 1507.129345] Stack: >> [ 1507.153299] ffffc90000003e40 ffffffff8135b834 0000000000000001 >> ffff88011b894fc0 >> [ 1507.239581]<0> ffffffff815fe560 0000000000000002 ffffc90000003e8= 0 >> ffffffff8135bbf6 >> [ 1507.331133]<0> ffffffff815fe560 ffffffff815fe560 ffffffff815fe6c= 0 >> ffffc90000003eb0 >> [ 1507.424833] Call Trace: >> [ 1507.453976] >> [ 1507.478971] [] __inet_twsk_kill+0xb4/0xf0 >> [ 1507.546538] [] inet_twdr_do_twkill_work+0x66/0x= d0 >> [ 1507.622408] [] ? inet_twdr_hangman+0x0/0xd0 >> [ 1507.692041] [] inet_twdr_hangman+0x55/0xd0 >> [ 1507.760650] [] run_timer_softirq+0x18c/0x220 >> [ 1507.831330] [] __do_softirq+0xc8/0x1f0 >> [ 1507.895799] [] call_softirq+0x1c/0x30 >> [ 1507.959209] [] do_softirq+0x45/0x80 >> [ 1508.020538] [] irq_exit+0x87/0x90 >> [ 1508.079797] [] smp_apic_timer_interrupt+0x71/0x= 9d >> [ 1508.155667] [] apic_timer_interrupt+0x13/0x20 >> [ 1508.227385] >> [ 1508.252383] [] ? mwait_idle+0x7e/0x110 >> [ 1508.316836] [] ? __atomic_notifier_call_chain+0= xd/0x10 >> [ 1508.397903] [] ? atomic_notifier_call_chain+0x1= 1/0x20 >> [ 1508.477928] [] ? cpu_idle+0x51/0x90 >> [ 1508.539265] [] ? rest_init+0x6b/0x80 >> [ 1508.601639] [] ? start_kernel+0x2c5/0x370 >> [ 1508.669200] [] ? x86_64_start_reservations+0x81= /0xc0 >> [ 1508.748192] [] ? x86_64_start_kernel+0xd6/0x100 >> [ 1508.821979] Code: 64 10 40 eb 94 0f 1f 44 00 00 66 0f 1f 44 00 00= 55 >> 48 89 e5 0f 1f 44 00 00 48 83 7e 18 00 75 19 48 8b 46 08 48 8b 56 10 >> [ 1509.048123] RIP [] inet_bind_bucket_destroy+0x1= b/0x30 >> [ 1509.128157] RSP >> [ 1509.169759] CR2: 0000000000000000 >> >> >> After spending a while tracking it down, I discovered that the wrong >> locks get held when operating on the bind hash table's chains. >> >> This is due to the listen socket and the child socket having differe= nt >> local ports when __inet_inherit_port() is called. The lock is held b= ased >> on the child socket's port, but the list operated on is the one the >> listen socket belongs to. >> >> e.g. >> There is a transparent proxy listening on port 9999. >> A new http connection (with port 80) is redirected to the proxy. >> >> The inet_bind_hashbucket locked in this case is table->bhash[80].loc= k, >> but the inet_bind_bucket the child socket is added to is in the chai= n of >> the table->bhash[9999] inet_bind_hashbucket. This means that if anot= her >> connection with a different local port arrived and was redirected to= the >> proxy, they could both be operating on the list at the same time. >> >> >> Attached is a patch that should fix this by looking up the correct >> inet_bind_bucket based on the child's local port when the >> inet_bind_bucket from the listen socket has a different port to the >> child's inet_num. It was built against 2.6.34.1, but should apply to= any >> mainline kernel. >> >> It is also possible the same bug exists in the IPv6 code as well. As= I >> have not had to deal with IPv6 yet, I have not had a look. > Hi Stephen > > CC netfilter-devel& Patrick& Krisztian > > I cannot convince myself this patch is a right fix. > > This probably should be fixed in netfilter tree, not in > net/ipv4/inet_hashtables.c ? > > Once tproxy is involved, the original port (80) should be changed to > 9999 by tproxy (for SYN packet) and conntrack for following ones. > > So listening socket and its children all use source port 9999 ? > > (inet_sk(child)->inet_num =3D=3D inet_sk(parent)->inet_num) I think you are referring to the older method of transparent proxying.=20 Tproxy4 does not rely on conntrack. The socket is created with the=20 connection's original ports and IPs. > > You claim wrong lock is taken at insert time, but are you sure the ri= ght > lock is taken at deletion time ? > > Hmm... > Without the patch, you have the same problem in __inet_put_port(). The=20 lock is taken based on the child's inet_num, but the icsk_bind_hash of=20 the socket was inherited from the parent, so it belongs to a different=20 inet_bind_hashbucket. With the patch, inet_bind_hash refers to the inet_bind_bucket that was=20 found by searching the hash table, rather than directly inherited from=20 the parent. This means that the correct lock is chosen for the list=20 being manipulated. -- To unsubscribe from this list: send the line "unsubscribe netfilter-dev= el" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html