From mboxrd@z Thu Jan 1 00:00:00 1970 From: Phil Oester Subject: Re: Deadlocks Date: Mon, 14 Jun 2004 21:47:23 -0700 Sender: netfilter-devel-admin@lists.netfilter.org Message-ID: <20040615044723.GA16891@linuxace.com> References: <20040609180909.GA11445@linuxace.com> <1087156709.11287.8.camel@ws> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: netfilter-devel@lists.netfilter.org Return-path: To: Patrick McHardy Content-Disposition: inline In-Reply-To: <1087156709.11287.8.camel@ws> Errors-To: netfilter-devel-admin@lists.netfilter.org List-Help: List-Post: List-Subscribe: , List-Unsubscribe: , List-Archive: List-Id: netfilter-devel.vger.kernel.org I agree I am likely experiencing the deadlock you refer to: CPU1: conntrack-helper:help: lock(private_lock) ip_conntrack_expect_related: write_lock(ip_conntrack_lock) CPU2: nat-core:do_bindings: read_lock(ip_conntrack_lock) nat-helper:help: lock(private_lock) However, it's unclear to me that the ip_ftp_lock can be trivially eliminated. This code path looks particularly prickly in ip_nat_ftp.c: help ftp_data_fixup ip_conntrack_change_expect so the nat helper is changing the expectation -- potentially at the same time the conntrack helper is calling ip_conntrack_expect. If the private lock were removed, could this not cause a race condition if the expectation got created just after the nat-helper changed the expectation? It seems the ip_ftp_lock is needed, but perhaps needs to be reworked to avoid the deadlock condition illustrated above. Thoughts? Phil On Sun, Jun 13, 2004 at 09:58:29PM +0200, Patrick McHardy wrote: > On Wed, 2004-06-09 at 20:09, Phil Oester wrote: > > For the past 3 months I've been experiencing deadlocks on some heavily > > used gateway/firewall boxes which started after upgrading from 2.4.20. > > > > I can confirm that moving back to 2.4.20 stops the hangs, moving to 2.4.21 > > (or any kernel after that) makes them return. I am in the process of testing > > out each individual 2.4.21-pre to find out where exactly the problem is. > > > > In the interim, I've collected some SysRq output which may help in the > > analysis. Below are two separate lockups on a 2.6.6 kernel. Anyone have > > any bright ideas? > > This looks like the problem I described a couple of month ago: > http://lists.netfilter.org/pipermail/netfilter-devel/2003-November/013130.html > I went through the 2.4.21 patch, but couldn't find anything that looks > related to this. The patch attached to the email above should apply to > something around 2.4.23. Please also enable CONFIG_NETFILTER_DEBUG, so > we can see where exactly the problem occurs. > > Regards > Patrick > > > > > Phil Oester > > > > > > Lockup #1: > > Pid: 0, comm: swapper > > EIP: 0060:[] CPU: 1 > > EIP is at __write_lock_failed+0xf/0x20 > > EFLAGS: 00000287 Not tainted (2.6.6) > > EAX: c0283360 EBX: ffffffff ECX: 7d9d14aa EDX: ee83c1e0 > > ESI: f454b910 EDI: ffffffff EBP: 0000007d DS: 007b ES: 007b > > CR0: 8005003b CR2: 08076ac4 CR3: 37b34000 CR4: 00000690 > > Call Trace: > > [] .text.lock.ip_conntrack_core+0x7d/0xd5 > > [] do_bindings+0x8d/0x260 > > [] try_rfc959+0x25/0x30 > > [] help+0x2f7/0x430 > > [] try_rfc959+0x0/0x30 > > [] tcp_packet+0xd1/0x160 > > [] ip_conntrack_in+0x100/0x220 > > [] nf_iterate+0x72/0xb0 > > [] ip_rcv_finish+0x0/0x245 > > [] nf_hook_slow+0x78/0x110 > > [] ip_rcv_finish+0x0/0x245 > > [] ip_rcv+0x3c1/0x480 > > [] ip_rcv_finish+0x0/0x245 > > [] alloc_skb+0x32/0xd0 > > [] netif_receive_skb+0x162/0x190 > > [] e1000_clean_rx_irq+0x399/0x410 > > [] e1000_clean+0x34/0xb0 > > [] net_rx_action+0x7f/0x110 > > [] __do_softirq+0xb4/0xc0 > > [] do_softirq+0x4c/0x60 > > ======================= > > [] do_IRQ+0x145/0x180 > > [] common_interrupt+0x18/0x20 > > [] default_idle+0x0/0x40 > > [] default_idle+0x2c/0x40 > > [] cpu_idle+0x3b/0x50 > > [] __call_console_drivers+0x57/0x60 > > [] call_console_drivers+0x7f/0x100 > > > > > > Lockup #2: > > Pid: 0, comm: swapper > > EIP: 0060:[] CPU: 0 > > EIP is at .text.lock.ip_nat_ftp+0x19/0x29 > > EFLAGS: 00000286 Not tainted (2.6.6) > > EAX: 00000001 EBX: c0306000 ECX: d31c3034 EDX: eaeb8ac0 > > ESI: 00000019 EDI: eaeb8a48 EBP: c0306d24 DS: 007b ES: 007b > > CR0: 8005003b CR2: 4024f0ec CR3: 31515000 CR4: 00000690 > > Call Trace: > > [] tcp_exp_matches_pkt+0x32/0x79 > > [] do_bindings+0x34f/0x570 > > [] ip_nat_fn+0x77/0x310 > > [] nf_iterate+0x6e/0xc0 > > [] ip_finish_output2+0x0/0x1cb > > [] nf_hook_slow+0x86/0x150 > > [] ip_finish_output2+0x0/0x1cb > > [] ip_finish_output+0x43/0x50 > > [] ip_finish_output2+0x0/0x1cb > > [] ip_forward_finish+0x2c/0x50 > > [] nf_hook_slow+0xda/0x150 > > [] ip_forward_finish+0x0/0x50 > > [] ip_forward+0x137/0x1d0 > > [] ip_forward_finish+0x0/0x50 > > [] ip_rcv_finish+0x1e8/0x25d > > [] nf_iterate+0x6e/0xc0 > > [] ip_rcv_finish+0x0/0x25d > > [] nf_hook_slow+0xda/0x150 > > [] ip_rcv_finish+0x0/0x25d > > [] ip_rcv+0x18d/0x240 > > [] ip_rcv_finish+0x0/0x25d > > [] netif_receive_skb+0x174/0x1a0 > > [] e1000_clean_rx_irq+0x3d8/0x490 > > [] e1000_clean+0x3c/0xb0 > > [] net_rx_action+0x90/0x130 > > [] __do_softirq+0xb4/0xc0 > > [] do_softirq+0x4f/0x60 > > ======================= > > [] do_IRQ+0x1a9/0x260 > > [] smp_apic_timer_interrupt+0xcc/0x130 > > [] common_interrupt+0x18/0x20 > > [] default_idle+0x0/0x40 > > [] default_idle+0x2f/0x40 > > [] cpu_idle+0x3b/0x50 > > [] unknown_bootoption+0x0/0x120 > > [] start_kernel+0x173/0x1c0 > > [] unknown_bootoption+0x0/0x120 > > > > > >