From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755319Ab1KRB1G (ORCPT ); Thu, 17 Nov 2011 20:27:06 -0500 Received: from hrndva-omtalb.mail.rr.com ([71.74.56.122]:43567 "EHLO hrndva-omtalb.mail.rr.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754693Ab1KRB1E (ORCPT ); Thu, 17 Nov 2011 20:27:04 -0500 X-Authority-Analysis: v=2.0 cv=Pdt9d1dd c=1 sm=0 a=ZycB6UtQUfgMyuk2+PxD7w==:17 a=ROpngT_7MxMA:10 a=5SG0PmZfjMsA:10 a=IkcTkHD0fZMA:10 a=QyXUC8HyAAAA:8 a=lwmcjWK_4c6Z4KxQNRMA:9 a=bAmGZfYICvn2uyURma8A:7 a=QEXdDO2ut3YA:10 a=dGJ0OcVc7YAA:10 a=ZycB6UtQUfgMyuk2+PxD7w==:117 X-Cloudmark-Score: 0 X-Originating-IP: 74.67.80.29 Subject: [BUG] e1000: possible deadlock scenario caught by lockdep From: Steven Rostedt To: Jesse Brandeburg Cc: LKML , Thomas Gleixner , Tushar Dave , Aaron Brown , Jeff Kirsher Content-Type: text/plain; charset="UTF-8" Date: Thu, 17 Nov 2011 20:27:00 -0500 Message-ID: <1321579620.3533.29.camel@frodo> Mime-Version: 1.0 X-Mailer: Evolution 2.32.3 (2.32.3-1.fc14) Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I hit the following lockdep splat: ====================================================== [ INFO: possible circular locking dependency detected ] 3.2.0-rc2-test+ #14 ------------------------------------------------------- reboot/2316 is trying to acquire lock: ((&(&adapter->watchdog_task)->work)){+.+...}, at: [] wait_on_work+0x0/0xac but task is already holding lock: (&adapter->mutex){+.+...}, at: [] __e1000_shutdown+0x56/0x1f5 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&adapter->mutex){+.+...}: [] lock_acquire+0x103/0x158 [] __mutex_lock_common+0x6a/0x441 [] mutex_lock_nested+0x1b/0x1d [] e1000_watchdog+0x56/0x4a4 [] process_one_work+0x1ef/0x3e0 [] worker_thread+0xda/0x15e [] kthread+0x9f/0xa7 [] kernel_thread_helper+0x4/0x10 -> #0 ((&(&adapter->watchdog_task)->work)){+.+...}: [] __lock_acquire+0xa29/0xd06 [] lock_acquire+0x103/0x158 [] wait_on_work+0x3d/0xac [] __cancel_work_timer+0xb9/0xff [] cancel_delayed_work_sync+0x12/0x14 [] e1000_down_and_stop+0x2e/0x4a [] e1000_down+0x116/0x176 [] __e1000_shutdown+0x83/0x1f5 [] e1000_shutdown+0x1a/0x43 [] pci_device_shutdown+0x29/0x3d [] device_shutdown+0xbe/0xf9 [] kernel_restart_prepare+0x31/0x38 [] kernel_restart+0x14/0x51 [] sys_reboot+0x157/0x1b0 [] system_call_fastpath+0x16/0x1b other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&adapter->mutex); lock((&(&adapter->watchdog_task)->work)); lock(&adapter->mutex); lock((&(&adapter->watchdog_task)->work)); *** DEADLOCK *** 2 locks held by reboot/2316: #0: (reboot_mutex){+.+.+.}, at: [] sys_reboot+0x9f/0x1b0 #1: (&adapter->mutex){+.+...}, at: [] __e1000_shutdown+0x56/0x1f5 stack backtrace: Pid: 2316, comm: reboot Not tainted 3.2.0-rc2-test+ #14 Call Trace: [] print_circular_bug+0x1f8/0x209 [] __lock_acquire+0xa29/0xd06 [] ? wait_on_cpu_work+0x94/0x94 [] lock_acquire+0x103/0x158 [] ? wait_on_cpu_work+0x94/0x94 [] ? trace_preempt_on+0x2a/0x2f [] wait_on_work+0x3d/0xac [] ? wait_on_cpu_work+0x94/0x94 [] __cancel_work_timer+0xb9/0xff [] cancel_delayed_work_sync+0x12/0x14 [] e1000_down_and_stop+0x2e/0x4a [] e1000_down+0x116/0x176 [] __e1000_shutdown+0x83/0x1f5 [] ? _raw_spin_unlock+0x33/0x56 [] ? device_shutdown+0x40/0xf9 [] e1000_shutdown+0x1a/0x43 [] ? sub_preempt_count+0xa1/0xb4 [] pci_device_shutdown+0x29/0x3d [] device_shutdown+0xbe/0xf9 [] kernel_restart_prepare+0x31/0x38 [] kernel_restart+0x14/0x51 [] sys_reboot+0x157/0x1b0 [] ? hrtimer_cancel+0x17/0x24 [] ? do_nanosleep+0x74/0xac [] ? trace_hardirqs_off_thunk+0x3a/0x3c [] ? error_sti+0x5/0x6 [] ? time_hardirqs_off+0x2a/0x2f [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] ? retint_swapgs+0x13/0x1b [] ? retint_swapgs+0x13/0x1b [] ? trace_hardirqs_on_caller+0x12d/0x164 [] ? audit_syscall_entry+0x11c/0x148 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] system_call_fastpath+0x16/0x1b The issue comes from two recent commits: commit a4010afef585b7142eb605e3a6e4210c0e1b2957 Author: Jesse Brandeburg Date: Wed Oct 5 07:24:41 2011 +0000 e1000: convert hardware management from timers to threads and commit 0ef4eedc2e98edd51cd106e1f6a27178622b7e57 Author: Jesse Brandeburg Date: Wed Oct 5 07:24:51 2011 +0000 e1000: convert to private mutex from rtnl What we have is on __e1000_shutdown(): mutex_lock(&adapter->mutex); if (netif_running(netdev)) { WARN_ON(test_bit(__E1000_RESETTING, &adapter->flags)); e1000_down(adapter); } but e1000_down() calls: e1000_down_and_stop(): static void e1000_down_and_stop(struct e1000_adapter *adapter) { set_bit(__E1000_DOWN, &adapter->flags); cancel_work_sync(&adapter->reset_task); cancel_delayed_work_sync(&adapter->watchdog_task); cancel_delayed_work_sync(&adapter->phy_info_task); cancel_delayed_work_sync(&adapter->fifo_stall_task); } Here you see that we are calling cancel_delayed_work_sync(&adapter->watchdog_task); The problem is that adapter->watchdog_task grabs the mutex &adapter->mutex. If the work has started and it blocked on that mutex, the cancel_delayed_work_sync() will block indefinitely and we have a deadlock. Not sure what's the best way around this. Can we call e1000_down() without grabbing the adapter->mutex? -- Steve