From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752332AbbIOVmv (ORCPT ); Tue, 15 Sep 2015 17:42:51 -0400 Received: from e35.co.us.ibm.com ([32.97.110.153]:54441 "EHLO e35.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751465AbbIOVmt (ORCPT ); Tue, 15 Sep 2015 17:42:49 -0400 X-Helo: d03dlp01.boulder.ibm.com X-MailFrom: paulmck@linux.vnet.ibm.com X-RcptTo: linux-kernel@vger.kernel.org Date: Tue, 15 Sep 2015 14:42:44 -0700 From: "Paul E. McKenney" To: Tim Spriggs Cc: peterz@infradead.org, davej@codemonkey.org.uk, sasha.levin@oracle.com, linux-kernel@vger.kernel.org Subject: Re: unpinning an unpinned lock Message-ID: <20150915214244.GS4029@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15091521-0013-0000-0000-0000186259AA Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Sep 14, 2015 at 04:22:53PM -0700, Tim Spriggs wrote: > Hello! Adding Sasha and LKML on CC. > I noticed some traffic at http://www.gossamer-threads.com/lists/linux/kernel/2229593 that you three generated. I’ve run into a similar issue and I’m wondering if I can contribute anything with the info I have. Initially I was seeing deadlocks (giant load averages and processes stop responding) and I was able to get a stuck-app stack-traces. They (cron/sshd/…) looked like this: > > [71458.731530] Call Trace: > [71458.734130] [] schedule+0x3e/0x90 > [71458.739250] [] schedule_preempt_disabled+0x18/0x30 > [71458.745857] [] __mutex_lock_slowpath+0x95/0x130 > [71458.752193] [] mutex_lock+0x1b/0x30 > [71458.757490] [] audit_receive+0x21/0xb0 > [71458.763049] [] netlink_unicast+0x185/0x210 > [71458.768954] [] netlink_sendmsg+0x32a/0x3b0 > [71458.774861] [] sock_sendmsg+0x4d/0x60 > [71458.780336] [] SYSC_sendto+0xef/0x130 > [71458.785806] [] ? __handle_mm_fault+0xf9/0x190 > [71458.791970] [] ? syscall_trace_enter_phase1+0x13b/0x170 > [71458.799006] [] SyS_sendto+0xe/0x10 > [71458.804214] [] system_call_fastpath+0x12/0x6a > > I enabled lock debugging in the kernel and ran it across a bunch of nodes and waited for call traces. After the weekend, I found 3 distinct traces on 16 nodes. > > All nodes report from: kernel/locking/lockdep.c:3497 lock_unpin_lock+0xed/0xf0() Dave Jones was seeing lock_unpin_lock(), IIRC. Me, I was seeing infrequent silent hangs. Thanx, Paul > All nodes state: unpinning an unpinned lock > > All nodes have these modules linked in: ahci(E) dca(E) dm_log(E) dm_mirror(E) dm_mod(E) dm_region_hash(E) hwmon(E) i2c_algo_bit(E) i2c_core(E) i2c_i801(E) iTCO_vendor_support(E) iTCO_wdt(E) igb(E) ioatdma(E) ipmi_msghandler(E) ipmi_si(E) ipmi_ssif(E) ixgbe(E) libahci(E) lpc_ich(E) mdio(E) mfd_core(E) microcode(E) overlay(E) pps_core(E) ptp(E) sb_edac(E) sd_mod(E) sg(E) vxlan(E) wmi(E) xhci_hcd(E) xhci_pci(E) > > There was one variation on Workqueue: > 15 Workqueue: events key_garbage_collector > 1 Workqueue: kdmflush dm_wq_work [dm_mod] > > I’m hoping that the extra info will help in your debugging/tracing this error down. > > Cheers, > -Tim > > — > > [] dump_stack+0x48/0x5a > [] warn_slowpath_common+0x95/0xe0 > [] warn_slowpath_fmt+0x46/0x50 > [] lock_unpin_lock+0xed/0xf0 > [] pick_next_task_fair+0x51/0x220 > [] __schedule+0x12c/0x990 > [] ? __lock_acquire+0x3ab/0x560 > [] schedule+0x3e/0x80 > [] schedule_timeout+0x1c6/0x220 > [] ? wait_for_completion+0x91/0x110 > [] ? mark_held_locks+0x70/0x90 > [] ? _raw_spin_unlock_irq+0x30/0x40 > [] ? trace_hardirqs_on_caller+0xff/0x1c0 > [] wait_for_completion+0x99/0x110 > [] ? try_to_wake_up+0x2e0/0x2e0 > [] ? __call_rcu.clone.0+0x1a0/0x1a0 > [] wait_rcu_gp+0x50/0x60 > [] ? wait_rcu_gp+0x60/0x60 > [] synchronize_sched+0x47/0xb0 > [] key_garbage_collector+0x19d/0x3e0 > [] process_one_work+0x1df/0x570 > [] ? process_one_work+0x160/0x570 > [] ? worker_thread+0x191/0x420 > [] ? _raw_spin_unlock_irq+0x30/0x40 > [] worker_thread+0x11d/0x420 > [] ? trace_hardirqs_on+0xd/0x10 > [] ? process_one_work+0x570/0x570 > [] ? process_one_work+0x570/0x570 > [] kthread+0xde/0x100 > [] ? schedule_tail+0x1e/0xd0 > [] ? __init_kthread_worker+0x70/0x70 > [] ret_from_fork+0x3f/0x70 > [] ? __init_kthread_worker+0x70/0x70 > > (… and a diff -u variation on this) > > @@ -5,6 +5,7 @@ > [] lock_unpin_lock+0xed/0xf0 > [] pick_next_task_fair+0x51/0x220 > [] __schedule+0x12c/0x990 > + [] ? __lock_release+0xc7/0x390 > [] ? __lock_acquire+0x3ab/0x560 > [] schedule+0x3e/0x80 > [] schedule_timeout+0x1c6/0x220 > > — > > [] dump_stack+0x48/0x5a > [] warn_slowpath_common+0x95/0xe0 > [] warn_slowpath_fmt+0x46/0x50 > [] lock_unpin_lock+0xed/0xf0 > [] __schedule+0x3b1/0x990 > [] ? mark_held_locks+0x70/0x90 > [] ? _raw_spin_unlock_irq+0x30/0x40 > [] schedule+0x3e/0x80 > [] get_active_stripe+0x295/0x430 [raid456] > [] ? woken_wake_function+0x20/0x20 > [] make_request+0x1ee/0x670 [raid456] > [] ? md_make_request+0x1d6/0x480 > [] ? woken_wake_function+0x20/0x20 > [] ? generic_make_request_checks+0x15a/0x3e0 > [] md_make_request+0x1d6/0x480 > [] ? md_make_request+0x53/0x480 > [] ? mempool_alloc_slab+0x15/0x20 > [] ? mempool_alloc+0x59/0x160 > [] generic_make_request+0xb1/0xe0 > [] __map_bio+0xbe/0x130 [dm_mod] > [] __clone_and_map_data_bio+0x112/0x130 [dm_mod] > [] __split_and_process_non_flush+0x6c/0xb0 [dm_mod] > [] __split_and_process_bio+0x207/0x2c0 [dm_mod] > [] ? __split_and_process_bio+0x25/0x2c0 [dm_mod] > [] ? _raw_spin_unlock_irq+0x30/0x40 > [] dm_wq_work+0xbf/0xe0 [dm_mod] > [] process_one_work+0x1df/0x570 > [] ? process_one_work+0x160/0x570 > [] ? worker_thread+0x191/0x420 > [] ? _raw_spin_unlock_irq+0x30/0x40 > [] worker_thread+0x11d/0x420 > [] ? trace_hardirqs_on+0xd/0x10 > [] ? process_one_work+0x570/0x570 > [] ? process_one_work+0x570/0x570 > [] kthread+0xde/0x100 > [] ? schedule_tail+0x1e/0xd0 > [] ? __init_kthread_worker+0x70/0x70 > [] ret_from_fork+0x3f/0x70 > [] ? __init_kthread_worker+0x70/0x70 > > — > > [] dump_stack+0x48/0x5a > [] warn_slowpath_common+0x95/0xe0 > [] warn_slowpath_fmt+0x46/0x50 > [] lock_unpin_lock+0xed/0xf0 > [] pick_next_task_fair+0x51/0x220 > [] __schedule+0x12c/0x990 > [] ? __lock_release+0xc7/0x390 > [] ? __lock_acquire+0x3ab/0x560 > [] schedule+0x3e/0x80 > [] schedule_timeout+0x1c6/0x220 > [] ? wait_for_completion+0x91/0x110 > [] ? mark_held_locks+0x70/0x90 > [] ? _raw_spin_unlock_irq+0x30/0x40 > [] ? trace_hardirqs_on_caller+0xff/0x1c0 > [] wait_for_completion+0x99/0x110 > [] ? try_to_wake_up+0x2e0/0x2e0 > [] ? __call_rcu.clone.0+0x1a0/0x1a0 > [] wait_rcu_gp+0x50/0x60 > [] ? wait_rcu_gp+0x60/0x60 > [] synchronize_sched+0x47/0xb0 > [] key_garbage_collector+0x19d/0x3e0 > [] process_one_work+0x1df/0x570 > [] ? process_one_work+0x160/0x570 > [] ? worker_thread+0x191/0x420 > [] ? _raw_spin_unlock_irq+0x30/0x40 > [] worker_thread+0x208/0x420 > [] ? trace_hardirqs_on+0xd/0x10 > [] ? process_one_work+0x570/0x570 > [] ? process_one_work+0x570/0x570 > [] kthread+0xde/0x100 > [] ? schedule_tail+0x1e/0xd0 > [] ? __init_kthread_worker+0x70/0x70 > [] ret_from_fork+0x3f/0x70 > [] ? __init_kthread_worker+0x70/0x70 >