From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Tim Spriggs <tspriggs@apple.com>
Cc: peterz@infradead.org, davej@codemonkey.org.uk,
sasha.levin@oracle.com, linux-kernel@vger.kernel.org
Subject: Re: unpinning an unpinned lock
Date: Tue, 15 Sep 2015 14:42:44 -0700 [thread overview]
Message-ID: <20150915214244.GS4029@linux.vnet.ibm.com> (raw)
In-Reply-To: <C137BC9E-19DB-4D41-A9BE-DFE2AEE7CF42@apple.com>
On Mon, Sep 14, 2015 at 04:22:53PM -0700, Tim Spriggs wrote:
> Hello!
Adding Sasha and LKML on CC.
> I noticed some traffic at http://www.gossamer-threads.com/lists/linux/kernel/2229593 that you three generated. I’ve run into a similar issue and I’m wondering if I can contribute anything with the info I have. Initially I was seeing deadlocks (giant load averages and processes stop responding) and I was able to get a stuck-app stack-traces. They (cron/sshd/…) looked like this:
>
> [71458.731530] Call Trace:
> [71458.734130] [<ffffffff815614ce>] schedule+0x3e/0x90
> [71458.739250] [<ffffffff81561578>] schedule_preempt_disabled+0x18/0x30
> [71458.745857] [<ffffffff81562ee5>] __mutex_lock_slowpath+0x95/0x130
> [71458.752193] [<ffffffff81562f9b>] mutex_lock+0x1b/0x30
> [71458.757490] [<ffffffff810f2bc1>] audit_receive+0x21/0xb0
> [71458.763049] [<ffffffff814cbf05>] netlink_unicast+0x185/0x210
> [71458.768954] [<ffffffff814cccca>] netlink_sendmsg+0x32a/0x3b0
> [71458.774861] [<ffffffff8148695d>] sock_sendmsg+0x4d/0x60
> [71458.780336] [<ffffffff8148861f>] SYSC_sendto+0xef/0x130
> [71458.785806] [<ffffffff8117e679>] ? __handle_mm_fault+0xf9/0x190
> [71458.791970] [<ffffffff8101234b>] ? syscall_trace_enter_phase1+0x13b/0x170
> [71458.799006] [<ffffffff8148866e>] SyS_sendto+0xe/0x10
> [71458.804214] [<ffffffff81565517>] system_call_fastpath+0x12/0x6a
>
> I enabled lock debugging in the kernel and ran it across a bunch of nodes and waited for call traces. After the weekend, I found 3 distinct traces on 16 nodes.
>
> All nodes report from: kernel/locking/lockdep.c:3497 lock_unpin_lock+0xed/0xf0()
Dave Jones was seeing lock_unpin_lock(), IIRC. Me, I was seeing
infrequent silent hangs.
Thanx, Paul
> All nodes state: unpinning an unpinned lock
>
> All nodes have these modules linked in: ahci(E) dca(E) dm_log(E) dm_mirror(E) dm_mod(E) dm_region_hash(E) hwmon(E) i2c_algo_bit(E) i2c_core(E) i2c_i801(E) iTCO_vendor_support(E) iTCO_wdt(E) igb(E) ioatdma(E) ipmi_msghandler(E) ipmi_si(E) ipmi_ssif(E) ixgbe(E) libahci(E) lpc_ich(E) mdio(E) mfd_core(E) microcode(E) overlay(E) pps_core(E) ptp(E) sb_edac(E) sd_mod(E) sg(E) vxlan(E) wmi(E) xhci_hcd(E) xhci_pci(E)
>
> There was one variation on Workqueue:
> 15 Workqueue: events key_garbage_collector
> 1 Workqueue: kdmflush dm_wq_work [dm_mod]
>
> I’m hoping that the extra info will help in your debugging/tracing this error down.
>
> Cheers,
> -Tim
>
> —
>
> [<ffffffff816b2a4e>] dump_stack+0x48/0x5a
> [<ffffffff81053545>] warn_slowpath_common+0x95/0xe0
> [<ffffffff81053646>] warn_slowpath_fmt+0x46/0x50
> [<ffffffff810a49dd>] lock_unpin_lock+0xed/0xf0
> [<ffffffff810962f1>] pick_next_task_fair+0x51/0x220
> [<ffffffff816b2c9c>] __schedule+0x12c/0x990
> [<ffffffff810a9eab>] ? __lock_acquire+0x3ab/0x560
> [<ffffffff816b367e>] schedule+0x3e/0x80
> [<ffffffff816b74d6>] schedule_timeout+0x1c6/0x220
> [<ffffffff816b4711>] ? wait_for_completion+0x91/0x110
> [<ffffffff810a8920>] ? mark_held_locks+0x70/0x90
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff810a8bef>] ? trace_hardirqs_on_caller+0xff/0x1c0
> [<ffffffff816b4719>] wait_for_completion+0x99/0x110
> [<ffffffff810875e0>] ? try_to_wake_up+0x2e0/0x2e0
> [<ffffffff810bed00>] ? __call_rcu.clone.0+0x1a0/0x1a0
> [<ffffffff810bb760>] wait_rcu_gp+0x50/0x60
> [<ffffffff810bb770>] ? wait_rcu_gp+0x60/0x60
> [<ffffffff810c02f7>] synchronize_sched+0x47/0xb0
> [<ffffffff813412dd>] key_garbage_collector+0x19d/0x3e0
> [<ffffffff8106f95f>] process_one_work+0x1df/0x570
> [<ffffffff8106f8e0>] ? process_one_work+0x160/0x570
> [<ffffffff8106fe81>] ? worker_thread+0x191/0x420
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff8106fe0d>] worker_thread+0x11d/0x420
> [<ffffffff810a8cbd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff81076a5e>] kthread+0xde/0x100
> [<ffffffff81080e6e>] ? schedule_tail+0x1e/0xd0
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
> [<ffffffff816b914f>] ret_from_fork+0x3f/0x70
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
>
> (… and a diff -u variation on this)
>
> @@ -5,6 +5,7 @@
> [<ffffffff810a49dd>] lock_unpin_lock+0xed/0xf0
> [<ffffffff810962f1>] pick_next_task_fair+0x51/0x220
> [<ffffffff816b2c9c>] __schedule+0x12c/0x990
> + [<ffffffff810aa247>] ? __lock_release+0xc7/0x390
> [<ffffffff810a9eab>] ? __lock_acquire+0x3ab/0x560
> [<ffffffff816b367e>] schedule+0x3e/0x80
> [<ffffffff816b74d6>] schedule_timeout+0x1c6/0x220
>
> —
>
> [<ffffffff816b2a4e>] dump_stack+0x48/0x5a
> [<ffffffff81053545>] warn_slowpath_common+0x95/0xe0
> [<ffffffff81053646>] warn_slowpath_fmt+0x46/0x50
> [<ffffffff810a49dd>] lock_unpin_lock+0xed/0xf0
> [<ffffffff816b2f21>] __schedule+0x3b1/0x990
> [<ffffffff810a8920>] ? mark_held_locks+0x70/0x90
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff816b367e>] schedule+0x3e/0x80
> [<ffffffffa0269a45>] get_active_stripe+0x295/0x430 [raid456]
> [<ffffffff8109d1a0>] ? woken_wake_function+0x20/0x20
> [<ffffffffa0271e5e>] make_request+0x1ee/0x670 [raid456]
> [<ffffffff81547446>] ? md_make_request+0x1d6/0x480
> [<ffffffff8109d1a0>] ? woken_wake_function+0x20/0x20
> [<ffffffff81370aea>] ? generic_make_request_checks+0x15a/0x3e0
> [<ffffffff81547446>] md_make_request+0x1d6/0x480
> [<ffffffff815472c3>] ? md_make_request+0x53/0x480
> [<ffffffff81163345>] ? mempool_alloc_slab+0x15/0x20
> [<ffffffff81163499>] ? mempool_alloc+0x59/0x160
> [<ffffffff81370e21>] generic_make_request+0xb1/0xe0
> [<ffffffffa000301e>] __map_bio+0xbe/0x130 [dm_mod]
> [<ffffffffa0003352>] __clone_and_map_data_bio+0x112/0x130 [dm_mod]
> [<ffffffffa00033dc>] __split_and_process_non_flush+0x6c/0xb0 [dm_mod]
> [<ffffffffa0004c47>] __split_and_process_bio+0x207/0x2c0 [dm_mod]
> [<ffffffffa0004a65>] ? __split_and_process_bio+0x25/0x2c0 [dm_mod]
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffffa0004e8f>] dm_wq_work+0xbf/0xe0 [dm_mod]
> [<ffffffff8106f95f>] process_one_work+0x1df/0x570
> [<ffffffff8106f8e0>] ? process_one_work+0x160/0x570
> [<ffffffff8106fe81>] ? worker_thread+0x191/0x420
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff8106fe0d>] worker_thread+0x11d/0x420
> [<ffffffff810a8cbd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff81076a5e>] kthread+0xde/0x100
> [<ffffffff81080e6e>] ? schedule_tail+0x1e/0xd0
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
> [<ffffffff816b914f>] ret_from_fork+0x3f/0x70
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
>
> —
>
> [<ffffffff816b2a4e>] dump_stack+0x48/0x5a
> [<ffffffff81053545>] warn_slowpath_common+0x95/0xe0
> [<ffffffff81053646>] warn_slowpath_fmt+0x46/0x50
> [<ffffffff810a49dd>] lock_unpin_lock+0xed/0xf0
> [<ffffffff810962f1>] pick_next_task_fair+0x51/0x220
> [<ffffffff816b2c9c>] __schedule+0x12c/0x990
> [<ffffffff810aa247>] ? __lock_release+0xc7/0x390
> [<ffffffff810a9eab>] ? __lock_acquire+0x3ab/0x560
> [<ffffffff816b367e>] schedule+0x3e/0x80
> [<ffffffff816b74d6>] schedule_timeout+0x1c6/0x220
> [<ffffffff816b4711>] ? wait_for_completion+0x91/0x110
> [<ffffffff810a8920>] ? mark_held_locks+0x70/0x90
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff810a8bef>] ? trace_hardirqs_on_caller+0xff/0x1c0
> [<ffffffff816b4719>] wait_for_completion+0x99/0x110
> [<ffffffff810875e0>] ? try_to_wake_up+0x2e0/0x2e0
> [<ffffffff810bed00>] ? __call_rcu.clone.0+0x1a0/0x1a0
> [<ffffffff810bb760>] wait_rcu_gp+0x50/0x60
> [<ffffffff810bb770>] ? wait_rcu_gp+0x60/0x60
> [<ffffffff810c02f7>] synchronize_sched+0x47/0xb0
> [<ffffffff813412dd>] key_garbage_collector+0x19d/0x3e0
> [<ffffffff8106f95f>] process_one_work+0x1df/0x570
> [<ffffffff8106f8e0>] ? process_one_work+0x160/0x570
> [<ffffffff8106fe81>] ? worker_thread+0x191/0x420
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff8106fef8>] worker_thread+0x208/0x420
> [<ffffffff810a8cbd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff81076a5e>] kthread+0xde/0x100
> [<ffffffff81080e6e>] ? schedule_tail+0x1e/0xd0
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
> [<ffffffff816b914f>] ret_from_fork+0x3f/0x70
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
>
next parent reply other threads:[~2015-09-15 21:42 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <C137BC9E-19DB-4D41-A9BE-DFE2AEE7CF42@apple.com>
2015-09-15 21:42 ` Paul E. McKenney [this message]
2015-09-16 10:55 ` unpinning an unpinned lock Peter Zijlstra
2015-09-16 14:10 ` Peter Zijlstra
2015-09-18 13:35 ` Sasha Levin
2015-09-23 8:46 ` [tip:locking/core] locking/lockdep: Fix hlock-> pin_count reset on lock stack rebuilds tip-bot for Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150915214244.GS4029@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=davej@codemonkey.org.uk \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=sasha.levin@oracle.com \
--cc=tspriggs@apple.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.