* Re: unpinning an unpinned lock
[not found] <C137BC9E-19DB-4D41-A9BE-DFE2AEE7CF42@apple.com>
@ 2015-09-15 21:42 ` Paul E. McKenney
2015-09-16 10:55 ` Peter Zijlstra
0 siblings, 1 reply; 5+ messages in thread
From: Paul E. McKenney @ 2015-09-15 21:42 UTC (permalink / raw)
To: Tim Spriggs; +Cc: peterz, davej, sasha.levin, linux-kernel
On Mon, Sep 14, 2015 at 04:22:53PM -0700, Tim Spriggs wrote:
> Hello!
Adding Sasha and LKML on CC.
> I noticed some traffic at http://www.gossamer-threads.com/lists/linux/kernel/2229593 that you three generated. I’ve run into a similar issue and I’m wondering if I can contribute anything with the info I have. Initially I was seeing deadlocks (giant load averages and processes stop responding) and I was able to get a stuck-app stack-traces. They (cron/sshd/…) looked like this:
>
> [71458.731530] Call Trace:
> [71458.734130] [<ffffffff815614ce>] schedule+0x3e/0x90
> [71458.739250] [<ffffffff81561578>] schedule_preempt_disabled+0x18/0x30
> [71458.745857] [<ffffffff81562ee5>] __mutex_lock_slowpath+0x95/0x130
> [71458.752193] [<ffffffff81562f9b>] mutex_lock+0x1b/0x30
> [71458.757490] [<ffffffff810f2bc1>] audit_receive+0x21/0xb0
> [71458.763049] [<ffffffff814cbf05>] netlink_unicast+0x185/0x210
> [71458.768954] [<ffffffff814cccca>] netlink_sendmsg+0x32a/0x3b0
> [71458.774861] [<ffffffff8148695d>] sock_sendmsg+0x4d/0x60
> [71458.780336] [<ffffffff8148861f>] SYSC_sendto+0xef/0x130
> [71458.785806] [<ffffffff8117e679>] ? __handle_mm_fault+0xf9/0x190
> [71458.791970] [<ffffffff8101234b>] ? syscall_trace_enter_phase1+0x13b/0x170
> [71458.799006] [<ffffffff8148866e>] SyS_sendto+0xe/0x10
> [71458.804214] [<ffffffff81565517>] system_call_fastpath+0x12/0x6a
>
> I enabled lock debugging in the kernel and ran it across a bunch of nodes and waited for call traces. After the weekend, I found 3 distinct traces on 16 nodes.
>
> All nodes report from: kernel/locking/lockdep.c:3497 lock_unpin_lock+0xed/0xf0()
Dave Jones was seeing lock_unpin_lock(), IIRC. Me, I was seeing
infrequent silent hangs.
Thanx, Paul
> All nodes state: unpinning an unpinned lock
>
> All nodes have these modules linked in: ahci(E) dca(E) dm_log(E) dm_mirror(E) dm_mod(E) dm_region_hash(E) hwmon(E) i2c_algo_bit(E) i2c_core(E) i2c_i801(E) iTCO_vendor_support(E) iTCO_wdt(E) igb(E) ioatdma(E) ipmi_msghandler(E) ipmi_si(E) ipmi_ssif(E) ixgbe(E) libahci(E) lpc_ich(E) mdio(E) mfd_core(E) microcode(E) overlay(E) pps_core(E) ptp(E) sb_edac(E) sd_mod(E) sg(E) vxlan(E) wmi(E) xhci_hcd(E) xhci_pci(E)
>
> There was one variation on Workqueue:
> 15 Workqueue: events key_garbage_collector
> 1 Workqueue: kdmflush dm_wq_work [dm_mod]
>
> I’m hoping that the extra info will help in your debugging/tracing this error down.
>
> Cheers,
> -Tim
>
> —
>
> [<ffffffff816b2a4e>] dump_stack+0x48/0x5a
> [<ffffffff81053545>] warn_slowpath_common+0x95/0xe0
> [<ffffffff81053646>] warn_slowpath_fmt+0x46/0x50
> [<ffffffff810a49dd>] lock_unpin_lock+0xed/0xf0
> [<ffffffff810962f1>] pick_next_task_fair+0x51/0x220
> [<ffffffff816b2c9c>] __schedule+0x12c/0x990
> [<ffffffff810a9eab>] ? __lock_acquire+0x3ab/0x560
> [<ffffffff816b367e>] schedule+0x3e/0x80
> [<ffffffff816b74d6>] schedule_timeout+0x1c6/0x220
> [<ffffffff816b4711>] ? wait_for_completion+0x91/0x110
> [<ffffffff810a8920>] ? mark_held_locks+0x70/0x90
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff810a8bef>] ? trace_hardirqs_on_caller+0xff/0x1c0
> [<ffffffff816b4719>] wait_for_completion+0x99/0x110
> [<ffffffff810875e0>] ? try_to_wake_up+0x2e0/0x2e0
> [<ffffffff810bed00>] ? __call_rcu.clone.0+0x1a0/0x1a0
> [<ffffffff810bb760>] wait_rcu_gp+0x50/0x60
> [<ffffffff810bb770>] ? wait_rcu_gp+0x60/0x60
> [<ffffffff810c02f7>] synchronize_sched+0x47/0xb0
> [<ffffffff813412dd>] key_garbage_collector+0x19d/0x3e0
> [<ffffffff8106f95f>] process_one_work+0x1df/0x570
> [<ffffffff8106f8e0>] ? process_one_work+0x160/0x570
> [<ffffffff8106fe81>] ? worker_thread+0x191/0x420
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff8106fe0d>] worker_thread+0x11d/0x420
> [<ffffffff810a8cbd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff81076a5e>] kthread+0xde/0x100
> [<ffffffff81080e6e>] ? schedule_tail+0x1e/0xd0
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
> [<ffffffff816b914f>] ret_from_fork+0x3f/0x70
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
>
> (… and a diff -u variation on this)
>
> @@ -5,6 +5,7 @@
> [<ffffffff810a49dd>] lock_unpin_lock+0xed/0xf0
> [<ffffffff810962f1>] pick_next_task_fair+0x51/0x220
> [<ffffffff816b2c9c>] __schedule+0x12c/0x990
> + [<ffffffff810aa247>] ? __lock_release+0xc7/0x390
> [<ffffffff810a9eab>] ? __lock_acquire+0x3ab/0x560
> [<ffffffff816b367e>] schedule+0x3e/0x80
> [<ffffffff816b74d6>] schedule_timeout+0x1c6/0x220
>
> —
>
> [<ffffffff816b2a4e>] dump_stack+0x48/0x5a
> [<ffffffff81053545>] warn_slowpath_common+0x95/0xe0
> [<ffffffff81053646>] warn_slowpath_fmt+0x46/0x50
> [<ffffffff810a49dd>] lock_unpin_lock+0xed/0xf0
> [<ffffffff816b2f21>] __schedule+0x3b1/0x990
> [<ffffffff810a8920>] ? mark_held_locks+0x70/0x90
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff816b367e>] schedule+0x3e/0x80
> [<ffffffffa0269a45>] get_active_stripe+0x295/0x430 [raid456]
> [<ffffffff8109d1a0>] ? woken_wake_function+0x20/0x20
> [<ffffffffa0271e5e>] make_request+0x1ee/0x670 [raid456]
> [<ffffffff81547446>] ? md_make_request+0x1d6/0x480
> [<ffffffff8109d1a0>] ? woken_wake_function+0x20/0x20
> [<ffffffff81370aea>] ? generic_make_request_checks+0x15a/0x3e0
> [<ffffffff81547446>] md_make_request+0x1d6/0x480
> [<ffffffff815472c3>] ? md_make_request+0x53/0x480
> [<ffffffff81163345>] ? mempool_alloc_slab+0x15/0x20
> [<ffffffff81163499>] ? mempool_alloc+0x59/0x160
> [<ffffffff81370e21>] generic_make_request+0xb1/0xe0
> [<ffffffffa000301e>] __map_bio+0xbe/0x130 [dm_mod]
> [<ffffffffa0003352>] __clone_and_map_data_bio+0x112/0x130 [dm_mod]
> [<ffffffffa00033dc>] __split_and_process_non_flush+0x6c/0xb0 [dm_mod]
> [<ffffffffa0004c47>] __split_and_process_bio+0x207/0x2c0 [dm_mod]
> [<ffffffffa0004a65>] ? __split_and_process_bio+0x25/0x2c0 [dm_mod]
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffffa0004e8f>] dm_wq_work+0xbf/0xe0 [dm_mod]
> [<ffffffff8106f95f>] process_one_work+0x1df/0x570
> [<ffffffff8106f8e0>] ? process_one_work+0x160/0x570
> [<ffffffff8106fe81>] ? worker_thread+0x191/0x420
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff8106fe0d>] worker_thread+0x11d/0x420
> [<ffffffff810a8cbd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff81076a5e>] kthread+0xde/0x100
> [<ffffffff81080e6e>] ? schedule_tail+0x1e/0xd0
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
> [<ffffffff816b914f>] ret_from_fork+0x3f/0x70
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
>
> —
>
> [<ffffffff816b2a4e>] dump_stack+0x48/0x5a
> [<ffffffff81053545>] warn_slowpath_common+0x95/0xe0
> [<ffffffff81053646>] warn_slowpath_fmt+0x46/0x50
> [<ffffffff810a49dd>] lock_unpin_lock+0xed/0xf0
> [<ffffffff810962f1>] pick_next_task_fair+0x51/0x220
> [<ffffffff816b2c9c>] __schedule+0x12c/0x990
> [<ffffffff810aa247>] ? __lock_release+0xc7/0x390
> [<ffffffff810a9eab>] ? __lock_acquire+0x3ab/0x560
> [<ffffffff816b367e>] schedule+0x3e/0x80
> [<ffffffff816b74d6>] schedule_timeout+0x1c6/0x220
> [<ffffffff816b4711>] ? wait_for_completion+0x91/0x110
> [<ffffffff810a8920>] ? mark_held_locks+0x70/0x90
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff810a8bef>] ? trace_hardirqs_on_caller+0xff/0x1c0
> [<ffffffff816b4719>] wait_for_completion+0x99/0x110
> [<ffffffff810875e0>] ? try_to_wake_up+0x2e0/0x2e0
> [<ffffffff810bed00>] ? __call_rcu.clone.0+0x1a0/0x1a0
> [<ffffffff810bb760>] wait_rcu_gp+0x50/0x60
> [<ffffffff810bb770>] ? wait_rcu_gp+0x60/0x60
> [<ffffffff810c02f7>] synchronize_sched+0x47/0xb0
> [<ffffffff813412dd>] key_garbage_collector+0x19d/0x3e0
> [<ffffffff8106f95f>] process_one_work+0x1df/0x570
> [<ffffffff8106f8e0>] ? process_one_work+0x160/0x570
> [<ffffffff8106fe81>] ? worker_thread+0x191/0x420
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff8106fef8>] worker_thread+0x208/0x420
> [<ffffffff810a8cbd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff81076a5e>] kthread+0xde/0x100
> [<ffffffff81080e6e>] ? schedule_tail+0x1e/0xd0
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
> [<ffffffff816b914f>] ret_from_fork+0x3f/0x70
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: unpinning an unpinned lock
2015-09-15 21:42 ` unpinning an unpinned lock Paul E. McKenney
@ 2015-09-16 10:55 ` Peter Zijlstra
2015-09-16 14:10 ` Peter Zijlstra
0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2015-09-16 10:55 UTC (permalink / raw)
To: Paul E. McKenney; +Cc: Tim Spriggs, davej, sasha.levin, linux-kernel
On Tue, Sep 15, 2015 at 02:42:44PM -0700, Paul E. McKenney wrote:
> > All nodes report from: kernel/locking/lockdep.c:3497 lock_unpin_lock+0xed/0xf0()
>
> > [<ffffffff810a49dd>] lock_unpin_lock+0xed/0xf0
> > [<ffffffff810962f1>] pick_next_task_fair+0x51/0x220
> > [<ffffffff816b2c9c>] __schedule+0x12c/0x990
> > [<ffffffff816b367e>] schedule+0x3e/0x80
> > [<ffffffff810a49dd>] lock_unpin_lock+0xed/0xf0
> > [<ffffffff816b2f21>] __schedule+0x3b1/0x990
> > [<ffffffff816b367e>] schedule+0x3e/0x80
Right, Sasha was reporting these too, I've not been able to reproduce
and neither have I been able to make sense of the extra debug bits Sasha
generated :/
I've even completely reimplemented the feature from scratch and compared
to see if I missed something -- either I'm being very consistent with my
mistakes or something else is happening.
In any case, I'll have another go at tackling this, otherwise I'll have
to disable this warning for now.
Note that this lockdep 'feature' is pure annotation, no actual logic was
changed.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: unpinning an unpinned lock
2015-09-16 10:55 ` Peter Zijlstra
@ 2015-09-16 14:10 ` Peter Zijlstra
2015-09-18 13:35 ` Sasha Levin
2015-09-23 8:46 ` [tip:locking/core] locking/lockdep: Fix hlock-> pin_count reset on lock stack rebuilds tip-bot for Peter Zijlstra
0 siblings, 2 replies; 5+ messages in thread
From: Peter Zijlstra @ 2015-09-16 14:10 UTC (permalink / raw)
To: Paul E. McKenney; +Cc: Tim Spriggs, davej, sasha.levin, linux-kernel
On Wed, Sep 16, 2015 at 12:55:35PM +0200, Peter Zijlstra wrote:
> In any case, I'll have another go at tackling this, otherwise I'll have
> to disable this warning for now.
>
> Note that this lockdep 'feature' is pure annotation, no actual logic was
> changed.
Sasha, can you give this a spin?
So there's 2 places where we take a lock out of the middle of a stack,
and in those cases it would fail to preserve the pin_count.
---
kernel/locking/lockdep.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 8acfbf773e06..4e49cc4c9952 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3068,7 +3068,7 @@ static int __lock_is_held(struct lockdep_map *lock);
static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
int trylock, int read, int check, int hardirqs_off,
struct lockdep_map *nest_lock, unsigned long ip,
- int references)
+ int references, int pin_count)
{
struct task_struct *curr = current;
struct lock_class *class = NULL;
@@ -3157,7 +3157,7 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
hlock->waittime_stamp = 0;
hlock->holdtime_stamp = lockstat_clock();
#endif
- hlock->pin_count = 0;
+ hlock->pin_count = pin_count;
if (check && !mark_irqflags(curr, hlock))
return 0;
@@ -3343,7 +3343,7 @@ __lock_set_class(struct lockdep_map *lock, const char *name,
hlock_class(hlock)->subclass, hlock->trylock,
hlock->read, hlock->check, hlock->hardirqs_off,
hlock->nest_lock, hlock->acquire_ip,
- hlock->references))
+ hlock->references, hlock->pin_count))
return 0;
}
@@ -3433,7 +3433,7 @@ __lock_release(struct lockdep_map *lock, int nested, unsigned long ip)
hlock_class(hlock)->subclass, hlock->trylock,
hlock->read, hlock->check, hlock->hardirqs_off,
hlock->nest_lock, hlock->acquire_ip,
- hlock->references))
+ hlock->references, hlock->pin_count))
return 0;
}
@@ -3583,7 +3583,7 @@ void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
current->lockdep_recursion = 1;
trace_lock_acquire(lock, subclass, trylock, read, check, nest_lock, ip);
__lock_acquire(lock, subclass, trylock, read, check,
- irqs_disabled_flags(flags), nest_lock, ip, 0);
+ irqs_disabled_flags(flags), nest_lock, ip, 0, 0);
current->lockdep_recursion = 0;
raw_local_irq_restore(flags);
}
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: unpinning an unpinned lock
2015-09-16 14:10 ` Peter Zijlstra
@ 2015-09-18 13:35 ` Sasha Levin
2015-09-23 8:46 ` [tip:locking/core] locking/lockdep: Fix hlock-> pin_count reset on lock stack rebuilds tip-bot for Peter Zijlstra
1 sibling, 0 replies; 5+ messages in thread
From: Sasha Levin @ 2015-09-18 13:35 UTC (permalink / raw)
To: Peter Zijlstra, Paul E. McKenney; +Cc: Tim Spriggs, davej, linux-kernel
On 09/16/2015 10:10 AM, Peter Zijlstra wrote:
> On Wed, Sep 16, 2015 at 12:55:35PM +0200, Peter Zijlstra wrote:
>
>> > In any case, I'll have another go at tackling this, otherwise I'll have
>> > to disable this warning for now.
>> >
>> > Note that this lockdep 'feature' is pure annotation, no actual logic was
>> > changed.
> Sasha, can you give this a spin?
>
> So there's 2 places where we take a lock out of the middle of a stack,
> and in those cases it would fail to preserve the pin_count.
The patch fixes the issue for me:
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Thanks,
Sasha
^ permalink raw reply [flat|nested] 5+ messages in thread
* [tip:locking/core] locking/lockdep: Fix hlock-> pin_count reset on lock stack rebuilds
2015-09-16 14:10 ` Peter Zijlstra
2015-09-18 13:35 ` Sasha Levin
@ 2015-09-23 8:46 ` tip-bot for Peter Zijlstra
1 sibling, 0 replies; 5+ messages in thread
From: tip-bot for Peter Zijlstra @ 2015-09-23 8:46 UTC (permalink / raw)
To: linux-tip-commits
Cc: akpm, sasha.levin, tglx, paulmck, peterz, torvalds, mingo,
linux-kernel, tspriggs, hpa
Commit-ID: 21199f27b430576552b26210b3194a363d7f05cd
Gitweb: http://git.kernel.org/tip/21199f27b430576552b26210b3194a363d7f05cd
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 16 Sep 2015 16:10:40 +0200
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 23 Sep 2015 09:48:53 +0200
locking/lockdep: Fix hlock->pin_count reset on lock stack rebuilds
Various people reported hitting the "unpinning an unpinned lock"
warning. As it turns out there are 2 places where we take a lock out
of the middle of a stack, and in those cases it would fail to preserve
the pin_count when rebuilding the lock stack.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Tim Spriggs <tspriggs@apple.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davej@codemonkey.org.uk
Link: http://lkml.kernel.org/r/20150916141040.GA11639@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/locking/lockdep.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 8acfbf7..4e49cc4 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3068,7 +3068,7 @@ static int __lock_is_held(struct lockdep_map *lock);
static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
int trylock, int read, int check, int hardirqs_off,
struct lockdep_map *nest_lock, unsigned long ip,
- int references)
+ int references, int pin_count)
{
struct task_struct *curr = current;
struct lock_class *class = NULL;
@@ -3157,7 +3157,7 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
hlock->waittime_stamp = 0;
hlock->holdtime_stamp = lockstat_clock();
#endif
- hlock->pin_count = 0;
+ hlock->pin_count = pin_count;
if (check && !mark_irqflags(curr, hlock))
return 0;
@@ -3343,7 +3343,7 @@ found_it:
hlock_class(hlock)->subclass, hlock->trylock,
hlock->read, hlock->check, hlock->hardirqs_off,
hlock->nest_lock, hlock->acquire_ip,
- hlock->references))
+ hlock->references, hlock->pin_count))
return 0;
}
@@ -3433,7 +3433,7 @@ found_it:
hlock_class(hlock)->subclass, hlock->trylock,
hlock->read, hlock->check, hlock->hardirqs_off,
hlock->nest_lock, hlock->acquire_ip,
- hlock->references))
+ hlock->references, hlock->pin_count))
return 0;
}
@@ -3583,7 +3583,7 @@ void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
current->lockdep_recursion = 1;
trace_lock_acquire(lock, subclass, trylock, read, check, nest_lock, ip);
__lock_acquire(lock, subclass, trylock, read, check,
- irqs_disabled_flags(flags), nest_lock, ip, 0);
+ irqs_disabled_flags(flags), nest_lock, ip, 0, 0);
current->lockdep_recursion = 0;
raw_local_irq_restore(flags);
}
^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2015-09-23 8:47 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <C137BC9E-19DB-4D41-A9BE-DFE2AEE7CF42@apple.com>
2015-09-15 21:42 ` unpinning an unpinned lock Paul E. McKenney
2015-09-16 10:55 ` Peter Zijlstra
2015-09-16 14:10 ` Peter Zijlstra
2015-09-18 13:35 ` Sasha Levin
2015-09-23 8:46 ` [tip:locking/core] locking/lockdep: Fix hlock-> pin_count reset on lock stack rebuilds tip-bot for Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox