From: "Paul E. McKenney" <paulmck@linux.ibm.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@gmail.com, mingo@kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
Date: Wed, 29 May 2019 11:19:41 -0700 [thread overview]
Message-ID: <20190529181941.GZ28207@linux.ibm.com> (raw)
In-Reply-To: <alpine.DEB.2.21.1905281300340.1859@nanos.tec.linutronix.de>
On Tue, May 28, 2019 at 01:07:29PM -0700, Thomas Gleixner wrote:
> On Mon, 27 May 2019, Paul E. McKenney wrote:
>
> > The TASKS03 and TREE04 rcutorture scenarios produce the following
> > lockdep complaint:
> >
> > ================================
> > WARNING: inconsistent lock state
> > 5.2.0-rc1+ #513 Not tainted
> > --------------------------------
> > inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
> > migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
> > (____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
> > {IN-HARDIRQ-W} state was registered at:
> > lock_acquire+0xb0/0x1c0
> > _raw_spin_lock_irqsave+0x3c/0x50
> > tick_broadcast_switch_to_oneshot+0xd/0x40
> > tick_switch_to_oneshot+0x4f/0xd0
> > hrtimer_run_queues+0xf3/0x130
> > run_local_timers+0x1c/0x50
> > update_process_times+0x1c/0x50
> > tick_periodic+0x26/0xc0
> > tick_handle_periodic+0x1a/0x60
> > smp_apic_timer_interrupt+0x80/0x2a0
> > apic_timer_interrupt+0xf/0x20
> > _raw_spin_unlock_irqrestore+0x4e/0x60
> > rcu_nocb_gp_kthread+0x15d/0x590
> > kthread+0xf3/0x130
> > ret_from_fork+0x3a/0x50
> > irq event stamp: 171
> > hardirqs last enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
> > hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
> > softirqs last enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
> > softirqs last disabled at (0): [<0000000000000000>] 0x0
> >
> > other info that might help us debug this:
> > Possible unsafe locking scenario:
> >
> > CPU0
> > ----
> > lock(tick_broadcast_lock);
> > <Interrupt>
> > lock(tick_broadcast_lock);
> >
> > *** DEADLOCK ***
> >
> > 1 lock held by migration/1/14:
> > #0: (____ptrval____) (clockevents_lock){+.+.}, at: tick_offline_cpu+0xf/0x30
> >
> > stack backtrace:
> > CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.2.0-rc1+ #513
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
> > Call Trace:
> > dump_stack+0x5e/0x8b
> > print_usage_bug+0x1fc/0x216
> > ? print_shortest_lock_dependencies+0x1b0/0x1b0
> > mark_lock+0x1f2/0x280
> > __lock_acquire+0x1e0/0x18f0
> > ? __lock_acquire+0x21b/0x18f0
> > ? _raw_spin_unlock_irqrestore+0x4e/0x60
> > lock_acquire+0xb0/0x1c0
> > ? tick_broadcast_offline+0xf/0x70
> > _raw_spin_lock+0x33/0x40
> > ? tick_broadcast_offline+0xf/0x70
> > tick_broadcast_offline+0xf/0x70
> > tick_offline_cpu+0x16/0x30
> > take_cpu_down+0x7d/0xa0
> > multi_cpu_stop+0xa2/0xe0
> > ? cpu_stop_queue_work+0xc0/0xc0
> > cpu_stopper_thread+0x6d/0x100
> > smpboot_thread_fn+0x169/0x240
> > kthread+0xf3/0x130
> > ? sort_range+0x20/0x20
> > ? kthread_cancel_delayed_work_sync+0x10/0x10
> > ret_from_fork+0x3a/0x50
> >
> > It turns out that tick_broadcast_offline() can be invoked with interrupts
> > enabled, so this commit fixes this issue by replacing the raw_spin_lock()
> > with raw_spin_lock_irqsave().
>
> What?
>
> take_cpu_down() is called from multi_cpu_stop() with interrupts disabled.
>
> So this is just papering over the fact that something called from
> take_cpu_down() enabled interrupts. That needs to be found and fixed.
Just posting the information covered via IRC for posterity.
A bisection located commit a0e928ed7c60
("Merge branch 'timers-core-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip").
Yes, this is a merge commit, but both commits feeding into it are
fine, but the merge fails. There were no merge conflicts.
It turns out that tick_broadcast_offline() was in innocent bystander.
After all, interrupts are supposed to be disabled throughout
take_cpu_down(), and therefore should have been disabled upon entry to
tick_offline_cpu() and thus to tick_broadcast_offline().
The function returning with irqs enabled was sched_cpu_dying(). It had
irqs enabled after return from sched_tick_stop(). And it had irqs enabled
after return from cancel_delayed_work_sync(). Which is a wrapper around
__cancel_work_timer(). Which can sleep in the case where something else
is concurrently trying to cancel the same delayed work, and sleeping is
a decidedly bad idea when you are invoked from take_cpu_down().
None of these functions have been changed (at all!) in the past year,
so my guess is that some other code was introduced that can race on
__cancel_work_timer(). Except that I am not seeing any other call
to cancel_delayed_work_sync().
Thoughts?
Thanx, Paul
next prev parent reply other threads:[~2019-05-29 18:19 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-27 14:39 [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint Paul E. McKenney
2019-05-28 20:07 ` Thomas Gleixner
2019-05-29 18:19 ` Paul E. McKenney [this message]
2019-05-30 12:58 ` Paul E. McKenney
2019-05-31 1:36 ` Frederic Weisbecker
2019-05-31 13:43 ` Paul E. McKenney
-- strict thread matches above, loose matches on Subject: below --
2019-06-19 18:19 Paul E. McKenney
2019-06-20 12:10 ` Peter Zijlstra
2019-06-20 16:01 ` Paul E. McKenney
2019-06-20 21:10 ` Peter Zijlstra
2019-06-20 22:13 ` Paul E. McKenney
2019-06-21 10:55 ` Peter Zijlstra
2019-06-21 12:16 ` Paul E. McKenney
2019-06-21 12:29 ` Peter Zijlstra
2019-06-21 13:34 ` Paul E. McKenney
2019-06-21 17:41 ` Paul E. McKenney
2019-06-21 17:50 ` Paul E. McKenney
2019-06-21 23:46 ` Paul E. McKenney
2019-06-24 23:12 ` Frederic Weisbecker
2019-06-24 23:44 ` Paul E. McKenney
2019-06-25 0:43 ` Frederic Weisbecker
2019-06-25 2:05 ` Paul E. McKenney
2019-06-25 7:51 ` Peter Zijlstra
2019-06-25 12:25 ` Frederic Weisbecker
2019-06-25 13:54 ` Paul E. McKenney
2019-06-25 14:05 ` Peter Zijlstra
2019-06-25 14:16 ` Paul E. McKenney
2019-06-25 16:20 ` Frederic Weisbecker
2019-06-25 16:52 ` Paul E. McKenney
2019-06-28 7:37 ` Peter Zijlstra
2019-06-28 12:17 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190529181941.GZ28207@linux.ibm.com \
--to=paulmck@linux.ibm.com \
--cc=fweisbec@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.