From: "Paul E. McKenney" <paulmck@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: mingo@kernel.org, tglx@linutronix.de,
linux-kernel@vger.kernel.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
frederic@kernel.org, Will Deacon <will@kernel.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
npiggin@gmail.com
Subject: Re: [PATCH 0/6] sched: TTWU, IPI, and assorted stuff
Date: Sat, 20 Jun 2020 11:46:22 -0700 [thread overview]
Message-ID: <20200620184622.GA19696@paulmck-ThinkPad-P72> (raw)
In-Reply-To: <20200619134423.GB577403@hirez.programming.kicks-ass.net>
On Fri, Jun 19, 2020 at 03:44:23PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 16, 2020 at 07:17:21PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 16, 2020 at 07:04:10PM +0200, Peter Zijlstra wrote:
> > > [19324.795303] ------------[ cut here ]------------
> > > [19324.795304] WARNING: CPU: 10 PID: 76 at kernel/smp.c:138 __smp_call_single_queue+0x40/0x50
> > > [19324.795305] Modules linked in:
> > > [19324.795306] CPU: 10 PID: 76 Comm: ksoftirqd/10 Not tainted 5.8.0-rc1+ #8
> > > [19324.795307] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014
> > > [19324.795307] RIP: 0010:__smp_call_single_queue+0x40/0x50
> > > [19324.795308] Code: c2 40 91 02 00 4c 89 e6 4c 89 e7 48 03 14 c5 e0 56 2d b4 e8 b2 3a 2f 00 84 c0 75 04 5d 41 5c c3 89 ef 5d 41 5c e9 40 af f9 ff <0f> 0b eb cd 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 54 49 89 f4 55
> > > [19324.795309] RSP: 0000:ffffb3cb4030bd18 EFLAGS: 00010046
> > > [19324.795310] RAX: 000000000000000a RBX: 0000000000000000 RCX: 00000000ffffffff
> > > [19324.795310] RDX: 00000000000090aa RSI: ffffffffb420bc3f RDI: ffffffffb4232e3e
> > > [19324.795311] RBP: 000000000000000a R08: 00001193646cd91c R09: ffff93c1df49c008
> > > [19324.795312] R10: ffffb3cb4030bdf8 R11: 000000000000032e R12: ffff93c1dbed5b30
> > > [19324.795312] R13: ffff93c1df4a8340 R14: 000000000000000a R15: ffff93c1df2e8340
> > > [19324.795313] FS: 0000000000000000(0000) GS:ffff93c1df480000(0000) knlGS:0000000000000000
> > > [19324.795313] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [19324.795314] CR2: 00000000ffffffff CR3: 000000001e40a000 CR4: 00000000000006e0
> > > [19324.795315] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [19324.795315] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > [19324.795316] Call Trace:
> > > [19324.795316] ttwu_queue_wakelist+0xa4/0xc0
> > > [19324.795316] try_to_wake_up+0x432/0x530
> >
> > This is indeed WF_ON_CPU... it had to be, but how ?!
>
> So my latest theory is that we have a memory ordering problem. It would
> fully explain the thing, but it would also render my patch #1
> insufficient.
>
> If we suppose the: task_cpu(p) load at the beginning of try_to_wake_up()
> returns an old value, and this old value happens to be this_cpu. Further
> assume that the p->on_cpu load accurately returns 1, it really is still
> running, just not here.
>
> Then, when we issue a local wakeup, we can crash in exactly the observed
> manner because p->se.cfs_rq != rq->cfs_rq, because p's cfs_rq is from
> the wrong CPU, therefore we'll iterate into the non-existant parents and
> NULL deref.
>
> The scenario is somewhat elaborate:
>
>
> X->cpu = 1
> rq(1)->curr = X
>
>
> CPU0 CPU1 CPU2
>
> // switch away from X
> LOCK rq(1)->lock
> smp_mb__after_spinlock
> dequeue_task(X)
> X->on_rq = 9
> switch_to(Z)
> X->on_cpu = 0
> UNLOCK rq(1)->lock
>
>
> // migrate X to cpu 0
> LOCK rq(1)->lock
> dequeue_task(X)
> set_task_cpu(X, 0)
> X->cpu = 0
> UNLOCK rq(1)->lock
>
> LOCK rq(0)->lock
> enqueue_task(X)
> X->on_rq = 1
> UNLOCK rq(0)->lock
>
> // switch to X
> LOCK rq(0)->lock
> smp_mb__after_spinlock
> switch_to(X)
> X->on_cpu = 1
> UNLOCK rq(0)->lock
>
> // X goes sleep
> X->state = TASK_UNINTERRUPTIBLE
> smp_mb(); // wake X
> ttwu()
> LOCK X->pi_lock
> smp_mb__after_spinlock
>
> if (p->state)
>
> cpu = X->cpu; // =? 1
>
> smp_rmb()
>
> // X calls schedule()
> LOCK rq(0)->lock
> smp_mb__after_spinlock
> dequeue_task(X)
> X->on_rq = 0
>
> if (p->on_rq)
>
> smp_rmb();
>
> if (p->on_cpu && ttwu_queue_wakelist(..)) [*]
>
> smp_cond_load_acquire(&p->on_cpu, !VAL)
>
> cpu = select_task_rq(X, X->wake_cpu, ...)
> if (X->cpu != cpu)
> switch_to(Y)
> X->on_cpu = 0
> UNLOCK rq(0)->lock
>
>
> Furthermore, without the fancy new path [*] we would have hit
> smp_cond_load_acquire(), and if we _really_ would have had ->on_cpu==1
> and cpu==this_cpu there, that'd have been a deadlock, but no such
> deadlocks have ever been observed.
>
> Also, note how the rest of the code never actually uses the @cpu value
> loaded earlier, all that is re-loaded after the load_aquire of
> X->on_cpu.
>
> I'm having trouble convincing myself that's actually possible on
> x86_64 -- after all, every LOCK implies an smp_mb there, so if ttwu
> observes ->state != RUNNING, it must also observe ->cpu != 1.
>
> Most of the previous ttwu() races were found on very large PowerPC
> machines which are far more 'interesting'. I suppose I should go write
> me litmus tests...
>
> Anyway, IFF any of this holds true; then I suppose a patch like the below
> ought to cure things.
>
> If not, I'm, once again, defeated by this...
700 hours of TREE03 with no drama whatsoever, so it does appear that
defeat has finally been deterred! ;-)
Tested-by: Paul E. McKenney <paulmck@kernel.org>
> ---
> kernel/sched/core.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8298b2c240ce..5534eb1ab79a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2378,6 +2378,9 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
> static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
> {
> if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
> + if (WARN_ON(cpu == smp_processor_id()))
> + return false;
> +
> sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> __ttwu_queue_wakelist(p, cpu, wake_flags);
> return true;
> @@ -2550,7 +2553,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>
> /* We're going to change ->state: */
> success = 1;
> - cpu = task_cpu(p);
>
> /*
> * Ensure we load p->on_rq _after_ p->state, otherwise it would
> @@ -2615,7 +2617,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> * let the waker make forward progress. This is safe because IRQs are
> * disabled and the IPI will deliver after on_cpu is cleared.
> */
> - if (READ_ONCE(p->on_cpu) && ttwu_queue_wakelist(p, cpu, wake_flags | WF_ON_RQ))
> + if (smp_load_acquire(&p->on_cpu) &&
> + ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_RQ))
> goto unlock;
>
> /*
> @@ -2635,6 +2638,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> psi_ttwu_dequeue(p);
> set_task_cpu(p, cpu);
> }
> +#else
> + cpu = task_cpu(p);
> #endif /* CONFIG_SMP */
>
> ttwu_queue(p, cpu, wake_flags);
>
next prev parent reply other threads:[~2020-06-20 18:46 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-15 12:56 [PATCH 0/6] sched: TTWU, IPI, and assorted stuff Peter Zijlstra
2020-06-15 12:56 ` [PATCH 1/6] sched: Fix ttwu_queue_cond() Peter Zijlstra
2020-06-15 13:34 ` Peter Zijlstra
2020-06-15 16:45 ` Paul E. McKenney
2020-06-15 22:58 ` Paul E. McKenney
2020-06-22 9:11 ` Mel Gorman
2020-06-22 9:41 ` Peter Zijlstra
2020-06-15 12:56 ` [PATCH 2/6] sched: Verify some SMP assumptions Peter Zijlstra
2020-06-15 12:56 ` [PATCH 3/6] sched: s/WF_ON_RQ/WQ_ON_CPU/ Peter Zijlstra
2020-06-22 9:13 ` Mel Gorman
2020-06-15 12:56 ` [PATCH 4/6] smp, irq_work: Continue smp_call_function*() and irq_work*() integration Peter Zijlstra
2020-06-15 12:56 ` [PATCH 5/6] irq_work: Cleanup Peter Zijlstra
2020-06-16 15:16 ` Petr Mladek
2020-06-15 12:57 ` [PATCH 6/6] smp: Cleanup smp_call_function*() Peter Zijlstra
2020-06-15 14:34 ` Jens Axboe
2020-06-15 16:04 ` Daniel Thompson
2020-06-17 8:23 ` Christoph Hellwig
2020-06-17 9:00 ` Peter Zijlstra
2020-06-17 11:04 ` Peter Zijlstra
2020-06-18 6:51 ` Christoph Hellwig
2020-06-18 16:25 ` Peter Zijlstra
2020-06-15 16:23 ` [PATCH 0/6] sched: TTWU, IPI, and assorted stuff Paul E. McKenney
2020-06-15 16:40 ` Peter Zijlstra
2020-06-15 17:21 ` Paul E. McKenney
2020-06-15 19:11 ` Peter Zijlstra
2020-06-15 19:55 ` Paul E. McKenney
2020-06-16 16:31 ` Paul E. McKenney
2020-06-16 17:04 ` Peter Zijlstra
2020-06-16 17:17 ` Peter Zijlstra
2020-06-16 17:53 ` Paul E. McKenney
2020-06-19 13:44 ` Peter Zijlstra
2020-06-19 17:20 ` Paul E. McKenney
2020-06-19 17:48 ` Paul E. McKenney
2020-06-19 18:11 ` Peter Zijlstra
2020-06-19 18:46 ` Paul E. McKenney
2020-06-20 18:46 ` Paul E. McKenney [this message]
2020-06-16 17:51 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200620184622.GA19696@paulmck-ThinkPad-P72 \
--to=paulmck@kernel.org \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=frederic@kernel.org \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=npiggin@gmail.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.