From: Peter Zijlstra <peterz@infradead.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org,
Ingo Molnar <mingo@redhat.com>
Subject: Re: [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS
Date: Wed, 1 Jul 2020 10:02:13 +0200 [thread overview]
Message-ID: <20200701080213.GF4817@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20200701022646.GO2005@dread.disaster.area>
On Wed, Jul 01, 2020 at 12:26:46PM +1000, Dave Chinner wrote:
> There's nothing like this in the scheduler code that I can find that
> explains the expected overall ordering/serialisation mechanisms and
> relationships used in the subsystem. Hence when I comes across
> something that doesn't appear to make sense, there's nothign I can
> refer to that would explain whether it is a bug or whether the
> comment is wrong or whether I've just completely misunderstood what
> the comment is referring to.
>
> Put simply: the little details are great, but they aren't sufficient
> by themselves to understand the relationships being maintained
> between the various objects.
You're absolutely right, we lack that. As a very first draft / brain
dump, I wrote the below, I'll try and polish it up and add to it over
the next few days.
---
kernel/sched/core.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 75 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d3d2d67f398..568f7ade9a09 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -77,6 +77,74 @@ __read_mostly int scheduler_running;
*/
int sysctl_sched_rt_runtime = 950000;
+
+/*
+ * Serialization rules:
+ *
+ * Lock order:
+ *
+ * p->pi_lock
+ * rq->lock
+ * hrtimer_clock_base::cpu_base->lock
+ *
+ * rq1->lock
+ * rq2->lock where: &rq1 < &rq2
+ *
+ * Regular state:
+ *
+ * Normal scheduling state is serialized by rq->lock. __schedule() takes the
+ * local CPU's rq->lock, it optionally removes the task from the runqueue and
+ * always looks at the local rq data structures to find the most elegible task
+ * to run next.
+ *
+ * Task enqueue is also under rq->lock, possibly taken from another CPU.
+ * Wakeups from another LLC domain might use an IPI to transfer the enqueue
+ * to the local CPU to avoid bouncing the runqueue state around.
+ *
+ * Task wakeup, specifically wakeups that involve migration, are horribly
+ * complicated to avoid having to take two rq->locks.
+ *
+ * Special state:
+ *
+ * p->state <- TASK_*
+ * p->on_cpu <- { 0, 1 }
+ * p->on_rq <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }
+ * task_cpu(p) <- set_task_cpu()
+ *
+ * System-calls and anything external will use task_rq_lock() which acquires
+ * both p->lock and rq->lock. As a consequence things like p->cpus_ptr are
+ * stable while holding either lock.
+ *
+ * p->state is changed locklessly using set_current_state(),
+ * __set_current_state() or set_special_state(), see their respective comments.
+ *
+ * p->on_cpu is set by prepare_task() and cleared by finish_task() such that it
+ * will be set before p is scheduled-in and cleared after p is scheduled-out,
+ * both under rq->lock. Non-zero indicates the task is 'current' on it's CPU.
+ *
+ * p->on_rq is set by activate_task() and cleared by deactivate_task(), under
+ * rq->lock. Non-zero indicates the task is runnable, the special
+ * ON_RQ_MIGRATING state is used for migration without holding both rq->locks.
+ * It indicates task_cpu() is not stable, see *task_rq_lock().
+ *
+ * task_cpu(p) is changed by set_task_cpu(), the rules are intricate but basically:
+ *
+ * - don't call set_task_cpu() on a blocked task
+ *
+ * - for try_to_wake_up(), called under p->pi_lock
+ *
+ * - for migration called under rq->lock:
+ * o move_queued_task()
+ * o __migrate_swap_task()
+ * o detach_task()
+ *
+ * - for migration called under double_rq_lock():
+ * o push_rt_task() / pull_rt_task()
+ * o push_dl_task() / pull_dl_task()
+ * o dl_task_offline_migration()
+ *
+ */
+
/*
* __task_rq_lock - lock the rq @p resides on.
*/
@@ -1466,8 +1534,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
{
lockdep_assert_held(&rq->lock);
- WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
- dequeue_task(rq, p, DEQUEUE_NOCLOCK);
+ deactivate_task(rq, p, DEQUEUE_NOCLOCK);
set_task_cpu(p, new_cpu);
rq_unlock(rq, rf);
@@ -1475,8 +1542,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
rq_lock(rq, rf);
BUG_ON(task_cpu(p) != new_cpu);
- enqueue_task(rq, p, 0);
- p->on_rq = TASK_ON_RQ_QUEUED;
+ activate_task(rq, p, 0);
check_preempt_curr(rq, p, 0);
return rq;
@@ -3134,8 +3200,12 @@ static inline void prepare_task(struct task_struct *next)
/*
* Claim the task as running, we do this before switching to it
* such that any running task will have this set.
+ *
+ * __schedule()'s rq->lock and smp_mb__after_spin_lock() orders this
+ * store against prior state change of p, also see try_to_wake_up(),
+ * specifically smp_load_acquire(&p->on_cpu).
*/
- next->on_cpu = 1;
+ WRITE_ONCE(next->on_cpu, 1);
#endif
}
next prev parent reply other threads:[~2020-07-01 8:02 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-26 0:47 [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS Dave Chinner
2020-06-26 5:57 ` Dave Chinner
2020-06-26 7:33 ` Peter Zijlstra
2020-06-26 22:32 ` Dave Chinner
2020-06-27 18:30 ` Peter Zijlstra
2020-06-29 23:55 ` Dave Chinner
2020-06-30 8:57 ` Peter Zijlstra
2020-07-01 2:26 ` Dave Chinner
2020-07-01 8:02 ` Peter Zijlstra [this message]
2020-07-01 11:06 ` Dave Chinner
2020-07-01 13:42 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200701080213.GF4817@hirez.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=david@fromorbit.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mingo@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.