From: Frederic Weisbecker <frederic@kernel.org>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Oleg Nesterov <oleg@redhat.com>,
Russell King - ARM Linux admin <linux@armlinux.org.uk>,
Chris Metcalf <cmetcalf@ezchip.com>,
Christoph Lameter <cl@linux.com>, Kirill Tkhai <tkhai@yandex.ru>,
Mike Galbraith <efault@gmx.de>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@kernel.org>,
Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
Davidlohr Bueso <dave@stgolabs.net>,
"Paul E. McKenney" <paulmck@kernel.org>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH v2 4/4] task: RCUify the assignment of rq->curr
Date: Sat, 21 Sep 2019 01:02:49 +0200 [thread overview]
Message-ID: <20190920230247.GA6449@lenoir> (raw)
In-Reply-To: <87ftkzdpjd.fsf_-_@x220.int.ebiederm.org>
On Sat, Sep 14, 2019 at 07:35:02AM -0500, Eric W. Biederman wrote:
>
> The current task on the runqueue is currently read with rcu_dereference().
>
> To obtain ordinary rcu semantics for an rcu_dereference of rq->curr it needs
> to be paird with rcu_assign_pointer of rq->curr. Which provides the
> memory barrier necessary to order assignments to the task_struct
> and the assignment to rq->curr.
>
> Unfortunately the assignment of rq->curr in __schedule is a hot path,
> and it has already been show that additional barriers in that code
> will reduce the performance of the scheduler. So I will attempt to
> describe below why you can effectively have ordinary rcu semantics
> without any additional barriers.
>
> The assignment of rq->curr in init_idle is a slow path called once
> per cpu and that can use rcu_assign_pointer() without any concerns.
>
> As I write this there are effectively two users of rcu_dereference on
> rq->curr. There is the membarrier code in kernel/sched/membarrier.c
> that only looks at "->mm" after the rcu_dereference. Then there is
> task_numa_compare() in kernel/sched/fair.c. My best reading of the
> code shows that task_numa_compare only access: "->flags",
> "->cpus_ptr", "->numa_group", "->numa_faults[]",
> "->total_numa_faults", and "->se.cfs_rq".
>
> The code in __schedule() essentially does:
> rq_lock(...);
> smp_mb__after_spinlock();
>
> next = pick_next_task(...);
> rq->curr = next;
>
> context_switch(prev, next);
>
> At the start of the function the rq_lock/smp_mb__after_spinlock
> pair provides a full memory barrier. Further there is a full memory barrier
> in context_switch().
>
> This means that any task that has already run and modified itself (the
> common case) has already seen two memory barriers before __schedule()
> runs and begins executing. A task that modifies itself then sees a
> third full memory barrier pair with the rq_lock();
>
> For a brand new task that is enqueued with wake_up_new_task() there
> are the memory barriers present from the taking and release the
> pi_lock and the rq_lock as the processes is enqueued as well as the
> full memory barrier at the start of __schedule() assuming __schedule()
> happens on the same cpu.
>
> This means that by the time we reach the assignment of rq->curr
> except for values on the task struct modified in pick_next_task
> the code has the same guarantees as if it used rcu_assign_pointer.
>
> Reading through all of the implementations of pick_next_task it
> appears pick_next_task is limited to modifying the task_struct fields
> "->se", "->rt", "->dl". These fields are the sched_entity structures
> of the varies schedulers.
>
> Further "->se.cfs_rq" is only changed in cgroup attach/move operations
> initialized by userspace.
>
> Unless I have missed something this means that in practice that the
> users of "rcu_dereerence(rq->curr)" get normal rcu semantics of
> rcu_dereference() for the fields the care about, despite the
> assignment of rq->curr in __schedule() ot using rcu_assign_pointer.
>
> Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.net
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> kernel/sched/core.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 69015b7c28da..668262806942 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3857,7 +3857,11 @@ static void __sched notrace __schedule(bool preempt)
>
> if (likely(prev != next)) {
> rq->nr_switches++;
> - rq->curr = next;
> + /*
> + * RCU users of rcu_dereference(rq->curr) may not see
> + * changes to task_struct made by pick_next_task().
> + */
> + RCU_INIT_POINTER(rq->curr, next);
It would be nice to have more explanations in the comments as to why we
don't use rcu_assign_pointer() here (the very fast-path issue) and why
it is expected to be fine (the rq_lock() + post spinlock barrier) under
which condition. Some short summary of the changelog. Because that line
implies way too many subtleties.
Thanks.
next prev parent reply other threads:[~2019-09-20 23:02 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-30 14:08 [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value Russell King - ARM Linux admin
2019-08-30 15:24 ` Oleg Nesterov
2019-08-30 15:30 ` Linus Torvalds
2019-08-30 15:40 ` Russell King - ARM Linux admin
2019-08-30 15:43 ` Linus Torvalds
2019-08-30 15:41 ` Linus Torvalds
2019-08-30 16:09 ` Oleg Nesterov
2019-08-30 16:21 ` Linus Torvalds
2019-08-30 16:44 ` Oleg Nesterov
2019-08-30 16:58 ` Linus Torvalds
2019-08-30 19:36 ` Eric W. Biederman
2019-09-02 13:40 ` Oleg Nesterov
2019-09-02 13:53 ` Peter Zijlstra
2019-09-02 14:44 ` Oleg Nesterov
2019-09-02 16:20 ` Peter Zijlstra
2019-09-02 17:04 ` Eric W. Biederman
2019-09-02 17:34 ` Linus Torvalds
2019-09-03 4:50 ` [PATCH 0/3] task: Making tasks on the runqueue rcu protected Eric W. Biederman
2019-09-03 4:51 ` [PATCH 1/3] task: Add a count of task rcu users Eric W. Biederman
2019-09-04 14:36 ` Oleg Nesterov
2019-09-04 14:44 ` Frederic Weisbecker
2019-09-04 15:32 ` Oleg Nesterov
2019-09-04 16:33 ` Frederic Weisbecker
2019-09-04 18:20 ` Linus Torvalds
2019-09-05 14:59 ` Frederic Weisbecker
2019-09-03 4:52 ` [PATCH 2/3] task: RCU protect tasks on the runqueue Eric W. Biederman
2019-09-03 7:41 ` Peter Zijlstra
2019-09-03 7:47 ` Peter Zijlstra
2019-09-03 16:44 ` Eric W. Biederman
2019-09-03 17:08 ` Linus Torvalds
2019-09-03 18:13 ` Eric W. Biederman
2019-09-03 19:18 ` Linus Torvalds
2019-09-03 20:06 ` Peter Zijlstra
2019-09-03 21:32 ` Paul E. McKenney
2019-09-05 20:02 ` Eric W. Biederman
2019-09-05 20:55 ` Paul E. McKenney
2019-09-06 7:07 ` Peter Zijlstra
2019-09-09 12:22 ` Eric W. Biederman
2019-09-25 7:36 ` Peter Zijlstra
2019-09-27 8:10 ` [tip: sched/urgent] tasks, sched/core: RCUify the assignment of rq->curr tip-bot2 for Eric W. Biederman
2019-09-03 19:42 ` [PATCH 2/3] task: RCU protect tasks on the runqueue Peter Zijlstra
2019-09-14 12:31 ` [PATCH v2 1/4] task: Add a count of task rcu users Eric W. Biederman
2019-09-14 12:31 ` [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue Eric W. Biederman
2019-09-14 12:32 ` [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code Eric W. Biederman
2019-09-04 14:22 ` [PATCH 2/3] task: RCU protect tasks on the runqueue Frederic Weisbecker
2019-09-03 4:52 ` [PATCH 3/3] task: Clean house now that tasks on the runqueue are rcu protected Eric W. Biederman
2019-09-03 9:45 ` kbuild test robot
2019-09-03 13:06 ` Oleg Nesterov
2019-09-03 13:58 ` [PATCH 0/3] task: Making tasks on the runqueue " Oleg Nesterov
2019-09-03 15:44 ` Linus Torvalds
2019-09-03 19:46 ` Peter Zijlstra
[not found] ` <87muf7f4bf.fsf_-_@x220.int.ebiederm.org>
2019-09-14 12:33 ` [PATCH v2 1/4] task: Add a count of task rcu users Eric W. Biederman
2019-09-15 13:54 ` Paul E. McKenney
2019-09-27 8:10 ` [tip: sched/urgent] tasks: Add a count of task RCU users tip-bot2 for Eric W. Biederman
2019-09-14 12:33 ` [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue Eric W. Biederman
2019-09-15 14:07 ` Paul E. McKenney
2019-09-15 14:09 ` Paul E. McKenney
2019-09-27 8:10 ` [tip: sched/urgent] tasks, sched/core: " tip-bot2 for Eric W. Biederman
2019-09-14 12:34 ` [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code Eric W. Biederman
2019-09-15 14:32 ` Paul E. McKenney
2019-09-15 17:07 ` Linus Torvalds
2019-09-15 18:47 ` Paul E. McKenney
2019-09-27 8:10 ` [tip: sched/urgent] tasks, sched/core: With a grace period after finish_task_switch(), " tip-bot2 for Eric W. Biederman
2019-09-14 12:35 ` [PATCH v2 4/4] task: RCUify the assignment of rq->curr Eric W. Biederman
2019-09-15 14:41 ` Paul E. McKenney
2019-09-15 17:59 ` Eric W. Biederman
2019-09-15 18:25 ` Eric W. Biederman
2019-09-15 18:48 ` Paul E. McKenney
2019-09-20 23:02 ` Frederic Weisbecker [this message]
2019-09-26 1:49 ` Eric W. Biederman
2019-09-26 12:42 ` Frederic Weisbecker
2019-09-14 17:43 ` [PATCH v2 0/4] task: Making tasks on the runqueue rcu protected Linus Torvalds
2019-09-17 17:38 ` Eric W. Biederman
2019-09-25 7:51 ` Peter Zijlstra
2019-09-26 1:11 ` Eric W. Biederman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190920230247.GA6449@lenoir \
--to=frederic@kernel.org \
--cc=cl@linux.com \
--cc=cmetcalf@ezchip.com \
--cc=dave@stgolabs.net \
--cc=ebiederm@xmission.com \
--cc=efault@gmx.de \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@armlinux.org.uk \
--cc=mingo@kernel.org \
--cc=oleg@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=tkhai@yandex.ru \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.