From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>,
Joel Fernandes <joelaf@google.com>,
Qais Yousef <qyousef@google.com>, Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Valentin Schneider <vschneid@redhat.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>,
Zimuzo Ezeozue <zezeozue@google.com>,
Mel Gorman <mgorman@suse.de>,
Daniel Bristot de Oliveira <bristot@redhat.com>,
Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>,
Boqun Feng <boqun.feng@gmail.com>,
"Paul E . McKenney" <paulmck@kernel.org>
Subject: [PATCH v3 00/14] Generalized Priority Inheritance via Proxy Execution v3
Date: Tue, 11 Apr 2023 04:24:57 +0000 [thread overview]
Message-ID: <20230411042511.1606592-1-jstultz@google.com> (raw)
As mentioned last time, this Proxy Execution series has a long history:
First described in a paper[1] by Watkins, Straub, Niehaus, then from
patches from Peter Zijlstra, extended with lots of work by Juri Lelli,
Valentin Schneider, and Connor O'Brien. (and thank you to Steven Rostedt
for providing additional details here!)
So again, many thanks to those above, as all the credit for this series
really is due to them - while the mistakes are likely mine.
Overview:
—----------
Proxy Execution is a generalized form of priority inheritance. Classic
priority inheritance works well for real-time tasks where there is a
straight forward priority order to how things are run. But it breaks
down when used between CFS tasks, as there are lots of parameters
involved outside of just the task’s nice value when selecting the next
task to run (via pick_next_task()). So ideally we want to imbue the
mutex holder with all the scheduler attributes of the blocked waiting
task.
Proxy Execution does this via a few changes:
* Keeping tasks that are blocked on a mutex *on* the runqueue
* Keeping additional tracking of which mutex a task is blocked on, and
which task holds a specific mutex.
* Special handling for when we select a blocked task to run, so that we
instead run the mutex holder.
The first of these is the most difficult to grasp (I do get the mental
friction here: blocked tasks on the *run*queue sounds like nonsense!
Personally I like to think of the runqueue in this model more like a
“task-selection queue”).
By leaving blocked tasks on the runqueue, we allow pick_next_task() to
choose the task that should run next (even if it’s blocked waiting on a
mutex). If we do select a blocked task, we look at the task’s blocked_on
mutex and from there look at the mutex’s owner task. And in the simple
case, the task which owns the mutex is what we then choose to run,
allowing it to release the mutex.
This means that instead of just tracking “curr”, the scheduler needs to
track both the scheduler context (what was picked and all the state used
for scheduling decisions), and the execution context (what we’re
running)
In this way, the mutex owner is run “on behalf” of the blocked task
that was picked to run, essentially inheriting the scheduler context of
the blocked task.
As Connor outlined in a previous submission of this patch series, this
raises a number of complicated situations: The mutex owner might itself
be blocked on another mutex, or it could be sleeping, running on a
different CPU, in the process of migrating between CPUs, etc.
But the functionality provided is useful, as in Android we have a number
of cases where we are seeing priority inversion (not unbounded, but
longer than we’d like) between “foreground” and “background”
SCHED_NORMAL applications, so having a generalized solution would be
very useful.
New in v3:
—------
* While not a functional change, the biggest rework in this version is
probably my renaming of the rq->proxy (or rq_proxy() in v2) pointer to
rq_selected() as I think it helps clarify the patch. Previously it was
using “proxy” as the name for the scheduler context, which is sort of
inverted from how the idea is explained - the proxy in proxy execution
should be the task running on behalf of the selected blocked task.
* Fix for cpu runtime accounting issue Joel Fernandes demonstrated in
Connor’s earlier submission[2]. We now charge the running task for
cputime, but the vruntime accounting is charged to the selected task
we’re running on behalf.
* As Deitmar earlier noticed[3], rq_pin_lock() was complaining w/
SCHED_WARN when calls from pick_next_task() would queue callbacks,
which would not be handled before the next call to rq_pin_lock().
I’ve added extra calls to __balance_callbacks to address this and
resolve the warnings.
* Re-added “locking/mutex: make mutex::wait_lock irq safe” as in
earlier review it was questioned if it was necessary, so I had dropped
it in v2, but further testing found it tripping up lockdep pretty
quickly.
* Fixed null pointer crashes in the deadline load balancing rework that
additional testing uncovered.
* Build fixups Reported-by: kernel test robot <lkp@intel.com>
Issues still to address:
—----------
In preparation for OSPM next week, I wanted to go ahead and share the
patch series now, but there is still more to work on:
* Recently I’ve been tripping over a deadlock caused by what looks like
a circular blocked_on relationship, which appears to be due to
misaccounting the blocked_on pointer somewhere. I’m still digging on
this.
* RT/DL load balancing. There is a scheduling invariant that we always
need to run the top N highest priority RT tasks across the N cpus.
However keeping blocked tasks on the runqueue greatly complicates the
load balancing for this. Connor took an initial stab at this with
“chain level balancing” included in this series. Feedback on this
would be appreciated!
* CFS load balancing. Blocked tasks may carry forward load (PELT) to
the lock owner's CPU, so CPU may look like it is overloaded.
* The cfs_rq->curr gets set in pick_next_task_fair() which means it
points to the selected task, not the task to be run. This muddies
things as cfs_rq->curr and rq_curr() may point to different tasks.
I suspect further renaming or pushing down the split context awareness
will be needed for this to be cleaner.
* Resolving open questions in comments: I’ve left these in for now, but
I hope to review and try to make some choices where there are open
questions. If folks have specific feedback or suggestions here, it
would be great!
Performance:
—----------
This patch series switches mutexes to use handoff mode rather than
optimistic spinning. This is a potential concern where locks are under
high contention. However, so far in our initial performance analysis (on
both x86 and mobile devices) we’ve not seen any major regressions. That
said, Chenyu did report a regression[4], which we’ll need to look
further into.
Review and feedback would be greatly appreciated!
If folks find it easier to test/tinker with, this patch series can also
be found here:
https://github.com/johnstultz-work/linux-dev.git proxy-exec-v3-6.3-rc6
Thanks so much!
-john
[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
[2] https://lore.kernel.org/lkml/Y0y8iURTSAv7ZspC@google.com/
[3] https://lore.kernel.org/lkml/dab347c1-3724-8ac6-c051-9d2caea20101@arm.com/
[4] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/
Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Connor O'Brien (1):
sched: Attempt to fix rt/dl load balancing via chain level balance
John Stultz (3):
sched: Replace rq->curr access w/ rq_curr(rq)
sched: Unnest ttwu_runnable in prep for proxy-execution
sched: Fix runtime accounting w/ proxy-execution
Juri Lelli (2):
locking/mutex: make mutex::wait_lock irq safe
locking/mutex: Expose mutex_owner()
Peter Zijlstra (6):
locking/ww_mutex: Remove wakeups from under mutex::wait_lock
locking/mutex: Rework task_struct::blocked_on
locking/mutex: Add task_struct::blocked_lock to serialize changes to
the blocked_on state
sched: Unify runtime accounting across classes
sched: Split scheduler execution context
sched: Add proxy execution
Valentin Schneider (2):
locking/mutex: Add p->blocked_on wrappers
sched/rt: Fix proxy/current (push,pull)ability
include/linux/mutex.h | 2 +
include/linux/sched.h | 24 +-
include/linux/ww_mutex.h | 3 +
init/Kconfig | 7 +
init/init_task.c | 1 +
kernel/Kconfig.locks | 2 +-
kernel/fork.c | 6 +-
kernel/locking/mutex-debug.c | 9 +-
kernel/locking/mutex.c | 117 ++++-
kernel/locking/ww_mutex.h | 32 +-
kernel/sched/core.c | 802 ++++++++++++++++++++++++++++++++---
kernel/sched/core_sched.c | 2 +-
kernel/sched/cpudeadline.c | 12 +-
kernel/sched/cpudeadline.h | 3 +-
kernel/sched/cpupri.c | 29 +-
kernel/sched/cpupri.h | 6 +-
kernel/sched/cputime.c | 4 +-
kernel/sched/deadline.c | 220 ++++++----
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 127 ++++--
kernel/sched/idle.c | 4 +-
kernel/sched/membarrier.c | 22 +-
kernel/sched/pelt.h | 2 +-
kernel/sched/rt.c | 301 +++++++++----
kernel/sched/sched.h | 282 +++++++++++-
kernel/sched/stop_task.c | 13 +-
26 files changed, 1664 insertions(+), 370 deletions(-)
--
2.40.0.577.gac1e443424-goog
next reply other threads:[~2023-04-11 4:25 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-11 4:24 John Stultz [this message]
2023-04-11 4:24 ` [PATCH v3 01/14] locking/ww_mutex: Remove wakeups from under mutex::wait_lock John Stultz
2023-04-11 4:24 ` [PATCH v3 02/14] locking/mutex: make mutex::wait_lock irq safe John Stultz
2023-04-11 4:25 ` [PATCH v3 03/14] locking/mutex: Rework task_struct::blocked_on John Stultz
2023-04-11 4:25 ` [PATCH v3 04/14] locking/mutex: Add task_struct::blocked_lock to serialize changes to the blocked_on state John Stultz
2023-04-11 4:25 ` [PATCH v3 05/14] locking/mutex: Add p->blocked_on wrappers John Stultz
2023-04-11 4:25 ` [PATCH v3 06/14] locking/mutex: Expose mutex_owner() John Stultz
2023-04-22 10:36 ` Peter Zijlstra
2023-04-25 14:53 ` John Stultz
2023-04-11 4:25 ` [PATCH v3 07/14] sched: Unify runtime accounting across classes John Stultz
2023-04-11 4:25 ` [PATCH v3 08/14] sched: Replace rq->curr access w/ rq_curr(rq) John Stultz
2023-04-11 14:07 ` kernel test robot
2023-04-11 20:04 ` John Stultz
2023-04-22 10:42 ` Peter Zijlstra
2023-04-25 14:47 ` John Stultz
2023-04-11 4:25 ` [PATCH v3 09/14] sched: Split scheduler execution context John Stultz
2023-04-22 10:13 ` Peter Zijlstra
2023-04-22 10:14 ` Peter Zijlstra
2023-04-25 14:52 ` John Stultz
2023-04-11 4:25 ` [PATCH v3 10/14] sched: Unnest ttwu_runnable in prep for proxy-execution John Stultz
2023-04-11 4:25 ` [PATCH v3 11/14] sched: Add proxy execution John Stultz
2023-04-11 4:25 ` [PATCH v3 12/14] sched/rt: Fix proxy/current (push,pull)ability John Stultz
2023-04-11 4:25 ` [PATCH v3 13/14] sched: Attempt to fix rt/dl load balancing via chain level balance John Stultz
2023-04-11 4:25 ` [PATCH v3 14/14] sched: Fix runtime accounting w/ proxy-execution John Stultz
2023-04-22 11:54 ` [PATCH v3 00/14] Generalized Priority Inheritance via Proxy Execution v3 Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230411042511.1606592-1-jstultz@google.com \
--to=jstultz@google.com \
--cc=boqun.feng@gmail.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=joelaf@google.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=longman@redhat.com \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=qyousef@google.com \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=will@kernel.org \
--cc=zezeozue@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox