From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
Joel Fernandes <joelaf@google.com>,
Qais Yousef <qyousef@google.com>, Ingo Molnar <mingo@redhat.com>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Valentin Schneider <vschneid@redhat.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>,
Zimuzo Ezeozue <zezeozue@google.com>,
Youssef Esmat <youssefesmat@google.com>,
Mel Gorman <mgorman@suse.de>,
Daniel Bristot de Oliveira <bristot@redhat.com>,
Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>,
Boqun Feng <boqun.feng@gmail.com>,
"Paul E . McKenney" <paulmck@kernel.org>,
kernel-team@android.com,
Valentin Schneider <valentin.schneider@arm.com>,
"Connor O'Brien" <connoro@google.com>,
John Stultz <jstultz@google.com>
Subject: [PATCH v6 20/20] sched: Add deactivated (sleeping) owner handling to proxy()
Date: Mon, 6 Nov 2023 19:35:03 +0000 [thread overview]
Message-ID: <20231106193524.866104-21-jstultz@google.com> (raw)
In-Reply-To: <20231106193524.866104-1-jstultz@google.com>
From: Peter Zijlstra <peterz@infradead.org>
Adds a implementation of (sleeping) deactivated owner handling
where we queue the selected task on the deactivated owner task
and deactivate it as well, re-activating it later when the owner
is woken up.
NOTE: This has been particularly challenging to get working
properly, and some of the locking is particularly ackward. I'd
very much appreciate review and feedback for ways to simplify
this.
Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: This was broken out from the larger proxy() patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split out from larger proxy patch
v6:
* Major rework, replacing the single list head per task with
per-task list head and nodes, creating a tree structure so
we only wake up decendents of the task woken.
* Reworked the locking to take the task->pi_lock, so we can
avoid mid-chain wakeup races from try_to_wake_up() called by
the ww_mutex logic.
---
include/linux/sched.h | 3 +
kernel/fork.c | 4 +-
kernel/sched/core.c | 198 ++++++++++++++++++++++++++++++++++++++++--
3 files changed, 196 insertions(+), 9 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9bff2f123207..c5aa0208104f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1148,6 +1148,9 @@ struct task_struct {
struct task_struct *blocked_donor; /* task that is boosting us */
struct mutex *blocked_on; /* lock we're blocked on */
bool blocked_on_waking; /* blocked on, but waking */
+ struct list_head blocked_head; /* tasks blocked on us */
+ struct list_head blocked_node; /* our entry on someone elses blocked_head */
+ struct task_struct *sleeping_owner; /* task our blocked_node is enqueued on */
raw_spinlock_t blocked_lock;
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
diff --git a/kernel/fork.c b/kernel/fork.c
index 6604e0472da0..bbcf2697652f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2459,7 +2459,9 @@ __latent_entropy struct task_struct *copy_process(
p->blocked_donor = NULL; /* nobody is boosting us yet */
p->blocked_on = NULL; /* not blocked yet */
p->blocked_on_waking = false; /* not blocked yet */
-
+ INIT_LIST_HEAD(&p->blocked_head);
+ INIT_LIST_HEAD(&p->blocked_node);
+ p->sleeping_owner = NULL;
#ifdef CONFIG_BCACHE
p->sequential_io = 0;
p->sequential_io_avg = 0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6ac7a241dacc..8f87318784d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3804,6 +3804,119 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
trace_sched_wakeup(p);
}
+#ifdef CONFIG_PROXY_EXEC
+static void do_activate_task(struct rq *rq, struct task_struct *p, int en_flags)
+{
+ lockdep_assert_rq_held(rq);
+
+ if (!sched_proxy_exec()) {
+ activate_task(rq, p, en_flags);
+ return;
+ }
+
+ if (p->sleeping_owner) {
+ struct task_struct *owner = p->sleeping_owner;
+
+ raw_spin_lock(&owner->blocked_lock);
+ list_del_init(&p->blocked_node);
+ p->sleeping_owner = NULL;
+ raw_spin_unlock(&owner->blocked_lock);
+ }
+
+ /*
+ * By calling activate_task with blocked_lock held, we order against
+ * the proxy() blocked_task case such that no more blocked tasks will
+ * be enqueued on p once we release p->blocked_lock.
+ */
+ raw_spin_lock(&p->blocked_lock);
+ WARN_ON(task_cpu(p) != cpu_of(rq));
+ activate_task(rq, p, en_flags);
+ raw_spin_unlock(&p->blocked_lock);
+}
+
+static void activate_blocked_ents(struct rq *target_rq,
+ struct task_struct *owner,
+ int wake_flags)
+{
+ unsigned long flags;
+ struct rq_flags rf;
+ int target_cpu = cpu_of(target_rq);
+ int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
+
+ if (wake_flags & WF_MIGRATED)
+ en_flags |= ENQUEUE_MIGRATED;
+ /*
+ * A whole bunch of 'proxy' tasks back this blocked task, wake
+ * them all up to give this task its 'fair' share.
+ */
+ raw_spin_lock(&owner->blocked_lock);
+ while (!list_empty(&owner->blocked_head)) {
+ struct task_struct *pp;
+ unsigned int state;
+
+ pp = list_first_entry(&owner->blocked_head,
+ struct task_struct,
+ blocked_node);
+ BUG_ON(pp == owner);
+ list_del_init(&pp->blocked_node);
+ WARN_ON(!pp->sleeping_owner);
+ pp->sleeping_owner = NULL;
+ raw_spin_unlock(&owner->blocked_lock);
+
+ /* Nested as ttwu holds the owner's pi_lock */
+ /* XXX But how do we enforce ordering to avoid ABBA? */
+ raw_spin_lock_irqsave_nested(&pp->pi_lock, flags, SINGLE_DEPTH_NESTING);
+ smp_rmb();
+ state = READ_ONCE(pp->__state);
+ /* Avoid racing with ttwu */
+ if (state == TASK_WAKING) {
+ raw_spin_unlock_irqrestore(&pp->pi_lock, flags);
+ raw_spin_lock(&owner->blocked_lock);
+ continue;
+ }
+ if (READ_ONCE(pp->on_rq)) {
+ /*
+ * We raced with a non mutex handoff activation of pp.
+ * That activation will also take care of activating
+ * all of the tasks after pp in the blocked_entry list,
+ * so we're done here.
+ */
+ raw_spin_unlock_irqrestore(&pp->pi_lock, flags);
+ raw_spin_lock(&owner->blocked_lock);
+ continue;
+ }
+
+ __set_task_cpu(pp, target_cpu);
+
+ rq_lock_irqsave(target_rq, &rf);
+ update_rq_clock(target_rq);
+ do_activate_task(target_rq, pp, en_flags);
+ resched_curr(target_rq);
+ rq_unlock_irqrestore(target_rq, &rf);
+ raw_spin_unlock_irqrestore(&pp->pi_lock, flags);
+
+ /* recurse */
+ activate_blocked_ents(target_rq, pp, wake_flags);
+
+ raw_spin_lock(&owner->blocked_lock);
+ }
+ raw_spin_unlock(&owner->blocked_lock);
+}
+
+#else
+static inline void do_activate_task(struct rq *rq, struct task_struct *p,
+ int en_flags)
+{
+ activate_task(rq, p, en_flags);
+}
+
+static inline void activate_blocked_ents(struct rq *target_rq,
+ struct task_struct *owner,
+ int wake_flags)
+{
+}
+#endif
+
static void
ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
struct rq_flags *rf)
@@ -3825,7 +3938,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
atomic_dec(&task_rq(p)->nr_iowait);
}
- activate_task(rq, p, en_flags);
+ do_activate_task(rq, p, en_flags);
+
check_preempt_curr(rq, p, wake_flags);
ttwu_do_wakeup(p);
@@ -3922,13 +4036,19 @@ void sched_ttwu_pending(void *arg)
update_rq_clock(rq);
llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
+ int wake_flags;
if (WARN_ON_ONCE(p->on_cpu))
smp_cond_load_acquire(&p->on_cpu, !VAL);
if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
set_task_cpu(p, cpu_of(rq));
- ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
+ wake_flags = p->sched_remote_wakeup ? WF_MIGRATED : 0;
+ ttwu_do_activate(rq, p, wake_flags, &rf);
+ rq_unlock(rq, &rf);
+ activate_blocked_ents(rq, p, wake_flags);
+ rq_lock(rq, &rf);
+ update_rq_clock(rq);
}
/*
@@ -4069,6 +4189,15 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
update_rq_clock(rq);
ttwu_do_activate(rq, p, wake_flags, &rf);
rq_unlock(rq, &rf);
+
+ /*
+ * When activating blocked ents, we will take the entities
+ * pi_lock, so drop the owners. Would love suggestions for
+ * a better approach.
+ */
+ raw_spin_unlock(&p->pi_lock);
+ activate_blocked_ents(rq, p, wake_flags);
+ raw_spin_lock(&p->pi_lock);
}
/*
@@ -6778,6 +6907,31 @@ static inline bool proxy_return_migration(struct rq *rq, struct rq_flags *rf,
return false;
}
+static void proxy_enqueue_on_owner(struct rq *rq, struct task_struct *owner,
+ struct task_struct *next)
+{
+ /*
+ * ttwu_activate() will pick them up and place them on whatever rq
+ * @owner will run next.
+ */
+ if (!owner->on_rq) {
+ BUG_ON(!next->on_rq);
+ deactivate_task(rq, next, DEQUEUE_SLEEP);
+ if (task_current_selected(rq, next)) {
+ put_prev_task(rq, next);
+ rq_set_selected(rq, rq->idle);
+ }
+ /*
+ * ttwu_do_activate must not have a chance to activate p
+ * elsewhere before it's fully extricated from its old rq.
+ */
+ WARN_ON(next->sleeping_owner);
+ next->sleeping_owner = owner;
+ smp_mb();
+ list_add(&next->blocked_node, &owner->blocked_head);
+ }
+}
+
/*
* Find who @next (currently blocked on a mutex) can proxy for.
*
@@ -6807,7 +6961,6 @@ static inline bool proxy_return_migration(struct rq *rq, struct rq_flags *rf,
static struct task_struct *
proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
{
- struct task_struct *ret = NULL;
struct task_struct *p = next;
struct task_struct *owner = NULL;
bool curr_in_chain = false;
@@ -6886,12 +7039,41 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
}
if (!owner->on_rq) {
- /* XXX Don't handle blocked owners yet */
- if (!proxy_deactivate(rq, next))
- ret = next;
- raw_spin_unlock(&p->blocked_lock);
+ /*
+ * rq->curr must not be added to the blocked_head list or else
+ * ttwu_do_activate could enqueue it elsewhere before it switches
+ * out here. The approach to avoiding this is the same as in the
+ * migrate_task case.
+ */
+ if (curr_in_chain) {
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return proxy_resched_idle(rq, next);
+ }
+
+ /*
+ * If !@owner->on_rq, holding @rq->lock will not pin the task,
+ * so we cannot drop @mutex->wait_lock until we're sure its a blocked
+ * task on this rq.
+ *
+ * We use @owner->blocked_lock to serialize against ttwu_activate().
+ * Either we see its new owner->on_rq or it will see our list_add().
+ */
+ if (owner != p) {
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_lock(&owner->blocked_lock);
+ }
+
+ proxy_enqueue_on_owner(rq, owner, next);
+
+ if (task_current_selected(rq, next)) {
+ put_prev_task(rq, next);
+ rq_set_selected(rq, rq->idle);
+ }
+ raw_spin_unlock(&owner->blocked_lock);
raw_spin_unlock(&mutex->wait_lock);
- return ret;
+
+ return NULL; /* retry task selection */
}
if (owner == p) {
--
2.42.0.869.gea05f2083d-goog
next prev parent reply other threads:[~2023-11-06 19:37 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-06 19:34 [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6 John Stultz
2023-11-06 19:34 ` [PATCH v6 01/20] sched: Unify runtime accounting across classes John Stultz
2023-12-17 16:19 ` Qais Yousef
2023-12-18 20:23 ` John Stultz
2023-12-28 16:21 ` Qais Yousef
2023-11-06 19:34 ` [PATCH v6 02/20] locking/mutex: Removes wakeups from under mutex::wait_lock John Stultz
2023-11-06 19:34 ` [PATCH v6 03/20] locking/mutex: make mutex::wait_lock irq safe John Stultz
2023-11-06 19:34 ` [PATCH v6 04/20] locking/mutex: Expose __mutex_owner() John Stultz
2023-11-06 19:34 ` [PATCH v6 05/20] locking/mutex: Rework task_struct::blocked_on John Stultz
2023-11-06 19:34 ` [PATCH v6 06/20] locking/mutex: Add task_struct::blocked_lock to serialize changes to the blocked_on state John Stultz
2023-11-06 19:34 ` [PATCH v6 07/20] locking/mutex: Add p->blocked_on wrappers for correctness checks John Stultz
2023-11-06 19:34 ` [PATCH v6 08/20] sched: Add CONFIG_PROXY_EXEC & boot argument to enable/disable John Stultz
2023-11-06 19:34 ` [PATCH v6 09/20] locking/mutex: Split blocked_on logic into two states (blocked_on and blocked_on_waking) John Stultz
2023-11-06 19:34 ` [PATCH v6 10/20] locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC John Stultz
2023-11-06 19:34 ` [PATCH v6 11/20] sched: Split scheduler execution context John Stultz
2023-11-11 9:34 ` kernel test robot
2023-11-11 10:25 ` kernel test robot
2023-11-06 19:34 ` [PATCH v6 12/20] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
2023-11-11 11:26 ` kernel test robot
2023-11-06 19:34 ` [PATCH v6 13/20] sched: Split out __sched() deactivate task logic into a helper John Stultz
2023-11-06 19:34 ` [PATCH v6 14/20] sched: Add a very simple proxy() function John Stultz
2023-11-11 13:32 ` kernel test robot
2023-11-06 19:34 ` [PATCH v6 15/20] sched: Add proxy deactivate helper John Stultz
2023-11-08 2:51 ` kernel test robot
2023-11-18 0:27 ` John Stultz
2023-11-06 19:34 ` [PATCH v6 16/20] sched: Fix proxy/current (push,pull)ability John Stultz
2023-11-06 19:35 ` [PATCH v6 17/20] sched: Start blocked_on chain processing in proxy() John Stultz
2023-11-06 19:35 ` [PATCH v6 18/20] sched: Handle blocked-waiter migration (and return migration) John Stultz
2023-11-09 5:31 ` Xuewen Yan
2023-11-09 6:08 ` John Stultz
2023-11-09 6:38 ` Xuewen Yan
2023-11-10 3:45 ` John Stultz
2023-11-06 19:35 ` [PATCH v6 19/20] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
2023-11-06 19:35 ` John Stultz [this message]
[not found] ` <20231108111458.1368-1-hdanton@sina.com>
2023-11-08 22:13 ` [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6 John Stultz
2023-11-10 9:07 ` Xuewen Yan
2023-12-13 6:37 ` K Prateek Nayak
2023-12-13 16:20 ` Metin Kaya
2023-12-13 19:11 ` John Stultz
2023-12-14 5:15 ` K Prateek Nayak
2023-12-14 1:00 ` John Stultz
2023-12-14 1:03 ` John Stultz
2023-12-17 3:07 ` Qais Yousef
2023-12-18 23:38 ` John Stultz
2023-12-28 16:45 ` Qais Yousef
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231106193524.866104-21-jstultz@google.com \
--to=jstultz@google.com \
--cc=boqun.feng@gmail.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=connoro@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=joelaf@google.com \
--cc=juri.lelli@redhat.com \
--cc=kernel-team@android.com \
--cc=linux-kernel@vger.kernel.org \
--cc=longman@redhat.com \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=qyousef@google.com \
--cc=rostedt@goodmis.org \
--cc=valentin.schneider@arm.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=will@kernel.org \
--cc=youssefesmat@google.com \
--cc=zezeozue@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.