* [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
@ 2024-07-11 12:59 Valentin Schneider
2024-07-11 12:59 ` [RFC PATCH v3 01/10] rcuwait: Split type definition to its own header Valentin Schneider
` (9 more replies)
0 siblings, 10 replies; 28+ messages in thread
From: Valentin Schneider @ 2024-07-11 12:59 UTC (permalink / raw)
To: linux-kernel, rcu
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
Hi folks,
The patches are based on top of tip/sched/core at:
db43a609d01e ("sched: Update MAINTAINERS and CREDITS")
They are also available at:
https://gitlab.com/vschneid/linux.git -b mainline/sched/fair/throttle-on-exit-all-v2
Problem statement
=================
CFS tasks can end up throttled while holding locks that other, non-throttled
tasks are blocking on.
For !PREEMPT_RT, this can be a source of latency due to the throttling causing a
resource acquisition denial.
For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter the
slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)
If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued on
the same CPU as one where ktimers/ksoftirqd is blocked on read_lock(&lock),
this creates a circular dependency.
This has been observed to happen with:
o fs/eventpoll.c::ep->lock
o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
but can trigger with any rwlock that can be acquired in both process and
softirq contexts.
The linux-rt tree has had
1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.")
which helped this scenario for non-rwlock locks by ensuring the throttled
task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately,
rwlocks cannot sanely do PI as they allow multiple readers.
Proposed approach
=================
Peter mentioned [1] that there have been discussions on changing /when/ the
throttling happens: rather than have it be done immediately upon updating
the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
for the task to be about to return to userspace: if it's in userspace, it can't
hold any in-kernel lock.
Ben Segall proposed a different approach [2] which I worked on at [3], but I
eventually lost faith in all the .*nr_running hackery that it required.
Furthermore, talking about this at OSPM24, I felt like the first approach had
more appeal, so guess who's back (back again).
See the last patch for implementation considerations.
I've tried splitting this into as many bite-size patches as possible, but it's
tricky to further split that one last patch switching from cfs_rq-wide throttling
to per-task throttling, apologies for the big diff.
Testing
=======
Tested on QEMU via the below with the usual DEBUG options enabled.
mount -t cgroup -o cpu none /root/cpu
mkdir /root/cpu/cg0
echo 10000 > /root/cpu/cg0/cpu.cfs_period_us
echo 1000 > /root/cpu/cg0/cpu.cfs_quota_us
mkdir /root/cpu/cg0/cg00
mkdir /root/cpu/cg0/cg01
mkdir /root/cpu/cg0/cg00/cg000
mkdir /root/cpu/cg0/cg00/cg001
spawn() {
while true; do cat /sys/devices/system/cpu/smt/active &>/dev/null; done &
PID=$!
echo "Starting PID${PID}"
echo $PID > $1
}
spawn cpu/cg0/tasks
spawn cpu/cg0/tasks
spawn cpu/cg0/tasks
spawn cpu/cg0/tasks
spawn cpu/cg0/cg01/tasks
spawn cpu/cg0/cg00/cg000/tasks
spawn cpu/cg0/cg00/cg001/tasks
sleep 120
kill $(jobs -p)
I've also looked at the PELT side of things using rt-app + LISA. The PELT behaviour
itself looks mostly the same, but the execution patterns change, again see last
patch for details.
Links
=====
[1]: https://lore.kernel.org/lkml/20231031160120.GE15024@noisy.programming.kicks-ass.net/
[2]: http://lore.kernel.org/r/xm26edfxpock.fsf@bsegall-linux.svl.corp.google.com
[3]: https://lore.kernel.org/lkml/20240202080920.3337862-1-vschneid@redhat.com/
Valentin Schneider (10):
rcuwait: Split type definition to its own header
irq_work: Split type definition to its own header
task_work, sched: Add a _locked variant to task_work_cancel()
sched/fair: Introduce sched_throttle_work
sched/fair: Introduce an irq_work for cancelling throttle task_work
sched/fair: Prepare switched_from & switched_to for per-task
throttling
sched/fair: Prepare task_change_group_fair() for per-task throttling
sched/fair: Prepare migrate_task_rq_fair() for per-task throttling
sched/fair: Add a class->task_woken callback in preparation for
per-task throttling
sched/fair: Throttle CFS tasks on return to userspace
include/linux/irq_work.h | 8 +-
include/linux/irq_work_types.h | 14 +
include/linux/rcuwait.h | 9 +-
include/linux/rcuwait_types.h | 16 ++
include/linux/sched.h | 6 +
kernel/sched/core.c | 4 +
kernel/sched/fair.c | 497 +++++++++++++++++++++++++--------
kernel/sched/sched.h | 6 +
kernel/sched/task_work_sched.h | 14 +
kernel/task_work.c | 67 ++++-
10 files changed, 499 insertions(+), 142 deletions(-)
create mode 100644 include/linux/irq_work_types.h
create mode 100644 include/linux/rcuwait_types.h
create mode 100644 kernel/sched/task_work_sched.h
--
2.43.0
^ permalink raw reply [flat|nested] 28+ messages in thread
* [RFC PATCH v3 01/10] rcuwait: Split type definition to its own header
2024-07-11 12:59 [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry Valentin Schneider
@ 2024-07-11 12:59 ` Valentin Schneider
2024-07-12 15:15 ` Peter Zijlstra
2024-07-11 12:59 ` [RFC PATCH v3 02/10] irq_work: " Valentin Schneider
` (8 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Valentin Schneider @ 2024-07-11 12:59 UTC (permalink / raw)
To: linux-kernel, rcu
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
A later commit will add a struct irq_work member to struct task_struct, which
would require <linux/sched.h> to include the definition of struct
irq_work.
Thanks to header dependency hell, incudling <linux/irq_work.h> in <linux/sched.h>
results in defining inline helpers using not-yet-defined fields (mm_struct,
task_struct, various task states...).
Break off the definition of struct rcuwait into its own header file.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
include/linux/rcuwait.h | 9 ++-------
include/linux/rcuwait_types.h | 16 ++++++++++++++++
2 files changed, 18 insertions(+), 7 deletions(-)
create mode 100644 include/linux/rcuwait_types.h
diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
index 27343424225cf..1f1ca7d38cdf8 100644
--- a/include/linux/rcuwait.h
+++ b/include/linux/rcuwait.h
@@ -5,17 +5,12 @@
#include <linux/rcupdate.h>
#include <linux/sched/signal.h>
+#include <linux/rcuwait_types.h>
+
/*
* rcuwait provides a way of blocking and waking up a single
* task in an rcu-safe manner.
- *
- * The only time @task is non-nil is when a user is blocked (or
- * checking if it needs to) on a condition, and reset as soon as we
- * know that the condition has succeeded and are awoken.
*/
-struct rcuwait {
- struct task_struct __rcu *task;
-};
#define __RCUWAIT_INITIALIZER(name) \
{ .task = NULL, }
diff --git a/include/linux/rcuwait_types.h b/include/linux/rcuwait_types.h
new file mode 100644
index 0000000000000..60a4385a2c368
--- /dev/null
+++ b/include/linux/rcuwait_types.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_RCUWAIT_TYPES_H_
+#define _LINUX_RCUWAIT_TYPES_H_
+
+#include <linux/sched.h>
+
+/*
+ * The only time @task is non-nil is when a user is blocked (or
+ * checking if it needs to) on a condition, and reset as soon as we
+ * know that the condition has succeeded and are awoken.
+ */
+struct rcuwait {
+ struct task_struct __rcu *task;
+};
+
+#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH v3 02/10] irq_work: Split type definition to its own header
2024-07-11 12:59 [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry Valentin Schneider
2024-07-11 12:59 ` [RFC PATCH v3 01/10] rcuwait: Split type definition to its own header Valentin Schneider
@ 2024-07-11 12:59 ` Valentin Schneider
2024-07-12 15:17 ` Peter Zijlstra
2024-07-11 12:59 ` [RFC PATCH v3 03/10] task_work, sched: Add a _locked variant to task_work_cancel() Valentin Schneider
` (7 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Valentin Schneider @ 2024-07-11 12:59 UTC (permalink / raw)
To: linux-kernel, rcu
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
A later commit will add a struct irq_work member to struct task_struct, which
would require <linux/sched.h> to include the definition of struct
irq_work.
Thanks to header dependency hell, incudling <linux/irq_work.h> in <linux/sched.h>
results in defining inline helpers using not-yet-defined fields (mm_struct,
task_struct, various task states...).
Break off the definition of struct irq_work into its own header file.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
include/linux/irq_work.h | 8 ++------
include/linux/irq_work_types.h | 14 ++++++++++++++
2 files changed, 16 insertions(+), 6 deletions(-)
create mode 100644 include/linux/irq_work_types.h
diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 136f2980cba30..7f6d2af360d91 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -5,6 +5,8 @@
#include <linux/smp_types.h>
#include <linux/rcuwait.h>
+#include <linux/irq_work_types.h>
+
/*
* An entry can be in one of four states:
*
@@ -14,12 +16,6 @@
* busy NULL, 2 -> {free, claimed} : callback in progress, can be claimed
*/
-struct irq_work {
- struct __call_single_node node;
- void (*func)(struct irq_work *);
- struct rcuwait irqwait;
-};
-
#define __IRQ_WORK_INIT(_func, _flags) (struct irq_work){ \
.node = { .u_flags = (_flags), }, \
.func = (_func), \
diff --git a/include/linux/irq_work_types.h b/include/linux/irq_work_types.h
new file mode 100644
index 0000000000000..108cbc514733b
--- /dev/null
+++ b/include/linux/irq_work_types.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_IRQ_WORK_TYPES_H
+#define _LINUX_IRQ_WORK_TYPES_H
+
+#include <linux/smp_types.h>
+#include <linux/rcuwait_types.h>
+
+struct irq_work {
+ struct __call_single_node node;
+ void (*func)(struct irq_work *);
+ struct rcuwait irqwait;
+};
+
+#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH v3 03/10] task_work, sched: Add a _locked variant to task_work_cancel()
2024-07-11 12:59 [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry Valentin Schneider
2024-07-11 12:59 ` [RFC PATCH v3 01/10] rcuwait: Split type definition to its own header Valentin Schneider
2024-07-11 12:59 ` [RFC PATCH v3 02/10] irq_work: " Valentin Schneider
@ 2024-07-11 12:59 ` Valentin Schneider
2024-07-12 10:35 ` Oleg Nesterov
2024-07-12 15:20 ` Peter Zijlstra
2024-07-11 12:59 ` [RFC PATCH v3 04/10] sched/fair: Introduce sched_throttle_work Valentin Schneider
` (6 subsequent siblings)
9 siblings, 2 replies; 28+ messages in thread
From: Valentin Schneider @ 2024-07-11 12:59 UTC (permalink / raw)
To: linux-kernel, rcu
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
Later commits will need to issue a task_work_cancel() from within the
scheduler with the task's ->pi_lock held.
Add a _locked variant that expects p->pi_lock to be held. Expose it in a
separate scheduler header file, as this really is a scheduler-only
interface.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
kernel/sched/task_work_sched.h | 14 +++++++
kernel/task_work.c | 67 ++++++++++++++++++++++++++--------
2 files changed, 66 insertions(+), 15 deletions(-)
create mode 100644 kernel/sched/task_work_sched.h
diff --git a/kernel/sched/task_work_sched.h b/kernel/sched/task_work_sched.h
new file mode 100644
index 0000000000000..e235da456427f
--- /dev/null
+++ b/kernel/sched/task_work_sched.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Scheduler internal task_work methods
+ */
+#ifndef _KERNEL_TASK_WORK_SCHED_H
+#define _KERNEL_TASK_WORK_SCHED_H
+
+#include <linux/task_work.h>
+#include <linux/sched.h>
+
+struct callback_head *
+task_work_cancel_locked(struct task_struct *task, task_work_func_t func);
+
+#endif
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 95a7e1b7f1dab..81092bc2e7371 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -3,6 +3,8 @@
#include <linux/task_work.h>
#include <linux/resume_user_mode.h>
+#include "sched/task_work_sched.h"
+
static struct callback_head work_exited; /* all we need is ->next == NULL */
/**
@@ -74,33 +76,20 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
return 0;
}
-/**
- * task_work_cancel_match - cancel a pending work added by task_work_add()
- * @task: the task which should execute the work
- * @match: match function to call
- * @data: data to be passed in to match function
- *
- * RETURNS:
- * The found work or NULL if not found.
- */
-struct callback_head *
-task_work_cancel_match(struct task_struct *task,
+static struct callback_head *
+task_work_cancel_match_locked(struct task_struct *task,
bool (*match)(struct callback_head *, void *data),
void *data)
{
struct callback_head **pprev = &task->task_works;
struct callback_head *work;
- unsigned long flags;
- if (likely(!task_work_pending(task)))
- return NULL;
/*
* If cmpxchg() fails we continue without updating pprev.
* Either we raced with task_work_add() which added the
* new entry before this work, we will find it again. Or
* we raced with task_work_run(), *pprev == NULL/exited.
*/
- raw_spin_lock_irqsave(&task->pi_lock, flags);
work = READ_ONCE(*pprev);
while (work) {
if (!match(work, data)) {
@@ -109,6 +98,32 @@ task_work_cancel_match(struct task_struct *task,
} else if (try_cmpxchg(pprev, &work, work->next))
break;
}
+
+ return work;
+}
+
+/**
+ * task_work_cancel_match - cancel a pending work added by task_work_add()
+ * @task: the task which should execute the work
+ * @match: match function to call
+ * @data: data to be passed in to match function
+ *
+ * RETURNS:
+ * The found work or NULL if not found.
+ */
+struct callback_head *
+task_work_cancel_match(struct task_struct *task,
+ bool (*match)(struct callback_head *, void *data),
+ void *data)
+{
+ unsigned long flags;
+ struct callback_head *work;
+
+ if (likely(!task_work_pending(task)))
+ return NULL;
+
+ raw_spin_lock_irqsave(&task->pi_lock, flags);
+ work = task_work_cancel_match_locked(task, match, data);
raw_spin_unlock_irqrestore(&task->pi_lock, flags);
return work;
@@ -136,6 +151,28 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
return task_work_cancel_match(task, task_work_func_match, func);
}
+/**
+ * task_work_cancel - cancel a pending work added by task_work_add()
+ * @task: the task which should execute the work
+ * @func: identifies the work to remove
+ *
+ * Find the last queued pending work with ->func == @func and remove
+ * it from queue.
+ *
+ * RETURNS:
+ * The found work or NULL if not found.
+ */
+struct callback_head *
+task_work_cancel_locked(struct task_struct *task, task_work_func_t func)
+{
+ lockdep_assert_held(&task->pi_lock);
+
+ if (likely(!task_work_pending(task)))
+ return NULL;
+
+ return task_work_cancel_match_locked(task, task_work_func_match, func);
+}
+
/**
* task_work_run - execute the works added by task_work_add()
*
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH v3 04/10] sched/fair: Introduce sched_throttle_work
2024-07-11 12:59 [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry Valentin Schneider
` (2 preceding siblings ...)
2024-07-11 12:59 ` [RFC PATCH v3 03/10] task_work, sched: Add a _locked variant to task_work_cancel() Valentin Schneider
@ 2024-07-11 12:59 ` Valentin Schneider
2024-07-12 15:21 ` Peter Zijlstra
2024-07-11 12:59 ` [RFC PATCH v3 05/10] sched/fair: Introduce an irq_work for cancelling throttle task_work Valentin Schneider
` (5 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Valentin Schneider @ 2024-07-11 12:59 UTC (permalink / raw)
To: linux-kernel, rcu
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
Later commits will change CFS bandwidth control throttling from a
per-cfs_rq basis to a per-task basis. Actual throttling of a task will
happen in the return to user path, which will be implemented via a
task_work callback.
To ease reviewing, the infrastructure and helpers are added first, the
actual behaviour will be implemented when switching to per-task
throttling.
Add a task_work node to struct task_struct, and have it initialised at
sched_fork().
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 4 ++++
kernel/sched/fair.c | 12 ++++++++++++
kernel/sched/sched.h | 2 ++
4 files changed, 19 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 90691d99027e3..a4976eb5065fc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -812,6 +812,7 @@ struct task_struct {
#ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
+ struct callback_head sched_throttle_work;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6d35c48239be0..b811670d2c362 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4329,6 +4329,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.cfs_rq = NULL;
#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+ init_cfs_throttle_work(p);
+#endif
+
#ifdef CONFIG_SCHEDSTATS
/* Even if schedstat is disabled, there should not be garbage */
memset(&p->stats, 0, sizeof(p->stats));
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9057584ec06de..775547cdd3ce0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5759,6 +5759,18 @@ static int tg_throttle_down(struct task_group *tg, void *data)
return 0;
}
+static void throttle_cfs_rq_work(struct callback_head *work)
+{
+
+}
+
+void init_cfs_throttle_work(struct task_struct *p)
+{
+ /* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
+ p->sched_throttle_work.next = &p->sched_throttle_work;
+ init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
+}
+
static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4c36cc6803617..943bca8263ffe 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2510,6 +2510,8 @@ extern void init_sched_dl_class(void);
extern void init_sched_rt_class(void);
extern void init_sched_fair_class(void);
+extern void init_cfs_throttle_work(struct task_struct *p);
+
extern void reweight_task(struct task_struct *p, const struct load_weight *lw);
extern void resched_curr(struct rq *rq);
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH v3 05/10] sched/fair: Introduce an irq_work for cancelling throttle task_work
2024-07-11 12:59 [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry Valentin Schneider
` (3 preceding siblings ...)
2024-07-11 12:59 ` [RFC PATCH v3 04/10] sched/fair: Introduce sched_throttle_work Valentin Schneider
@ 2024-07-11 12:59 ` Valentin Schneider
2024-07-11 13:00 ` [RFC PATCH v3 06/10] sched/fair: Prepare switched_from & switched_to for per-task throttling Valentin Schneider
` (4 subsequent siblings)
9 siblings, 0 replies; 28+ messages in thread
From: Valentin Schneider @ 2024-07-11 12:59 UTC (permalink / raw)
To: linux-kernel, rcu
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
Moving towards per-task throttling, the soon-to-be task_work used for the
actual throttling will need to be cancelled when a task is moving out of a
throttled cfs_rq and into a non-throttled cfs_rq (or out of CFS
altogether).
Such code paths will have at least the rq lock held, sometimes both the rq
and the p->pi_lock locks held. Functions such as migrate_task_rq_fair()
don't have guarantees as to which of the two is held, as such the
cancellation will need to happen in a separate context.
It will be punted to irq_work context, the groundwork is added here and the
irq_work callback will be implemented when switching to per-task
throttling.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
include/linux/sched.h | 4 ++++
kernel/sched/fair.c | 6 ++++++
2 files changed, 10 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a4976eb5065fc..99a1e77d769db 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -46,6 +46,7 @@
#include <linux/livepatch_sched.h>
#include <linux/uidgid_types.h>
#include <asm/kmap_size.h>
+#include <linux/irq_work_types.h>
/* task_struct member predeclarations (sorted alphabetically): */
struct audit_context;
@@ -813,6 +814,9 @@ struct task_struct {
#ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
struct callback_head sched_throttle_work;
+#ifdef CONFIG_CFS_BANDWIDTH
+ struct irq_work unthrottle_irq_work;
+#endif
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 775547cdd3ce0..095357bd17f0e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5764,11 +5764,17 @@ static void throttle_cfs_rq_work(struct callback_head *work)
}
+static void task_throttle_cancel_irq_work_fn(struct irq_work *work)
+{
+ /* Write me */
+}
+
void init_cfs_throttle_work(struct task_struct *p)
{
/* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
p->sched_throttle_work.next = &p->sched_throttle_work;
init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
+ p->unthrottle_irq_work = IRQ_WORK_INIT_HARD(task_throttle_cancel_irq_work_fn);
}
static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH v3 06/10] sched/fair: Prepare switched_from & switched_to for per-task throttling
2024-07-11 12:59 [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry Valentin Schneider
` (4 preceding siblings ...)
2024-07-11 12:59 ` [RFC PATCH v3 05/10] sched/fair: Introduce an irq_work for cancelling throttle task_work Valentin Schneider
@ 2024-07-11 13:00 ` Valentin Schneider
2024-07-12 15:26 ` Peter Zijlstra
2024-07-11 13:00 ` [RFC PATCH v3 07/10] sched/fair: Prepare task_change_group_fair() " Valentin Schneider
` (3 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Valentin Schneider @ 2024-07-11 13:00 UTC (permalink / raw)
To: linux-kernel, rcu
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
Later commits will change CFS bandwidth control throttling from a
per-cfs_rq basis to a per-task basis. This means special care needs to be
taken around any transition a task can have into and out of a cfs_rq.
To ease reviewing, the transitions are patched with dummy-helpers that are
implemented later on.
Add helpers to switched_from_fair() and switched_to_fair() to cover class
changes. If switching from CFS, a task may need to be unthrottled. If
switching to CFS, a task may need to be throttled.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
kernel/sched/fair.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 095357bd17f0e..acac0829c71f3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5694,6 +5694,10 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
return cfs_bandwidth_used() && cfs_rq->throttle_count;
}
+static inline bool task_needs_throttling(struct task_struct *p) { return false; }
+static inline void task_throttle_setup(struct task_struct *p) { }
+static inline void task_throttle_cancel(struct task_struct *p) { }
+
/*
* Ensure that neither of the group entities corresponding to src_cpu or
* dest_cpu are members of a throttled hierarchy when performing group
@@ -6622,6 +6626,10 @@ static inline int throttled_lb_pair(struct task_group *tg,
return 0;
}
+static inline bool task_needs_throttling(struct task_struct *p) { return false; }
+static inline void task_throttle_setup(struct task_struct *p) { }
+static inline void task_throttle_cancel(struct task_struct *p) { }
+
#ifdef CONFIG_FAIR_GROUP_SCHED
void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -12847,11 +12855,15 @@ static void attach_task_cfs_rq(struct task_struct *p)
static void switched_from_fair(struct rq *rq, struct task_struct *p)
{
detach_task_cfs_rq(p);
+ if (cfs_bandwidth_used())
+ task_throttle_cancel(p);
}
static void switched_to_fair(struct rq *rq, struct task_struct *p)
{
attach_task_cfs_rq(p);
+ if (cfs_bandwidth_used() && task_needs_throttling(p))
+ task_throttle_setup(p);
set_task_max_allowed_capacity(p);
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH v3 07/10] sched/fair: Prepare task_change_group_fair() for per-task throttling
2024-07-11 12:59 [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry Valentin Schneider
` (5 preceding siblings ...)
2024-07-11 13:00 ` [RFC PATCH v3 06/10] sched/fair: Prepare switched_from & switched_to for per-task throttling Valentin Schneider
@ 2024-07-11 13:00 ` Valentin Schneider
2024-07-11 13:00 ` [RFC PATCH v3 08/10] sched/fair: Prepare migrate_task_rq_fair() " Valentin Schneider
` (2 subsequent siblings)
9 siblings, 0 replies; 28+ messages in thread
From: Valentin Schneider @ 2024-07-11 13:00 UTC (permalink / raw)
To: linux-kernel, rcu
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
Later commits will change CFS bandwidth control throttling from a
per-cfs_rq basis to a per-task basis. This means special care needs to be
taken around any transition a task can have into and out of a cfs_rq.
To ease reviewing, the transitions are patched with dummy-helpers that are
implemented later on.
Add helpers to task_change_group_fair() to cover CPU cgroup
migration. If changing to a throttled taskgroup/cfs_rq, the task needs to
be throttled. Conversely, if the task is already throttled but changing to
a taskgroup/cfs_rq that still has some runtime, the task must be unthrottled.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
kernel/sched/fair.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index acac0829c71f3..ec4cf7308a586 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12935,6 +12935,14 @@ static void task_change_group_fair(struct task_struct *p)
#endif
set_task_rq(p, task_cpu(p));
attach_task_cfs_rq(p);
+
+ if (!cfs_bandwidth_used())
+ return;
+
+ if (task_needs_throttling(p))
+ task_throttle_setup(p);
+ else
+ task_throttle_cancel(p);
}
void free_fair_sched_group(struct task_group *tg)
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH v3 08/10] sched/fair: Prepare migrate_task_rq_fair() for per-task throttling
2024-07-11 12:59 [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry Valentin Schneider
` (6 preceding siblings ...)
2024-07-11 13:00 ` [RFC PATCH v3 07/10] sched/fair: Prepare task_change_group_fair() " Valentin Schneider
@ 2024-07-11 13:00 ` Valentin Schneider
2024-07-11 13:00 ` [RFC PATCH v3 09/10] sched/fair: Add a class->task_woken callback in preparation " Valentin Schneider
2024-07-11 13:00 ` [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace Valentin Schneider
9 siblings, 0 replies; 28+ messages in thread
From: Valentin Schneider @ 2024-07-11 13:00 UTC (permalink / raw)
To: linux-kernel, rcu
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
Later commits will change CFS bandwidth control throttling from a
per-cfs_rq basis to a per-task basis. This means special care needs to be
taken around any transition a task can have into and out of a cfs_rq.
To ease reviewing, the transitions are patched with dummy-helpers that are
implemented later on.
Add helpers to migrate_task_rq_fair() to cover CPU migration. Even if the
task stays within the same taskgroup, each cfs_rq has its own runtime
accounting, thus the task needs to be throttled or unthrottled
accordingly.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
kernel/sched/fair.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ec4cf7308a586..b2242307677ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5694,8 +5694,11 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
return cfs_bandwidth_used() && cfs_rq->throttle_count;
}
+static inline bool task_has_throttle_work(struct task_struct *p) { return false; }
static inline bool task_needs_throttling(struct task_struct *p) { return false; }
+static inline bool task_needs_migrate_throttling(struct task_struct *p, unsigned int dst_cpu) { return false; }
static inline void task_throttle_setup(struct task_struct *p) { }
+static inline void task_throttle_cancel_migrate(struct task_struct *p, int dst_cpu) { }
static inline void task_throttle_cancel(struct task_struct *p) { }
/*
@@ -6626,8 +6629,11 @@ static inline int throttled_lb_pair(struct task_group *tg,
return 0;
}
+static inline bool task_has_throttle_work(struct task_struct *p) { return false; }
static inline bool task_needs_throttling(struct task_struct *p) { return false; }
+static inline bool task_needs_migrate_throttling(struct task_struct *p, unsigned int dst_cpu) { return false; }
static inline void task_throttle_setup(struct task_struct *p) { }
+static inline void task_throttle_cancel_migrate(struct task_struct *p, int dst_cpu) { }
static inline void task_throttle_cancel(struct task_struct *p) { }
#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -8308,6 +8314,24 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
se->avg.last_update_time = 0;
update_scan_period(p, new_cpu);
+
+ if (!cfs_bandwidth_used())
+ return;
+ /*
+ * When the runtime within a cfs_bandwidth is depleted, all underlying
+ * cfs_rq's can have (approximately) sched_cfs_bandwidth_slice() runtime
+ * remaining.
+ *
+ * This means all tg->cfs_rq[]'s do not get throttled at the exact same
+ * time: some may still have a bit of runtime left. Thus, even if the
+ * task is staying within the same cgroup, and under the same
+ * cfs_bandwidth, the cfs_rq it migrates to might have a different
+ * throttle status - resync is needed.
+ */
+ if (task_needs_migrate_throttling(p, new_cpu))
+ task_throttle_setup(p);
+ else if (task_has_throttle_work(p))
+ task_throttle_cancel_migrate(p, new_cpu);
}
static void task_dead_fair(struct task_struct *p)
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH v3 09/10] sched/fair: Add a class->task_woken callback in preparation for per-task throttling
2024-07-11 12:59 [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry Valentin Schneider
` (7 preceding siblings ...)
2024-07-11 13:00 ` [RFC PATCH v3 08/10] sched/fair: Prepare migrate_task_rq_fair() " Valentin Schneider
@ 2024-07-11 13:00 ` Valentin Schneider
2024-07-11 13:00 ` [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace Valentin Schneider
9 siblings, 0 replies; 28+ messages in thread
From: Valentin Schneider @ 2024-07-11 13:00 UTC (permalink / raw)
To: linux-kernel, rcu
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
Later commits will change CFS bandwidth control throttling from a
per-cfs_rq basis to a per-task basis. This means special care needs to be
taken around any transition a task can have into and out of a cfs_rq.
To ease reviewing, the transitions are patched with dummy-helpers that are
implemented later on.
Add a class->task_woken callback to handle tasks being woken into
potentially throttled cfs_rq's. Conversely, a task flagged for
throttle-at-kernel-exit may block and need to have its pending throttle
removed if runtime was replenished by the time it got woken up.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
kernel/sched/fair.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b2242307677ca..0cec3e70f1277 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5776,6 +5776,17 @@ static void task_throttle_cancel_irq_work_fn(struct irq_work *work)
/* Write me */
}
+static void task_woken_fair(struct rq *rq, struct task_struct *p)
+{
+ if (!cfs_bandwidth_used())
+ return;
+
+ if (task_needs_throttling(p))
+ task_throttle_setup(p);
+ else
+ task_throttle_cancel(p);
+}
+
void init_cfs_throttle_work(struct task_struct *p)
{
/* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
@@ -13288,6 +13299,10 @@ DEFINE_SCHED_CLASS(fair) = {
.task_change_group = task_change_group_fair,
#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+ .task_woken = task_woken_fair,
+#endif
+
#ifdef CONFIG_SCHED_CORE
.task_is_throttled = task_is_throttled_fair,
#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace
2024-07-11 12:59 [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry Valentin Schneider
` (8 preceding siblings ...)
2024-07-11 13:00 ` [RFC PATCH v3 09/10] sched/fair: Add a class->task_woken callback in preparation " Valentin Schneider
@ 2024-07-11 13:00 ` Valentin Schneider
2024-07-12 17:05 ` Peter Zijlstra
` (3 more replies)
9 siblings, 4 replies; 28+ messages in thread
From: Valentin Schneider @ 2024-07-11 13:00 UTC (permalink / raw)
To: linux-kernel, rcu
Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
As reported in [1], CFS bandwidth throttling is a source of headaches in
PREEMPT_RT - generally speaking, a throttled CFS task can hold locks that
prevent ksoftirqd from running, which prevents replenishing & unthrottling
the cfs_rq of said CFS task.
Peter mentioned that there have been discussions on changing /when/ the
throttling happens: rather than have it be done immediately upon updating
the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
for the task to be about to return to userspace.
This approach also benefits !PREEMPT_RT, as it reduces latency caused by
throttled tasks owning contended (kernel) resources.
Concept
=======
Upon throttling a cfs_rq, all tasks already enqueued get a task_work added,
which lets the actual throttling happen in exit_to_user_mode().
Any new task migrated to or enqueued on such a cfs_rq similarly gets the
task_work added.
Previous patches have added helpers for all the relevant locations where
the task_work may need to be either added or removed depending on the state
of the cfs_rq the task is (to be) enqueued on:
o sched_class change
o cgroup migration
o CPU migration
o task wakeup
Upon unthrottling, tasks are enqueued back onto their respective
cfs_rq. Unlike the previous throttling implementation, cfs_rq's can be
unthrottled while in a half-throttled state (i.e. some tasks have been
removed from them, while others are still enqueued and runnable as they
haven't reached exit_to_user_mode() yet), so the unthrottling process is a
bit more involved, especially when it comes to maintaining *nr_running fields.
Clocks
======
Correctly handling the different cfs_rq->throttled_clock* is tricky, as
unlike the current upstream approach where all tasks of a cfs_rq are
throttled at the exact same time, here they each get throttled at a
per-task, not-known-beforehand time.
For instance, for the ->throttled_clock_pelt, ideally we would need a
per-task snapshot of when the task gets really throttled in
exit_to_user_mode(), rather than a single snapshot of when the cfs_rq runs
out of runtime. This isn't implemented here. The ->throttled_clock_pelt is
set when the cfs_rq runs out of runtime, which means the "grace period"
given to the cfs_rq's tasks on their way to exit_to_user_mode() isn't
accounted.
Notable behaviour changes
=========================
Once a cfs_rq is ->throttled, its tasks can continue running until they hit
exit_to_user_mode(). This means they can keep draining further runtime
from their cfs_rq, which can end up draining more than one period's worth
of runtime.
I've tested a 10ms runtime / 100ms period cgroup with an always running
task: upstream gets a "clean" periodic pattern of 10ms runtime every 100ms,
whereas this gets something more like 40ms runtime every 400ms.
[1]: https://lore.kernel.org/all/20231031160120.GE15024@noisy.programming.kicks-ass.net/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 438 ++++++++++++++++++++++++++++++------------
kernel/sched/sched.h | 4 +
3 files changed, 320 insertions(+), 123 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 99a1e77d769db..29b9334738af1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,6 +815,7 @@ struct task_struct {
struct task_group *sched_task_group;
struct callback_head sched_throttle_work;
#ifdef CONFIG_CFS_BANDWIDTH
+ struct list_head throttle_node;
struct irq_work unthrottle_irq_work;
#endif
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0cec3e70f1277..08cf7343aedb1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -54,6 +54,7 @@
#include "sched.h"
#include "stats.h"
#include "autogroup.h"
+#include "task_work_sched.h"
/*
* The initial- and re-scaling of tunables is configurable
@@ -5694,12 +5695,114 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
return cfs_bandwidth_used() && cfs_rq->throttle_count;
}
-static inline bool task_has_throttle_work(struct task_struct *p) { return false; }
-static inline bool task_needs_throttling(struct task_struct *p) { return false; }
-static inline bool task_needs_migrate_throttling(struct task_struct *p, unsigned int dst_cpu) { return false; }
-static inline void task_throttle_setup(struct task_struct *p) { }
-static inline void task_throttle_cancel_migrate(struct task_struct *p, int dst_cpu) { }
-static inline void task_throttle_cancel(struct task_struct *p) { }
+static inline bool task_has_throttle_work(struct task_struct *p)
+{
+ return p->sched_throttle_work.next != &p->sched_throttle_work;
+}
+
+static inline bool task_needs_throttling(struct task_struct *p)
+{
+ return throttled_hierarchy(cfs_rq_of(&p->se));
+}
+
+static inline bool task_needs_migrate_throttling(struct task_struct *p, unsigned int dst_cpu)
+{
+ return throttled_hierarchy(task_group(p)->cfs_rq[dst_cpu]);
+}
+
+static inline bool task_is_throttled(struct task_struct *p)
+{
+ return !list_empty(&p->throttle_node);
+}
+
+static inline void task_throttle_setup_work(struct task_struct *p)
+{
+ /*
+ * Kthreads and exiting tasks don't return to userspace, so adding the
+ * work is pointless
+ */
+ if (!(p->flags & (PF_EXITING | PF_KTHREAD)))
+ task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
+}
+
+static void throttle_cfs_rq_work(struct callback_head *work);
+static inline void task_throttle_do_cancel_work(struct task_struct *p)
+{
+ /*
+ * If this returns NULL, it means the work got run, which per
+ * this being called is a bug: the task_work throttled the
+ * task when it didn't need to be.
+ */
+ WARN_ON_ONCE(!task_work_cancel_locked(p, throttle_cfs_rq_work));
+ p->sched_throttle_work.next = &p->sched_throttle_work;
+}
+
+static inline void task_throttle_cancel_work(struct task_struct *p, int dst_cpu)
+{
+ /*
+ * The calling context may be holding p->pi_lock, which is also acquired
+ * by task_work_cancel_match().
+ *
+ * Lock recursion is prevented by punting the work cancellation to the
+ * next IRQ enable. This is sent to the destination CPU rather than
+ * >this< CPU to prevent the task from resuming execution and getting
+ * throttled in its return to userspace.
+ */
+ irq_work_queue_on(&p->unthrottle_irq_work, dst_cpu);
+}
+
+static void task_throttle_cancel_irq_work_fn(struct irq_work *work)
+{
+ struct task_struct *p = container_of(work, struct task_struct, unthrottle_irq_work);
+ int cpu = raw_smp_processor_id();
+
+ CLASS(task_rq_lock, rq_guard)(p);
+ WARN_ON_ONCE(task_cpu(p) != cpu);
+
+ if (task_has_throttle_work(p) && !task_needs_throttling(p))
+ task_throttle_do_cancel_work(p);
+}
+
+static inline void task_throttle_setup(struct task_struct *p)
+{
+ /*
+ * If already throttled-in-userspace, just transfer the throttle_node
+ * link to the new cfs_rq
+ *
+ * Else, if not yet throttled, set up the work. Also, the task may be
+ * running in userspace (e.g. this is called from sched_move_task()),
+ * so make sure it is running in kernelspace to get the kernel-exit
+ * throttle.
+ */
+ if (task_is_throttled(p))
+ list_move(&p->throttle_node, &cfs_rq_of(&p->se)->throttled_limbo_list);
+ else if (!task_has_throttle_work(p))
+ task_throttle_setup_work(p);
+}
+
+static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
+static inline void __task_throttle_cancel(struct task_struct *p, unsigned int cpu)
+{
+ /*
+ * Task musn't be throttled, either:
+ * o it's already throttled-in-userspace, unthrottle it
+ * o it has the task_work installed, remove it
+ */
+ if (task_is_throttled(p)) {
+ list_del_init(&p->throttle_node);
+ enqueue_task_fair(cpu_rq(cpu), p, ENQUEUE_WAKEUP);
+ } else if (task_has_throttle_work(p)) {
+ task_throttle_cancel_work(p, cpu);
+ }
+}
+static inline void task_throttle_cancel(struct task_struct *p)
+{
+ __task_throttle_cancel(p, task_cpu(p));
+}
+static inline void task_throttle_cancel_migrate(struct task_struct *p, unsigned int dst_cpu)
+{
+ __task_throttle_cancel(p, dst_cpu);
+}
/*
* Ensure that neither of the group entities corresponding to src_cpu or
@@ -5722,35 +5825,107 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
{
struct rq *rq = data;
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ struct sched_entity *se = tg->se[cpu_of(rq)];
+ struct cfs_rq *pcfs_rq = cfs_rq_of(se);
+ long task_delta = 0, idle_task_delta = 0;
+ struct task_struct *p, *tmp;
cfs_rq->throttle_count--;
- if (!cfs_rq->throttle_count) {
- cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
- cfs_rq->throttled_clock_pelt;
+ if (cfs_rq->throttle_count)
+ return 0;
- /* Add cfs_rq with load or one or more already running entities to the list */
- if (!cfs_rq_is_decayed(cfs_rq))
- list_add_leaf_cfs_rq(cfs_rq);
+ cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
+ cfs_rq->throttled_clock_pelt;
+
+ /* Add cfs_rq with load or one or more already running entities to the list */
+ if (!cfs_rq_is_decayed(cfs_rq))
+ list_add_leaf_cfs_rq(cfs_rq);
- if (cfs_rq->throttled_clock_self) {
- u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
+ if (cfs_rq->throttled_clock_self) {
+ u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
- cfs_rq->throttled_clock_self = 0;
+ cfs_rq->throttled_clock_self = 0;
- if (SCHED_WARN_ON((s64)delta < 0))
- delta = 0;
+ if (SCHED_WARN_ON((s64)delta < 0))
+ delta = 0;
- cfs_rq->throttled_clock_self_time += delta;
- }
+ cfs_rq->throttled_clock_self_time += delta;
+ }
+
+ /*
+ * Re-enqueue the tasks that have been throttled at this level.
+ *
+ * The task count is up-propagated via ->unthrottled_*h_nr_running,
+ * as we can't purely rely on h_nr_running post-enqueue: the unthrottle
+ * might happen when a cfs_rq still has some tasks enqueued, either still
+ * making their way to userspace, or freshly migrated to it.
+ */
+ list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
+ struct sched_entity *pse = &p->se;
+
+ list_del_init(&p->throttle_node);
+
+ enqueue_entity(cfs_rq, pse, ENQUEUE_WAKEUP);
+ task_delta++;
+ idle_task_delta += task_has_idle_policy(p);
+ }
+
+ /*
+ * Account tasks woken up in children; by this point all direct children
+ * have been visited.
+ */
+ task_delta += cfs_rq->unthrottled_h_nr_running;
+ idle_task_delta += cfs_rq->unthrottled_idle_h_nr_running;
+
+ cfs_rq->h_nr_running += task_delta;
+ cfs_rq->idle_h_nr_running += idle_task_delta;
+
+ /*
+ * unthrottle_cfs_rq() needs a value to up-propagate above the
+ * freshly unthrottled cfs_rq.
+ */
+ cfs_rq->unthrottled_h_nr_running = task_delta;
+ cfs_rq->unthrottled_idle_h_nr_running = idle_task_delta;
+
+ /* Accumulate the delta in the parent's stash. Once all its children
+ * (i.e. all of this cfs_rq's siblings) have been visited, this value
+ * will be stable and used for its own count update.
+ */
+ pcfs_rq->unthrottled_h_nr_running += task_delta;
+ pcfs_rq->unthrottled_idle_h_nr_running += idle_task_delta;
+
+ /*
+ * If the cfs_rq became empty during throttling, then we dequeued
+ * it. It needs to be put back in the hierarchy if it or any of
+ * its children have now-unthrottled tasks.
+ */
+ if (!se->on_rq && (cfs_rq->h_nr_running || cfs_rq->idle_h_nr_running)) {
+ enqueue_entity(pcfs_rq, se, ENQUEUE_WAKEUP);
+ } else {
+ update_load_avg(pcfs_rq, se, UPDATE_TG);
+ se_update_runnable(se);
}
return 0;
}
+static int tg_unthrottle_clear_up(struct task_group *tg, void *data)
+{
+ struct rq *rq = data;
+ struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+
+ cfs_rq->unthrottled_h_nr_running = 0;
+ cfs_rq->unthrottled_idle_h_nr_running = 0;
+
+ return 0;
+}
+
static int tg_throttle_down(struct task_group *tg, void *data)
{
struct rq *rq = data;
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ struct rb_node *node;
+ struct task_struct *p;
/* group is entering throttled state, stop time */
if (!cfs_rq->throttle_count) {
@@ -5763,17 +5938,118 @@ static int tg_throttle_down(struct task_group *tg, void *data)
}
cfs_rq->throttle_count++;
+ /*
+ * If we've already visited this cfs_rq (e.g. it ran out of its own
+ * runtime sometime earlier and hasn't had a replenish yet), then
+ * there's nothing more to do.
+ */
+ if (cfs_rq->throttle_count > 1)
+ return 0;
+
+ WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
+ /*
+ * rq_lock is held, current is (obviously) executing this in kernelspace.
+ *
+ * All other tasks enqueued on this rq have their saved PC at the
+ * context switch, so they will go through the kernel before returning
+ * to userspace. Thus, there are no tasks-in-userspace to handle, just
+ * install the task_work on all of them.
+ */
+ node = rb_first(&cfs_rq->tasks_timeline.rb_root);
+ while (node) {
+ struct sched_entity *se = __node_2_se(node);
+
+ if (!entity_is_task(se))
+ goto next;
+
+ p = task_of(se);
+
+ if (!task_has_throttle_work(p))
+ task_throttle_setup_work(p);
+next:
+ node = rb_next(node);
+ }
+
return 0;
}
-static void throttle_cfs_rq_work(struct callback_head *work)
+static void throttle_one_task(struct cfs_rq *cfs_rq, struct task_struct *p)
{
+ long task_delta, idle_task_delta;
+ struct sched_entity *se = &p->se;
+
+ list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+ task_delta = 1;
+ idle_task_delta = cfs_rq_is_idle(cfs_rq) ? 1 : 0;
+
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+
+ if (!se->on_rq)
+ return;
+
+ dequeue_entity(cfs_rq, se, DEQUEUE_SLEEP);
+ cfs_rq->h_nr_running -= task_delta;
+ cfs_rq->idle_h_nr_running -= idle_task_delta;
+
+ if (cfs_rq->load.weight) {
+ /* Avoid re-evaluating load for this entity: */
+ se = parent_entity(se);
+ break;
+ }
+ }
+
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ /* throttled entity or throttle-on-deactivate */
+ if (!se->on_rq)
+ goto throttle_done;
+
+ update_load_avg(cfs_rq, se, 0);
+ se_update_runnable(se);
+ cfs_rq->h_nr_running -= task_delta;
+ cfs_rq->h_nr_running -= idle_task_delta;
+ }
+
+throttle_done:
+ /* At this point se is NULL and we are at root level*/
+ sub_nr_running(rq_of(cfs_rq), 1);
}
-static void task_throttle_cancel_irq_work_fn(struct irq_work *work)
+static void throttle_cfs_rq_work(struct callback_head *work)
{
- /* Write me */
+ struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
+ struct sched_entity *se;
+ struct rq *rq;
+ struct cfs_rq *cfs_rq;
+
+ WARN_ON_ONCE(p != current);
+ p->sched_throttle_work.next = &p->sched_throttle_work;
+ /*
+ * If task is exiting, then there won't be a return to userspace, so we
+ * don't have to bother with any of this.
+ */
+ if ((p->flags & PF_EXITING))
+ return;
+
+ CLASS(task_rq_lock, rq_guard)(p);
+ rq = rq_guard.rq;
+ se = &p->se;
+ cfs_rq = cfs_rq_of(se);
+
+ /*
+ * If not in limbo, then either replenish has happened or this task got
+ * migrated out of the throttled cfs_rq, move along
+ */
+ if (!cfs_rq->throttle_count)
+ return;
+
+ update_rq_clock(rq);
+
+ throttle_one_task(cfs_rq, p);
+
+ resched_curr(rq);
}
static void task_woken_fair(struct rq *rq, struct task_struct *p)
@@ -5792,6 +6068,7 @@ void init_cfs_throttle_work(struct task_struct *p)
/* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
p->sched_throttle_work.next = &p->sched_throttle_work;
init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
+ INIT_LIST_HEAD(&p->throttle_node);
p->unthrottle_irq_work = IRQ_WORK_INIT_HARD(task_throttle_cancel_irq_work_fn);
}
@@ -5799,8 +6076,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
- struct sched_entity *se;
- long task_delta, idle_task_delta, dequeue = 1;
+ long dequeue = 1;
raw_spin_lock(&cfs_b->lock);
/* This will start the period timer if necessary */
@@ -5818,70 +6094,24 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
list_add_tail_rcu(&cfs_rq->throttled_list,
&cfs_b->throttled_cfs_rq);
}
+
raw_spin_unlock(&cfs_b->lock);
if (!dequeue)
return false; /* Throttle no longer required. */
- se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
-
- /* freeze hierarchy runnable averages while throttled */
+ /* Flag the hierarchy for throttle-at-user-entry */
rcu_read_lock();
walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
rcu_read_unlock();
- task_delta = cfs_rq->h_nr_running;
- idle_task_delta = cfs_rq->idle_h_nr_running;
- for_each_sched_entity(se) {
- struct cfs_rq *qcfs_rq = cfs_rq_of(se);
- /* throttled entity or throttle-on-deactivate */
- if (!se->on_rq)
- goto done;
-
- dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
-
- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_task_delta = cfs_rq->h_nr_running;
-
- qcfs_rq->h_nr_running -= task_delta;
- qcfs_rq->idle_h_nr_running -= idle_task_delta;
-
- if (qcfs_rq->load.weight) {
- /* Avoid re-evaluating load for this entity: */
- se = parent_entity(se);
- break;
- }
- }
-
- for_each_sched_entity(se) {
- struct cfs_rq *qcfs_rq = cfs_rq_of(se);
- /* throttled entity or throttle-on-deactivate */
- if (!se->on_rq)
- goto done;
-
- update_load_avg(qcfs_rq, se, 0);
- se_update_runnable(se);
-
- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_task_delta = cfs_rq->h_nr_running;
-
- qcfs_rq->h_nr_running -= task_delta;
- qcfs_rq->idle_h_nr_running -= idle_task_delta;
- }
-
- /* At this point se is NULL and we are at root level*/
- sub_nr_running(rq, task_delta);
-
-done:
- /*
- * Note: distribution will already see us throttled via the
- * throttled-list. rq->lock protects completion.
- */
cfs_rq->throttled = 1;
+
SCHED_WARN_ON(cfs_rq->throttled_clock);
if (cfs_rq->nr_running)
cfs_rq->throttled_clock = rq_clock(rq);
- return true;
+
+ return false;
}
void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
@@ -5922,25 +6152,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
goto unthrottle_throttle;
}
- task_delta = cfs_rq->h_nr_running;
- idle_task_delta = cfs_rq->idle_h_nr_running;
- for_each_sched_entity(se) {
- struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-
- if (se->on_rq)
- break;
- enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
-
- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_task_delta = cfs_rq->h_nr_running;
+ if (cfs_rq->throttle_count)
+ return;
- qcfs_rq->h_nr_running += task_delta;
- qcfs_rq->idle_h_nr_running += idle_task_delta;
+ /*
+ * cfs_rq's below us may not have been fully emptied out, so we can't rely
+ * directly on ->h_nr_running. Use the stash instead.
+ */
+ task_delta = cfs_rq->unthrottled_h_nr_running;
+ idle_task_delta = cfs_rq->unthrottled_idle_h_nr_running;
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(qcfs_rq))
- goto unthrottle_throttle;
- }
+ walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_clear_up, (void *)rq);
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
@@ -5948,15 +6170,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
update_load_avg(qcfs_rq, se, UPDATE_TG);
se_update_runnable(se);
- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_task_delta = cfs_rq->h_nr_running;
-
qcfs_rq->h_nr_running += task_delta;
qcfs_rq->idle_h_nr_running += idle_task_delta;
-
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(qcfs_rq))
- goto unthrottle_throttle;
}
/* At this point se is NULL and we are at root level*/
@@ -6455,6 +6670,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
cfs_rq->runtime_enabled = 0;
INIT_LIST_HEAD(&cfs_rq->throttled_list);
INIT_LIST_HEAD(&cfs_rq->throttled_csd_list);
+ INIT_LIST_HEAD(&cfs_rq->throttled_limbo_list);
}
void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
@@ -6822,10 +7038,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq))
- goto enqueue_throttle;
-
flags = ENQUEUE_WAKEUP;
}
@@ -6841,10 +7053,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
-
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq))
- goto enqueue_throttle;
}
/* At this point se is NULL and we are at root level*/
@@ -6867,7 +7075,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!task_new)
check_update_overutilized_status(rq);
-enqueue_throttle:
assert_list_leaf_cfs_rq(rq);
hrtick_update(rq);
@@ -6900,10 +7107,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq))
- goto dequeue_throttle;
-
/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight) {
/* Avoid re-evaluating load for this entity: */
@@ -6932,10 +7135,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq))
- goto dequeue_throttle;
-
}
/* At this point se is NULL and we are at root level*/
@@ -6945,7 +7144,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
rq->next_balance = jiffies;
-dequeue_throttle:
util_est_update(&rq->cfs, p, task_sleep);
hrtick_update(rq);
}
@@ -12815,9 +13013,6 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
- if (cfs_rq_throttled(cfs_rq))
- return;
-
if (!throttled_hierarchy(cfs_rq))
list_add_leaf_cfs_rq(cfs_rq);
@@ -12829,9 +13024,6 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
update_load_avg(cfs_rq, se, UPDATE_TG);
- if (cfs_rq_throttled(cfs_rq))
- break;
-
if (!throttled_hierarchy(cfs_rq))
list_add_leaf_cfs_rq(cfs_rq);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 943bca8263ffe..f4a00b1dd9505 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -690,8 +690,12 @@ struct cfs_rq {
u64 throttled_clock_self_time;
int throttled;
int throttle_count;
+ /* Temp storage for updating the counts during unthrottling */
+ unsigned int unthrottled_h_nr_running;
+ unsigned int unthrottled_idle_h_nr_running;
struct list_head throttled_list;
struct list_head throttled_csd_list;
+ struct list_head throttled_limbo_list;
#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */
};
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 03/10] task_work, sched: Add a _locked variant to task_work_cancel()
2024-07-11 12:59 ` [RFC PATCH v3 03/10] task_work, sched: Add a _locked variant to task_work_cancel() Valentin Schneider
@ 2024-07-12 10:35 ` Oleg Nesterov
2024-07-12 15:20 ` Peter Zijlstra
1 sibling, 0 replies; 28+ messages in thread
From: Oleg Nesterov @ 2024-07-12 10:35 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, rcu, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Phil Auld, Clark Williams, Tomas Glozar,
Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Alexander Gordeev, Catalin Marinas,
Arnd Bergmann, Guo Ren, Palmer Dabbelt, Andrew Morton, Jens Axboe
On 07/11, Valentin Schneider wrote:
>
> Later commits will need to issue a task_work_cancel() from within the
> scheduler with the task's ->pi_lock held.
>
> Add a _locked variant that expects p->pi_lock to be held. Expose it in a
> separate scheduler header file, as this really is a scheduler-only
> interface.
>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> kernel/sched/task_work_sched.h | 14 +++++++
> kernel/task_work.c | 67 ++++++++++++++++++++++++++--------
> 2 files changed, 66 insertions(+), 15 deletions(-)
> create mode 100644 kernel/sched/task_work_sched.h
I am not sure the new task_work_sched.h makes sense, but I won't argue.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 01/10] rcuwait: Split type definition to its own header
2024-07-11 12:59 ` [RFC PATCH v3 01/10] rcuwait: Split type definition to its own header Valentin Schneider
@ 2024-07-12 15:15 ` Peter Zijlstra
2024-07-15 17:57 ` Valentin Schneider
0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2024-07-12 15:15 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, rcu, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
On Thu, Jul 11, 2024 at 02:59:55PM +0200, Valentin Schneider wrote:
> A later commit will add a struct irq_work member to struct task_struct, which
> would require <linux/sched.h> to include the definition of struct
> irq_work.
>
> Thanks to header dependency hell, incudling <linux/irq_work.h> in <linux/sched.h>
> results in defining inline helpers using not-yet-defined fields (mm_struct,
> task_struct, various task states...).
>
> Break off the definition of struct rcuwait into its own header file.
>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> include/linux/rcuwait.h | 9 ++-------
> include/linux/rcuwait_types.h | 16 ++++++++++++++++
> 2 files changed, 18 insertions(+), 7 deletions(-)
> create mode 100644 include/linux/rcuwait_types.h
>
> diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
> index 27343424225cf..1f1ca7d38cdf8 100644
> --- a/include/linux/rcuwait.h
> +++ b/include/linux/rcuwait.h
> @@ -5,17 +5,12 @@
> #include <linux/rcupdate.h>
> #include <linux/sched/signal.h>
>
> +#include <linux/rcuwait_types.h>
> +
> /*
> * rcuwait provides a way of blocking and waking up a single
> * task in an rcu-safe manner.
> - *
> - * The only time @task is non-nil is when a user is blocked (or
> - * checking if it needs to) on a condition, and reset as soon as we
> - * know that the condition has succeeded and are awoken.
> */
> -struct rcuwait {
> - struct task_struct __rcu *task;
> -};
>
> #define __RCUWAIT_INITIALIZER(name) \
> { .task = NULL, }
> diff --git a/include/linux/rcuwait_types.h b/include/linux/rcuwait_types.h
> new file mode 100644
> index 0000000000000..60a4385a2c368
> --- /dev/null
> +++ b/include/linux/rcuwait_types.h
> @@ -0,0 +1,16 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_RCUWAIT_TYPES_H_
> +#define _LINUX_RCUWAIT_TYPES_H_
> +
> +#include <linux/sched.h>
> +
> +/*
> + * The only time @task is non-nil is when a user is blocked (or
> + * checking if it needs to) on a condition, and reset as soon as we
> + * know that the condition has succeeded and are awoken.
> + */
> +struct rcuwait {
> + struct task_struct __rcu *task;
> +};
> +
> +#endif
Can't we simplu stick this in include/linux/types.h ?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 02/10] irq_work: Split type definition to its own header
2024-07-11 12:59 ` [RFC PATCH v3 02/10] irq_work: " Valentin Schneider
@ 2024-07-12 15:17 ` Peter Zijlstra
0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2024-07-12 15:17 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, rcu, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
On Thu, Jul 11, 2024 at 02:59:56PM +0200, Valentin Schneider wrote:
> A later commit will add a struct irq_work member to struct task_struct, which
> would require <linux/sched.h> to include the definition of struct
> irq_work.
>
> Thanks to header dependency hell, incudling <linux/irq_work.h> in <linux/sched.h>
> results in defining inline helpers using not-yet-defined fields (mm_struct,
> task_struct, various task states...).
>
> Break off the definition of struct irq_work into its own header file.
>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> include/linux/irq_work.h | 8 ++------
> include/linux/irq_work_types.h | 14 ++++++++++++++
> 2 files changed, 16 insertions(+), 6 deletions(-)
> create mode 100644 include/linux/irq_work_types.h
>
> diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
> index 136f2980cba30..7f6d2af360d91 100644
> --- a/include/linux/irq_work.h
> +++ b/include/linux/irq_work.h
> @@ -5,6 +5,8 @@
> #include <linux/smp_types.h>
> #include <linux/rcuwait.h>
>
> +#include <linux/irq_work_types.h>
> +
> /*
> * An entry can be in one of four states:
> *
> @@ -14,12 +16,6 @@
> * busy NULL, 2 -> {free, claimed} : callback in progress, can be claimed
> */
>
> -struct irq_work {
> - struct __call_single_node node;
> - void (*func)(struct irq_work *);
> - struct rcuwait irqwait;
> -};
> -
> #define __IRQ_WORK_INIT(_func, _flags) (struct irq_work){ \
> .node = { .u_flags = (_flags), }, \
> .func = (_func), \
> diff --git a/include/linux/irq_work_types.h b/include/linux/irq_work_types.h
> new file mode 100644
> index 0000000000000..108cbc514733b
> --- /dev/null
> +++ b/include/linux/irq_work_types.h
> @@ -0,0 +1,14 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_IRQ_WORK_TYPES_H
> +#define _LINUX_IRQ_WORK_TYPES_H
> +
> +#include <linux/smp_types.h>
> +#include <linux/rcuwait_types.h>
> +
> +struct irq_work {
> + struct __call_single_node node;
> + void (*func)(struct irq_work *);
> + struct rcuwait irqwait;
> +};
> +
> +#endif
Since smp_types.h already mentions IRQ_WORK, given how I tangled the
whole things together, should we just put this struct there as well?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 03/10] task_work, sched: Add a _locked variant to task_work_cancel()
2024-07-11 12:59 ` [RFC PATCH v3 03/10] task_work, sched: Add a _locked variant to task_work_cancel() Valentin Schneider
2024-07-12 10:35 ` Oleg Nesterov
@ 2024-07-12 15:20 ` Peter Zijlstra
2024-07-15 17:57 ` Valentin Schneider
1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2024-07-12 15:20 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, rcu, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
On Thu, Jul 11, 2024 at 02:59:57PM +0200, Valentin Schneider wrote:
> Later commits will need to issue a task_work_cancel() from within the
> scheduler with the task's ->pi_lock held.
>
> Add a _locked variant that expects p->pi_lock to be held. Expose it in a
> separate scheduler header file, as this really is a scheduler-only
> interface.
>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> kernel/sched/task_work_sched.h | 14 +++++++
> kernel/task_work.c | 67 ++++++++++++++++++++++++++--------
> 2 files changed, 66 insertions(+), 15 deletions(-)
> create mode 100644 kernel/sched/task_work_sched.h
>
> diff --git a/kernel/sched/task_work_sched.h b/kernel/sched/task_work_sched.h
> new file mode 100644
> index 0000000000000..e235da456427f
> --- /dev/null
> +++ b/kernel/sched/task_work_sched.h
> @@ -0,0 +1,14 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Scheduler internal task_work methods
> + */
> +#ifndef _KERNEL_TASK_WORK_SCHED_H
> +#define _KERNEL_TASK_WORK_SCHED_H
> +
> +#include <linux/task_work.h>
> +#include <linux/sched.h>
> +
> +struct callback_head *
> +task_work_cancel_locked(struct task_struct *task, task_work_func_t func);
> +
> +#endif
Do we really need that exposed? Can't we squirrel that way in
kernel/sched/sched.h and forget about it?
> @@ -74,33 +76,20 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
> return 0;
> }
>
> -/**
> - * task_work_cancel_match - cancel a pending work added by task_work_add()
> - * @task: the task which should execute the work
> - * @match: match function to call
> - * @data: data to be passed in to match function
> - *
> - * RETURNS:
> - * The found work or NULL if not found.
> - */
> -struct callback_head *
> -task_work_cancel_match(struct task_struct *task,
> +static struct callback_head *
> +task_work_cancel_match_locked(struct task_struct *task,
> bool (*match)(struct callback_head *, void *data),
> void *data)
> {
> struct callback_head **pprev = &task->task_works;
> struct callback_head *work;
> - unsigned long flags;
>
> - if (likely(!task_work_pending(task)))
> - return NULL;
> /*
> * If cmpxchg() fails we continue without updating pprev.
> * Either we raced with task_work_add() which added the
> * new entry before this work, we will find it again. Or
> * we raced with task_work_run(), *pprev == NULL/exited.
> */
> - raw_spin_lock_irqsave(&task->pi_lock, flags);
> work = READ_ONCE(*pprev);
> while (work) {
> if (!match(work, data)) {
> @@ -109,6 +98,32 @@ task_work_cancel_match(struct task_struct *task,
> } else if (try_cmpxchg(pprev, &work, work->next))
> break;
> }
> +
> + return work;
> +}
> @@ -136,6 +151,28 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
> return task_work_cancel_match(task, task_work_func_match, func);
> }
>
> +/**
> + * task_work_cancel - cancel a pending work added by task_work_add()
> + * @task: the task which should execute the work
> + * @func: identifies the work to remove
> + *
> + * Find the last queued pending work with ->func == @func and remove
> + * it from queue.
> + *
> + * RETURNS:
> + * The found work or NULL if not found.
> + */
> +struct callback_head *
> +task_work_cancel_locked(struct task_struct *task, task_work_func_t func)
> +{
> + lockdep_assert_held(&task->pi_lock);
I'm thinking that lockde_assert wants to live in your _locked function
above.
> + if (likely(!task_work_pending(task)))
> + return NULL;
> +
> + return task_work_cancel_match_locked(task, task_work_func_match, func);
> +}
> +
> /**
> * task_work_run - execute the works added by task_work_add()
> *
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 04/10] sched/fair: Introduce sched_throttle_work
2024-07-11 12:59 ` [RFC PATCH v3 04/10] sched/fair: Introduce sched_throttle_work Valentin Schneider
@ 2024-07-12 15:21 ` Peter Zijlstra
2024-07-15 17:58 ` Valentin Schneider
0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2024-07-12 15:21 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, rcu, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
On Thu, Jul 11, 2024 at 02:59:58PM +0200, Valentin Schneider wrote:
> +void init_cfs_throttle_work(struct task_struct *p)
> +{
> + /* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
> + p->sched_throttle_work.next = &p->sched_throttle_work;
> + init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
Yes, init_task_work() does not write .next, but can we please flip these
two statements and avoid me having to double check that every time I
seem them? :-)
> +}
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 06/10] sched/fair: Prepare switched_from & switched_to for per-task throttling
2024-07-11 13:00 ` [RFC PATCH v3 06/10] sched/fair: Prepare switched_from & switched_to for per-task throttling Valentin Schneider
@ 2024-07-12 15:26 ` Peter Zijlstra
0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2024-07-12 15:26 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, rcu, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
On Thu, Jul 11, 2024 at 03:00:00PM +0200, Valentin Schneider wrote:
> Later commits will change CFS bandwidth control throttling from a
> per-cfs_rq basis to a per-task basis. This means special care needs to be
> taken around any transition a task can have into and out of a cfs_rq.
>
> To ease reviewing, the transitions are patched with dummy-helpers that are
> implemented later on.
>
> Add helpers to switched_from_fair() and switched_to_fair() to cover class
> changes. If switching from CFS, a task may need to be unthrottled. If
> switching to CFS, a task may need to be throttled.
>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> kernel/sched/fair.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 095357bd17f0e..acac0829c71f3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5694,6 +5694,10 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
> return cfs_bandwidth_used() && cfs_rq->throttle_count;
> }
>
> +static inline bool task_needs_throttling(struct task_struct *p) { return false; }
> +static inline void task_throttle_setup(struct task_struct *p) { }
> +static inline void task_throttle_cancel(struct task_struct *p) { }
> +
> /*
> * Ensure that neither of the group entities corresponding to src_cpu or
> * dest_cpu are members of a throttled hierarchy when performing group
> @@ -6622,6 +6626,10 @@ static inline int throttled_lb_pair(struct task_group *tg,
> return 0;
> }
>
> +static inline bool task_needs_throttling(struct task_struct *p) { return false; }
> +static inline void task_throttle_setup(struct task_struct *p) { }
> +static inline void task_throttle_cancel(struct task_struct *p) { }
> +
> #ifdef CONFIG_FAIR_GROUP_SCHED
> void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
> static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> @@ -12847,11 +12855,15 @@ static void attach_task_cfs_rq(struct task_struct *p)
> static void switched_from_fair(struct rq *rq, struct task_struct *p)
> {
> detach_task_cfs_rq(p);
> + if (cfs_bandwidth_used())
> + task_throttle_cancel(p);
> }
>
> static void switched_to_fair(struct rq *rq, struct task_struct *p)
> {
> attach_task_cfs_rq(p);
> + if (cfs_bandwidth_used() && task_needs_throttling(p))
> + task_throttle_setup(p);
>
> set_task_max_allowed_capacity(p);
Other functions seem to have cfs_bandwidth_used() inside them, and not
bother the caller with this detail.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace
2024-07-11 13:00 ` [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace Valentin Schneider
@ 2024-07-12 17:05 ` Peter Zijlstra
2024-07-12 17:10 ` Peter Zijlstra
` (2 subsequent siblings)
3 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2024-07-12 17:05 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, rcu, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
On Thu, Jul 11, 2024 at 03:00:04PM +0200, Valentin Schneider wrote:
> +static inline void task_throttle_cancel_work(struct task_struct *p, int dst_cpu)
> +{
> + /*
> + * The calling context may be holding p->pi_lock, which is also acquired
> + * by task_work_cancel_match().
> + *
> + * Lock recursion is prevented by punting the work cancellation to the
> + * next IRQ enable. This is sent to the destination CPU rather than
> + * >this< CPU to prevent the task from resuming execution and getting
> + * throttled in its return to userspace.
> + */
u're having white space trouble there.. :-)
> + irq_work_queue_on(&p->unthrottle_irq_work, dst_cpu);
> +}
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace
2024-07-11 13:00 ` [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace Valentin Schneider
2024-07-12 17:05 ` Peter Zijlstra
@ 2024-07-12 17:10 ` Peter Zijlstra
2024-07-12 17:43 ` Peter Zijlstra
2024-07-19 0:25 ` Benjamin Segall
3 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2024-07-12 17:10 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, rcu, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
On Thu, Jul 11, 2024 at 03:00:04PM +0200, Valentin Schneider wrote:
> +static void task_throttle_cancel_irq_work_fn(struct irq_work *work)
> +{
> + struct task_struct *p = container_of(work, struct task_struct, unthrottle_irq_work);
> + int cpu = raw_smp_processor_id();
> +
> + CLASS(task_rq_lock, rq_guard)(p);
guard(task_rq_lock)(p);
> + WARN_ON_ONCE(task_cpu(p) != cpu);
> +
> + if (task_has_throttle_work(p) && !task_needs_throttling(p))
> + task_throttle_do_cancel_work(p);
> +}
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace
2024-07-11 13:00 ` [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace Valentin Schneider
2024-07-12 17:05 ` Peter Zijlstra
2024-07-12 17:10 ` Peter Zijlstra
@ 2024-07-12 17:43 ` Peter Zijlstra
2024-07-16 12:46 ` Valentin Schneider
2024-07-19 0:25 ` Benjamin Segall
3 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2024-07-12 17:43 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, rcu, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
On Thu, Jul 11, 2024 at 03:00:04PM +0200, Valentin Schneider wrote:
> +static void throttle_one_task(struct cfs_rq *cfs_rq, struct task_struct *p)
> {
> + long task_delta, idle_task_delta;
> + struct sched_entity *se = &p->se;
> +
> + list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
>
> + task_delta = 1;
> + idle_task_delta = cfs_rq_is_idle(cfs_rq) ? 1 : 0;
> +
> + for_each_sched_entity(se) {
> + cfs_rq = cfs_rq_of(se);
> +
> + if (!se->on_rq)
> + return;
> +
> + dequeue_entity(cfs_rq, se, DEQUEUE_SLEEP);
> + cfs_rq->h_nr_running -= task_delta;
> + cfs_rq->idle_h_nr_running -= idle_task_delta;
> +
> + if (cfs_rq->load.weight) {
> + /* Avoid re-evaluating load for this entity: */
> + se = parent_entity(se);
> + break;
> + }
> + }
> +
> + for_each_sched_entity(se) {
> + cfs_rq = cfs_rq_of(se);
> + /* throttled entity or throttle-on-deactivate */
> + if (!se->on_rq)
> + goto throttle_done;
> +
> + update_load_avg(cfs_rq, se, 0);
> + se_update_runnable(se);
> + cfs_rq->h_nr_running -= task_delta;
> + cfs_rq->h_nr_running -= idle_task_delta;
> + }
> +
> +throttle_done:
> + /* At this point se is NULL and we are at root level*/
> + sub_nr_running(rq_of(cfs_rq), 1);
> }
I know you're just moving code around, but we should look if we can
share code between this and dequeue_task_fair().
I have patches around this in that eevdf series I should send out again,
I'll try and have a stab.
> -static void task_throttle_cancel_irq_work_fn(struct irq_work *work)
> +static void throttle_cfs_rq_work(struct callback_head *work)
> {
> - /* Write me */
> + struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
> + struct sched_entity *se;
> + struct rq *rq;
> + struct cfs_rq *cfs_rq;
> +
> + WARN_ON_ONCE(p != current);
> + p->sched_throttle_work.next = &p->sched_throttle_work;
> + /*
> + * If task is exiting, then there won't be a return to userspace, so we
> + * don't have to bother with any of this.
> + */
> + if ((p->flags & PF_EXITING))
> + return;
> +
> + CLASS(task_rq_lock, rq_guard)(p);
> + rq = rq_guard.rq;
The other way to write this is:
scoped_guard (task_rq_lock, p) {
struct rq *rq = scope.rq;
> + se = &p->se;
> + cfs_rq = cfs_rq_of(se);
> +
> + /*
> + * If not in limbo, then either replenish has happened or this task got
> + * migrated out of the throttled cfs_rq, move along
> + */
> + if (!cfs_rq->throttle_count)
> + return;
> +
> + update_rq_clock(rq);
> +
> + throttle_one_task(cfs_rq, p);
> +
> + resched_curr(rq);
}
> }
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 01/10] rcuwait: Split type definition to its own header
2024-07-12 15:15 ` Peter Zijlstra
@ 2024-07-15 17:57 ` Valentin Schneider
0 siblings, 0 replies; 28+ messages in thread
From: Valentin Schneider @ 2024-07-15 17:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, rcu, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
On 12/07/24 17:15, Peter Zijlstra wrote:
> On Thu, Jul 11, 2024 at 02:59:55PM +0200, Valentin Schneider wrote:
>> diff --git a/include/linux/rcuwait_types.h b/include/linux/rcuwait_types.h
>> new file mode 100644
>> index 0000000000000..60a4385a2c368
>> --- /dev/null
>> +++ b/include/linux/rcuwait_types.h
>> @@ -0,0 +1,16 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_RCUWAIT_TYPES_H_
>> +#define _LINUX_RCUWAIT_TYPES_H_
>> +
>> +#include <linux/sched.h>
>> +
>> +/*
>> + * The only time @task is non-nil is when a user is blocked (or
>> + * checking if it needs to) on a condition, and reset as soon as we
>> + * know that the condition has succeeded and are awoken.
>> + */
>> +struct rcuwait {
>> + struct task_struct __rcu *task;
>> +};
>> +
>> +#endif
>
> Can't we simplu stick this in include/linux/types.h ?
Huh, we can indeed. Silly me didn't realize that since all that struct
contains is a pointer, we don't need the embedded type definition for it to
be a complete type.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 03/10] task_work, sched: Add a _locked variant to task_work_cancel()
2024-07-12 15:20 ` Peter Zijlstra
@ 2024-07-15 17:57 ` Valentin Schneider
0 siblings, 0 replies; 28+ messages in thread
From: Valentin Schneider @ 2024-07-15 17:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, rcu, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
On 12/07/24 17:20, Peter Zijlstra wrote:
> On Thu, Jul 11, 2024 at 02:59:57PM +0200, Valentin Schneider wrote:
>> Later commits will need to issue a task_work_cancel() from within the
>> scheduler with the task's ->pi_lock held.
>>
>> Add a _locked variant that expects p->pi_lock to be held. Expose it in a
>> separate scheduler header file, as this really is a scheduler-only
>> interface.
>>
>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>> ---
>> kernel/sched/task_work_sched.h | 14 +++++++
>> kernel/task_work.c | 67 ++++++++++++++++++++++++++--------
>> 2 files changed, 66 insertions(+), 15 deletions(-)
>> create mode 100644 kernel/sched/task_work_sched.h
>>
>> diff --git a/kernel/sched/task_work_sched.h b/kernel/sched/task_work_sched.h
>> new file mode 100644
>> index 0000000000000..e235da456427f
>> --- /dev/null
>> +++ b/kernel/sched/task_work_sched.h
>> @@ -0,0 +1,14 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Scheduler internal task_work methods
>> + */
>> +#ifndef _KERNEL_TASK_WORK_SCHED_H
>> +#define _KERNEL_TASK_WORK_SCHED_H
>> +
>> +#include <linux/task_work.h>
>> +#include <linux/sched.h>
>> +
>> +struct callback_head *
>> +task_work_cancel_locked(struct task_struct *task, task_work_func_t func);
>> +
>> +#endif
>
>
> Do we really need that exposed? Can't we squirrel that way in
> kernel/sched/sched.h and forget about it?
>
Nah that's not required, I thought a clean cut header would be neater but
given its single user, tossing that to sched.h looks better.
>> +struct callback_head *
>> +task_work_cancel_locked(struct task_struct *task, task_work_func_t func)
>> +{
>> + lockdep_assert_held(&task->pi_lock);
>
> I'm thinking that lockde_assert wants to live in your _locked function
> above.
>
Quite so!
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 04/10] sched/fair: Introduce sched_throttle_work
2024-07-12 15:21 ` Peter Zijlstra
@ 2024-07-15 17:58 ` Valentin Schneider
0 siblings, 0 replies; 28+ messages in thread
From: Valentin Schneider @ 2024-07-15 17:58 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, rcu, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
On 12/07/24 17:21, Peter Zijlstra wrote:
> On Thu, Jul 11, 2024 at 02:59:58PM +0200, Valentin Schneider wrote:
>
>> +void init_cfs_throttle_work(struct task_struct *p)
>> +{
>> + /* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
>> + p->sched_throttle_work.next = &p->sched_throttle_work;
>> + init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
>
> Yes, init_task_work() does not write .next, but can we please flip these
> two statements and avoid me having to double check that every time I
> seem them? :-)
>
Easy enough :)
>> +}
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace
2024-07-12 17:43 ` Peter Zijlstra
@ 2024-07-16 12:46 ` Valentin Schneider
0 siblings, 0 replies; 28+ messages in thread
From: Valentin Schneider @ 2024-07-16 12:46 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, rcu, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
On 12/07/24 19:43, Peter Zijlstra wrote:
> On Thu, Jul 11, 2024 at 03:00:04PM +0200, Valentin Schneider wrote:
>
>> +static void throttle_one_task(struct cfs_rq *cfs_rq, struct task_struct *p)
>> {
>> + long task_delta, idle_task_delta;
>> + struct sched_entity *se = &p->se;
>> +
>> + list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
>>
>> + task_delta = 1;
>> + idle_task_delta = cfs_rq_is_idle(cfs_rq) ? 1 : 0;
>> +
>> + for_each_sched_entity(se) {
>> + cfs_rq = cfs_rq_of(se);
>> +
>> + if (!se->on_rq)
>> + return;
>> +
>> + dequeue_entity(cfs_rq, se, DEQUEUE_SLEEP);
>> + cfs_rq->h_nr_running -= task_delta;
>> + cfs_rq->idle_h_nr_running -= idle_task_delta;
>> +
>> + if (cfs_rq->load.weight) {
>> + /* Avoid re-evaluating load for this entity: */
>> + se = parent_entity(se);
>> + break;
>> + }
>> + }
>> +
>> + for_each_sched_entity(se) {
>> + cfs_rq = cfs_rq_of(se);
>> + /* throttled entity or throttle-on-deactivate */
>> + if (!se->on_rq)
>> + goto throttle_done;
>> +
>> + update_load_avg(cfs_rq, se, 0);
>> + se_update_runnable(se);
>> + cfs_rq->h_nr_running -= task_delta;
>> + cfs_rq->h_nr_running -= idle_task_delta;
>> + }
>> +
>> +throttle_done:
>> + /* At this point se is NULL and we are at root level*/
>> + sub_nr_running(rq_of(cfs_rq), 1);
>> }
>
> I know you're just moving code around, but we should look if we can
> share code between this and dequeue_task_fair().
>
> I have patches around this in that eevdf series I should send out again,
> I'll try and have a stab.
>
Looking at this again, couldn't this actually just be:
list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
dequeue_task_fair(rq, p, DEQUEUE_SLEEP);
?
Because the main deltas between throttle_one_task() and dequeue_task_fair()
are (other than the list_add()):
o Caring about throttled entities (!se->on_rq) - which AFAICT can't happen
after this patch, since non-empty cfs_rq's are kept enqueued
o Load tracking faff (util_est, an extra update_cfs_group())
o The set_next_buddy() thing, which will always be a nop because of
throttled_hierarchy()
o A rq->next_balance update
o hrtick_update()
I think some of these are omitted from throttle_cfs_rq() because the whole
cfs_rq is being taken out, but here we are genuinely taking just one task
out, so I think this does want to be pretty much dequeue_task_fair().
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace
2024-07-11 13:00 ` [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace Valentin Schneider
` (2 preceding siblings ...)
2024-07-12 17:43 ` Peter Zijlstra
@ 2024-07-19 0:25 ` Benjamin Segall
2024-07-23 15:16 ` Valentin Schneider
3 siblings, 1 reply; 28+ messages in thread
From: Benjamin Segall @ 2024-07-19 0:25 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, rcu, Peter Zijlstra, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
Valentin Schneider <vschneid@redhat.com> writes:
> As reported in [1], CFS bandwidth throttling is a source of headaches in
> PREEMPT_RT - generally speaking, a throttled CFS task can hold locks that
> prevent ksoftirqd from running, which prevents replenishing & unthrottling
> the cfs_rq of said CFS task.
>
> Peter mentioned that there have been discussions on changing /when/ the
> throttling happens: rather than have it be done immediately upon updating
> the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
> for the task to be about to return to userspace.
Sorry for taking a while to get to replying ot this.
> Clocks
> ======
>
> Correctly handling the different cfs_rq->throttled_clock* is tricky, as
> unlike the current upstream approach where all tasks of a cfs_rq are
> throttled at the exact same time, here they each get throttled at a
> per-task, not-known-beforehand time.
>
> For instance, for the ->throttled_clock_pelt, ideally we would need a
> per-task snapshot of when the task gets really throttled in
> exit_to_user_mode(), rather than a single snapshot of when the cfs_rq runs
> out of runtime. This isn't implemented here. The ->throttled_clock_pelt is
> set when the cfs_rq runs out of runtime, which means the "grace period"
> given to the cfs_rq's tasks on their way to exit_to_user_mode() isn't
> accounted.
And I haven't checked it yet because I never remember how the whole
throttled clock thing works. :P
>
> Notable behaviour changes
> =========================
>
> Once a cfs_rq is ->throttled, its tasks can continue running until they hit
> exit_to_user_mode(). This means they can keep draining further runtime
> from their cfs_rq, which can end up draining more than one period's worth
> of runtime.
>
> I've tested a 10ms runtime / 100ms period cgroup with an always running
> task: upstream gets a "clean" periodic pattern of 10ms runtime every 100ms,
> whereas this gets something more like 40ms runtime every 400ms.
Hmm, this seems a little odd since TWA_RESUME does a kick_process.
> +static inline void task_throttle_setup_work(struct task_struct *p)
> +{
> + /*
> + * Kthreads and exiting tasks don't return to userspace, so adding the
> + * work is pointless
> + */
> + if (!(p->flags & (PF_EXITING | PF_KTHREAD)))
> + task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
> +}
> +
> +static void throttle_cfs_rq_work(struct callback_head *work);
> +static inline void task_throttle_do_cancel_work(struct task_struct *p)
> +{
> + /*
> + * If this returns NULL, it means the work got run, which per
> + * this being called is a bug: the task_work throttled the
> + * task when it didn't need to be.
> + */
> + WARN_ON_ONCE(!task_work_cancel_locked(p, throttle_cfs_rq_work));
> + p->sched_throttle_work.next = &p->sched_throttle_work;
> +}
> +
> +static inline void task_throttle_cancel_work(struct task_struct *p, int dst_cpu)
> +{
> + /*
> + * The calling context may be holding p->pi_lock, which is also acquired
> + * by task_work_cancel_match().
> + *
> + * Lock recursion is prevented by punting the work cancellation to the
> + * next IRQ enable. This is sent to the destination CPU rather than
> + * >this< CPU to prevent the task from resuming execution and getting
> + * throttled in its return to userspace.
> + */
> + irq_work_queue_on(&p->unthrottle_irq_work, dst_cpu);
> +}
> +
> +static void task_throttle_cancel_irq_work_fn(struct irq_work *work)
> +{
> + struct task_struct *p = container_of(work, struct task_struct, unthrottle_irq_work);
> + int cpu = raw_smp_processor_id();
> +
> + CLASS(task_rq_lock, rq_guard)(p);
> + WARN_ON_ONCE(task_cpu(p) != cpu);
> +
> + if (task_has_throttle_work(p) && !task_needs_throttling(p))
> + task_throttle_do_cancel_work(p);
> +}
I think you can wind up triggering this WARN and in general have some
weird state, whether or not it matters, basically any time that you
__task_throttle_cancel(p, a_remote_cpu).
It queues an irq_work and sends an IPI, but doesn't wait for it. If
handling that IPI is delayed, then we can wind up doing more
migration/class switch/etc (on this cpu or some third cpu) before that
irq_work runs.
I think this may be ok and it's just the WARN that's wrong, but I'm not
sure.
> @@ -5722,35 +5825,107 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> {
> struct rq *rq = data;
> struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> + struct sched_entity *se = tg->se[cpu_of(rq)];
> + struct cfs_rq *pcfs_rq = cfs_rq_of(se);
> + long task_delta = 0, idle_task_delta = 0;
> + struct task_struct *p, *tmp;
>
> cfs_rq->throttle_count--;
> - if (!cfs_rq->throttle_count) {
> - cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> - cfs_rq->throttled_clock_pelt;
> + if (cfs_rq->throttle_count)
> + return 0;
>
> - /* Add cfs_rq with load or one or more already running entities to the list */
> - if (!cfs_rq_is_decayed(cfs_rq))
> - list_add_leaf_cfs_rq(cfs_rq);
> + cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> + cfs_rq->throttled_clock_pelt;
> +
> + /* Add cfs_rq with load or one or more already running entities to the list */
> + if (!cfs_rq_is_decayed(cfs_rq))
> + list_add_leaf_cfs_rq(cfs_rq);
>
> - if (cfs_rq->throttled_clock_self) {
> - u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
> + if (cfs_rq->throttled_clock_self) {
> + u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
>
> - cfs_rq->throttled_clock_self = 0;
> + cfs_rq->throttled_clock_self = 0;
>
> - if (SCHED_WARN_ON((s64)delta < 0))
> - delta = 0;
> + if (SCHED_WARN_ON((s64)delta < 0))
> + delta = 0;
>
> - cfs_rq->throttled_clock_self_time += delta;
> - }
> + cfs_rq->throttled_clock_self_time += delta;
> + }
> +
> + /*
> + * Re-enqueue the tasks that have been throttled at this level.
> + *
> + * The task count is up-propagated via ->unthrottled_*h_nr_running,
> + * as we can't purely rely on h_nr_running post-enqueue: the unthrottle
> + * might happen when a cfs_rq still has some tasks enqueued, either still
> + * making their way to userspace, or freshly migrated to it.
> + */
> + list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
> + struct sched_entity *pse = &p->se;
> +
> + list_del_init(&p->throttle_node);
> +
> + enqueue_entity(cfs_rq, pse, ENQUEUE_WAKEUP);
> + task_delta++;
> + idle_task_delta += task_has_idle_policy(p);
> + }
You know, on our too-large machines with too-many cgroups, we're already
running into these walk_tg_trees being worrying slow for holding a spinlock :P
> +
> + /*
> + * Account tasks woken up in children; by this point all direct children
> + * have been visited.
> + */
> + task_delta += cfs_rq->unthrottled_h_nr_running;
> + idle_task_delta += cfs_rq->unthrottled_idle_h_nr_running;
> +
> + cfs_rq->h_nr_running += task_delta;
> + cfs_rq->idle_h_nr_running += idle_task_delta;
> +
> + /*
> + * unthrottle_cfs_rq() needs a value to up-propagate above the
> + * freshly unthrottled cfs_rq.
> + */
> + cfs_rq->unthrottled_h_nr_running = task_delta;
> + cfs_rq->unthrottled_idle_h_nr_running = idle_task_delta;
I think this should have no effect, right?
> +
> + /* Accumulate the delta in the parent's stash. Once all its children
> + * (i.e. all of this cfs_rq's siblings) have been visited, this value
> + * will be stable and used for its own count update.
> + */
> + pcfs_rq->unthrottled_h_nr_running += task_delta;
> + pcfs_rq->unthrottled_idle_h_nr_running += idle_task_delta;
> +
> + /*
> + * If the cfs_rq became empty during throttling, then we dequeued
> + * it. It needs to be put back in the hierarchy if it or any of
> + * its children have now-unthrottled tasks.
> + */
> + if (!se->on_rq && (cfs_rq->h_nr_running || cfs_rq->idle_h_nr_running)) {
> + enqueue_entity(pcfs_rq, se, ENQUEUE_WAKEUP);
> + } else {
> + update_load_avg(pcfs_rq, se, UPDATE_TG);
> + se_update_runnable(se);
> }
So I think this is trying to combine the all updates, and it's logically
just the same as if the loop was enqueue_task_fair, like you mentioned
in a followup for the throttle_one_task and dequeue_task_fair?
> void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> @@ -5922,25 +6152,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> goto unthrottle_throttle;
> }
>
> - task_delta = cfs_rq->h_nr_running;
> - idle_task_delta = cfs_rq->idle_h_nr_running;
> - for_each_sched_entity(se) {
> - struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> -
> - if (se->on_rq)
> - break;
> - enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
> -
> - if (cfs_rq_is_idle(group_cfs_rq(se)))
> - idle_task_delta = cfs_rq->h_nr_running;
> + if (cfs_rq->throttle_count)
> + return;
>
> - qcfs_rq->h_nr_running += task_delta;
> - qcfs_rq->idle_h_nr_running += idle_task_delta;
> + /*
> + * cfs_rq's below us may not have been fully emptied out, so we can't rely
> + * directly on ->h_nr_running. Use the stash instead.
> + */
> + task_delta = cfs_rq->unthrottled_h_nr_running;
> + idle_task_delta = cfs_rq->unthrottled_idle_h_nr_running;
>
> - /* end evaluation on encountering a throttled cfs_rq */
> - if (cfs_rq_throttled(qcfs_rq))
> - goto unthrottle_throttle;
> - }
> + walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_clear_up, (void *)rq);
>
> for_each_sched_entity(se) {
> struct cfs_rq *qcfs_rq = cfs_rq_of(se);
I think this would be simpler by making the first/original
walk_tg_tree_from be
walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_clear_down, tg_unthrottle_up, (void *)rq);
(With the clear function being the same, just renamed)
We never have any unthrottled* saved between calls to unthrottle_cfs_rq,
because if throttled_count is set for us, it's set for all of our
descendants too, so we're doing nothing but throttle_count stuff in that
case. Sadly that doesn't let us remove the cfs_rq fields, that would
need a walk_tg_tree_by_level.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace
2024-07-19 0:25 ` Benjamin Segall
@ 2024-07-23 15:16 ` Valentin Schneider
2024-07-24 1:34 ` Benjamin Segall
0 siblings, 1 reply; 28+ messages in thread
From: Valentin Schneider @ 2024-07-23 15:16 UTC (permalink / raw)
To: Benjamin Segall
Cc: linux-kernel, rcu, Peter Zijlstra, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
On 18/07/24 17:25, Benjamin Segall wrote:
> Valentin Schneider <vschneid@redhat.com> writes:
>
>> As reported in [1], CFS bandwidth throttling is a source of headaches in
>> PREEMPT_RT - generally speaking, a throttled CFS task can hold locks that
>> prevent ksoftirqd from running, which prevents replenishing & unthrottling
>> the cfs_rq of said CFS task.
>>
>> Peter mentioned that there have been discussions on changing /when/ the
>> throttling happens: rather than have it be done immediately upon updating
>> the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
>> for the task to be about to return to userspace.
>
> Sorry for taking a while to get to replying ot this.
Well it's a pretty big diff, so thanks for taking a look!
>
>> Clocks
>> ======
>>
>> Correctly handling the different cfs_rq->throttled_clock* is tricky, as
>> unlike the current upstream approach where all tasks of a cfs_rq are
>> throttled at the exact same time, here they each get throttled at a
>> per-task, not-known-beforehand time.
>>
>> For instance, for the ->throttled_clock_pelt, ideally we would need a
>> per-task snapshot of when the task gets really throttled in
>> exit_to_user_mode(), rather than a single snapshot of when the cfs_rq runs
>> out of runtime. This isn't implemented here. The ->throttled_clock_pelt is
>> set when the cfs_rq runs out of runtime, which means the "grace period"
>> given to the cfs_rq's tasks on their way to exit_to_user_mode() isn't
>> accounted.
>
> And I haven't checked it yet because I never remember how the whole
> throttled clock thing works. :P
>
>>
>> Notable behaviour changes
>> =========================
>>
>> Once a cfs_rq is ->throttled, its tasks can continue running until they hit
>> exit_to_user_mode(). This means they can keep draining further runtime
>> from their cfs_rq, which can end up draining more than one period's worth
>> of runtime.
>>
>> I've tested a 10ms runtime / 100ms period cgroup with an always running
>> task: upstream gets a "clean" periodic pattern of 10ms runtime every 100ms,
>> whereas this gets something more like 40ms runtime every 400ms.
>
> Hmm, this seems a little odd since TWA_RESUME does a kick_process.
I didn't ponder too much on the workload used here, but the point I wanted
to bring up is: if you give a cgroup X amount of runtime, it may still
consume more than that within a single period because execution in
kernelspace isn't immediately stopped/throttled.
It means the "standard" bandwidth control behaviour becomes a bit more
bursty.
>> +static void task_throttle_cancel_irq_work_fn(struct irq_work *work)
>> +{
>> + struct task_struct *p = container_of(work, struct task_struct, unthrottle_irq_work);
>> + int cpu = raw_smp_processor_id();
>> +
>> + CLASS(task_rq_lock, rq_guard)(p);
>> + WARN_ON_ONCE(task_cpu(p) != cpu);
>> +
>> + if (task_has_throttle_work(p) && !task_needs_throttling(p))
>> + task_throttle_do_cancel_work(p);
>> +}
>
> I think you can wind up triggering this WARN and in general have some
> weird state, whether or not it matters, basically any time that you
> __task_throttle_cancel(p, a_remote_cpu).
>
> It queues an irq_work and sends an IPI, but doesn't wait for it. If
> handling that IPI is delayed, then we can wind up doing more
> migration/class switch/etc (on this cpu or some third cpu) before that
> irq_work runs.
>
> I think this may be ok and it's just the WARN that's wrong, but I'm not
> sure.
>
That whole thing is indeed nasty, /me goes back through my notes
Right, so you can have:
task_cpu(p) == CPU0
CPU0 CPU1
rq_lock();
rq_lock();
switched_from_fair()
task_throttle_cancel()
...
rq_unlock();
pull_rt_task()
set_task_cpu(p, CPU1);
rq_unlock();
task_throttle_cancel_irq_work_fn()
task_throttle_do_cancel()
WARN_ON_ONCE(CPU1 != CPU0);
That does mean the task p could start executing on CPU1 before the
task_work is cleared, which is something I want to avoid - while very
unlikely, the worst case is it reaches exit_to_user_mode() before the
task_work gets cleared, which is a no-no.
I could add a sched_class check in throttle_cfs_rq_work(), but this is making
wonder if irq_work is the right mechanism here.
The one important thing for me is: if a task doesn't need throttling
anymore, then the dequeue_task_fair() in exit_to_user_mode() mustn't happen.
One option I'm seeing is forget about the irq_work, always re-check sched_class
in throttle_cfs_rq_work() (we already check cfs_rq->throttle_count aka
throttled_hierarchy()), and just do a kick_process() to ensure a kernel entry.
>> + /*
>> + * Re-enqueue the tasks that have been throttled at this level.
>> + *
>> + * The task count is up-propagated via ->unthrottled_*h_nr_running,
>> + * as we can't purely rely on h_nr_running post-enqueue: the unthrottle
>> + * might happen when a cfs_rq still has some tasks enqueued, either still
>> + * making their way to userspace, or freshly migrated to it.
>> + */
>> + list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
>> + struct sched_entity *pse = &p->se;
>> +
>> + list_del_init(&p->throttle_node);
>> +
>> + enqueue_entity(cfs_rq, pse, ENQUEUE_WAKEUP);
>> + task_delta++;
>> + idle_task_delta += task_has_idle_policy(p);
>> + }
>
> You know, on our too-large machines with too-many cgroups, we're already
> running into these walk_tg_trees being worrying slow for holding a spinlock :P
>
Yeah, for N throttled cgroups in a hierarchy, this is pretty much squashing
N rq-lock-held segments into one. Not great.
What I'm thinking is we could have N walks re-queuing the tasks of a single
cfs_rq each walk and updating the hierarchy, releasing the rq-lock between
each walk - instead of a single walk re-queueing up to N lists of
tasks. More operations but smaller rq-lock-held segments.
>> +
>> + /*
>> + * Account tasks woken up in children; by this point all direct children
>> + * have been visited.
>> + */
>> + task_delta += cfs_rq->unthrottled_h_nr_running;
>> + idle_task_delta += cfs_rq->unthrottled_idle_h_nr_running;
>> +
>> + cfs_rq->h_nr_running += task_delta;
>> + cfs_rq->idle_h_nr_running += idle_task_delta;
>> +
>> + /*
>> + * unthrottle_cfs_rq() needs a value to up-propagate above the
>> + * freshly unthrottled cfs_rq.
>> + */
>> + cfs_rq->unthrottled_h_nr_running = task_delta;
>> + cfs_rq->unthrottled_idle_h_nr_running = idle_task_delta;
>
> I think this should have no effect, right?
Hm so my thoughts here are:
The walk_tg_tree_from(tg_unthrottle_up) will update *nr_running counts up
to the cfs_rq->tg->se[cpu_of(rq)]. However if that cfs_rq isn't the root
one, we'll need the for_each_sched_entity() loop further down
unthrottle_cfs_rq() to update the upper part of the hierarchy. The values
that will be up-propagated there are the ones being saved here.
>
>> +
>> + /* Accumulate the delta in the parent's stash. Once all its children
>> + * (i.e. all of this cfs_rq's siblings) have been visited, this value
>> + * will be stable and used for its own count update.
>> + */
>> + pcfs_rq->unthrottled_h_nr_running += task_delta;
>> + pcfs_rq->unthrottled_idle_h_nr_running += idle_task_delta;
>> +
>> + /*
>> + * If the cfs_rq became empty during throttling, then we dequeued
>> + * it. It needs to be put back in the hierarchy if it or any of
>> + * its children have now-unthrottled tasks.
>> + */
>> + if (!se->on_rq && (cfs_rq->h_nr_running || cfs_rq->idle_h_nr_running)) {
>> + enqueue_entity(pcfs_rq, se, ENQUEUE_WAKEUP);
>> + } else {
>> + update_load_avg(pcfs_rq, se, UPDATE_TG);
>> + se_update_runnable(se);
>> }
>
> So I think this is trying to combine the all updates, and it's logically
> just the same as if the loop was enqueue_task_fair, like you mentioned
> in a followup for the throttle_one_task and dequeue_task_fair?
>
Good point, that should be a mirror of throttle_one_task() wrt the enqueueing.
>> void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>> @@ -5922,25 +6152,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>> goto unthrottle_throttle;
>> }
>>
>> - task_delta = cfs_rq->h_nr_running;
>> - idle_task_delta = cfs_rq->idle_h_nr_running;
>> - for_each_sched_entity(se) {
>> - struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>> -
>> - if (se->on_rq)
>> - break;
>> - enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
>> -
>> - if (cfs_rq_is_idle(group_cfs_rq(se)))
>> - idle_task_delta = cfs_rq->h_nr_running;
>> + if (cfs_rq->throttle_count)
>> + return;
>>
>> - qcfs_rq->h_nr_running += task_delta;
>> - qcfs_rq->idle_h_nr_running += idle_task_delta;
>> + /*
>> + * cfs_rq's below us may not have been fully emptied out, so we can't rely
>> + * directly on ->h_nr_running. Use the stash instead.
>> + */
>> + task_delta = cfs_rq->unthrottled_h_nr_running;
>> + idle_task_delta = cfs_rq->unthrottled_idle_h_nr_running;
>>
>> - /* end evaluation on encountering a throttled cfs_rq */
>> - if (cfs_rq_throttled(qcfs_rq))
>> - goto unthrottle_throttle;
>> - }
>> + walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_clear_up, (void *)rq);
>>
>> for_each_sched_entity(se) {
>> struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>
> I think this would be simpler by making the first/original
> walk_tg_tree_from be
>
> walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_clear_down, tg_unthrottle_up, (void *)rq);
>
> (With the clear function being the same, just renamed)
>
> We never have any unthrottled* saved between calls to unthrottle_cfs_rq,
> because if throttled_count is set for us, it's set for all of our
> descendants too, so we're doing nothing but throttle_count stuff in that
> case. Sadly that doesn't let us remove the cfs_rq fields, that would
> need a walk_tg_tree_by_level.
Good point, I think that makes sense. Thanks!
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace
2024-07-23 15:16 ` Valentin Schneider
@ 2024-07-24 1:34 ` Benjamin Segall
2024-07-24 7:20 ` Valentin Schneider
0 siblings, 1 reply; 28+ messages in thread
From: Benjamin Segall @ 2024-07-24 1:34 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, rcu, Peter Zijlstra, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
Valentin Schneider <vschneid@redhat.com> writes:
> On 18/07/24 17:25, Benjamin Segall wrote:
>> Valentin Schneider <vschneid@redhat.com> writes:
>>
>>> I've tested a 10ms runtime / 100ms period cgroup with an always running
>>> task: upstream gets a "clean" periodic pattern of 10ms runtime every 100ms,
>>> whereas this gets something more like 40ms runtime every 400ms.
>>
>> Hmm, this seems a little odd since TWA_RESUME does a kick_process.
>
> I didn't ponder too much on the workload used here, but the point I wanted
> to bring up is: if you give a cgroup X amount of runtime, it may still
> consume more than that within a single period because execution in
> kernelspace isn't immediately stopped/throttled.
>
> It means the "standard" bandwidth control behaviour becomes a bit more
> bursty.
Yeah, more bursty behavior when doing cpu-burning syscalls is expected.
With the check on exit to user I wouldn't expect anything worse than the
duration of the syscall though, so it depends on what your test was.
>>> +
>>> + /*
>>> + * Account tasks woken up in children; by this point all direct children
>>> + * have been visited.
>>> + */
>>> + task_delta += cfs_rq->unthrottled_h_nr_running;
>>> + idle_task_delta += cfs_rq->unthrottled_idle_h_nr_running;
>>> +
>>> + cfs_rq->h_nr_running += task_delta;
>>> + cfs_rq->idle_h_nr_running += idle_task_delta;
>>> +
>>> + /*
>>> + * unthrottle_cfs_rq() needs a value to up-propagate above the
>>> + * freshly unthrottled cfs_rq.
>>> + */
>>> + cfs_rq->unthrottled_h_nr_running = task_delta;
>>> + cfs_rq->unthrottled_idle_h_nr_running = idle_task_delta;
>>
>> I think this should have no effect, right?
>
> Hm so my thoughts here are:
> The walk_tg_tree_from(tg_unthrottle_up) will update *nr_running counts up
> to the cfs_rq->tg->se[cpu_of(rq)]. However if that cfs_rq isn't the root
> one, we'll need the for_each_sched_entity() loop further down
> unthrottle_cfs_rq() to update the upper part of the hierarchy. The values
> that will be up-propagated there are the ones being saved here.
I'm pretty sure this comment was left over from when I didn't understand
what they were being used for. I'm pretty sure I remember intending to
remove it.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace
2024-07-24 1:34 ` Benjamin Segall
@ 2024-07-24 7:20 ` Valentin Schneider
0 siblings, 0 replies; 28+ messages in thread
From: Valentin Schneider @ 2024-07-24 7:20 UTC (permalink / raw)
To: Benjamin Segall
Cc: linux-kernel, rcu, Peter Zijlstra, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe
On 23/07/24 18:34, Benjamin Segall wrote:
> Valentin Schneider <vschneid@redhat.com> writes:
>
>> On 18/07/24 17:25, Benjamin Segall wrote:
>>> Valentin Schneider <vschneid@redhat.com> writes:
>>>
>>>> I've tested a 10ms runtime / 100ms period cgroup with an always running
>>>> task: upstream gets a "clean" periodic pattern of 10ms runtime every 100ms,
>>>> whereas this gets something more like 40ms runtime every 400ms.
>>>
>>> Hmm, this seems a little odd since TWA_RESUME does a kick_process.
>>
>> I didn't ponder too much on the workload used here, but the point I wanted
>> to bring up is: if you give a cgroup X amount of runtime, it may still
>> consume more than that within a single period because execution in
>> kernelspace isn't immediately stopped/throttled.
>>
>> It means the "standard" bandwidth control behaviour becomes a bit more
>> bursty.
>
> Yeah, more bursty behavior when doing cpu-burning syscalls is expected.
> With the check on exit to user I wouldn't expect anything worse than the
> duration of the syscall though, so it depends on what your test was.
>
That would also be my expectations. That was rt-app, so the userspace part
is mostly floating point operations IIRC. I'll do a bit of tracing to make
sure nothing funny is happening in the kernel.
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2024-07-24 7:22 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-11 12:59 [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry Valentin Schneider
2024-07-11 12:59 ` [RFC PATCH v3 01/10] rcuwait: Split type definition to its own header Valentin Schneider
2024-07-12 15:15 ` Peter Zijlstra
2024-07-15 17:57 ` Valentin Schneider
2024-07-11 12:59 ` [RFC PATCH v3 02/10] irq_work: " Valentin Schneider
2024-07-12 15:17 ` Peter Zijlstra
2024-07-11 12:59 ` [RFC PATCH v3 03/10] task_work, sched: Add a _locked variant to task_work_cancel() Valentin Schneider
2024-07-12 10:35 ` Oleg Nesterov
2024-07-12 15:20 ` Peter Zijlstra
2024-07-15 17:57 ` Valentin Schneider
2024-07-11 12:59 ` [RFC PATCH v3 04/10] sched/fair: Introduce sched_throttle_work Valentin Schneider
2024-07-12 15:21 ` Peter Zijlstra
2024-07-15 17:58 ` Valentin Schneider
2024-07-11 12:59 ` [RFC PATCH v3 05/10] sched/fair: Introduce an irq_work for cancelling throttle task_work Valentin Schneider
2024-07-11 13:00 ` [RFC PATCH v3 06/10] sched/fair: Prepare switched_from & switched_to for per-task throttling Valentin Schneider
2024-07-12 15:26 ` Peter Zijlstra
2024-07-11 13:00 ` [RFC PATCH v3 07/10] sched/fair: Prepare task_change_group_fair() " Valentin Schneider
2024-07-11 13:00 ` [RFC PATCH v3 08/10] sched/fair: Prepare migrate_task_rq_fair() " Valentin Schneider
2024-07-11 13:00 ` [RFC PATCH v3 09/10] sched/fair: Add a class->task_woken callback in preparation " Valentin Schneider
2024-07-11 13:00 ` [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace Valentin Schneider
2024-07-12 17:05 ` Peter Zijlstra
2024-07-12 17:10 ` Peter Zijlstra
2024-07-12 17:43 ` Peter Zijlstra
2024-07-16 12:46 ` Valentin Schneider
2024-07-19 0:25 ` Benjamin Segall
2024-07-23 15:16 ` Valentin Schneider
2024-07-24 1:34 ` Benjamin Segall
2024-07-24 7:20 ` Valentin Schneider
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox