* [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry
@ 2024-02-02 8:09 Valentin Schneider
2024-02-02 8:09 ` [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace Valentin Schneider
` (5 more replies)
0 siblings, 6 replies; 15+ messages in thread
From: Valentin Schneider @ 2024-02-02 8:09 UTC (permalink / raw)
To: linux-kernel
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
Tomas Glozar
Hi folks,
Problem statement
=================
CFS tasks can end up throttled while holding locks that other, non-throttled
tasks are blocking on.
For !PREEMPT_RT, this can be a source of latency due to the throttling causing a
resource acquisition denial.
For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter the
slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)
If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued on
the same CPU as one where ktimers/ksoftirqd is blocked on read_lock(&lock),
this creates a circular dependency.
This has been observed to happen with:
o fs/eventpoll.c::ep->lock
o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
but can trigger with any rwlock that can be acquired in both process and
softirq contexts.
The linux-rt tree has had
1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.")
which helped this scenario for non-rwlock locks by ensuring the throttled
task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately,
rwlocks cannot sanely do PI as they allow multiple readers.
Proposed approach
=================
Peter mentioned [1] that there have been discussions on changing /when/ the
throttling happens: rather than have it be done immediately upon updating
the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
for the task to be about to return to userspace: if it's in userspace, it can't
hold any in-kernel lock.
I submitted an initial jab at this [2] and Ben Segall added his own version to
the conversation [3]. This series contains Ben's patch plus my additions. The
main change here is updating the .h_nr_running counts throughout the cfs_rq
hierachies to improve the picture given to load_balance().
The main thing that remains doing for this series is making the second cfs_rq
tree an actual RB tree (it's just a plain list ATM).
This also doesn't touch rq.nr_running yet, I'm not entirely sure whether we want
to expose this outside of CFS, but it is another field that's used by load balance.
Testing
=======
Tested on QEMU via:
mount -t cgroup -o cpu none /root/cpu
mkdir /root/cpu/cg0
echo 10000 > /root/cpu/cg0/cpu.cfs_period_us
echo 1000 > /root/cpu/cg0/cpu.cfs_quota_us
mkdir /root/cpu/cg0/cg00
mkdir /root/cpu/cg0/cg01
mkdir /root/cpu/cg0/cg00/cg000
mkdir /root/cpu/cg0/cg00/cg001
spawn() {
while true; do cat /sys/devices/system/cpu/smt/active &>/dev/null; done &
PID=$!
echo "Starting PID${PID}"
echo $PID > $1
}
spawn cpu/cg0/tasks
spawn cpu/cg0/tasks
spawn cpu/cg0/tasks
spawn cpu/cg0/tasks
spawn cpu/cg0/cg01/tasks
spawn cpu/cg0/cg00/cg000/tasks
spawn cpu/cg0/cg00/cg001/tasks
sleep 120
kill $(jobs -p)
Links
=====
[1]: https://lore.kernel.org/all/20231031160120.GE15024@noisy.programming.kicks-ass.net/
[2]: http://lore.kernel.org/r/20231130161245.3894682-1-vschneid@redhat.com
[3]: http://lore.kernel.org/r/xm26edfxpock.fsf@bsegall-linux.svl.corp.google.com
Benjamin Segall (1):
sched/fair: Only throttle CFS tasks on return to userspace
Valentin Schneider (4):
sched: Note schedule() invocations at return-to-user with SM_USER
sched/fair: Delete cfs_rq_throttled_loose(), use
cfs_rq->throttle_pending instead
sched/fair: Track count of tasks running in userspace
sched/fair: Assert user/kernel/total nr invariants
include/linux/sched.h | 7 +
kernel/entry/common.c | 2 +-
kernel/entry/kvm.c | 2 +-
kernel/sched/core.c | 45 ++++-
kernel/sched/debug.c | 28 +++
kernel/sched/fair.c | 399 ++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 5 +
7 files changed, 466 insertions(+), 22 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace
2024-02-02 8:09 [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Valentin Schneider
@ 2024-02-02 8:09 ` Valentin Schneider
2024-02-03 19:12 ` kernel test robot
2024-02-03 19:45 ` kernel test robot
2024-02-02 8:09 ` [RFC PATCH v2 2/5] sched: Note schedule() invocations at return-to-user with SM_USER Valentin Schneider
` (4 subsequent siblings)
5 siblings, 2 replies; 15+ messages in thread
From: Valentin Schneider @ 2024-02-02 8:09 UTC (permalink / raw)
To: linux-kernel
Cc: Benjamin Segall, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
Tomas Glozar
From: Benjamin Segall <bsegall@google.com>
The basic idea of this implementation is to maintain duplicate runqueues
in each cfs_rq that contain duplicate pointers to sched_entitys which
should bypass throttling. Then we can skip throttling cfs_rqs that have
any such children, and when we pick inside any not-actually-throttled
cfs_rq, we only look at this duplicated list.
"Which tasks should bypass throttling" here is "all schedule() calls
that don't set a special flag", but could instead involve the lockdep
markers (except for the problem of percpu-rwsem and similar) or explicit
flags around syscalls and faults, or something else.
This approach avoids any O(tasks) loops, but leaves partially-throttled
cfs_rqs still contributing their full h_nr_running to their parents,
which might result in worse balancing. Also it adds more (generally
still small) overhead to the common enqueue/dequeue/pick paths.
The very basic debug test added is to run a cpusoaker and "cat
/sys/kernel/debug/sched_locked_spin" pinned to the same cpu in the same
cgroup with a quota < 1 cpu.
Not-signed-off-by: Benjamin Segall <bsegall@google.com>
[Slight comment / naming changes]
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
include/linux/sched.h | 7 ++
kernel/entry/common.c | 2 +-
kernel/entry/kvm.c | 2 +-
kernel/sched/core.c | 20 ++++
kernel/sched/debug.c | 28 +++++
kernel/sched/fair.c | 232 ++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 3 +
7 files changed, 281 insertions(+), 13 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 03bfe9ab29511..4a0105d1eaa21 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -303,6 +303,7 @@ extern long schedule_timeout_killable(long timeout);
extern long schedule_timeout_uninterruptible(long timeout);
extern long schedule_timeout_idle(long timeout);
asmlinkage void schedule(void);
+asmlinkage void schedule_usermode(void);
extern void schedule_preempt_disabled(void);
asmlinkage void preempt_schedule_irq(void);
#ifdef CONFIG_PREEMPT_RT
@@ -553,6 +554,9 @@ struct sched_entity {
struct cfs_rq *my_q;
/* cached value of my_q->h_nr_running */
unsigned long runnable_weight;
+#ifdef CONFIG_CFS_BANDWIDTH
+ struct list_head kernel_node;
+#endif
#endif
#ifdef CONFIG_SMP
@@ -1539,6 +1543,9 @@ struct task_struct {
struct user_event_mm *user_event_mm;
#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+ atomic_t in_return_to_user;
+#endif
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d7ee4bc3f2ba3..16b5432a62c6f 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -156,7 +156,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
local_irq_enable_exit_to_user(ti_work);
if (ti_work & _TIF_NEED_RESCHED)
- schedule();
+ schedule_usermode(); /* TODO: also all of the arch/ loops that don't use this yet */
if (ti_work & _TIF_UPROBE)
uprobe_notify_resume(regs);
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 2e0f75bcb7fd1..fc4b73de07539 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -14,7 +14,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
}
if (ti_work & _TIF_NEED_RESCHED)
- schedule();
+ schedule_usermode();
if (ti_work & _TIF_NOTIFY_RESUME)
resume_user_mode_work(NULL);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db4be4921e7f0..a7c028fad5a89 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4529,6 +4529,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_FAIR_GROUP_SCHED
p->se.cfs_rq = NULL;
#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+ INIT_LIST_HEAD(&p->se.kernel_node);
+ atomic_set(&p->in_return_to_user, 0);
+#endif
#ifdef CONFIG_SCHEDSTATS
/* Even if schedstat is disabled, there should not be garbage */
@@ -6818,6 +6822,22 @@ asmlinkage __visible void __sched schedule(void)
}
EXPORT_SYMBOL(schedule);
+asmlinkage __visible void __sched schedule_usermode(void)
+{
+#ifdef CONFIG_CFS_BANDWIDTH
+ /*
+ * This is only atomic because of this simple implementation. We could
+ * do something with an SM_USER to avoid other-cpu scheduler operations
+ * racing against these writes.
+ */
+ atomic_set(¤t->in_return_to_user, true);
+ schedule();
+ atomic_set(¤t->in_return_to_user, false);
+#else
+ schedule();
+#endif
+}
+
/*
* synchronize_rcu_tasks() makes sure that no task is stuck in preempted
* state (have scheduled out non-voluntarily) by making sure that all
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8d5d98a5834df..4a89dbc3ddfcd 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -319,6 +319,32 @@ static const struct file_operations sched_verbose_fops = {
.llseek = default_llseek,
};
+static DEFINE_MUTEX(sched_debug_spin_mutex);
+static int sched_debug_spin_show(struct seq_file *m, void *v) {
+ int count;
+ mutex_lock(&sched_debug_spin_mutex);
+ for (count = 0; count < 1000; count++) {
+ u64 start2;
+ start2 = jiffies;
+ while (jiffies == start2)
+ cpu_relax();
+ schedule();
+ }
+ mutex_unlock(&sched_debug_spin_mutex);
+ return 0;
+}
+static int sched_debug_spin_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, sched_debug_spin_show, NULL);
+}
+
+static const struct file_operations sched_debug_spin_fops = {
+ .open = sched_debug_spin_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
static const struct seq_operations sched_debug_sops;
static int sched_debug_open(struct inode *inode, struct file *filp)
@@ -374,6 +400,8 @@ static __init int sched_init_debug(void)
debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
+ debugfs_create_file("sched_locked_spin", 0444, NULL, NULL,
+ &sched_debug_spin_fops);
return 0;
}
late_initcall(sched_init_debug);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b803030c3a037..a1808459a5acc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -128,6 +128,7 @@ int __weak arch_asym_cpu_priority(int cpu)
* (default: 5 msec, units: microseconds)
*/
static unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
+static unsigned int sysctl_sched_cfs_bandwidth_kernel_bypass = 1;
#endif
#ifdef CONFIG_NUMA_BALANCING
@@ -146,6 +147,15 @@ static struct ctl_table sched_fair_sysctls[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ONE,
},
+ {
+ .procname = "sched_cfs_bandwidth_kernel_bypass",
+ .data = &sysctl_sched_cfs_bandwidth_kernel_bypass,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
#endif
#ifdef CONFIG_NUMA_BALANCING
{
@@ -5445,14 +5455,34 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
/*
* Pick the next process, keeping these things in mind, in this order:
- * 1) keep things fair between processes/task groups
- * 2) pick the "next" process, since someone really wants that to run
- * 3) pick the "last" process, for cache locality
- * 4) do not run the "skip" process, if something else is available
+ * 1) If we're inside a throttled cfs_rq, only pick threads in the kernel
+ * 2) keep things fair between processes/task groups
+ * 3) pick the "next" process, since someone really wants that to run
+ * 4) pick the "last" process, for cache locality
+ * 5) do not run the "skip" process, if something else is available
*/
static struct sched_entity *
-pick_next_entity(struct cfs_rq *cfs_rq)
+pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
{
+#ifdef CONFIG_CFS_BANDWIDTH
+ /*
+ * TODO: This might trigger, I'm not sure/don't remember. Regardless,
+ * while we do not explicitly handle the case where h_kernel_running
+ * goes to 0, we will call account/check_cfs_rq_runtime at worst in
+ * entity_tick and notice that we can now properly do the full
+ * throttle_cfs_rq.
+ */
+ WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
+ if (throttled && !list_empty(&cfs_rq->kernel_children)) {
+ /*
+ * TODO: you'd want to factor out pick_eevdf to just take
+ * tasks_timeline, and replace this list with a second rbtree
+ * and a call to pick_eevdf.
+ */
+ return list_first_entry(&cfs_rq->kernel_children,
+ struct sched_entity, kernel_node);
+ }
+#endif
/*
* Enabling NEXT_BUDDY will affect latency but not fairness.
*/
@@ -5651,8 +5681,14 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
/*
* if we're unable to extend our runtime we resched so that the active
* hierarchy can be throttled
+ *
+ * Don't resched_curr() if curr is in the kernel. We won't throttle the
+ * cfs_rq if any task is in the kernel, and if curr in particular is we
+ * don't need to preempt it in favor of whatever other task is in the
+ * kernel.
*/
- if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
+ if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr) &&
+ list_empty(&rq_of(cfs_rq)->curr->se.kernel_node))
resched_curr(rq_of(cfs_rq));
}
@@ -5741,12 +5777,22 @@ static int tg_throttle_down(struct task_group *tg, void *data)
return 0;
}
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
+
static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
- long task_delta, idle_task_delta, dequeue = 1;
+ long task_delta, idle_task_delta, kernel_delta, dequeue = 1;
+
+ /*
+ * We don't actually throttle, though account() will have made sure to
+ * resched us so that we pick into a kernel task.
+ */
+ if (cfs_rq->h_kernel_running)
+ return false;
raw_spin_lock(&cfs_b->lock);
/* This will start the period timer if necessary */
@@ -5778,6 +5824,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
task_delta = cfs_rq->h_nr_running;
idle_task_delta = cfs_rq->idle_h_nr_running;
+ kernel_delta = cfs_rq->h_kernel_running;
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
/* throttled entity or throttle-on-deactivate */
@@ -5791,6 +5838,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
qcfs_rq->h_nr_running -= task_delta;
qcfs_rq->idle_h_nr_running -= idle_task_delta;
+ dequeue_kernel(qcfs_rq, se, kernel_delta);
if (qcfs_rq->load.weight) {
/* Avoid re-evaluating load for this entity: */
@@ -5813,6 +5861,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
qcfs_rq->h_nr_running -= task_delta;
qcfs_rq->idle_h_nr_running -= idle_task_delta;
+ dequeue_kernel(qcfs_rq, se, kernel_delta);
}
/* At this point se is NULL and we are at root level*/
@@ -5835,7 +5884,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
- long task_delta, idle_task_delta;
+ long task_delta, idle_task_delta, kernel_delta;
se = cfs_rq->tg->se[cpu_of(rq)];
@@ -5870,6 +5919,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
task_delta = cfs_rq->h_nr_running;
idle_task_delta = cfs_rq->idle_h_nr_running;
+ kernel_delta = cfs_rq->h_kernel_running;
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
@@ -5882,6 +5932,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
qcfs_rq->h_nr_running += task_delta;
qcfs_rq->idle_h_nr_running += idle_task_delta;
+ enqueue_kernel(qcfs_rq, se, kernel_delta);
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(qcfs_rq))
@@ -5899,6 +5950,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
qcfs_rq->h_nr_running += task_delta;
qcfs_rq->idle_h_nr_running += idle_task_delta;
+ enqueue_kernel(qcfs_rq, se, kernel_delta);
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(qcfs_rq))
@@ -6557,6 +6609,86 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
}
#endif
+/*
+ * We keep track of all children that are runnable in the kernel with a count of
+ * all descendants. The state is checked on enqueue and put_prev (and hard
+ * cleared on dequeue), and is stored just as the filled/empty state of the
+ * kernel_node list entry.
+ *
+ * These are simple helpers that do both parts, and should be called bottom-up
+ * until hitting a throttled cfs_rq whenever a task changes state (or a cfs_rq
+ * is (un)throttled).
+ */
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count)
+{
+ if (count == 0)
+ return;
+
+ if (list_empty(&se->kernel_node))
+ list_add(&se->kernel_node, &cfs_rq->kernel_children);
+ cfs_rq->h_kernel_running += count;
+}
+
+static bool is_kernel_task(struct task_struct *p)
+{
+ return sysctl_sched_cfs_bandwidth_kernel_bypass && !atomic_read(&p->in_return_to_user);
+}
+
+/*
+ * When called on a task this always transitions it to a !kernel state.
+ *
+ * When called on a group it is just synchronizing the state with the new
+ * h_kernel_waiters, unless this it has been throttled and is !on_rq
+ */
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count)
+{
+ if (count == 0)
+ return;
+
+ if (!se->on_rq || entity_is_task(se) ||
+ !group_cfs_rq(se)->h_kernel_running)
+ list_del_init(&se->kernel_node);
+ cfs_rq->h_kernel_running -= count;
+}
+
+/*
+ * Returns if the cfs_rq "should" be throttled but might not be because of
+ * kernel threads bypassing throttle.
+ */
+static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
+{
+ if (!cfs_bandwidth_used())
+ return false;
+
+ if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
+ return false;
+ return true;
+}
+
+static void unthrottle_on_enqueue(struct task_struct *p)
+{
+ struct sched_entity *se = &p->se;
+
+ if (!cfs_bandwidth_used() || !sysctl_sched_cfs_bandwidth_kernel_bypass)
+ return;
+ if (!cfs_rq_of(&p->se)->throttle_count)
+ return;
+
+ /*
+ * MAYBE TODO: doing it this simple way is O(throttle_count *
+ * cgroup_depth). We could optimize that into a single pass, but making
+ * a mostly-copy of unthrottle_cfs_rq that does that is a pain and easy
+ * to get wrong. (And even without unthrottle_on_enqueue it's O(nm),
+ * just not while holding rq->lock the whole time)
+ */
+
+ for_each_sched_entity(se) {
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ if (cfs_rq->throttled)
+ unthrottle_cfs_rq(cfs_rq);
+ }
+}
+
#else /* CONFIG_CFS_BANDWIDTH */
static inline bool cfs_bandwidth_used(void)
@@ -6604,6 +6736,16 @@ bool cfs_task_bw_constrained(struct task_struct *p)
return false;
}
#endif
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count) {}
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count) {}
+static inline bool is_kernel_task(struct task_struct *p)
+{
+ return false;
+}
+static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
+{
+ return false;
+}
#endif /* CONFIG_CFS_BANDWIDTH */
#if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
@@ -6707,6 +6849,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct sched_entity *se = &p->se;
int idle_h_nr_running = task_has_idle_policy(p);
int task_new = !(flags & ENQUEUE_WAKEUP);
+ bool kernel_task = is_kernel_task(p);
/*
* The code below (indirectly) updates schedutil which looks at
@@ -6735,6 +6878,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ if (kernel_task)
+ enqueue_kernel(cfs_rq, se, 1);
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -6755,6 +6900,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ if (kernel_task)
+ enqueue_kernel(cfs_rq, se, 1);
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -6785,6 +6932,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
assert_list_leaf_cfs_rq(rq);
hrtick_update(rq);
+
+ if (kernel_task)
+ unthrottle_on_enqueue(p);
}
static void set_next_buddy(struct sched_entity *se);
@@ -6801,6 +6951,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
int task_sleep = flags & DEQUEUE_SLEEP;
int idle_h_nr_running = task_has_idle_policy(p);
bool was_sched_idle = sched_idle_rq(rq);
+ bool kernel_task = !list_empty(&p->se.kernel_node);
util_est_dequeue(&rq->cfs, p);
@@ -6813,6 +6964,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ if (kernel_task)
+ dequeue_kernel(cfs_rq, se, 1);
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -6845,6 +6998,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ if (kernel_task)
+ dequeue_kernel(cfs_rq, se, 1);
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -8343,11 +8498,40 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
resched_curr(rq);
}
+static void handle_kernel_task_prev(struct task_struct *prev)
+{
+#ifdef CONFIG_CFS_BANDWIDTH
+ struct sched_entity *se = &prev->se;
+ bool p_in_kernel = is_kernel_task(prev);
+ bool p_in_kernel_tree = !list_empty(&se->kernel_node);
+ /*
+ * These extra loops are bad and against the whole point of the merged
+ * PNT, but it's a pain to merge, particularly since we want it to occur
+ * before check_cfs_runtime().
+ */
+ if (p_in_kernel_tree && !p_in_kernel) {
+ WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */
+ for_each_sched_entity(se) {
+ dequeue_kernel(cfs_rq_of(se), se, 1);
+ if (cfs_rq_throttled(cfs_rq_of(se)))
+ break;
+ }
+ } else if (!p_in_kernel_tree && p_in_kernel && se->on_rq) {
+ for_each_sched_entity(se) {
+ enqueue_kernel(cfs_rq_of(se), se, 1);
+ if (cfs_rq_throttled(cfs_rq_of(se)))
+ break;
+ }
+ }
+#endif
+}
+
#ifdef CONFIG_SMP
static struct task_struct *pick_task_fair(struct rq *rq)
{
struct sched_entity *se;
struct cfs_rq *cfs_rq;
+ bool throttled = false;
again:
cfs_rq = &rq->cfs;
@@ -8368,7 +8552,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
goto again;
}
- se = pick_next_entity(cfs_rq);
+ if (cfs_rq_throttled_loose(cfs_rq))
+ throttled = true;
+
+ se = pick_next_entity(cfs_rq, throttled);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
@@ -8383,6 +8570,14 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
struct sched_entity *se;
struct task_struct *p;
int new_tasks;
+ bool throttled;
+
+ /*
+ * We want to handle this before check_cfs_runtime(prev). We'll
+ * duplicate a little work in the goto simple case, but that's fine
+ */
+ if (prev)
+ handle_kernel_task_prev(prev);
again:
if (!sched_fair_runnable(rq))
@@ -8400,6 +8595,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
* hierarchy, only change the part that actually changes.
*/
+ throttled = false;
do {
struct sched_entity *curr = cfs_rq->curr;
@@ -8431,7 +8627,10 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
}
}
- se = pick_next_entity(cfs_rq);
+ if (cfs_rq_throttled_loose(cfs_rq))
+ throttled = true;
+
+ se = pick_next_entity(cfs_rq, throttled);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
@@ -8469,8 +8668,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
if (prev)
put_prev_task(rq, prev);
+ throttled = false;
do {
- se = pick_next_entity(cfs_rq);
+ if (cfs_rq_throttled_loose(cfs_rq))
+ throttled = true;
+ se = pick_next_entity(cfs_rq, throttled);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
@@ -8534,6 +8736,8 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
struct sched_entity *se = &prev->se;
struct cfs_rq *cfs_rq;
+ handle_kernel_task_prev(prev);
+
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
put_prev_entity(cfs_rq, se);
@@ -12818,6 +13022,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#ifdef CONFIG_SMP
raw_spin_lock_init(&cfs_rq->removed.lock);
#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+ INIT_LIST_HEAD(&cfs_rq->kernel_children);
+#endif
}
#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -12970,6 +13177,9 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
/* guarantee group entities always have weight */
update_load_set(&se->load, NICE_0_LOAD);
se->parent = parent;
+#ifdef CONFIG_CFS_BANDWIDTH
+ INIT_LIST_HEAD(&se->kernel_node);
+#endif
}
static DEFINE_MUTEX(shares_mutex);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54bda77de..0b33ce2e60555 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -580,6 +580,7 @@ struct cfs_rq {
struct rb_root_cached tasks_timeline;
+
/*
* 'curr' points to currently running entity on this cfs_rq.
* It is set to NULL otherwise (i.e when none are currently running).
@@ -658,8 +659,10 @@ struct cfs_rq {
u64 throttled_clock_self_time;
int throttled;
int throttle_count;
+ int h_kernel_running;
struct list_head throttled_list;
struct list_head throttled_csd_list;
+ struct list_head kernel_children;
#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */
};
--
2.43.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC PATCH v2 2/5] sched: Note schedule() invocations at return-to-user with SM_USER
2024-02-02 8:09 [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Valentin Schneider
2024-02-02 8:09 ` [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace Valentin Schneider
@ 2024-02-02 8:09 ` Valentin Schneider
2024-02-02 8:09 ` [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead Valentin Schneider
` (3 subsequent siblings)
5 siblings, 0 replies; 15+ messages in thread
From: Valentin Schneider @ 2024-02-02 8:09 UTC (permalink / raw)
To: linux-kernel
Cc: Benjamin Segall, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
Tomas Glozar
task_struct.in_return_to_user is currently updated via atomic operations in
schedule_usermode().
However, one can note:
o .in_return_to_user is only updated for the current task
o There are no remote (smp_processor_id() != task_cpu(p)) accesses to
.in_return_to_user
Add schedule_with_mode() to factorize schedule() with different flags to
pass down to __schedule_loop().
Add SM_USER to denote schedule() calls from return-to-userspace points.
Update .in_return_to_user from within the preemption-disabled, rq_lock-held
part of __schedule().
Suggested-by: Benjamin Segall <bsegall@google.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
include/linux/sched.h | 2 +-
kernel/sched/core.c | 43 ++++++++++++++++++++++++++++++++-----------
kernel/sched/fair.c | 17 ++++++++++++++++-
3 files changed, 49 insertions(+), 13 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4a0105d1eaa21..1b6f17b2150a6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1544,7 +1544,7 @@ struct task_struct {
#endif
#ifdef CONFIG_CFS_BANDWIDTH
- atomic_t in_return_to_user;
+ int in_return_to_user;
#endif
/*
* New fields for task_struct should be added above here, so that
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7c028fad5a89..54e6690626b13 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4531,7 +4531,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#endif
#ifdef CONFIG_CFS_BANDWIDTH
INIT_LIST_HEAD(&p->se.kernel_node);
- atomic_set(&p->in_return_to_user, 0);
+ p->in_return_to_user = false;
#endif
#ifdef CONFIG_SCHEDSTATS
@@ -5147,6 +5147,9 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
static inline void finish_lock_switch(struct rq *rq)
{
+#ifdef CONFIG_CFS_BANDWIDTH
+ current->in_return_to_user = false;
+#endif
/*
* If we are tracking spinlock dependencies then we have to
* fix up the runqueue lock - which gets 'carried over' from
@@ -6562,6 +6565,18 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
#define SM_PREEMPT 0x1
#define SM_RTLOCK_WAIT 0x2
+/*
+ * Special case for CFS_BANDWIDTH where we need to know if the call to
+ * __schedule() is directely preceding an entry into userspace.
+ * It is removed from the mode argument as soon as it is used to not go against
+ * the SM_MASK_PREEMPT optimisation below.
+ */
+#ifdef CONFIG_CFS_BANDWIDTH
+# define SM_USER 0x4
+#else
+# define SM_USER SM_NONE
+#endif
+
#ifndef CONFIG_PREEMPT_RT
# define SM_MASK_PREEMPT (~0U)
#else
@@ -6646,6 +6661,14 @@ static void __sched notrace __schedule(unsigned int sched_mode)
rq_lock(rq, &rf);
smp_mb__after_spinlock();
+#ifdef CONFIG_CFS_BANDWIDTH
+ if (sched_mode & SM_USER) {
+ prev->in_return_to_user = true;
+ sched_mode &= ~SM_USER;
+ }
+#endif
+ SCHED_WARN_ON(sched_mode & SM_USER);
+
/* Promote REQ to ACT */
rq->clock_update_flags <<= 1;
update_rq_clock(rq);
@@ -6807,7 +6830,7 @@ static __always_inline void __schedule_loop(unsigned int sched_mode)
} while (need_resched());
}
-asmlinkage __visible void __sched schedule(void)
+static __always_inline void schedule_with_mode(unsigned int sched_mode)
{
struct task_struct *tsk = current;
@@ -6817,22 +6840,20 @@ asmlinkage __visible void __sched schedule(void)
if (!task_is_running(tsk))
sched_submit_work(tsk);
- __schedule_loop(SM_NONE);
+ __schedule_loop(sched_mode);
sched_update_worker(tsk);
}
+
+asmlinkage __visible void __sched schedule(void)
+{
+ schedule_with_mode(SM_NONE);
+}
EXPORT_SYMBOL(schedule);
asmlinkage __visible void __sched schedule_usermode(void)
{
#ifdef CONFIG_CFS_BANDWIDTH
- /*
- * This is only atomic because of this simple implementation. We could
- * do something with an SM_USER to avoid other-cpu scheduler operations
- * racing against these writes.
- */
- atomic_set(¤t->in_return_to_user, true);
- schedule();
- atomic_set(¤t->in_return_to_user, false);
+ schedule_with_mode(SM_USER);
#else
schedule();
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1808459a5acc..96504be6ee14a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6631,7 +6631,22 @@ static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int c
static bool is_kernel_task(struct task_struct *p)
{
- return sysctl_sched_cfs_bandwidth_kernel_bypass && !atomic_read(&p->in_return_to_user);
+ /*
+ * The flag is updated within __schedule() with preemption disabled,
+ * under the rq lock, and only when the task is current.
+ *
+ * Holding the rq lock for that task's CPU is thus sufficient for the
+ * value to be stable, if the task is enqueued.
+ *
+ * If the task is dequeued, then task_cpu(p) *can* change, but this
+ * so far only happens in enqueue_task_fair() which means either:
+ * - the task is being activated, its CPU has been set previously in ttwu()
+ * - the task is going through a "change" cycle (e.g. sched_move_task()),
+ * the pi_lock is also held so the CPU is stable.
+ */
+ lockdep_assert_rq_held(cpu_rq(task_cpu(p)));
+
+ return sysctl_sched_cfs_bandwidth_kernel_bypass && !p->in_return_to_user;
}
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead
2024-02-02 8:09 [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Valentin Schneider
2024-02-02 8:09 ` [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace Valentin Schneider
2024-02-02 8:09 ` [RFC PATCH v2 2/5] sched: Note schedule() invocations at return-to-user with SM_USER Valentin Schneider
@ 2024-02-02 8:09 ` Valentin Schneider
2024-02-03 23:01 ` kernel test robot
2024-02-06 21:36 ` Benjamin Segall
2024-02-02 8:09 ` [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace Valentin Schneider
` (2 subsequent siblings)
5 siblings, 2 replies; 15+ messages in thread
From: Valentin Schneider @ 2024-02-02 8:09 UTC (permalink / raw)
To: linux-kernel
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
Tomas Glozar
cfs_rq_throttled_loose() does not check if there is runtime remaining in
the cfs_b, and thus relies on check_cfs_rq_runtime() being ran previously
for that to be checked.
Cache the throttle attempt in throttle_cfs_rq and reuse that where
needed.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
kernel/sched/fair.c | 44 ++++++++++----------------------------------
1 file changed, 10 insertions(+), 34 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 96504be6ee14a..60778afbff207 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5462,7 +5462,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
* 5) do not run the "skip" process, if something else is available
*/
static struct sched_entity *
-pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
+pick_next_entity(struct cfs_rq *cfs_rq)
{
#ifdef CONFIG_CFS_BANDWIDTH
/*
@@ -5473,7 +5473,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
* throttle_cfs_rq.
*/
WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
- if (throttled && !list_empty(&cfs_rq->kernel_children)) {
+ if (cfs_rq->throttle_pending && !list_empty(&cfs_rq->kernel_children)) {
/*
* TODO: you'd want to factor out pick_eevdf to just take
* tasks_timeline, and replace this list with a second rbtree
@@ -5791,8 +5791,12 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
* We don't actually throttle, though account() will have made sure to
* resched us so that we pick into a kernel task.
*/
- if (cfs_rq->h_kernel_running)
+ if (cfs_rq->h_kernel_running) {
+ cfs_rq->throttle_pending = true;
return false;
+ }
+
+ cfs_rq->throttle_pending = false;
raw_spin_lock(&cfs_b->lock);
/* This will start the period timer if necessary */
@@ -6666,20 +6670,6 @@ static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int c
cfs_rq->h_kernel_running -= count;
}
-/*
- * Returns if the cfs_rq "should" be throttled but might not be because of
- * kernel threads bypassing throttle.
- */
-static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
-{
- if (!cfs_bandwidth_used())
- return false;
-
- if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
- return false;
- return true;
-}
-
static void unthrottle_on_enqueue(struct task_struct *p)
{
struct sched_entity *se = &p->se;
@@ -8546,7 +8536,6 @@ static struct task_struct *pick_task_fair(struct rq *rq)
{
struct sched_entity *se;
struct cfs_rq *cfs_rq;
- bool throttled = false;
again:
cfs_rq = &rq->cfs;
@@ -8567,10 +8556,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
goto again;
}
- if (cfs_rq_throttled_loose(cfs_rq))
- throttled = true;
-
- se = pick_next_entity(cfs_rq, throttled);
+ se = pick_next_entity(cfs_rq);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
@@ -8585,7 +8571,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
struct sched_entity *se;
struct task_struct *p;
int new_tasks;
- bool throttled;
/*
* We want to handle this before check_cfs_runtime(prev). We'll
@@ -8609,8 +8594,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
* Therefore attempt to avoid putting and setting the entire cgroup
* hierarchy, only change the part that actually changes.
*/
-
- throttled = false;
do {
struct sched_entity *curr = cfs_rq->curr;
@@ -8641,11 +8624,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
goto simple;
}
}
-
- if (cfs_rq_throttled_loose(cfs_rq))
- throttled = true;
-
- se = pick_next_entity(cfs_rq, throttled);
+ se = pick_next_entity(cfs_rq);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
@@ -8683,11 +8662,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
if (prev)
put_prev_task(rq, prev);
- throttled = false;
do {
- if (cfs_rq_throttled_loose(cfs_rq))
- throttled = true;
- se = pick_next_entity(cfs_rq, throttled);
+ se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
--
2.43.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace
2024-02-02 8:09 [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Valentin Schneider
` (2 preceding siblings ...)
2024-02-02 8:09 ` [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead Valentin Schneider
@ 2024-02-02 8:09 ` Valentin Schneider
2024-02-04 3:08 ` kernel test robot
2024-02-02 8:09 ` [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants Valentin Schneider
2024-02-06 21:55 ` [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Benjamin Segall
5 siblings, 1 reply; 15+ messages in thread
From: Valentin Schneider @ 2024-02-02 8:09 UTC (permalink / raw)
To: linux-kernel
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
Tomas Glozar
While having a second tree to pick from solves the throttling aspect of things,
it also requires modification of the task count at the cfs_rq level.
.h_nr_running is used throughout load_balance(), and it needs to accurately
reflect the amount of pickable tasks: a cfs_rq with .throttle_pending=1 may have
many tasks in userspace (thus effectively throttled), and this "excess" of tasks
shouldn't cause find_busiest_group() / find_busiest_queue() to pick that
cfs_rq's CPU to pull load from when there are other CPUs with more pickable
tasks to pull.
The approach taken here is to track both the count of tasks in kernelspace and
the count of tasks in userspace (technically tasks-just-about-to-enter-userspace).
When a cfs_rq runs out of runtime, it gets marked as .throttle_pending=1. From
this point on, only tasks executing in kernelspace are pickable, and this is
reflected up the hierarchy by removing that cfs_rq.h_user_running from its
parents' .h_nr_running.
To aid in validating the proper behaviour of the implementation, we assert the
following invariants:
o For any cfs_rq with .throttle_pending == 0:
.h_kernel_running + .h_user_running == .h_nr_running
o For any cfs_rq with .throttle_pending == 1:
.h_kernel_running == .h_nr_running
This means the .h_user_running also needs to be updated as cfs_rq's become
.throttle_pending=1. When a cfs_rq becomes .throttle_pending=1, its
.h_user_running remains untouched, but it is subtracted from its parents'
.h_user_running.
Another way to look at it is that the .h_user_running is "stored" at the level
of the .throttle_pending cfs_rq, and restored to the upper part of the hierarchy
at unthrottle.
An overview of the count logic is:
Consider:
cfs_rq.kernel := count of kernel *tasks* enqueued on this cfs_rq
cfs_rq.user := count of user *tasks* enqueued on this cfs_rq
Then, the following logic is implemented:
cfs_rq.h_kernel_running = Sum(child.kernel) for all child cfs_rq
cfs_rq.h_user_running = Sum(child.user) for all child cfs_rq with !child.throttle_pending
cfs_rq.h_nr_running = Sum(child.kernel) for all child cfs_rq
+ Sum(child.user) for all child cfs_rq with !child.throttle_pending
An application of that logic to an A/B/C cgroup hierarchy:
Initial condition, no throttling
+------+ .h_kernel_running = C.kernel + B.kernel + A.kernel
A |cfs_rq| .h_user_running = C.user + B.user + A.user
+------+ .h_nr_running = C.{kernel+user} + B.{kernel+user} + A.{kernel+user}
^ .throttle_pending = 0
|
| parent
|
+------+ .h_kernel_running = C.kernel + B.kernel
B |cfs_rq| .h_user_running = C.user + B.user
+------+ .h_nr_running = C.{kernel+user} + B.{kernel+user}
^ .throttle_pending = 0
|
| parent
|
+------+ .h_kernel_running = C.kernel
C |cfs_rq| .h_user_running = C.user
+------+ .h_nr_running = C.{kernel+user}
.throttle_pending = 0
C becomes .throttle_pending
+------+ .h_kernel_running = C.kernel + B.kernel + A.kernel <- Untouched
A |cfs_rq| .h_user_running = B.user + A.user <- Decremented by C.user
+------+ .h_nr_running = C.kernel + B.{kernel+user} + A.{kernel+user} <- Decremented by C.user
^ .throttle_pending = 0
|
| parent
|
+------+ .h_kernel_running = C.kernel + B.kernel <- Untouched
B |cfs_rq| .h_user_running = B.user <- Decremented by C.user
+------+ .h_nr_running = C.kernel + B.{kernel+user} + A.{kernel+user} <- Decremented by C.user
^ .throttle_pending = 0
|
| parent
|
+------+ .h_kernel_running = C.kernel
C |cfs_rq| .h_user_running = C.user <- Untouched, the count is "stored" at this level
+------+ .h_nr_running = C.kernel <- Decremented by C.user
.throttle_pending = 1
C becomes throttled
+------+ .h_kernel_running = B.kernel + A.kernel <- Decremented by C.kernel
A |cfs_rq| .h_user_running = B.user + A.user
+------+ .h_nr_running = B.{kernel+user} + A.{kernel+user} <- Decremented by C.kernel
^ .throttle_pending = 0
|
| parent
|
+------+ .h_kernel_running = B.kernel <- Decremented by C.kernel
B |cfs_rq| .h_user_running = B.user
+------+ .h_nr_running = B.{kernel+user} + A.{kernel+user} <- Decremented by C.kernel
^ .throttle_pending = 0
|
| parent
|
+------+ .h_kernel_running = C.kernel
C |cfs_rq| .h_user_running = C.user
+------+ .h_nr_running = C.{kernel+user} <- Incremented by C.user
.throttle_pending = 0
Could we get away with just one count, e.g. the user count and not the kernel
count? Technically yes, we could follow this scheme:
if (throttle_pending) => kernel count := h_nr_running - h_user_running
else => kernel count := h_nr_running
this however prevents any sort of assertion or sanity checking on the counts,
which I am not the biggest fan on - CFS group scheduling is enough of a headache
as it is.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
kernel/sched/fair.c | 174 ++++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 2 +
2 files changed, 151 insertions(+), 25 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 60778afbff207..2b54d3813d18d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5785,17 +5785,48 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
- long task_delta, idle_task_delta, kernel_delta, dequeue = 1;
+ long task_delta, idle_task_delta, kernel_delta, user_delta, dequeue = 1;
+ bool was_pending;
/*
- * We don't actually throttle, though account() will have made sure to
- * resched us so that we pick into a kernel task.
+ * We don't actually throttle just yet, though account_cfs_rq_runtime()
+ * will have made sure to resched us so that we pick into a kernel task.
*/
if (cfs_rq->h_kernel_running) {
+ if (cfs_rq->throttle_pending)
+ return false;
+
+ /*
+ * From now on we're only going to pick tasks that are in the
+ * second tree. Reflect this by discounting tasks that aren't going
+ * to be pickable from the ->h_nr_running counts.
+ */
cfs_rq->throttle_pending = true;
+
+ se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+ user_delta = cfs_rq->h_user_running;
+ cfs_rq->h_nr_running -= user_delta;
+
+ for_each_sched_entity(se) {
+ struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+
+ if (!se->on_rq)
+ goto done;
+
+ qcfs_rq->h_nr_running -= user_delta;
+ qcfs_rq->h_user_running -= user_delta;
+
+ assert_cfs_rq_counts(qcfs_rq);
+ }
return false;
}
+ /*
+ * Unlikely as it may be, we may only have user tasks as we hit the
+ * throttle, in which case we won't have discount them from the
+ * h_nr_running, and we need to be aware of that.
+ */
+ was_pending = cfs_rq->throttle_pending;
cfs_rq->throttle_pending = false;
raw_spin_lock(&cfs_b->lock);
@@ -5826,9 +5857,27 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
rcu_read_unlock();
- task_delta = cfs_rq->h_nr_running;
+ /*
+ * At this point, h_nr_running == h_kernel_running. We add back the
+ * h_user_running to the throttled cfs_rq, and only remove the difference
+ * to the upper cfs_rq's.
+ */
+ if (was_pending) {
+ WARN_ON_ONCE(cfs_rq->h_nr_running != cfs_rq->h_kernel_running);
+ cfs_rq->h_nr_running += cfs_rq->h_user_running;
+ } else {
+ WARN_ON_ONCE(cfs_rq->h_nr_running != cfs_rq->h_user_running);
+ }
+
+ /*
+ * We always discount user tasks from h_nr_running when throttle_pending
+ * so only h_kernel_running remains to be removed
+ */
+ task_delta = was_pending ? cfs_rq->h_kernel_running : cfs_rq->h_nr_running;
idle_task_delta = cfs_rq->idle_h_nr_running;
kernel_delta = cfs_rq->h_kernel_running;
+ user_delta = was_pending ? 0 : cfs_rq->h_user_running;
+
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
/* throttled entity or throttle-on-deactivate */
@@ -5843,6 +5892,8 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
qcfs_rq->h_nr_running -= task_delta;
qcfs_rq->idle_h_nr_running -= idle_task_delta;
dequeue_kernel(qcfs_rq, se, kernel_delta);
+ qcfs_rq->h_user_running -= user_delta;
+
if (qcfs_rq->load.weight) {
/* Avoid re-evaluating load for this entity: */
@@ -5866,6 +5917,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
qcfs_rq->h_nr_running -= task_delta;
qcfs_rq->idle_h_nr_running -= idle_task_delta;
dequeue_kernel(qcfs_rq, se, kernel_delta);
+ qcfs_rq->h_user_running -= user_delta;
}
/* At this point se is NULL and we are at root level*/
@@ -5888,7 +5940,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
- long task_delta, idle_task_delta, kernel_delta;
+ long task_delta, idle_task_delta, kernel_delta, user_delta;
se = cfs_rq->tg->se[cpu_of(rq)];
@@ -5924,6 +5976,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
task_delta = cfs_rq->h_nr_running;
idle_task_delta = cfs_rq->idle_h_nr_running;
kernel_delta = cfs_rq->h_kernel_running;
+ user_delta = cfs_rq->h_user_running;
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
@@ -5937,6 +5990,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
qcfs_rq->h_nr_running += task_delta;
qcfs_rq->idle_h_nr_running += idle_task_delta;
enqueue_kernel(qcfs_rq, se, kernel_delta);
+ qcfs_rq->h_user_running += user_delta;
+
+ assert_cfs_rq_counts(qcfs_rq);
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(qcfs_rq))
@@ -5955,6 +6011,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
qcfs_rq->h_nr_running += task_delta;
qcfs_rq->idle_h_nr_running += idle_task_delta;
enqueue_kernel(qcfs_rq, se, kernel_delta);
+ qcfs_rq->h_user_running += user_delta;
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(qcfs_rq))
@@ -6855,6 +6912,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
int idle_h_nr_running = task_has_idle_policy(p);
int task_new = !(flags & ENQUEUE_WAKEUP);
bool kernel_task = is_kernel_task(p);
+ bool throttle_pending = false;
/*
* The code below (indirectly) updates schedutil which looks at
@@ -6878,13 +6936,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, flags);
- cfs_rq->h_nr_running++;
- cfs_rq->idle_h_nr_running += idle_h_nr_running;
- if (cfs_rq_is_idle(cfs_rq))
- idle_h_nr_running = 1;
+ if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+ cfs_rq->h_nr_running++;
if (kernel_task)
enqueue_kernel(cfs_rq, se, 1);
+ else if (!throttle_pending)
+ cfs_rq->h_user_running++;
+
+ throttle_pending |= cfs_rq->throttle_pending;
+
+ cfs_rq->idle_h_nr_running += idle_h_nr_running;
+ if (cfs_rq_is_idle(cfs_rq))
+ idle_h_nr_running = 1;
+
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -6900,13 +6965,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
se_update_runnable(se);
update_cfs_group(se);
- cfs_rq->h_nr_running++;
- cfs_rq->idle_h_nr_running += idle_h_nr_running;
- if (cfs_rq_is_idle(cfs_rq))
- idle_h_nr_running = 1;
+ if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+ cfs_rq->h_nr_running++;
if (kernel_task)
enqueue_kernel(cfs_rq, se, 1);
+ else if (!throttle_pending)
+ cfs_rq->h_user_running++;
+
+ throttle_pending |= cfs_rq->throttle_pending;
+
+ cfs_rq->idle_h_nr_running += idle_h_nr_running;
+ if (cfs_rq_is_idle(cfs_rq))
+ idle_h_nr_running = 1;
+
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -6957,6 +7029,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
int idle_h_nr_running = task_has_idle_policy(p);
bool was_sched_idle = sched_idle_rq(rq);
bool kernel_task = !list_empty(&p->se.kernel_node);
+ bool throttle_pending = false;
util_est_dequeue(&rq->cfs, p);
@@ -6964,13 +7037,20 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);
- cfs_rq->h_nr_running--;
- cfs_rq->idle_h_nr_running -= idle_h_nr_running;
- if (cfs_rq_is_idle(cfs_rq))
- idle_h_nr_running = 1;
+ if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+ cfs_rq->h_nr_running--;
if (kernel_task)
dequeue_kernel(cfs_rq, se, 1);
+ else if (!throttle_pending)
+ cfs_rq->h_user_running--;
+
+ throttle_pending |= cfs_rq->throttle_pending;
+
+ cfs_rq->idle_h_nr_running -= idle_h_nr_running;
+ if (cfs_rq_is_idle(cfs_rq))
+ idle_h_nr_running = 1;
+
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -6998,13 +7078,20 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
se_update_runnable(se);
update_cfs_group(se);
- cfs_rq->h_nr_running--;
- cfs_rq->idle_h_nr_running -= idle_h_nr_running;
- if (cfs_rq_is_idle(cfs_rq))
- idle_h_nr_running = 1;
+ if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+ cfs_rq->h_nr_running--;
if (kernel_task)
dequeue_kernel(cfs_rq, se, 1);
+ else if (!throttle_pending)
+ cfs_rq->h_user_running--;
+
+ throttle_pending |= cfs_rq->throttle_pending;
+
+ cfs_rq->idle_h_nr_running -= idle_h_nr_running;
+ if (cfs_rq_is_idle(cfs_rq))
+ idle_h_nr_running = 1;
+
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -8503,28 +8590,65 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
resched_curr(rq);
}
+/*
+ * Consider:
+ * cfs_rq.kernel := count of kernel *tasks* enqueued on this cfs_rq
+ * cfs_rq.user := count of user *tasks* enqueued on this cfs_rq
+ *
+ * Then, the following logic is implemented:
+ * cfs_rq.h_kernel_running = Sum(child.kernel) for all child cfs_rq
+ * cfs_rq.h_user_running = Sum(child.user) for all child cfs_rq with !child.throttle_pending
+ * cfs_rq.h_nr_running = Sum(child.kernel) for all child cfs_rq
+ * + Sum(child.user) for all child cfs_rq with !child.throttle_pending
+ *
+ * IOW, count of kernel tasks is always propagated up the hierarchy, and count
+ * of user tasks is only propagated up if the cfs_rq isn't .throttle_pending.
+ */
static void handle_kernel_task_prev(struct task_struct *prev)
{
#ifdef CONFIG_CFS_BANDWIDTH
struct sched_entity *se = &prev->se;
bool p_in_kernel = is_kernel_task(prev);
bool p_in_kernel_tree = !list_empty(&se->kernel_node);
+ bool throttle_pending = false;
/*
* These extra loops are bad and against the whole point of the merged
* PNT, but it's a pain to merge, particularly since we want it to occur
* before check_cfs_runtime().
*/
if (p_in_kernel_tree && !p_in_kernel) {
+ /* Switch from KERNEL -> USER */
WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */
+
for_each_sched_entity(se) {
- dequeue_kernel(cfs_rq_of(se), se, 1);
- if (cfs_rq_throttled(cfs_rq_of(se)))
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+ if (throttle_pending || cfs_rq->throttle_pending)
+ cfs_rq->h_nr_running--;
+ dequeue_kernel(cfs_rq, se, 1);
+ if (!throttle_pending)
+ cfs_rq->h_user_running++;
+
+ throttle_pending |= cfs_rq->throttle_pending;
+
+ if (cfs_rq_throttled(cfs_rq))
break;
}
} else if (!p_in_kernel_tree && p_in_kernel && se->on_rq) {
+ /* Switch from USER -> KERNEL */
+
for_each_sched_entity(se) {
- enqueue_kernel(cfs_rq_of(se), se, 1);
- if (cfs_rq_throttled(cfs_rq_of(se)))
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+ if (throttle_pending || cfs_rq->throttle_pending)
+ cfs_rq->h_nr_running++;
+ enqueue_kernel(cfs_rq, se, 1);
+ if (!throttle_pending)
+ cfs_rq->h_user_running--;
+
+ throttle_pending |= cfs_rq->throttle_pending;
+
+ if (cfs_rq_throttled(cfs_rq))
break;
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0b33ce2e60555..e8860e0d6fbc7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -660,6 +660,8 @@ struct cfs_rq {
int throttled;
int throttle_count;
int h_kernel_running;
+ int h_user_running;
+ int throttle_pending;
struct list_head throttled_list;
struct list_head throttled_csd_list;
struct list_head kernel_children;
--
2.43.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants
2024-02-02 8:09 [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Valentin Schneider
` (3 preceding siblings ...)
2024-02-02 8:09 ` [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace Valentin Schneider
@ 2024-02-02 8:09 ` Valentin Schneider
2024-02-04 0:04 ` kernel test robot
2024-02-06 21:55 ` [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Benjamin Segall
5 siblings, 1 reply; 15+ messages in thread
From: Valentin Schneider @ 2024-02-02 8:09 UTC (permalink / raw)
To: linux-kernel
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
Tomas Glozar
Previous commits have added .h_kernel_running and .h_user_running to struct
cfs_rq, and are using them to play games with the hierarchical
.h_nr_running.
Assert some count invariants under SCHED_DEBUG to improve debugging.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2b54d3813d18d..52d0ee0e4d47c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5780,6 +5780,30 @@ static int tg_throttle_down(struct task_group *tg, void *data)
static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
+#ifdef CONFIG_CFS_BANDWIDTH
+static inline void assert_cfs_rq_counts(struct cfs_rq *cfs_rq)
+{
+ lockdep_assert_rq_held(rq_of(cfs_rq));
+
+ /*
+ * When !throttle_pending, this is the normal operating mode, all tasks
+ * are pickable, so:
+ * nr_kernel_tasks + nr_user_tasks == nr_pickable_tasks
+ */
+ SCHED_WARN_ON(!cfs_rq->throttle_pending &&
+ (cfs_rq->h_kernel_running + cfs_rq->h_user_running !=
+ cfs_rq->h_nr_running));
+ /*
+ * When throttle_pending, only kernel tasks are pickable, so:
+ * nr_kernel_tasks == nr_pickable_tasks
+ */
+ SCHED_WARN_ON(cfs_rq->throttle_pending &&
+ (cfs_rq->h_kernel_running != cfs_rq->h_nr_running));
+}
+#else
+static inline void assert_cfs_rq_counts(struct cfs_rq *cfs_rq) { }
+#endif
+
static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
@@ -5894,6 +5918,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
dequeue_kernel(qcfs_rq, se, kernel_delta);
qcfs_rq->h_user_running -= user_delta;
+ assert_cfs_rq_counts(qcfs_rq);
if (qcfs_rq->load.weight) {
/* Avoid re-evaluating load for this entity: */
@@ -5918,6 +5943,8 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
qcfs_rq->idle_h_nr_running -= idle_task_delta;
dequeue_kernel(qcfs_rq, se, kernel_delta);
qcfs_rq->h_user_running -= user_delta;
+
+ assert_cfs_rq_counts(qcfs_rq);
}
/* At this point se is NULL and we are at root level*/
@@ -6013,6 +6040,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
enqueue_kernel(qcfs_rq, se, kernel_delta);
qcfs_rq->h_user_running += user_delta;
+ assert_cfs_rq_counts(qcfs_rq);
+
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(qcfs_rq))
goto unthrottle_throttle;
@@ -6950,6 +6979,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ assert_cfs_rq_counts(cfs_rq);
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -6965,6 +6995,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
se_update_runnable(se);
update_cfs_group(se);
+ assert_cfs_rq_counts(cfs_rq);
if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
cfs_rq->h_nr_running++;
@@ -6979,6 +7010,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ assert_cfs_rq_counts(cfs_rq);
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -7051,6 +7083,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ assert_cfs_rq_counts(cfs_rq);
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -7092,6 +7125,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ assert_cfs_rq_counts(cfs_rq);
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -8631,6 +8665,8 @@ static void handle_kernel_task_prev(struct task_struct *prev)
throttle_pending |= cfs_rq->throttle_pending;
+ assert_cfs_rq_counts(cfs_rq);
+
if (cfs_rq_throttled(cfs_rq))
break;
}
@@ -8648,6 +8684,8 @@ static void handle_kernel_task_prev(struct task_struct *prev)
throttle_pending |= cfs_rq->throttle_pending;
+ assert_cfs_rq_counts(cfs_rq);
+
if (cfs_rq_throttled(cfs_rq))
break;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace
2024-02-02 8:09 ` [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace Valentin Schneider
@ 2024-02-03 19:12 ` kernel test robot
2024-02-03 19:45 ` kernel test robot
1 sibling, 0 replies; 15+ messages in thread
From: kernel test robot @ 2024-02-03 19:12 UTC (permalink / raw)
To: Valentin Schneider; +Cc: oe-kbuild-all
Hi Valentin,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/sched/core]
[also build test ERROR on tip/core/entry tip/master linus/master v6.8-rc2 next-20240202]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Valentin-Schneider/sched-fair-Only-throttle-CFS-tasks-on-return-to-userspace/20240202-162620
base: tip/sched/core
patch link: https://lore.kernel.org/r/20240202080920.3337862-2-vschneid%40redhat.com
patch subject: [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20240204/202402040304.j8HAHYHG-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240204/202402040304.j8HAHYHG-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202402040304.j8HAHYHG-lkp@intel.com/
All errors (new ones prefixed by >>):
kernel/sched/fair.c: In function 'enqueue_task_fair':
>> kernel/sched/fair.c:6937:17: error: implicit declaration of function 'unthrottle_on_enqueue' [-Werror=implicit-function-declaration]
6937 | unthrottle_on_enqueue(p);
| ^~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c: In function 'dequeue_task_fair':
>> kernel/sched/fair.c:6954:46: error: 'struct sched_entity' has no member named 'kernel_node'
6954 | bool kernel_task = !list_empty(&p->se.kernel_node);
| ^
kernel/sched/fair.c: At top level:
kernel/sched/fair.c:13312:6: warning: no previous prototype for 'free_fair_sched_group' [-Wmissing-prototypes]
13312 | void free_fair_sched_group(struct task_group *tg) { }
| ^~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:13314:5: warning: no previous prototype for 'alloc_fair_sched_group' [-Wmissing-prototypes]
13314 | int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
| ^~~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:13319:6: warning: no previous prototype for 'online_fair_sched_group' [-Wmissing-prototypes]
13319 | void online_fair_sched_group(struct task_group *tg) { }
| ^~~~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:13321:6: warning: no previous prototype for 'unregister_fair_sched_group' [-Wmissing-prototypes]
13321 | void unregister_fair_sched_group(struct task_group *tg) { }
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +/unthrottle_on_enqueue +6937 kernel/sched/fair.c
6839
6840 /*
6841 * The enqueue_task method is called before nr_running is
6842 * increased. Here we update the fair scheduling stats and
6843 * then put the task into the rbtree:
6844 */
6845 static void
6846 enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
6847 {
6848 struct cfs_rq *cfs_rq;
6849 struct sched_entity *se = &p->se;
6850 int idle_h_nr_running = task_has_idle_policy(p);
6851 int task_new = !(flags & ENQUEUE_WAKEUP);
6852 bool kernel_task = is_kernel_task(p);
6853
6854 /*
6855 * The code below (indirectly) updates schedutil which looks at
6856 * the cfs_rq utilization to select a frequency.
6857 * Let's add the task's estimated utilization to the cfs_rq's
6858 * estimated utilization, before we update schedutil.
6859 */
6860 util_est_enqueue(&rq->cfs, p);
6861
6862 /*
6863 * If in_iowait is set, the code below may not trigger any cpufreq
6864 * utilization updates, so do it here explicitly with the IOWAIT flag
6865 * passed.
6866 */
6867 if (p->in_iowait)
6868 cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
6869
6870 for_each_sched_entity(se) {
6871 if (se->on_rq)
6872 break;
6873 cfs_rq = cfs_rq_of(se);
6874 enqueue_entity(cfs_rq, se, flags);
6875
6876 cfs_rq->h_nr_running++;
6877 cfs_rq->idle_h_nr_running += idle_h_nr_running;
6878
6879 if (cfs_rq_is_idle(cfs_rq))
6880 idle_h_nr_running = 1;
6881 if (kernel_task)
6882 enqueue_kernel(cfs_rq, se, 1);
6883
6884 /* end evaluation on encountering a throttled cfs_rq */
6885 if (cfs_rq_throttled(cfs_rq))
6886 goto enqueue_throttle;
6887
6888 flags = ENQUEUE_WAKEUP;
6889 }
6890
6891 for_each_sched_entity(se) {
6892 cfs_rq = cfs_rq_of(se);
6893
6894 update_load_avg(cfs_rq, se, UPDATE_TG);
6895 se_update_runnable(se);
6896 update_cfs_group(se);
6897
6898 cfs_rq->h_nr_running++;
6899 cfs_rq->idle_h_nr_running += idle_h_nr_running;
6900
6901 if (cfs_rq_is_idle(cfs_rq))
6902 idle_h_nr_running = 1;
6903 if (kernel_task)
6904 enqueue_kernel(cfs_rq, se, 1);
6905
6906 /* end evaluation on encountering a throttled cfs_rq */
6907 if (cfs_rq_throttled(cfs_rq))
6908 goto enqueue_throttle;
6909 }
6910
6911 /* At this point se is NULL and we are at root level*/
6912 add_nr_running(rq, 1);
6913
6914 /*
6915 * Since new tasks are assigned an initial util_avg equal to
6916 * half of the spare capacity of their CPU, tiny tasks have the
6917 * ability to cross the overutilized threshold, which will
6918 * result in the load balancer ruining all the task placement
6919 * done by EAS. As a way to mitigate that effect, do not account
6920 * for the first enqueue operation of new tasks during the
6921 * overutilized flag detection.
6922 *
6923 * A better way of solving this problem would be to wait for
6924 * the PELT signals of tasks to converge before taking them
6925 * into account, but that is not straightforward to implement,
6926 * and the following generally works well enough in practice.
6927 */
6928 if (!task_new)
6929 update_overutilized_status(rq);
6930
6931 enqueue_throttle:
6932 assert_list_leaf_cfs_rq(rq);
6933
6934 hrtick_update(rq);
6935
6936 if (kernel_task)
> 6937 unthrottle_on_enqueue(p);
6938 }
6939
6940 static void set_next_buddy(struct sched_entity *se);
6941
6942 /*
6943 * The dequeue_task method is called before nr_running is
6944 * decreased. We remove the task from the rbtree and
6945 * update the fair scheduling stats:
6946 */
6947 static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
6948 {
6949 struct cfs_rq *cfs_rq;
6950 struct sched_entity *se = &p->se;
6951 int task_sleep = flags & DEQUEUE_SLEEP;
6952 int idle_h_nr_running = task_has_idle_policy(p);
6953 bool was_sched_idle = sched_idle_rq(rq);
> 6954 bool kernel_task = !list_empty(&p->se.kernel_node);
6955
6956 util_est_dequeue(&rq->cfs, p);
6957
6958 for_each_sched_entity(se) {
6959 cfs_rq = cfs_rq_of(se);
6960 dequeue_entity(cfs_rq, se, flags);
6961
6962 cfs_rq->h_nr_running--;
6963 cfs_rq->idle_h_nr_running -= idle_h_nr_running;
6964
6965 if (cfs_rq_is_idle(cfs_rq))
6966 idle_h_nr_running = 1;
6967 if (kernel_task)
6968 dequeue_kernel(cfs_rq, se, 1);
6969
6970 /* end evaluation on encountering a throttled cfs_rq */
6971 if (cfs_rq_throttled(cfs_rq))
6972 goto dequeue_throttle;
6973
6974 /* Don't dequeue parent if it has other entities besides us */
6975 if (cfs_rq->load.weight) {
6976 /* Avoid re-evaluating load for this entity: */
6977 se = parent_entity(se);
6978 /*
6979 * Bias pick_next to pick a task from this cfs_rq, as
6980 * p is sleeping when it is within its sched_slice.
6981 */
6982 if (task_sleep && se && !throttled_hierarchy(cfs_rq))
6983 set_next_buddy(se);
6984 break;
6985 }
6986 flags |= DEQUEUE_SLEEP;
6987 }
6988
6989 for_each_sched_entity(se) {
6990 cfs_rq = cfs_rq_of(se);
6991
6992 update_load_avg(cfs_rq, se, UPDATE_TG);
6993 se_update_runnable(se);
6994 update_cfs_group(se);
6995
6996 cfs_rq->h_nr_running--;
6997 cfs_rq->idle_h_nr_running -= idle_h_nr_running;
6998
6999 if (cfs_rq_is_idle(cfs_rq))
7000 idle_h_nr_running = 1;
7001 if (kernel_task)
7002 dequeue_kernel(cfs_rq, se, 1);
7003
7004 /* end evaluation on encountering a throttled cfs_rq */
7005 if (cfs_rq_throttled(cfs_rq))
7006 goto dequeue_throttle;
7007
7008 }
7009
7010 /* At this point se is NULL and we are at root level*/
7011 sub_nr_running(rq, 1);
7012
7013 /* balance early to pull high priority tasks */
7014 if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
7015 rq->next_balance = jiffies;
7016
7017 dequeue_throttle:
7018 util_est_update(&rq->cfs, p, task_sleep);
7019 hrtick_update(rq);
7020 }
7021
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace
2024-02-02 8:09 ` [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace Valentin Schneider
2024-02-03 19:12 ` kernel test robot
@ 2024-02-03 19:45 ` kernel test robot
1 sibling, 0 replies; 15+ messages in thread
From: kernel test robot @ 2024-02-03 19:45 UTC (permalink / raw)
To: Valentin Schneider; +Cc: llvm, oe-kbuild-all
Hi Valentin,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/sched/core]
[also build test ERROR on tip/core/entry tip/master linus/master v6.8-rc2 next-20240202]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Valentin-Schneider/sched-fair-Only-throttle-CFS-tasks-on-return-to-userspace/20240202-162620
base: tip/sched/core
patch link: https://lore.kernel.org/r/20240202080920.3337862-2-vschneid%40redhat.com
patch subject: [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace
config: arm-randconfig-002-20240203 (https://download.01.org/0day-ci/archive/20240204/202402040329.OcEA18Jz-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240204/202402040329.OcEA18Jz-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202402040329.OcEA18Jz-lkp@intel.com/
All errors (new ones prefixed by >>):
>> kernel/sched/fair.c:6937:3: error: call to undeclared function 'unthrottle_on_enqueue'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
unthrottle_on_enqueue(p);
^
kernel/sched/fair.c:6954:40: error: no member named 'kernel_node' in 'struct sched_entity'
bool kernel_task = !list_empty(&p->se.kernel_node);
~~~~~ ^
kernel/sched/fair.c:13312:6: warning: no previous prototype for function 'free_fair_sched_group' [-Wmissing-prototypes]
void free_fair_sched_group(struct task_group *tg) { }
^
kernel/sched/fair.c:13312:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void free_fair_sched_group(struct task_group *tg) { }
^
static
kernel/sched/fair.c:13314:5: warning: no previous prototype for function 'alloc_fair_sched_group' [-Wmissing-prototypes]
int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
^
kernel/sched/fair.c:13314:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
^
static
kernel/sched/fair.c:13319:6: warning: no previous prototype for function 'online_fair_sched_group' [-Wmissing-prototypes]
void online_fair_sched_group(struct task_group *tg) { }
^
kernel/sched/fair.c:13319:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void online_fair_sched_group(struct task_group *tg) { }
^
static
kernel/sched/fair.c:13321:6: warning: no previous prototype for function 'unregister_fair_sched_group' [-Wmissing-prototypes]
void unregister_fair_sched_group(struct task_group *tg) { }
^
kernel/sched/fair.c:13321:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void unregister_fair_sched_group(struct task_group *tg) { }
^
static
4 warnings and 2 errors generated.
vim +/unthrottle_on_enqueue +6937 kernel/sched/fair.c
6839
6840 /*
6841 * The enqueue_task method is called before nr_running is
6842 * increased. Here we update the fair scheduling stats and
6843 * then put the task into the rbtree:
6844 */
6845 static void
6846 enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
6847 {
6848 struct cfs_rq *cfs_rq;
6849 struct sched_entity *se = &p->se;
6850 int idle_h_nr_running = task_has_idle_policy(p);
6851 int task_new = !(flags & ENQUEUE_WAKEUP);
6852 bool kernel_task = is_kernel_task(p);
6853
6854 /*
6855 * The code below (indirectly) updates schedutil which looks at
6856 * the cfs_rq utilization to select a frequency.
6857 * Let's add the task's estimated utilization to the cfs_rq's
6858 * estimated utilization, before we update schedutil.
6859 */
6860 util_est_enqueue(&rq->cfs, p);
6861
6862 /*
6863 * If in_iowait is set, the code below may not trigger any cpufreq
6864 * utilization updates, so do it here explicitly with the IOWAIT flag
6865 * passed.
6866 */
6867 if (p->in_iowait)
6868 cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
6869
6870 for_each_sched_entity(se) {
6871 if (se->on_rq)
6872 break;
6873 cfs_rq = cfs_rq_of(se);
6874 enqueue_entity(cfs_rq, se, flags);
6875
6876 cfs_rq->h_nr_running++;
6877 cfs_rq->idle_h_nr_running += idle_h_nr_running;
6878
6879 if (cfs_rq_is_idle(cfs_rq))
6880 idle_h_nr_running = 1;
6881 if (kernel_task)
6882 enqueue_kernel(cfs_rq, se, 1);
6883
6884 /* end evaluation on encountering a throttled cfs_rq */
6885 if (cfs_rq_throttled(cfs_rq))
6886 goto enqueue_throttle;
6887
6888 flags = ENQUEUE_WAKEUP;
6889 }
6890
6891 for_each_sched_entity(se) {
6892 cfs_rq = cfs_rq_of(se);
6893
6894 update_load_avg(cfs_rq, se, UPDATE_TG);
6895 se_update_runnable(se);
6896 update_cfs_group(se);
6897
6898 cfs_rq->h_nr_running++;
6899 cfs_rq->idle_h_nr_running += idle_h_nr_running;
6900
6901 if (cfs_rq_is_idle(cfs_rq))
6902 idle_h_nr_running = 1;
6903 if (kernel_task)
6904 enqueue_kernel(cfs_rq, se, 1);
6905
6906 /* end evaluation on encountering a throttled cfs_rq */
6907 if (cfs_rq_throttled(cfs_rq))
6908 goto enqueue_throttle;
6909 }
6910
6911 /* At this point se is NULL and we are at root level*/
6912 add_nr_running(rq, 1);
6913
6914 /*
6915 * Since new tasks are assigned an initial util_avg equal to
6916 * half of the spare capacity of their CPU, tiny tasks have the
6917 * ability to cross the overutilized threshold, which will
6918 * result in the load balancer ruining all the task placement
6919 * done by EAS. As a way to mitigate that effect, do not account
6920 * for the first enqueue operation of new tasks during the
6921 * overutilized flag detection.
6922 *
6923 * A better way of solving this problem would be to wait for
6924 * the PELT signals of tasks to converge before taking them
6925 * into account, but that is not straightforward to implement,
6926 * and the following generally works well enough in practice.
6927 */
6928 if (!task_new)
6929 update_overutilized_status(rq);
6930
6931 enqueue_throttle:
6932 assert_list_leaf_cfs_rq(rq);
6933
6934 hrtick_update(rq);
6935
6936 if (kernel_task)
> 6937 unthrottle_on_enqueue(p);
6938 }
6939
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead
2024-02-02 8:09 ` [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead Valentin Schneider
@ 2024-02-03 23:01 ` kernel test robot
2024-02-06 21:36 ` Benjamin Segall
1 sibling, 0 replies; 15+ messages in thread
From: kernel test robot @ 2024-02-03 23:01 UTC (permalink / raw)
To: Valentin Schneider; +Cc: oe-kbuild-all
Hi Valentin,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:
[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/core/entry tip/master linus/master v6.8-rc2 next-20240202]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Valentin-Schneider/sched-fair-Only-throttle-CFS-tasks-on-return-to-userspace/20240202-162620
base: tip/sched/core
patch link: https://lore.kernel.org/r/20240202080920.3337862-4-vschneid%40redhat.com
patch subject: [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20240204/202402040651.KMZz9c3q-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240204/202402040651.KMZz9c3q-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202402040651.KMZz9c3q-lkp@intel.com/
All warnings (new ones prefixed by >>):
kernel/sched/fair.c: In function 'enqueue_task_fair':
kernel/sched/fair.c:6942:17: error: implicit declaration of function 'unthrottle_on_enqueue' [-Werror=implicit-function-declaration]
6942 | unthrottle_on_enqueue(p);
| ^~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c: In function 'dequeue_task_fair':
kernel/sched/fair.c:6959:46: error: 'struct sched_entity' has no member named 'kernel_node'
6959 | bool kernel_task = !list_empty(&p->se.kernel_node);
| ^
kernel/sched/fair.c: At top level:
kernel/sched/fair.c:13303:6: warning: no previous prototype for 'free_fair_sched_group' [-Wmissing-prototypes]
13303 | void free_fair_sched_group(struct task_group *tg) { }
| ^~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:13305:5: warning: no previous prototype for 'alloc_fair_sched_group' [-Wmissing-prototypes]
13305 | int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
| ^~~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:13310:6: warning: no previous prototype for 'online_fair_sched_group' [-Wmissing-prototypes]
13310 | void online_fair_sched_group(struct task_group *tg) { }
| ^~~~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:13312:6: warning: no previous prototype for 'unregister_fair_sched_group' [-Wmissing-prototypes]
13312 | void unregister_fair_sched_group(struct task_group *tg) { }
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/sched/fair.c:6750:13: warning: 'cfs_rq_throttled_loose' defined but not used [-Wunused-function]
6750 | static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
| ^~~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +/cfs_rq_throttled_loose +6750 kernel/sched/fair.c
ab84d31e15502f kernel/sched_fair.c Paul Turner 2011-07-21 6730
029632fbb7b7c9 kernel/sched_fair.c Peter Zijlstra 2011-10-25 6731 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
029632fbb7b7c9 kernel/sched_fair.c Peter Zijlstra 2011-10-25 6732 {
029632fbb7b7c9 kernel/sched_fair.c Peter Zijlstra 2011-10-25 6733 return NULL;
029632fbb7b7c9 kernel/sched_fair.c Peter Zijlstra 2011-10-25 6734 }
029632fbb7b7c9 kernel/sched_fair.c Peter Zijlstra 2011-10-25 6735 static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
0e59bdaea75f12 kernel/sched/fair.c Kirill Tkhai 2014-06-25 6736 static inline void update_runtime_enabled(struct rq *rq) {}
a4c96ae319b804 kernel/sched/fair.c Peter Boonstoppel 2012-08-09 6737 static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
88c56cfeaec464 kernel/sched/fair.c Phil Auld 2023-07-12 6738 #ifdef CONFIG_CGROUP_SCHED
88c56cfeaec464 kernel/sched/fair.c Phil Auld 2023-07-12 6739 bool cfs_task_bw_constrained(struct task_struct *p)
88c56cfeaec464 kernel/sched/fair.c Phil Auld 2023-07-12 6740 {
88c56cfeaec464 kernel/sched/fair.c Phil Auld 2023-07-12 6741 return false;
88c56cfeaec464 kernel/sched/fair.c Phil Auld 2023-07-12 6742 }
88c56cfeaec464 kernel/sched/fair.c Phil Auld 2023-07-12 6743 #endif
f3743713c72187 kernel/sched/fair.c Benjamin Segall 2024-02-02 6744 static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count) {}
f3743713c72187 kernel/sched/fair.c Benjamin Segall 2024-02-02 6745 static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count) {}
f3743713c72187 kernel/sched/fair.c Benjamin Segall 2024-02-02 6746 static inline bool is_kernel_task(struct task_struct *p)
f3743713c72187 kernel/sched/fair.c Benjamin Segall 2024-02-02 6747 {
f3743713c72187 kernel/sched/fair.c Benjamin Segall 2024-02-02 6748 return false;
f3743713c72187 kernel/sched/fair.c Benjamin Segall 2024-02-02 6749 }
f3743713c72187 kernel/sched/fair.c Benjamin Segall 2024-02-02 @6750 static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
f3743713c72187 kernel/sched/fair.c Benjamin Segall 2024-02-02 6751 {
f3743713c72187 kernel/sched/fair.c Benjamin Segall 2024-02-02 6752 return false;
f3743713c72187 kernel/sched/fair.c Benjamin Segall 2024-02-02 6753 }
029632fbb7b7c9 kernel/sched_fair.c Peter Zijlstra 2011-10-25 6754 #endif /* CONFIG_CFS_BANDWIDTH */
029632fbb7b7c9 kernel/sched_fair.c Peter Zijlstra 2011-10-25 6755
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants
2024-02-02 8:09 ` [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants Valentin Schneider
@ 2024-02-04 0:04 ` kernel test robot
0 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2024-02-04 0:04 UTC (permalink / raw)
To: Valentin Schneider; +Cc: llvm, oe-kbuild-all
Hi Valentin,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/sched/core]
[also build test ERROR on tip/core/entry tip/master linus/master v6.8-rc2 next-20240202]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Valentin-Schneider/sched-fair-Only-throttle-CFS-tasks-on-return-to-userspace/20240202-162620
base: tip/sched/core
patch link: https://lore.kernel.org/r/20240202080920.3337862-6-vschneid%40redhat.com
patch subject: [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants
config: arm-randconfig-002-20240203 (https://download.01.org/0day-ci/archive/20240204/202402040702.ytDsOZ2e-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240204/202402040702.ytDsOZ2e-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202402040702.ytDsOZ2e-lkp@intel.com/
All errors (new ones prefixed by >>):
kernel/sched/fair.c:6969:53: error: no member named 'throttle_pending' in 'struct cfs_rq'
if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
~~~~~~ ^
kernel/sched/fair.c:6974:12: error: no member named 'h_user_running' in 'struct cfs_rq'; did you mean 'h_nr_running'?
cfs_rq->h_user_running++;
^~~~~~~~~~~~~~
h_nr_running
kernel/sched/sched.h:563:16: note: 'h_nr_running' declared here
unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */
^
kernel/sched/fair.c:6976:31: error: no member named 'throttle_pending' in 'struct cfs_rq'
throttle_pending |= cfs_rq->throttle_pending;
~~~~~~ ^
>> kernel/sched/fair.c:6982:3: error: call to undeclared function 'assert_cfs_rq_counts'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
assert_cfs_rq_counts(cfs_rq);
^
kernel/sched/fair.c:6998:3: error: call to undeclared function 'assert_cfs_rq_counts'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
assert_cfs_rq_counts(cfs_rq);
^
kernel/sched/fair.c:7000:53: error: no member named 'throttle_pending' in 'struct cfs_rq'
if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
~~~~~~ ^
kernel/sched/fair.c:7005:12: error: no member named 'h_user_running' in 'struct cfs_rq'; did you mean 'h_nr_running'?
cfs_rq->h_user_running++;
^~~~~~~~~~~~~~
h_nr_running
kernel/sched/sched.h:563:16: note: 'h_nr_running' declared here
unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */
^
kernel/sched/fair.c:7007:31: error: no member named 'throttle_pending' in 'struct cfs_rq'
throttle_pending |= cfs_rq->throttle_pending;
~~~~~~ ^
kernel/sched/fair.c:7046:3: error: call to undeclared function 'unthrottle_on_enqueue'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
unthrottle_on_enqueue(p);
^
kernel/sched/fair.c:7063:40: error: no member named 'kernel_node' in 'struct sched_entity'
bool kernel_task = !list_empty(&p->se.kernel_node);
~~~~~ ^
kernel/sched/fair.c:7073:53: error: no member named 'throttle_pending' in 'struct cfs_rq'
if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
~~~~~~ ^
kernel/sched/fair.c:7078:12: error: no member named 'h_user_running' in 'struct cfs_rq'; did you mean 'h_nr_running'?
cfs_rq->h_user_running--;
^~~~~~~~~~~~~~
h_nr_running
kernel/sched/sched.h:563:16: note: 'h_nr_running' declared here
unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */
^
kernel/sched/fair.c:7080:31: error: no member named 'throttle_pending' in 'struct cfs_rq'
throttle_pending |= cfs_rq->throttle_pending;
~~~~~~ ^
kernel/sched/fair.c:7086:3: error: call to undeclared function 'assert_cfs_rq_counts'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
assert_cfs_rq_counts(cfs_rq);
^
kernel/sched/fair.c:7115:53: error: no member named 'throttle_pending' in 'struct cfs_rq'
if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
~~~~~~ ^
kernel/sched/fair.c:7120:12: error: no member named 'h_user_running' in 'struct cfs_rq'; did you mean 'h_nr_running'?
cfs_rq->h_user_running--;
^~~~~~~~~~~~~~
h_nr_running
kernel/sched/sched.h:563:16: note: 'h_nr_running' declared here
unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */
^
kernel/sched/fair.c:7122:31: error: no member named 'throttle_pending' in 'struct cfs_rq'
throttle_pending |= cfs_rq->throttle_pending;
~~~~~~ ^
kernel/sched/fair.c:7128:3: error: call to undeclared function 'assert_cfs_rq_counts'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
assert_cfs_rq_counts(cfs_rq);
^
kernel/sched/fair.c:13465:6: warning: no previous prototype for function 'free_fair_sched_group' [-Wmissing-prototypes]
void free_fair_sched_group(struct task_group *tg) { }
^
kernel/sched/fair.c:13465:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void free_fair_sched_group(struct task_group *tg) { }
^
static
kernel/sched/fair.c:13467:5: warning: no previous prototype for function 'alloc_fair_sched_group' [-Wmissing-prototypes]
int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
^
kernel/sched/fair.c:13467:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
^
static
kernel/sched/fair.c:13472:6: warning: no previous prototype for function 'online_fair_sched_group' [-Wmissing-prototypes]
void online_fair_sched_group(struct task_group *tg) { }
^
kernel/sched/fair.c:13472:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void online_fair_sched_group(struct task_group *tg) { }
^
static
kernel/sched/fair.c:13474:6: warning: no previous prototype for function 'unregister_fair_sched_group' [-Wmissing-prototypes]
void unregister_fair_sched_group(struct task_group *tg) { }
^
kernel/sched/fair.c:13474:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void unregister_fair_sched_group(struct task_group *tg) { }
^
static
4 warnings and 18 errors generated.
vim +/assert_cfs_rq_counts +6982 kernel/sched/fair.c
6930
6931 /*
6932 * The enqueue_task method is called before nr_running is
6933 * increased. Here we update the fair scheduling stats and
6934 * then put the task into the rbtree:
6935 */
6936 static void
6937 enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
6938 {
6939 struct cfs_rq *cfs_rq;
6940 struct sched_entity *se = &p->se;
6941 int idle_h_nr_running = task_has_idle_policy(p);
6942 int task_new = !(flags & ENQUEUE_WAKEUP);
6943 bool kernel_task = is_kernel_task(p);
6944 bool throttle_pending = false;
6945
6946 /*
6947 * The code below (indirectly) updates schedutil which looks at
6948 * the cfs_rq utilization to select a frequency.
6949 * Let's add the task's estimated utilization to the cfs_rq's
6950 * estimated utilization, before we update schedutil.
6951 */
6952 util_est_enqueue(&rq->cfs, p);
6953
6954 /*
6955 * If in_iowait is set, the code below may not trigger any cpufreq
6956 * utilization updates, so do it here explicitly with the IOWAIT flag
6957 * passed.
6958 */
6959 if (p->in_iowait)
6960 cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
6961
6962 for_each_sched_entity(se) {
6963 if (se->on_rq)
6964 break;
6965 cfs_rq = cfs_rq_of(se);
6966 enqueue_entity(cfs_rq, se, flags);
6967
6968
6969 if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
6970 cfs_rq->h_nr_running++;
6971 if (kernel_task)
6972 enqueue_kernel(cfs_rq, se, 1);
6973 else if (!throttle_pending)
6974 cfs_rq->h_user_running++;
6975
6976 throttle_pending |= cfs_rq->throttle_pending;
6977
6978 cfs_rq->idle_h_nr_running += idle_h_nr_running;
6979 if (cfs_rq_is_idle(cfs_rq))
6980 idle_h_nr_running = 1;
6981
> 6982 assert_cfs_rq_counts(cfs_rq);
6983
6984 /* end evaluation on encountering a throttled cfs_rq */
6985 if (cfs_rq_throttled(cfs_rq))
6986 goto enqueue_throttle;
6987
6988 flags = ENQUEUE_WAKEUP;
6989 }
6990
6991 for_each_sched_entity(se) {
6992 cfs_rq = cfs_rq_of(se);
6993
6994 update_load_avg(cfs_rq, se, UPDATE_TG);
6995 se_update_runnable(se);
6996 update_cfs_group(se);
6997
6998 assert_cfs_rq_counts(cfs_rq);
6999
7000 if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
7001 cfs_rq->h_nr_running++;
7002 if (kernel_task)
7003 enqueue_kernel(cfs_rq, se, 1);
7004 else if (!throttle_pending)
7005 cfs_rq->h_user_running++;
7006
7007 throttle_pending |= cfs_rq->throttle_pending;
7008
7009 cfs_rq->idle_h_nr_running += idle_h_nr_running;
7010 if (cfs_rq_is_idle(cfs_rq))
7011 idle_h_nr_running = 1;
7012
7013 assert_cfs_rq_counts(cfs_rq);
7014
7015 /* end evaluation on encountering a throttled cfs_rq */
7016 if (cfs_rq_throttled(cfs_rq))
7017 goto enqueue_throttle;
7018 }
7019
7020 /* At this point se is NULL and we are at root level*/
7021 add_nr_running(rq, 1);
7022
7023 /*
7024 * Since new tasks are assigned an initial util_avg equal to
7025 * half of the spare capacity of their CPU, tiny tasks have the
7026 * ability to cross the overutilized threshold, which will
7027 * result in the load balancer ruining all the task placement
7028 * done by EAS. As a way to mitigate that effect, do not account
7029 * for the first enqueue operation of new tasks during the
7030 * overutilized flag detection.
7031 *
7032 * A better way of solving this problem would be to wait for
7033 * the PELT signals of tasks to converge before taking them
7034 * into account, but that is not straightforward to implement,
7035 * and the following generally works well enough in practice.
7036 */
7037 if (!task_new)
7038 update_overutilized_status(rq);
7039
7040 enqueue_throttle:
7041 assert_list_leaf_cfs_rq(rq);
7042
7043 hrtick_update(rq);
7044
7045 if (kernel_task)
7046 unthrottle_on_enqueue(p);
7047 }
7048
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace
2024-02-02 8:09 ` [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace Valentin Schneider
@ 2024-02-04 3:08 ` kernel test robot
0 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2024-02-04 3:08 UTC (permalink / raw)
To: Valentin Schneider; +Cc: oe-kbuild-all
Hi Valentin,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/sched/core]
[also build test ERROR on tip/core/entry tip/master linus/master v6.8-rc2 next-20240202]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Valentin-Schneider/sched-fair-Only-throttle-CFS-tasks-on-return-to-userspace/20240202-162620
base: tip/sched/core
patch link: https://lore.kernel.org/r/20240202080920.3337862-5-vschneid%40redhat.com
patch subject: [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20240204/202402041155.ZPTGxQSD-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240204/202402041155.ZPTGxQSD-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202402041155.ZPTGxQSD-lkp@intel.com/
All errors (new ones prefixed by >>):
kernel/sched/fair.c: In function 'enqueue_task_fair':
>> kernel/sched/fair.c:6940:65: error: 'struct cfs_rq' has no member named 'throttle_pending'
6940 | if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
| ^~
>> kernel/sched/fair.c:6945:33: error: 'struct cfs_rq' has no member named 'h_user_running'; did you mean 'h_nr_running'?
6945 | cfs_rq->h_user_running++;
| ^~~~~~~~~~~~~~
| h_nr_running
kernel/sched/fair.c:6947:43: error: 'struct cfs_rq' has no member named 'throttle_pending'
6947 | throttle_pending |= cfs_rq->throttle_pending;
| ^~
kernel/sched/fair.c:6969:65: error: 'struct cfs_rq' has no member named 'throttle_pending'
6969 | if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
| ^~
kernel/sched/fair.c:6974:33: error: 'struct cfs_rq' has no member named 'h_user_running'; did you mean 'h_nr_running'?
6974 | cfs_rq->h_user_running++;
| ^~~~~~~~~~~~~~
| h_nr_running
kernel/sched/fair.c:6976:43: error: 'struct cfs_rq' has no member named 'throttle_pending'
6976 | throttle_pending |= cfs_rq->throttle_pending;
| ^~
kernel/sched/fair.c:7014:17: error: implicit declaration of function 'unthrottle_on_enqueue' [-Werror=implicit-function-declaration]
7014 | unthrottle_on_enqueue(p);
| ^~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c: In function 'dequeue_task_fair':
kernel/sched/fair.c:7031:46: error: 'struct sched_entity' has no member named 'kernel_node'
7031 | bool kernel_task = !list_empty(&p->se.kernel_node);
| ^
kernel/sched/fair.c:7041:65: error: 'struct cfs_rq' has no member named 'throttle_pending'
7041 | if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
| ^~
kernel/sched/fair.c:7046:33: error: 'struct cfs_rq' has no member named 'h_user_running'; did you mean 'h_nr_running'?
7046 | cfs_rq->h_user_running--;
| ^~~~~~~~~~~~~~
| h_nr_running
kernel/sched/fair.c:7048:43: error: 'struct cfs_rq' has no member named 'throttle_pending'
7048 | throttle_pending |= cfs_rq->throttle_pending;
| ^~
kernel/sched/fair.c:7082:65: error: 'struct cfs_rq' has no member named 'throttle_pending'
7082 | if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
| ^~
kernel/sched/fair.c:7087:33: error: 'struct cfs_rq' has no member named 'h_user_running'; did you mean 'h_nr_running'?
7087 | cfs_rq->h_user_running--;
| ^~~~~~~~~~~~~~
| h_nr_running
kernel/sched/fair.c:7089:43: error: 'struct cfs_rq' has no member named 'throttle_pending'
7089 | throttle_pending |= cfs_rq->throttle_pending;
| ^~
kernel/sched/fair.c: At top level:
kernel/sched/fair.c:13427:6: warning: no previous prototype for 'free_fair_sched_group' [-Wmissing-prototypes]
13427 | void free_fair_sched_group(struct task_group *tg) { }
| ^~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:13429:5: warning: no previous prototype for 'alloc_fair_sched_group' [-Wmissing-prototypes]
13429 | int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
| ^~~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:13434:6: warning: no previous prototype for 'online_fair_sched_group' [-Wmissing-prototypes]
13434 | void online_fair_sched_group(struct task_group *tg) { }
| ^~~~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:13436:6: warning: no previous prototype for 'unregister_fair_sched_group' [-Wmissing-prototypes]
13436 | void unregister_fair_sched_group(struct task_group *tg) { }
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:6807:13: warning: 'cfs_rq_throttled_loose' defined but not used [-Wunused-function]
6807 | static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
| ^~~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +6940 kernel/sched/fair.c
6901
6902 /*
6903 * The enqueue_task method is called before nr_running is
6904 * increased. Here we update the fair scheduling stats and
6905 * then put the task into the rbtree:
6906 */
6907 static void
6908 enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
6909 {
6910 struct cfs_rq *cfs_rq;
6911 struct sched_entity *se = &p->se;
6912 int idle_h_nr_running = task_has_idle_policy(p);
6913 int task_new = !(flags & ENQUEUE_WAKEUP);
6914 bool kernel_task = is_kernel_task(p);
6915 bool throttle_pending = false;
6916
6917 /*
6918 * The code below (indirectly) updates schedutil which looks at
6919 * the cfs_rq utilization to select a frequency.
6920 * Let's add the task's estimated utilization to the cfs_rq's
6921 * estimated utilization, before we update schedutil.
6922 */
6923 util_est_enqueue(&rq->cfs, p);
6924
6925 /*
6926 * If in_iowait is set, the code below may not trigger any cpufreq
6927 * utilization updates, so do it here explicitly with the IOWAIT flag
6928 * passed.
6929 */
6930 if (p->in_iowait)
6931 cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
6932
6933 for_each_sched_entity(se) {
6934 if (se->on_rq)
6935 break;
6936 cfs_rq = cfs_rq_of(se);
6937 enqueue_entity(cfs_rq, se, flags);
6938
6939
> 6940 if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
6941 cfs_rq->h_nr_running++;
6942 if (kernel_task)
6943 enqueue_kernel(cfs_rq, se, 1);
6944 else if (!throttle_pending)
> 6945 cfs_rq->h_user_running++;
6946
6947 throttle_pending |= cfs_rq->throttle_pending;
6948
6949 cfs_rq->idle_h_nr_running += idle_h_nr_running;
6950 if (cfs_rq_is_idle(cfs_rq))
6951 idle_h_nr_running = 1;
6952
6953
6954 /* end evaluation on encountering a throttled cfs_rq */
6955 if (cfs_rq_throttled(cfs_rq))
6956 goto enqueue_throttle;
6957
6958 flags = ENQUEUE_WAKEUP;
6959 }
6960
6961 for_each_sched_entity(se) {
6962 cfs_rq = cfs_rq_of(se);
6963
6964 update_load_avg(cfs_rq, se, UPDATE_TG);
6965 se_update_runnable(se);
6966 update_cfs_group(se);
6967
6968
6969 if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
6970 cfs_rq->h_nr_running++;
6971 if (kernel_task)
6972 enqueue_kernel(cfs_rq, se, 1);
6973 else if (!throttle_pending)
6974 cfs_rq->h_user_running++;
6975
6976 throttle_pending |= cfs_rq->throttle_pending;
6977
6978 cfs_rq->idle_h_nr_running += idle_h_nr_running;
6979 if (cfs_rq_is_idle(cfs_rq))
6980 idle_h_nr_running = 1;
6981
6982
6983 /* end evaluation on encountering a throttled cfs_rq */
6984 if (cfs_rq_throttled(cfs_rq))
6985 goto enqueue_throttle;
6986 }
6987
6988 /* At this point se is NULL and we are at root level*/
6989 add_nr_running(rq, 1);
6990
6991 /*
6992 * Since new tasks are assigned an initial util_avg equal to
6993 * half of the spare capacity of their CPU, tiny tasks have the
6994 * ability to cross the overutilized threshold, which will
6995 * result in the load balancer ruining all the task placement
6996 * done by EAS. As a way to mitigate that effect, do not account
6997 * for the first enqueue operation of new tasks during the
6998 * overutilized flag detection.
6999 *
7000 * A better way of solving this problem would be to wait for
7001 * the PELT signals of tasks to converge before taking them
7002 * into account, but that is not straightforward to implement,
7003 * and the following generally works well enough in practice.
7004 */
7005 if (!task_new)
7006 update_overutilized_status(rq);
7007
7008 enqueue_throttle:
7009 assert_list_leaf_cfs_rq(rq);
7010
7011 hrtick_update(rq);
7012
7013 if (kernel_task)
7014 unthrottle_on_enqueue(p);
7015 }
7016
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead
2024-02-02 8:09 ` [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead Valentin Schneider
2024-02-03 23:01 ` kernel test robot
@ 2024-02-06 21:36 ` Benjamin Segall
2024-02-07 13:34 ` Valentin Schneider
1 sibling, 1 reply; 15+ messages in thread
From: Benjamin Segall @ 2024-02-06 21:36 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
Tomas Glozar
Valentin Schneider <vschneid@redhat.com> writes:
> cfs_rq_throttled_loose() does not check if there is runtime remaining in
> the cfs_b, and thus relies on check_cfs_rq_runtime() being ran previously
> for that to be checked.
>
> Cache the throttle attempt in throttle_cfs_rq and reuse that where
> needed.
The general idea of throttle_pending rather than constantly checking
runtime_remaining seems reasonable...
>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> kernel/sched/fair.c | 44 ++++++++++----------------------------------
> 1 file changed, 10 insertions(+), 34 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 96504be6ee14a..60778afbff207 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5462,7 +5462,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> * 5) do not run the "skip" process, if something else is available
> */
> static struct sched_entity *
> -pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
> +pick_next_entity(struct cfs_rq *cfs_rq)
> {
> #ifdef CONFIG_CFS_BANDWIDTH
> /*
> @@ -5473,7 +5473,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
> * throttle_cfs_rq.
> */
> WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
> - if (throttled && !list_empty(&cfs_rq->kernel_children)) {
> + if (cfs_rq->throttle_pending && !list_empty(&cfs_rq->kernel_children)) {
... but we still need to know here if any of our parents are throttled
as well, ie a "throttled_pending_count", or to keep the "throttled"
parameter tracking in pnt_fair. (ie just replace the implementation of
cfs_rq_throttled_loose).
> /*
> * TODO: you'd want to factor out pick_eevdf to just take
> * tasks_timeline, and replace this list with a second rbtree
> @@ -5791,8 +5791,12 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
> * We don't actually throttle, though account() will have made sure to
> * resched us so that we pick into a kernel task.
> */
> - if (cfs_rq->h_kernel_running)
> + if (cfs_rq->h_kernel_running) {
> + cfs_rq->throttle_pending = true;
> return false;
> + }
> +
> + cfs_rq->throttle_pending = false;
We also need to clear throttle_pending if quota refills and our
runtime_remaining goes positive. (And do the appropriate h_* accounting in
patch 4/5)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry
2024-02-02 8:09 [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Valentin Schneider
` (4 preceding siblings ...)
2024-02-02 8:09 ` [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants Valentin Schneider
@ 2024-02-06 21:55 ` Benjamin Segall
2024-02-07 13:34 ` Valentin Schneider
5 siblings, 1 reply; 15+ messages in thread
From: Benjamin Segall @ 2024-02-06 21:55 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
Tomas Glozar
Valentin Schneider <vschneid@redhat.com> writes:
> Proposed approach
> =================
>
> Peter mentioned [1] that there have been discussions on changing /when/ the
> throttling happens: rather than have it be done immediately upon updating
> the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
> for the task to be about to return to userspace: if it's in userspace, it can't
> hold any in-kernel lock.
>
> I submitted an initial jab at this [2] and Ben Segall added his own version to
> the conversation [3]. This series contains Ben's patch plus my additions. The
> main change here is updating the .h_nr_running counts throughout the cfs_rq
> hierachies to improve the picture given to load_balance().
>
> The main thing that remains doing for this series is making the second cfs_rq
> tree an actual RB tree (it's just a plain list ATM).
>
> This also doesn't touch rq.nr_running yet, I'm not entirely sure whether we want
> to expose this outside of CFS, but it is another field that's used by load balance.
Then there's also all the load values as well; I don't know the load
balance code well, but it looks like the main thing would be
runnable_avg and that it isn't doing anything that would particularly
care about h_nr_running and runnable_avg being out of sync.
Maybe pulling a pending-throttle user task and then not seeing the
update in h_nr_running could be a bit of trouble?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry
2024-02-06 21:55 ` [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Benjamin Segall
@ 2024-02-07 13:34 ` Valentin Schneider
0 siblings, 0 replies; 15+ messages in thread
From: Valentin Schneider @ 2024-02-07 13:34 UTC (permalink / raw)
To: Benjamin Segall
Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
Tomas Glozar
On 06/02/24 13:55, Benjamin Segall wrote:
> Valentin Schneider <vschneid@redhat.com> writes:
>
>
>> Proposed approach
>> =================
>>
>> Peter mentioned [1] that there have been discussions on changing /when/ the
>> throttling happens: rather than have it be done immediately upon updating
>> the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
>> for the task to be about to return to userspace: if it's in userspace, it can't
>> hold any in-kernel lock.
>>
>> I submitted an initial jab at this [2] and Ben Segall added his own version to
>> the conversation [3]. This series contains Ben's patch plus my additions. The
>> main change here is updating the .h_nr_running counts throughout the cfs_rq
>> hierachies to improve the picture given to load_balance().
>>
>> The main thing that remains doing for this series is making the second cfs_rq
>> tree an actual RB tree (it's just a plain list ATM).
>>
>> This also doesn't touch rq.nr_running yet, I'm not entirely sure whether we want
>> to expose this outside of CFS, but it is another field that's used by load balance.
>
> Then there's also all the load values as well; I don't know the load
> balance code well, but it looks like the main thing would be
> runnable_avg and that it isn't doing anything that would particularly
> care about h_nr_running and runnable_avg being out of sync.
>
Yes, all of the runnable, load and util averages are still going to be an
issue unfortunately. AFAICT tackling this would imply pretty much dequeuing
the throttle_pending user tasks, which was my earlier attempt.
> Maybe pulling a pending-throttle user task and then not seeing the
> update in h_nr_running could be a bit of trouble?
That too is something I hadn't considered. Given the h_nr_running count is
updated accordingly, we could change can_migrate_task() to only allow
kernel tasks to be pulled if the hierarchy is ->throttle_pending. That
would probably require implementing a throttle_pending_count (as you
suggested in the other email) so we don't waste too much time checking up
the hierarchy for every task.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead
2024-02-06 21:36 ` Benjamin Segall
@ 2024-02-07 13:34 ` Valentin Schneider
0 siblings, 0 replies; 15+ messages in thread
From: Valentin Schneider @ 2024-02-07 13:34 UTC (permalink / raw)
To: Benjamin Segall
Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
Tomas Glozar
On 06/02/24 13:36, Benjamin Segall wrote:
> Valentin Schneider <vschneid@redhat.com> writes:
>
>> cfs_rq_throttled_loose() does not check if there is runtime remaining in
>> the cfs_b, and thus relies on check_cfs_rq_runtime() being ran previously
>> for that to be checked.
>>
>> Cache the throttle attempt in throttle_cfs_rq and reuse that where
>> needed.
>
> The general idea of throttle_pending rather than constantly checking
> runtime_remaining seems reasonable...
>
>>
>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>> ---
>> kernel/sched/fair.c | 44 ++++++++++----------------------------------
>> 1 file changed, 10 insertions(+), 34 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 96504be6ee14a..60778afbff207 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5462,7 +5462,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> * 5) do not run the "skip" process, if something else is available
>> */
>> static struct sched_entity *
>> -pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
>> +pick_next_entity(struct cfs_rq *cfs_rq)
>> {
>> #ifdef CONFIG_CFS_BANDWIDTH
>> /*
>> @@ -5473,7 +5473,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
>> * throttle_cfs_rq.
>> */
>> WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
>> - if (throttled && !list_empty(&cfs_rq->kernel_children)) {
>> + if (cfs_rq->throttle_pending && !list_empty(&cfs_rq->kernel_children)) {
>
> ... but we still need to know here if any of our parents are throttled
> as well, ie a "throttled_pending_count", or to keep the "throttled"
> parameter tracking in pnt_fair. (ie just replace the implementation of
> cfs_rq_throttled_loose).
>
Hm, good point. We should be good with reinstoring the throttled parameter
and feeding it a ->throttle_pending accumulator.
>> /*
>> * TODO: you'd want to factor out pick_eevdf to just take
>> * tasks_timeline, and replace this list with a second rbtree
>> @@ -5791,8 +5791,12 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>> * We don't actually throttle, though account() will have made sure to
>> * resched us so that we pick into a kernel task.
>> */
>> - if (cfs_rq->h_kernel_running)
>> + if (cfs_rq->h_kernel_running) {
>> + cfs_rq->throttle_pending = true;
>> return false;
>> + }
>> +
>> + cfs_rq->throttle_pending = false;
>
> We also need to clear throttle_pending if quota refills and our
> runtime_remaining goes positive. (And do the appropriate h_* accounting in
> patch 4/5)
Right, so we could move the throttle_pending logic to after
__assign_cfs_rq_runtime(), and then modify distribute_cfs_runtime() to
catch the !throttled but throttle_pending case.
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-02-07 13:34 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-02 8:09 [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Valentin Schneider
2024-02-02 8:09 ` [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace Valentin Schneider
2024-02-03 19:12 ` kernel test robot
2024-02-03 19:45 ` kernel test robot
2024-02-02 8:09 ` [RFC PATCH v2 2/5] sched: Note schedule() invocations at return-to-user with SM_USER Valentin Schneider
2024-02-02 8:09 ` [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead Valentin Schneider
2024-02-03 23:01 ` kernel test robot
2024-02-06 21:36 ` Benjamin Segall
2024-02-07 13:34 ` Valentin Schneider
2024-02-02 8:09 ` [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace Valentin Schneider
2024-02-04 3:08 ` kernel test robot
2024-02-02 8:09 ` [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants Valentin Schneider
2024-02-04 0:04 ` kernel test robot
2024-02-06 21:55 ` [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Benjamin Segall
2024-02-07 13:34 ` Valentin Schneider
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.