[RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry
@ 2024-02-02  8:09 Valentin Schneider
  2024-02-02  8:09 ` [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace Valentin Schneider
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Valentin Schneider @ 2024-02-02  8:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
	Tomas Glozar

Hi folks,

Problem statement
=================

CFS tasks can end up throttled while holding locks that other, non-throttled
tasks are blocking on.

For !PREEMPT_RT, this can be a source of latency due to the throttling causing a
resource acquisition denial.

For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter the
  slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)

If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued on
the same CPU as one where ktimers/ksoftirqd is blocked on read_lock(&lock),
this creates a circular dependency.

This has been observed to happen with:
o fs/eventpoll.c::ep->lock
o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
but can trigger with any rwlock that can be acquired in both process and
softirq contexts.

The linux-rt tree has had
  1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.")
which helped this scenario for non-rwlock locks by ensuring the throttled
task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately,
rwlocks cannot sanely do PI as they allow multiple readers.

Proposed approach
=================

Peter mentioned [1] that there have been discussions on changing /when/ the
throttling happens: rather than have it be done immediately upon updating
the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
for the task to be about to return to userspace: if it's in userspace, it can't
hold any in-kernel lock.

I submitted an initial jab at this [2] and Ben Segall added his own version to
the conversation [3]. This series contains Ben's patch plus my additions. The
main change here is updating the .h_nr_running counts throughout the cfs_rq
hierachies to improve the picture given to load_balance().

The main thing that remains doing for this series is making the second cfs_rq
tree an actual RB tree (it's just a plain list ATM).

This also doesn't touch rq.nr_running yet, I'm not entirely sure whether we want
to expose this outside of CFS, but it is another field that's used by load balance.

Testing
=======

Tested on QEMU via:

  mount -t cgroup -o cpu none /root/cpu

  mkdir /root/cpu/cg0
  echo 10000 >  /root/cpu/cg0/cpu.cfs_period_us
  echo 1000 > /root/cpu/cg0/cpu.cfs_quota_us

  mkdir /root/cpu/cg0/cg00
  mkdir /root/cpu/cg0/cg01

  mkdir /root/cpu/cg0/cg00/cg000
  mkdir /root/cpu/cg0/cg00/cg001

  spawn() {
      while true; do cat /sys/devices/system/cpu/smt/active &>/dev/null; done &
      PID=$!
      echo "Starting PID${PID}"
      echo $PID > $1
  }

  spawn cpu/cg0/tasks
  spawn cpu/cg0/tasks
  spawn cpu/cg0/tasks
  spawn cpu/cg0/tasks

  spawn cpu/cg0/cg01/tasks

  spawn cpu/cg0/cg00/cg000/tasks
  spawn cpu/cg0/cg00/cg001/tasks

  sleep 120

  kill $(jobs -p)  

Links
=====

[1]: https://lore.kernel.org/all/20231031160120.GE15024@noisy.programming.kicks-ass.net/
[2]: http://lore.kernel.org/r/20231130161245.3894682-1-vschneid@redhat.com
[3]: http://lore.kernel.org/r/xm26edfxpock.fsf@bsegall-linux.svl.corp.google.com

Benjamin Segall (1):
  sched/fair: Only throttle CFS tasks on return to userspace

Valentin Schneider (4):
  sched: Note schedule() invocations at return-to-user with SM_USER
  sched/fair: Delete cfs_rq_throttled_loose(), use
    cfs_rq->throttle_pending instead
  sched/fair: Track count of tasks running in userspace
  sched/fair: Assert user/kernel/total nr invariants

 include/linux/sched.h |   7 +
 kernel/entry/common.c |   2 +-
 kernel/entry/kvm.c    |   2 +-
 kernel/sched/core.c   |  45 ++++-
 kernel/sched/debug.c  |  28 +++
 kernel/sched/fair.c   | 399 ++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h  |   5 +
 7 files changed, 466 insertions(+), 22 deletions(-)

--
2.43.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace
  2024-02-02  8:09 [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Valentin Schneider
@ 2024-02-02  8:09 ` Valentin Schneider
  2024-02-03 19:12   ` kernel test robot
  2024-02-03 19:45   ` kernel test robot
  2024-02-02  8:09 ` [RFC PATCH v2 2/5] sched: Note schedule() invocations at return-to-user with SM_USER Valentin Schneider
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 15+ messages in thread
From: Valentin Schneider @ 2024-02-02  8:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Benjamin Segall, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
	Tomas Glozar

From: Benjamin Segall <bsegall@google.com>

The basic idea of this implementation is to maintain duplicate runqueues
in each cfs_rq that contain duplicate pointers to sched_entitys which
should bypass throttling. Then we can skip throttling cfs_rqs that have
any such children, and when we pick inside any not-actually-throttled
cfs_rq, we only look at this duplicated list.

"Which tasks should bypass throttling" here is "all schedule() calls
that don't set a special flag", but could instead involve the lockdep
markers (except for the problem of percpu-rwsem and similar) or explicit
flags around syscalls and faults, or something else.

This approach avoids any O(tasks) loops, but leaves partially-throttled
cfs_rqs still contributing their full h_nr_running to their parents,
which might result in worse balancing. Also it adds more (generally
still small) overhead to the common enqueue/dequeue/pick paths.

The very basic debug test added is to run a cpusoaker and "cat
/sys/kernel/debug/sched_locked_spin" pinned to the same cpu in the same
cgroup with a quota < 1 cpu.

Not-signed-off-by: Benjamin Segall <bsegall@google.com>
[Slight comment / naming changes]
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/sched.h |   7 ++
 kernel/entry/common.c |   2 +-
 kernel/entry/kvm.c    |   2 +-
 kernel/sched/core.c   |  20 ++++
 kernel/sched/debug.c  |  28 +++++
 kernel/sched/fair.c   | 232 ++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h  |   3 +
 7 files changed, 281 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 03bfe9ab29511..4a0105d1eaa21 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -303,6 +303,7 @@ extern long schedule_timeout_killable(long timeout);
 extern long schedule_timeout_uninterruptible(long timeout);
 extern long schedule_timeout_idle(long timeout);
 asmlinkage void schedule(void);
+asmlinkage void schedule_usermode(void);
 extern void schedule_preempt_disabled(void);
 asmlinkage void preempt_schedule_irq(void);
 #ifdef CONFIG_PREEMPT_RT
@@ -553,6 +554,9 @@ struct sched_entity {
 	struct cfs_rq			*my_q;
 	/* cached value of my_q->h_nr_running */
 	unsigned long			runnable_weight;
+#ifdef CONFIG_CFS_BANDWIDTH
+	struct list_head		kernel_node;
+#endif
 #endif
 
 #ifdef CONFIG_SMP
@@ -1539,6 +1543,9 @@ struct task_struct {
 	struct user_event_mm		*user_event_mm;
 #endif
 
+#ifdef CONFIG_CFS_BANDWIDTH
+	atomic_t			in_return_to_user;
+#endif
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d7ee4bc3f2ba3..16b5432a62c6f 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -156,7 +156,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		local_irq_enable_exit_to_user(ti_work);
 
 		if (ti_work & _TIF_NEED_RESCHED)
-			schedule();
+			schedule_usermode(); /* TODO: also all of the arch/ loops that don't use this yet */
 
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 2e0f75bcb7fd1..fc4b73de07539 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -14,7 +14,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
 		}
 
 		if (ti_work & _TIF_NEED_RESCHED)
-			schedule();
+			schedule_usermode();
 
 		if (ti_work & _TIF_NOTIFY_RESUME)
 			resume_user_mode_work(NULL);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db4be4921e7f0..a7c028fad5a89 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4529,6 +4529,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	INIT_LIST_HEAD(&p->se.kernel_node);
+	atomic_set(&p->in_return_to_user, 0);
+#endif
 
 #ifdef CONFIG_SCHEDSTATS
 	/* Even if schedstat is disabled, there should not be garbage */
@@ -6818,6 +6822,22 @@ asmlinkage __visible void __sched schedule(void)
 }
 EXPORT_SYMBOL(schedule);
 
+asmlinkage __visible void __sched schedule_usermode(void)
+{
+#ifdef CONFIG_CFS_BANDWIDTH
+	/*
+	 * This is only atomic because of this simple implementation. We could
+	 * do something with an SM_USER to avoid other-cpu scheduler operations
+	 * racing against these writes.
+	 */
+	atomic_set(&current->in_return_to_user, true);
+	schedule();
+	atomic_set(&current->in_return_to_user, false);
+#else
+	schedule();
+#endif
+}
+
 /*
  * synchronize_rcu_tasks() makes sure that no task is stuck in preempted
  * state (have scheduled out non-voluntarily) by making sure that all
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8d5d98a5834df..4a89dbc3ddfcd 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -319,6 +319,32 @@ static const struct file_operations sched_verbose_fops = {
 	.llseek =       default_llseek,
 };
 
+static DEFINE_MUTEX(sched_debug_spin_mutex);
+static int sched_debug_spin_show(struct seq_file *m, void *v) {
+	int count;
+	mutex_lock(&sched_debug_spin_mutex);
+	for (count = 0; count < 1000; count++) {
+		u64 start2;
+		start2 = jiffies;
+		while (jiffies == start2)
+			cpu_relax();
+		schedule();
+	}
+	mutex_unlock(&sched_debug_spin_mutex);
+	return 0;
+}
+static int sched_debug_spin_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_debug_spin_show, NULL);
+}
+
+static const struct file_operations sched_debug_spin_fops = {
+	.open		= sched_debug_spin_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static const struct seq_operations sched_debug_sops;
 
 static int sched_debug_open(struct inode *inode, struct file *filp)
@@ -374,6 +400,8 @@ static __init int sched_init_debug(void)
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
+	debugfs_create_file("sched_locked_spin", 0444, NULL, NULL,
+			    &sched_debug_spin_fops);
 	return 0;
 }
 late_initcall(sched_init_debug);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b803030c3a037..a1808459a5acc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -128,6 +128,7 @@ int __weak arch_asym_cpu_priority(int cpu)
  * (default: 5 msec, units: microseconds)
  */
 static unsigned int sysctl_sched_cfs_bandwidth_slice		= 5000UL;
+static unsigned int sysctl_sched_cfs_bandwidth_kernel_bypass	= 1;
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -146,6 +147,15 @@ static struct ctl_table sched_fair_sysctls[] = {
 		.proc_handler   = proc_dointvec_minmax,
 		.extra1         = SYSCTL_ONE,
 	},
+	{
+		.procname       = "sched_cfs_bandwidth_kernel_bypass",
+		.data           = &sysctl_sched_cfs_bandwidth_kernel_bypass,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec_minmax,
+		.extra1         = SYSCTL_ZERO,
+		.extra2         = SYSCTL_ONE,
+	},
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 	{
@@ -5445,14 +5455,34 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 /*
  * Pick the next process, keeping these things in mind, in this order:
- * 1) keep things fair between processes/task groups
- * 2) pick the "next" process, since someone really wants that to run
- * 3) pick the "last" process, for cache locality
- * 4) do not run the "skip" process, if something else is available
+ * 1) If we're inside a throttled cfs_rq, only pick threads in the kernel
+ * 2) keep things fair between processes/task groups
+ * 3) pick the "next" process, since someone really wants that to run
+ * 4) pick the "last" process, for cache locality
+ * 5) do not run the "skip" process, if something else is available
  */
 static struct sched_entity *
-pick_next_entity(struct cfs_rq *cfs_rq)
+pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
 {
+#ifdef CONFIG_CFS_BANDWIDTH
+	/*
+	 * TODO: This might trigger, I'm not sure/don't remember. Regardless,
+	 * while we do not explicitly handle the case where h_kernel_running
+	 * goes to 0, we will call account/check_cfs_rq_runtime at worst in
+	 * entity_tick and notice that we can now properly do the full
+	 * throttle_cfs_rq.
+	 */
+	WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
+	if (throttled && !list_empty(&cfs_rq->kernel_children)) {
+		/*
+		 * TODO: you'd want to factor out pick_eevdf to just take
+		 * tasks_timeline, and replace this list with a second rbtree
+		 * and a call to pick_eevdf.
+		 */
+		return list_first_entry(&cfs_rq->kernel_children,
+					struct sched_entity, kernel_node);
+	}
+#endif
 	/*
 	 * Enabling NEXT_BUDDY will affect latency but not fairness.
 	 */
@@ -5651,8 +5681,14 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 	/*
 	 * if we're unable to extend our runtime we resched so that the active
 	 * hierarchy can be throttled
+	 *
+	 * Don't resched_curr() if curr is in the kernel. We won't throttle the
+	 * cfs_rq if any task is in the kernel, and if curr in particular is we
+	 * don't need to preempt it in favor of whatever other task is in the
+	 * kernel.
 	 */
-	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
+	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr) &&
+	    list_empty(&rq_of(cfs_rq)->curr->se.kernel_node))
 		resched_curr(rq_of(cfs_rq));
 }
 
@@ -5741,12 +5777,22 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	return 0;
 }
 
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
+
 static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta, dequeue = 1;
+	long task_delta, idle_task_delta, kernel_delta, dequeue = 1;
+
+	/*
+	 * We don't actually throttle, though account() will have made sure to
+	 * resched us so that we pick into a kernel task.
+	 */
+	if (cfs_rq->h_kernel_running)
+		return false;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* This will start the period timer if necessary */
@@ -5778,6 +5824,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	task_delta = cfs_rq->h_nr_running;
 	idle_task_delta = cfs_rq->idle_h_nr_running;
+	kernel_delta = cfs_rq->h_kernel_running;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 		/* throttled entity or throttle-on-deactivate */
@@ -5791,6 +5838,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
+		dequeue_kernel(qcfs_rq, se, kernel_delta);
 
 		if (qcfs_rq->load.weight) {
 			/* Avoid re-evaluating load for this entity: */
@@ -5813,6 +5861,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
+		dequeue_kernel(qcfs_rq, se, kernel_delta);
 	}
 
 	/* At this point se is NULL and we are at root level*/
@@ -5835,7 +5884,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta;
+	long task_delta, idle_task_delta, kernel_delta;
 
 	se = cfs_rq->tg->se[cpu_of(rq)];
 
@@ -5870,6 +5919,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	task_delta = cfs_rq->h_nr_running;
 	idle_task_delta = cfs_rq->idle_h_nr_running;
+	kernel_delta = cfs_rq->h_kernel_running;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 
@@ -5882,6 +5932,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running += task_delta;
 		qcfs_rq->idle_h_nr_running += idle_task_delta;
+		enqueue_kernel(qcfs_rq, se, kernel_delta);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -5899,6 +5950,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running += task_delta;
 		qcfs_rq->idle_h_nr_running += idle_task_delta;
+		enqueue_kernel(qcfs_rq, se, kernel_delta);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -6557,6 +6609,86 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
 }
 #endif
 
+/*
+ * We keep track of all children that are runnable in the kernel with a count of
+ * all descendants. The state is checked on enqueue and put_prev (and hard
+ * cleared on dequeue), and is stored just as the filled/empty state of the
+ * kernel_node list entry.
+ *
+ * These are simple helpers that do both parts, and should be called bottom-up
+ * until hitting a throttled cfs_rq whenever a task changes state (or a cfs_rq
+ * is (un)throttled).
+ */
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count)
+{
+	if (count == 0)
+		return;
+
+	if (list_empty(&se->kernel_node))
+		list_add(&se->kernel_node, &cfs_rq->kernel_children);
+	cfs_rq->h_kernel_running += count;
+}
+
+static bool is_kernel_task(struct task_struct *p)
+{
+	return sysctl_sched_cfs_bandwidth_kernel_bypass && !atomic_read(&p->in_return_to_user);
+}
+
+/*
+ * When called on a task this always transitions it to a !kernel state.
+ *
+ * When called on a group it is just synchronizing the state with the new
+ * h_kernel_waiters, unless this it has been throttled and is !on_rq
+ */
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count)
+{
+	if (count == 0)
+		return;
+
+	if (!se->on_rq || entity_is_task(se) ||
+	    !group_cfs_rq(se)->h_kernel_running)
+		list_del_init(&se->kernel_node);
+	cfs_rq->h_kernel_running -= count;
+}
+
+/*
+ * Returns if the cfs_rq "should" be throttled but might not be because of
+ * kernel threads bypassing throttle.
+ */
+static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
+{
+	if (!cfs_bandwidth_used())
+		return false;
+
+	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
+		return false;
+	return true;
+}
+
+static void unthrottle_on_enqueue(struct task_struct *p)
+{
+	struct sched_entity *se = &p->se;
+
+	if (!cfs_bandwidth_used() || !sysctl_sched_cfs_bandwidth_kernel_bypass)
+		return;
+	if (!cfs_rq_of(&p->se)->throttle_count)
+		return;
+
+	/*
+	 * MAYBE TODO: doing it this simple way is O(throttle_count *
+	 * cgroup_depth). We could optimize that into a single pass, but making
+	 * a mostly-copy of unthrottle_cfs_rq that does that is a pain and easy
+	 * to get wrong. (And even without unthrottle_on_enqueue it's O(nm),
+	 * just not while holding rq->lock the whole time)
+	 */
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		if (cfs_rq->throttled)
+			unthrottle_cfs_rq(cfs_rq);
+	}
+}
+
 #else /* CONFIG_CFS_BANDWIDTH */
 
 static inline bool cfs_bandwidth_used(void)
@@ -6604,6 +6736,16 @@ bool cfs_task_bw_constrained(struct task_struct *p)
 	return false;
 }
 #endif
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count) {}
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count) {}
+static inline bool is_kernel_task(struct task_struct *p)
+{
+	return false;
+}
+static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
+{
+	return false;
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 
 #if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
@@ -6707,6 +6849,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	struct sched_entity *se = &p->se;
 	int idle_h_nr_running = task_has_idle_policy(p);
 	int task_new = !(flags & ENQUEUE_WAKEUP);
+	bool kernel_task = is_kernel_task(p);
 
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
@@ -6735,6 +6878,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
+		if (kernel_task)
+			enqueue_kernel(cfs_rq, se, 1);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6755,6 +6900,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
+		if (kernel_task)
+			enqueue_kernel(cfs_rq, se, 1);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6785,6 +6932,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	assert_list_leaf_cfs_rq(rq);
 
 	hrtick_update(rq);
+
+	if (kernel_task)
+		unthrottle_on_enqueue(p);
 }
 
 static void set_next_buddy(struct sched_entity *se);
@@ -6801,6 +6951,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int task_sleep = flags & DEQUEUE_SLEEP;
 	int idle_h_nr_running = task_has_idle_policy(p);
 	bool was_sched_idle = sched_idle_rq(rq);
+	bool kernel_task = !list_empty(&p->se.kernel_node);
 
 	util_est_dequeue(&rq->cfs, p);
 
@@ -6813,6 +6964,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
+		if (kernel_task)
+			dequeue_kernel(cfs_rq, se, 1);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6845,6 +6998,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
+		if (kernel_task)
+			dequeue_kernel(cfs_rq, se, 1);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -8343,11 +8498,40 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	resched_curr(rq);
 }
 
+static void handle_kernel_task_prev(struct task_struct *prev)
+{
+#ifdef CONFIG_CFS_BANDWIDTH
+	struct sched_entity *se = &prev->se;
+	bool p_in_kernel = is_kernel_task(prev);
+	bool p_in_kernel_tree = !list_empty(&se->kernel_node);
+	/*
+	 * These extra loops are bad and against the whole point of the merged
+	 * PNT, but it's a pain to merge, particularly since we want it to occur
+	 * before check_cfs_runtime().
+	 */
+	if (p_in_kernel_tree && !p_in_kernel) {
+		WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */
+		for_each_sched_entity(se) {
+			dequeue_kernel(cfs_rq_of(se), se, 1);
+			if (cfs_rq_throttled(cfs_rq_of(se)))
+				break;
+		}
+	} else if (!p_in_kernel_tree && p_in_kernel && se->on_rq) {
+		for_each_sched_entity(se) {
+			enqueue_kernel(cfs_rq_of(se), se, 1);
+			if (cfs_rq_throttled(cfs_rq_of(se)))
+				break;
+		}
+	}
+#endif
+}
+
 #ifdef CONFIG_SMP
 static struct task_struct *pick_task_fair(struct rq *rq)
 {
 	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
+	bool throttled = false;
 
 again:
 	cfs_rq = &rq->cfs;
@@ -8368,7 +8552,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 				goto again;
 		}
 
-		se = pick_next_entity(cfs_rq);
+		if (cfs_rq_throttled_loose(cfs_rq))
+			throttled = true;
+
+		se = pick_next_entity(cfs_rq, throttled);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
@@ -8383,6 +8570,14 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	struct sched_entity *se;
 	struct task_struct *p;
 	int new_tasks;
+	bool throttled;
+
+	/*
+	 * We want to handle this before check_cfs_runtime(prev). We'll
+	 * duplicate a little work in the goto simple case, but that's fine
+	 */
+	if (prev)
+		handle_kernel_task_prev(prev);
 
 again:
 	if (!sched_fair_runnable(rq))
@@ -8400,6 +8595,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	 * hierarchy, only change the part that actually changes.
 	 */
 
+	throttled = false;
 	do {
 		struct sched_entity *curr = cfs_rq->curr;
 
@@ -8431,7 +8627,10 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 			}
 		}
 
-		se = pick_next_entity(cfs_rq);
+		if (cfs_rq_throttled_loose(cfs_rq))
+			throttled = true;
+
+		se = pick_next_entity(cfs_rq, throttled);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
@@ -8469,8 +8668,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	if (prev)
 		put_prev_task(rq, prev);
 
+	throttled = false;
 	do {
-		se = pick_next_entity(cfs_rq);
+		if (cfs_rq_throttled_loose(cfs_rq))
+			throttled = true;
+		se = pick_next_entity(cfs_rq, throttled);
 		set_next_entity(cfs_rq, se);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
@@ -8534,6 +8736,8 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
 	struct sched_entity *se = &prev->se;
 	struct cfs_rq *cfs_rq;
 
+	handle_kernel_task_prev(prev);
+
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		put_prev_entity(cfs_rq, se);
@@ -12818,6 +13022,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #ifdef CONFIG_SMP
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	INIT_LIST_HEAD(&cfs_rq->kernel_children);
+#endif
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -12970,6 +13177,9 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 	/* guarantee group entities always have weight */
 	update_load_set(&se->load, NICE_0_LOAD);
 	se->parent = parent;
+#ifdef CONFIG_CFS_BANDWIDTH
+	INIT_LIST_HEAD(&se->kernel_node);
+#endif
 }
 
 static DEFINE_MUTEX(shares_mutex);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54bda77de..0b33ce2e60555 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -580,6 +580,7 @@ struct cfs_rq {
 
 	struct rb_root_cached	tasks_timeline;
 
+
 	/*
 	 * 'curr' points to currently running entity on this cfs_rq.
 	 * It is set to NULL otherwise (i.e when none are currently running).
@@ -658,8 +659,10 @@ struct cfs_rq {
 	u64			throttled_clock_self_time;
 	int			throttled;
 	int			throttle_count;
+	int			h_kernel_running;
 	struct list_head	throttled_list;
 	struct list_head	throttled_csd_list;
+	struct list_head	kernel_children;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 };
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 2/5] sched: Note schedule() invocations at return-to-user with SM_USER
  2024-02-02  8:09 [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Valentin Schneider
  2024-02-02  8:09 ` [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace Valentin Schneider
@ 2024-02-02  8:09 ` Valentin Schneider
  2024-02-02  8:09 ` [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead Valentin Schneider
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Valentin Schneider @ 2024-02-02  8:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Benjamin Segall, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
	Tomas Glozar

task_struct.in_return_to_user is currently updated via atomic operations in
schedule_usermode().

However, one can note:
o .in_return_to_user is only updated for the current task
o There are no remote (smp_processor_id() != task_cpu(p)) accesses to
  .in_return_to_user

Add schedule_with_mode() to factorize schedule() with different flags to
pass down to __schedule_loop().

Add SM_USER to denote schedule() calls from return-to-userspace points.

Update .in_return_to_user from within the preemption-disabled, rq_lock-held
part of __schedule().

Suggested-by: Benjamin Segall <bsegall@google.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/sched.h |  2 +-
 kernel/sched/core.c   | 43 ++++++++++++++++++++++++++++++++-----------
 kernel/sched/fair.c   | 17 ++++++++++++++++-
 3 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4a0105d1eaa21..1b6f17b2150a6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1544,7 +1544,7 @@ struct task_struct {
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
-	atomic_t			in_return_to_user;
+	int				in_return_to_user;
 #endif
 	/*
 	 * New fields for task_struct should be added above here, so that
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7c028fad5a89..54e6690626b13 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4531,7 +4531,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	INIT_LIST_HEAD(&p->se.kernel_node);
-	atomic_set(&p->in_return_to_user, 0);
+	p->in_return_to_user            = false;
 #endif
 
 #ifdef CONFIG_SCHEDSTATS
@@ -5147,6 +5147,9 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
 
 static inline void finish_lock_switch(struct rq *rq)
 {
+#ifdef CONFIG_CFS_BANDWIDTH
+	current->in_return_to_user = false;
+#endif
 	/*
 	 * If we are tracking spinlock dependencies then we have to
 	 * fix up the runqueue lock - which gets 'carried over' from
@@ -6562,6 +6565,18 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 #define SM_PREEMPT		0x1
 #define SM_RTLOCK_WAIT		0x2
 
+/*
+ * Special case for CFS_BANDWIDTH where we need to know if the call to
+ * __schedule() is directely preceding an entry into userspace.
+ * It is removed from the mode argument as soon as it is used to not go against
+ * the SM_MASK_PREEMPT optimisation below.
+ */
+#ifdef CONFIG_CFS_BANDWIDTH
+# define SM_USER                0x4
+#else
+# define SM_USER                SM_NONE
+#endif
+
 #ifndef CONFIG_PREEMPT_RT
 # define SM_MASK_PREEMPT	(~0U)
 #else
@@ -6646,6 +6661,14 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	rq_lock(rq, &rf);
 	smp_mb__after_spinlock();
 
+#ifdef CONFIG_CFS_BANDWIDTH
+	if (sched_mode & SM_USER) {
+		prev->in_return_to_user = true;
+		sched_mode &= ~SM_USER;
+	}
+#endif
+	SCHED_WARN_ON(sched_mode & SM_USER);
+
 	/* Promote REQ to ACT */
 	rq->clock_update_flags <<= 1;
 	update_rq_clock(rq);
@@ -6807,7 +6830,7 @@ static __always_inline void __schedule_loop(unsigned int sched_mode)
 	} while (need_resched());
 }
 
-asmlinkage __visible void __sched schedule(void)
+static __always_inline void schedule_with_mode(unsigned int sched_mode)
 {
 	struct task_struct *tsk = current;
 
@@ -6817,22 +6840,20 @@ asmlinkage __visible void __sched schedule(void)
 
 	if (!task_is_running(tsk))
 		sched_submit_work(tsk);
-	__schedule_loop(SM_NONE);
+	__schedule_loop(sched_mode);
 	sched_update_worker(tsk);
 }
+
+asmlinkage __visible void __sched schedule(void)
+{
+	schedule_with_mode(SM_NONE);
+}
 EXPORT_SYMBOL(schedule);
 
 asmlinkage __visible void __sched schedule_usermode(void)
 {
 #ifdef CONFIG_CFS_BANDWIDTH
-	/*
-	 * This is only atomic because of this simple implementation. We could
-	 * do something with an SM_USER to avoid other-cpu scheduler operations
-	 * racing against these writes.
-	 */
-	atomic_set(&current->in_return_to_user, true);
-	schedule();
-	atomic_set(&current->in_return_to_user, false);
+	schedule_with_mode(SM_USER);
 #else
 	schedule();
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1808459a5acc..96504be6ee14a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6631,7 +6631,22 @@ static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int c
 
 static bool is_kernel_task(struct task_struct *p)
 {
-	return sysctl_sched_cfs_bandwidth_kernel_bypass && !atomic_read(&p->in_return_to_user);
+	/*
+	 * The flag is updated within __schedule() with preemption disabled,
+	 * under the rq lock, and only when the task is current.
+	 *
+	 * Holding the rq lock for that task's CPU is thus sufficient for the
+	 * value to be stable, if the task is enqueued.
+	 *
+	 * If the task is dequeued, then task_cpu(p) *can* change, but this
+	 * so far only happens in enqueue_task_fair() which means either:
+	 * - the task is being activated, its CPU has been set previously in ttwu()
+	 * - the task is going through a "change" cycle (e.g. sched_move_task()),
+	 *   the pi_lock is also held so the CPU is stable.
+	 */
+	lockdep_assert_rq_held(cpu_rq(task_cpu(p)));
+
+	return sysctl_sched_cfs_bandwidth_kernel_bypass && !p->in_return_to_user;
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead
  2024-02-02  8:09 [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Valentin Schneider
  2024-02-02  8:09 ` [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace Valentin Schneider
  2024-02-02  8:09 ` [RFC PATCH v2 2/5] sched: Note schedule() invocations at return-to-user with SM_USER Valentin Schneider
@ 2024-02-02  8:09 ` Valentin Schneider
  2024-02-03 23:01   ` kernel test robot
  2024-02-06 21:36   ` Benjamin Segall
  2024-02-02  8:09 ` [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace Valentin Schneider
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 15+ messages in thread
From: Valentin Schneider @ 2024-02-02  8:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
	Tomas Glozar

cfs_rq_throttled_loose() does not check if there is runtime remaining in
the cfs_b, and thus relies on check_cfs_rq_runtime() being ran previously
for that to be checked.

Cache the throttle attempt in throttle_cfs_rq and reuse that where
needed.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/fair.c | 44 ++++++++++----------------------------------
 1 file changed, 10 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 96504be6ee14a..60778afbff207 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5462,7 +5462,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * 5) do not run the "skip" process, if something else is available
  */
 static struct sched_entity *
-pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
+pick_next_entity(struct cfs_rq *cfs_rq)
 {
 #ifdef CONFIG_CFS_BANDWIDTH
 	/*
@@ -5473,7 +5473,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
 	 * throttle_cfs_rq.
 	 */
 	WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
-	if (throttled && !list_empty(&cfs_rq->kernel_children)) {
+	if (cfs_rq->throttle_pending && !list_empty(&cfs_rq->kernel_children)) {
 		/*
 		 * TODO: you'd want to factor out pick_eevdf to just take
 		 * tasks_timeline, and replace this list with a second rbtree
@@ -5791,8 +5791,12 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	 * We don't actually throttle, though account() will have made sure to
 	 * resched us so that we pick into a kernel task.
 	 */
-	if (cfs_rq->h_kernel_running)
+	if (cfs_rq->h_kernel_running) {
+		cfs_rq->throttle_pending = true;
 		return false;
+	}
+
+	cfs_rq->throttle_pending = false;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* This will start the period timer if necessary */
@@ -6666,20 +6670,6 @@ static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int c
 	cfs_rq->h_kernel_running -= count;
 }
 
-/*
- * Returns if the cfs_rq "should" be throttled but might not be because of
- * kernel threads bypassing throttle.
- */
-static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
-{
-	if (!cfs_bandwidth_used())
-		return false;
-
-	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
-		return false;
-	return true;
-}
-
 static void unthrottle_on_enqueue(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
@@ -8546,7 +8536,6 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 {
 	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
-	bool throttled = false;
 
 again:
 	cfs_rq = &rq->cfs;
@@ -8567,10 +8556,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 				goto again;
 		}
 
-		if (cfs_rq_throttled_loose(cfs_rq))
-			throttled = true;
-
-		se = pick_next_entity(cfs_rq, throttled);
+		se = pick_next_entity(cfs_rq);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
@@ -8585,7 +8571,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	struct sched_entity *se;
 	struct task_struct *p;
 	int new_tasks;
-	bool throttled;
 
 	/*
 	 * We want to handle this before check_cfs_runtime(prev). We'll
@@ -8609,8 +8594,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	 * Therefore attempt to avoid putting and setting the entire cgroup
 	 * hierarchy, only change the part that actually changes.
 	 */
-
-	throttled = false;
 	do {
 		struct sched_entity *curr = cfs_rq->curr;
 
@@ -8641,11 +8624,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 				goto simple;
 			}
 		}
-
-		if (cfs_rq_throttled_loose(cfs_rq))
-			throttled = true;
-
-		se = pick_next_entity(cfs_rq, throttled);
+		se = pick_next_entity(cfs_rq);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
@@ -8683,11 +8662,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	if (prev)
 		put_prev_task(rq, prev);
 
-	throttled = false;
 	do {
-		if (cfs_rq_throttled_loose(cfs_rq))
-			throttled = true;
-		se = pick_next_entity(cfs_rq, throttled);
+		se = pick_next_entity(cfs_rq);
 		set_next_entity(cfs_rq, se);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace
  2024-02-02  8:09 [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Valentin Schneider
                   ` (2 preceding siblings ...)
  2024-02-02  8:09 ` [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead Valentin Schneider
@ 2024-02-02  8:09 ` Valentin Schneider
  2024-02-04  3:08   ` kernel test robot
  2024-02-02  8:09 ` [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants Valentin Schneider
  2024-02-06 21:55 ` [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Benjamin Segall
  5 siblings, 1 reply; 15+ messages in thread
From: Valentin Schneider @ 2024-02-02  8:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
	Tomas Glozar

While having a second tree to pick from solves the throttling aspect of things,
it also requires modification of the task count at the cfs_rq level.

.h_nr_running is used throughout load_balance(), and it needs to accurately
reflect the amount of pickable tasks: a cfs_rq with .throttle_pending=1 may have
many tasks in userspace (thus effectively throttled), and this "excess" of tasks
shouldn't cause find_busiest_group() / find_busiest_queue() to pick that
cfs_rq's CPU to pull load from when there are other CPUs with more pickable
tasks to pull.

The approach taken here is to track both the count of tasks in kernelspace and
the count of tasks in userspace (technically tasks-just-about-to-enter-userspace).

When a cfs_rq runs out of runtime, it gets marked as .throttle_pending=1. From
this point on, only tasks executing in kernelspace are pickable, and this is
reflected up the hierarchy by removing that cfs_rq.h_user_running from its
parents' .h_nr_running.

To aid in validating the proper behaviour of the implementation, we assert the
following invariants:
  o For any cfs_rq with .throttle_pending == 0:
    .h_kernel_running + .h_user_running == .h_nr_running
  o For any cfs_rq with .throttle_pending == 1:
    .h_kernel_running == .h_nr_running

This means the .h_user_running also needs to be updated as cfs_rq's become
.throttle_pending=1. When a cfs_rq becomes .throttle_pending=1, its
.h_user_running remains untouched, but it is subtracted from its parents'
.h_user_running.

Another way to look at it is that the .h_user_running is "stored" at the level
of the .throttle_pending cfs_rq, and restored to the upper part of the hierarchy
at unthrottle.

An overview of the count logic is:

 Consider:
   cfs_rq.kernel := count of kernel *tasks* enqueued on this cfs_rq
   cfs_rq.user   := count of user   *tasks* enqueued on this cfs_rq

 Then, the following logic is implemented:
   cfs_rq.h_kernel_running = Sum(child.kernel) for all child cfs_rq
   cfs_rq.h_user_running   = Sum(child.user)   for all child cfs_rq with !child.throttle_pending
   cfs_rq.h_nr_running     = Sum(child.kernel) for all child cfs_rq
			   + Sum(child.user)   for all child cfs_rq with !child.throttle_pending

An application of that logic to an A/B/C cgroup hierarchy:

  Initial condition, no throttling

    +------+ .h_kernel_running = C.kernel + B.kernel + A.kernel
  A |cfs_rq| .h_user_running   = C.user   + B.user   + A.user
    +------+ .h_nr_running     = C.{kernel+user} + B.{kernel+user} + A.{kernel+user}
       ^     .throttle_pending = 0
       |
       | parent
       |
    +------+ .h_kernel_running = C.kernel + B.kernel
  B |cfs_rq| .h_user_running   = C.user   + B.user
    +------+ .h_nr_running     = C.{kernel+user} + B.{kernel+user}
       ^     .throttle_pending = 0
       |
       | parent
       |
    +------+ .h_kernel_running = C.kernel
  C |cfs_rq| .h_user_running   = C.user
    +------+ .h_nr_running     = C.{kernel+user}
	     .throttle_pending = 0

  C becomes .throttle_pending

    +------+ .h_kernel_running = C.kernel + B.kernel + A.kernel               <- Untouched
  A |cfs_rq| .h_user_running   = B.user   + A.user                            <- Decremented by C.user
    +------+ .h_nr_running     = C.kernel + B.{kernel+user} + A.{kernel+user} <- Decremented by C.user
       ^     .throttle_pending = 0
       |
       | parent
       |
    +------+ .h_kernel_running = C.kernel + B.kernel                          <- Untouched
  B |cfs_rq| .h_user_running   = B.user                                       <- Decremented by C.user
    +------+ .h_nr_running     = C.kernel + B.{kernel+user} + A.{kernel+user} <- Decremented by C.user
       ^     .throttle_pending = 0
       |
       | parent
       |
    +------+ .h_kernel_running = C.kernel
  C |cfs_rq| .h_user_running   = C.user   <- Untouched, the count is "stored" at this level
    +------+ .h_nr_running     = C.kernel <- Decremented by C.user
	     .throttle_pending = 1

  C becomes throttled

    +------+ .h_kernel_running = B.kernel + A.kernel               <- Decremented by C.kernel
  A |cfs_rq| .h_user_running   = B.user   + A.user
    +------+ .h_nr_running     = B.{kernel+user} + A.{kernel+user} <- Decremented by C.kernel
       ^     .throttle_pending = 0
       |
       | parent
       |
    +------+ .h_kernel_running = B.kernel                          <- Decremented by C.kernel
  B |cfs_rq| .h_user_running   = B.user
    +------+ .h_nr_running     = B.{kernel+user} + A.{kernel+user} <- Decremented by C.kernel
       ^     .throttle_pending = 0
       |
       | parent
       |
    +------+ .h_kernel_running = C.kernel
  C |cfs_rq| .h_user_running   = C.user
    +------+ .h_nr_running     = C.{kernel+user} <- Incremented by C.user
	     .throttle_pending = 0

Could we get away with just one count, e.g. the user count and not the kernel
count? Technically yes, we could follow this scheme:
  if (throttle_pending) => kernel count := h_nr_running - h_user_running
  else                  => kernel count := h_nr_running
this however prevents any sort of assertion or sanity checking on the counts,
which I am not the biggest fan on - CFS group scheduling is enough of a headache
as it is.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/fair.c  | 174 ++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h |   2 +
 2 files changed, 151 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 60778afbff207..2b54d3813d18d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5785,17 +5785,48 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta, kernel_delta, dequeue = 1;
+	long task_delta, idle_task_delta, kernel_delta, user_delta, dequeue = 1;
+	bool was_pending;
 
 	/*
-	 * We don't actually throttle, though account() will have made sure to
-	 * resched us so that we pick into a kernel task.
+	 * We don't actually throttle just yet, though account_cfs_rq_runtime()
+	 * will have made sure to resched us so that we pick into a kernel task.
 	 */
 	if (cfs_rq->h_kernel_running) {
+		if (cfs_rq->throttle_pending)
+			return false;
+
+		/*
+		 * From now on we're only going to pick tasks that are in the
+		 * second tree. Reflect this by discounting tasks that aren't going
+		 * to be pickable from the ->h_nr_running counts.
+		 */
 		cfs_rq->throttle_pending = true;
+
+		se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+		user_delta = cfs_rq->h_user_running;
+		cfs_rq->h_nr_running -= user_delta;
+
+		for_each_sched_entity(se) {
+			struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+
+			if (!se->on_rq)
+				goto done;
+
+			qcfs_rq->h_nr_running -= user_delta;
+			qcfs_rq->h_user_running -= user_delta;
+
+			assert_cfs_rq_counts(qcfs_rq);
+		}
 		return false;
 	}
 
+	/*
+	 * Unlikely as it may be, we may only have user tasks as we hit the
+	 * throttle, in which case we won't have discount them from the
+	 * h_nr_running, and we need to be aware of that.
+	 */
+	was_pending = cfs_rq->throttle_pending;
 	cfs_rq->throttle_pending = false;
 
 	raw_spin_lock(&cfs_b->lock);
@@ -5826,9 +5857,27 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
 	rcu_read_unlock();
 
-	task_delta = cfs_rq->h_nr_running;
+	/*
+	 * At this point, h_nr_running == h_kernel_running. We add back the
+	 * h_user_running to the throttled cfs_rq, and only remove the difference
+	 * to the upper cfs_rq's.
+	 */
+	if (was_pending) {
+		WARN_ON_ONCE(cfs_rq->h_nr_running != cfs_rq->h_kernel_running);
+		cfs_rq->h_nr_running += cfs_rq->h_user_running;
+	} else {
+		WARN_ON_ONCE(cfs_rq->h_nr_running != cfs_rq->h_user_running);
+	}
+
+	/*
+	 * We always discount user tasks from h_nr_running when throttle_pending
+	 * so only h_kernel_running remains to be removed
+	 */
+	task_delta = was_pending ? cfs_rq->h_kernel_running : cfs_rq->h_nr_running;
 	idle_task_delta = cfs_rq->idle_h_nr_running;
 	kernel_delta = cfs_rq->h_kernel_running;
+	user_delta   = was_pending ? 0 : cfs_rq->h_user_running;
+
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 		/* throttled entity or throttle-on-deactivate */
@@ -5843,6 +5892,8 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
 		dequeue_kernel(qcfs_rq, se, kernel_delta);
+		qcfs_rq->h_user_running -= user_delta;
+
 
 		if (qcfs_rq->load.weight) {
 			/* Avoid re-evaluating load for this entity: */
@@ -5866,6 +5917,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
 		dequeue_kernel(qcfs_rq, se, kernel_delta);
+		qcfs_rq->h_user_running -= user_delta;
 	}
 
 	/* At this point se is NULL and we are at root level*/
@@ -5888,7 +5940,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta, kernel_delta;
+	long task_delta, idle_task_delta, kernel_delta, user_delta;
 
 	se = cfs_rq->tg->se[cpu_of(rq)];
 
@@ -5924,6 +5976,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	task_delta = cfs_rq->h_nr_running;
 	idle_task_delta = cfs_rq->idle_h_nr_running;
 	kernel_delta = cfs_rq->h_kernel_running;
+	user_delta = cfs_rq->h_user_running;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 
@@ -5937,6 +5990,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->h_nr_running += task_delta;
 		qcfs_rq->idle_h_nr_running += idle_task_delta;
 		enqueue_kernel(qcfs_rq, se, kernel_delta);
+		qcfs_rq->h_user_running += user_delta;
+
+		assert_cfs_rq_counts(qcfs_rq);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -5955,6 +6011,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->h_nr_running += task_delta;
 		qcfs_rq->idle_h_nr_running += idle_task_delta;
 		enqueue_kernel(qcfs_rq, se, kernel_delta);
+		qcfs_rq->h_user_running += user_delta;
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -6855,6 +6912,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int idle_h_nr_running = task_has_idle_policy(p);
 	int task_new = !(flags & ENQUEUE_WAKEUP);
 	bool kernel_task = is_kernel_task(p);
+	bool throttle_pending = false;
 
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
@@ -6878,13 +6936,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
 
-		cfs_rq->h_nr_running++;
-		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
-		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running = 1;
+		if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+			cfs_rq->h_nr_running++;
 		if (kernel_task)
 			enqueue_kernel(cfs_rq, se, 1);
+		else if (!throttle_pending)
+			cfs_rq->h_user_running++;
+
+		throttle_pending |= cfs_rq->throttle_pending;
+
+		cfs_rq->idle_h_nr_running += idle_h_nr_running;
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6900,13 +6965,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		se_update_runnable(se);
 		update_cfs_group(se);
 
-		cfs_rq->h_nr_running++;
-		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
-		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running = 1;
+		if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+			cfs_rq->h_nr_running++;
 		if (kernel_task)
 			enqueue_kernel(cfs_rq, se, 1);
+		else if (!throttle_pending)
+			cfs_rq->h_user_running++;
+
+		throttle_pending |= cfs_rq->throttle_pending;
+
+		cfs_rq->idle_h_nr_running += idle_h_nr_running;
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6957,6 +7029,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int idle_h_nr_running = task_has_idle_policy(p);
 	bool was_sched_idle = sched_idle_rq(rq);
 	bool kernel_task = !list_empty(&p->se.kernel_node);
+	bool throttle_pending = false;
 
 	util_est_dequeue(&rq->cfs, p);
 
@@ -6964,13 +7037,20 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
 
-		cfs_rq->h_nr_running--;
-		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
-		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running = 1;
+		if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+			cfs_rq->h_nr_running--;
 		if (kernel_task)
 			dequeue_kernel(cfs_rq, se, 1);
+		else if (!throttle_pending)
+			cfs_rq->h_user_running--;
+
+		throttle_pending |= cfs_rq->throttle_pending;
+
+		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6998,13 +7078,20 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		se_update_runnable(se);
 		update_cfs_group(se);
 
-		cfs_rq->h_nr_running--;
-		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
-		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running = 1;
+		if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+			cfs_rq->h_nr_running--;
 		if (kernel_task)
 			dequeue_kernel(cfs_rq, se, 1);
+		else if (!throttle_pending)
+			cfs_rq->h_user_running--;
+
+		throttle_pending |= cfs_rq->throttle_pending;
+
+		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -8503,28 +8590,65 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	resched_curr(rq);
 }
 
+/*
+ * Consider:
+ *   cfs_rq.kernel := count of kernel *tasks* enqueued on this cfs_rq
+ *   cfs_rq.user   := count of user   *tasks* enqueued on this cfs_rq
+ *
+ * Then, the following logic is implemented:
+ *   cfs_rq.h_kernel_running = Sum(child.kernel) for all child cfs_rq
+ *   cfs_rq.h_user_running   = Sum(child.user)   for all child cfs_rq with !child.throttle_pending
+ *   cfs_rq.h_nr_running     = Sum(child.kernel) for all child cfs_rq
+ *			     + Sum(child.user)   for all child cfs_rq with !child.throttle_pending
+ *
+ * IOW, count of kernel tasks is always propagated up the hierarchy, and count
+ * of user tasks is only propagated up if the cfs_rq isn't .throttle_pending.
+ */
 static void handle_kernel_task_prev(struct task_struct *prev)
 {
 #ifdef CONFIG_CFS_BANDWIDTH
 	struct sched_entity *se = &prev->se;
 	bool p_in_kernel = is_kernel_task(prev);
 	bool p_in_kernel_tree = !list_empty(&se->kernel_node);
+	bool throttle_pending = false;
 	/*
 	 * These extra loops are bad and against the whole point of the merged
 	 * PNT, but it's a pain to merge, particularly since we want it to occur
 	 * before check_cfs_runtime().
 	 */
 	if (p_in_kernel_tree && !p_in_kernel) {
+		/* Switch from KERNEL -> USER */
 		WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */
+
 		for_each_sched_entity(se) {
-			dequeue_kernel(cfs_rq_of(se), se, 1);
-			if (cfs_rq_throttled(cfs_rq_of(se)))
+			struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+			if (throttle_pending || cfs_rq->throttle_pending)
+				cfs_rq->h_nr_running--;
+			dequeue_kernel(cfs_rq, se, 1);
+			if (!throttle_pending)
+				cfs_rq->h_user_running++;
+
+			throttle_pending |= cfs_rq->throttle_pending;
+
+			if (cfs_rq_throttled(cfs_rq))
 				break;
 		}
 	} else if (!p_in_kernel_tree && p_in_kernel && se->on_rq) {
+		/* Switch from USER -> KERNEL */
+
 		for_each_sched_entity(se) {
-			enqueue_kernel(cfs_rq_of(se), se, 1);
-			if (cfs_rq_throttled(cfs_rq_of(se)))
+			struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+			if (throttle_pending || cfs_rq->throttle_pending)
+				cfs_rq->h_nr_running++;
+			enqueue_kernel(cfs_rq, se, 1);
+			if (!throttle_pending)
+				cfs_rq->h_user_running--;
+
+			throttle_pending |= cfs_rq->throttle_pending;
+
+			if (cfs_rq_throttled(cfs_rq))
 				break;
 		}
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0b33ce2e60555..e8860e0d6fbc7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -660,6 +660,8 @@ struct cfs_rq {
 	int			throttled;
 	int			throttle_count;
 	int			h_kernel_running;
+	int			h_user_running;
+	int                     throttle_pending;
 	struct list_head	throttled_list;
 	struct list_head	throttled_csd_list;
 	struct list_head	kernel_children;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants
  2024-02-02  8:09 [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Valentin Schneider
                   ` (3 preceding siblings ...)
  2024-02-02  8:09 ` [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace Valentin Schneider
@ 2024-02-02  8:09 ` Valentin Schneider
  2024-02-04  0:04   ` kernel test robot
  2024-02-06 21:55 ` [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Benjamin Segall
  5 siblings, 1 reply; 15+ messages in thread
From: Valentin Schneider @ 2024-02-02  8:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
	Tomas Glozar

Previous commits have added .h_kernel_running and .h_user_running to struct
cfs_rq, and are using them to play games with the hierarchical
.h_nr_running.

Assert some count invariants under SCHED_DEBUG to improve debugging.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2b54d3813d18d..52d0ee0e4d47c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5780,6 +5780,30 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
 static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static inline void assert_cfs_rq_counts(struct cfs_rq *cfs_rq)
+{
+	lockdep_assert_rq_held(rq_of(cfs_rq));
+
+	/*
+	 * When !throttle_pending, this is the normal operating mode, all tasks
+	 * are pickable, so:
+	 * nr_kernel_tasks + nr_user_tasks == nr_pickable_tasks
+	 */
+	SCHED_WARN_ON(!cfs_rq->throttle_pending &&
+		      (cfs_rq->h_kernel_running + cfs_rq->h_user_running !=
+		       cfs_rq->h_nr_running));
+	/*
+	 * When throttle_pending, only kernel tasks are pickable, so:
+	 * nr_kernel_tasks == nr_pickable_tasks
+	 */
+	SCHED_WARN_ON(cfs_rq->throttle_pending &&
+		      (cfs_rq->h_kernel_running != cfs_rq->h_nr_running));
+}
+#else
+static inline void assert_cfs_rq_counts(struct cfs_rq *cfs_rq) { }
+#endif
+
 static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
@@ -5894,6 +5918,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		dequeue_kernel(qcfs_rq, se, kernel_delta);
 		qcfs_rq->h_user_running -= user_delta;
 
+		assert_cfs_rq_counts(qcfs_rq);
 
 		if (qcfs_rq->load.weight) {
 			/* Avoid re-evaluating load for this entity: */
@@ -5918,6 +5943,8 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
 		dequeue_kernel(qcfs_rq, se, kernel_delta);
 		qcfs_rq->h_user_running -= user_delta;
+
+		assert_cfs_rq_counts(qcfs_rq);
 	}
 
 	/* At this point se is NULL and we are at root level*/
@@ -6013,6 +6040,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		enqueue_kernel(qcfs_rq, se, kernel_delta);
 		qcfs_rq->h_user_running += user_delta;
 
+		assert_cfs_rq_counts(qcfs_rq);
+
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
 			goto unthrottle_throttle;
@@ -6950,6 +6979,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
 
+		assert_cfs_rq_counts(cfs_rq);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6965,6 +6995,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		se_update_runnable(se);
 		update_cfs_group(se);
 
+		assert_cfs_rq_counts(cfs_rq);
 
 		if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
 			cfs_rq->h_nr_running++;
@@ -6979,6 +7010,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
 
+		assert_cfs_rq_counts(cfs_rq);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -7051,6 +7083,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
 
+		assert_cfs_rq_counts(cfs_rq);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -7092,6 +7125,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
 
+		assert_cfs_rq_counts(cfs_rq);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -8631,6 +8665,8 @@ static void handle_kernel_task_prev(struct task_struct *prev)
 
 			throttle_pending |= cfs_rq->throttle_pending;
 
+			assert_cfs_rq_counts(cfs_rq);
+
 			if (cfs_rq_throttled(cfs_rq))
 				break;
 		}
@@ -8648,6 +8684,8 @@ static void handle_kernel_task_prev(struct task_struct *prev)
 
 			throttle_pending |= cfs_rq->throttle_pending;
 
+			assert_cfs_rq_counts(cfs_rq);
+
 			if (cfs_rq_throttled(cfs_rq))
 				break;
 		}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace
  2024-02-02  8:09 ` [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace Valentin Schneider
@ 2024-02-03 19:12   ` kernel test robot
  2024-02-03 19:45   ` kernel test robot
  1 sibling, 0 replies; 15+ messages in thread
From: kernel test robot @ 2024-02-03 19:12 UTC (permalink / raw)
  To: Valentin Schneider; +Cc: oe-kbuild-all

Hi Valentin,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on tip/core/entry tip/master linus/master v6.8-rc2 next-20240202]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Valentin-Schneider/sched-fair-Only-throttle-CFS-tasks-on-return-to-userspace/20240202-162620
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20240202080920.3337862-2-vschneid%40redhat.com
patch subject: [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20240204/202402040304.j8HAHYHG-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240204/202402040304.j8HAHYHG-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202402040304.j8HAHYHG-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/sched/fair.c: In function 'enqueue_task_fair':
>> kernel/sched/fair.c:6937:17: error: implicit declaration of function 'unthrottle_on_enqueue' [-Werror=implicit-function-declaration]
    6937 |                 unthrottle_on_enqueue(p);
         |                 ^~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c: In function 'dequeue_task_fair':
>> kernel/sched/fair.c:6954:46: error: 'struct sched_entity' has no member named 'kernel_node'
    6954 |         bool kernel_task = !list_empty(&p->se.kernel_node);
         |                                              ^
   kernel/sched/fair.c: At top level:
   kernel/sched/fair.c:13312:6: warning: no previous prototype for 'free_fair_sched_group' [-Wmissing-prototypes]
   13312 | void free_fair_sched_group(struct task_group *tg) { }
         |      ^~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:13314:5: warning: no previous prototype for 'alloc_fair_sched_group' [-Wmissing-prototypes]
   13314 | int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
         |     ^~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:13319:6: warning: no previous prototype for 'online_fair_sched_group' [-Wmissing-prototypes]
   13319 | void online_fair_sched_group(struct task_group *tg) { }
         |      ^~~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:13321:6: warning: no previous prototype for 'unregister_fair_sched_group' [-Wmissing-prototypes]
   13321 | void unregister_fair_sched_group(struct task_group *tg) { }
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +/unthrottle_on_enqueue +6937 kernel/sched/fair.c

  6839	
  6840	/*
  6841	 * The enqueue_task method is called before nr_running is
  6842	 * increased. Here we update the fair scheduling stats and
  6843	 * then put the task into the rbtree:
  6844	 */
  6845	static void
  6846	enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
  6847	{
  6848		struct cfs_rq *cfs_rq;
  6849		struct sched_entity *se = &p->se;
  6850		int idle_h_nr_running = task_has_idle_policy(p);
  6851		int task_new = !(flags & ENQUEUE_WAKEUP);
  6852		bool kernel_task = is_kernel_task(p);
  6853	
  6854		/*
  6855		 * The code below (indirectly) updates schedutil which looks at
  6856		 * the cfs_rq utilization to select a frequency.
  6857		 * Let's add the task's estimated utilization to the cfs_rq's
  6858		 * estimated utilization, before we update schedutil.
  6859		 */
  6860		util_est_enqueue(&rq->cfs, p);
  6861	
  6862		/*
  6863		 * If in_iowait is set, the code below may not trigger any cpufreq
  6864		 * utilization updates, so do it here explicitly with the IOWAIT flag
  6865		 * passed.
  6866		 */
  6867		if (p->in_iowait)
  6868			cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
  6869	
  6870		for_each_sched_entity(se) {
  6871			if (se->on_rq)
  6872				break;
  6873			cfs_rq = cfs_rq_of(se);
  6874			enqueue_entity(cfs_rq, se, flags);
  6875	
  6876			cfs_rq->h_nr_running++;
  6877			cfs_rq->idle_h_nr_running += idle_h_nr_running;
  6878	
  6879			if (cfs_rq_is_idle(cfs_rq))
  6880				idle_h_nr_running = 1;
  6881			if (kernel_task)
  6882				enqueue_kernel(cfs_rq, se, 1);
  6883	
  6884			/* end evaluation on encountering a throttled cfs_rq */
  6885			if (cfs_rq_throttled(cfs_rq))
  6886				goto enqueue_throttle;
  6887	
  6888			flags = ENQUEUE_WAKEUP;
  6889		}
  6890	
  6891		for_each_sched_entity(se) {
  6892			cfs_rq = cfs_rq_of(se);
  6893	
  6894			update_load_avg(cfs_rq, se, UPDATE_TG);
  6895			se_update_runnable(se);
  6896			update_cfs_group(se);
  6897	
  6898			cfs_rq->h_nr_running++;
  6899			cfs_rq->idle_h_nr_running += idle_h_nr_running;
  6900	
  6901			if (cfs_rq_is_idle(cfs_rq))
  6902				idle_h_nr_running = 1;
  6903			if (kernel_task)
  6904				enqueue_kernel(cfs_rq, se, 1);
  6905	
  6906			/* end evaluation on encountering a throttled cfs_rq */
  6907			if (cfs_rq_throttled(cfs_rq))
  6908				goto enqueue_throttle;
  6909		}
  6910	
  6911		/* At this point se is NULL and we are at root level*/
  6912		add_nr_running(rq, 1);
  6913	
  6914		/*
  6915		 * Since new tasks are assigned an initial util_avg equal to
  6916		 * half of the spare capacity of their CPU, tiny tasks have the
  6917		 * ability to cross the overutilized threshold, which will
  6918		 * result in the load balancer ruining all the task placement
  6919		 * done by EAS. As a way to mitigate that effect, do not account
  6920		 * for the first enqueue operation of new tasks during the
  6921		 * overutilized flag detection.
  6922		 *
  6923		 * A better way of solving this problem would be to wait for
  6924		 * the PELT signals of tasks to converge before taking them
  6925		 * into account, but that is not straightforward to implement,
  6926		 * and the following generally works well enough in practice.
  6927		 */
  6928		if (!task_new)
  6929			update_overutilized_status(rq);
  6930	
  6931	enqueue_throttle:
  6932		assert_list_leaf_cfs_rq(rq);
  6933	
  6934		hrtick_update(rq);
  6935	
  6936		if (kernel_task)
> 6937			unthrottle_on_enqueue(p);
  6938	}
  6939	
  6940	static void set_next_buddy(struct sched_entity *se);
  6941	
  6942	/*
  6943	 * The dequeue_task method is called before nr_running is
  6944	 * decreased. We remove the task from the rbtree and
  6945	 * update the fair scheduling stats:
  6946	 */
  6947	static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
  6948	{
  6949		struct cfs_rq *cfs_rq;
  6950		struct sched_entity *se = &p->se;
  6951		int task_sleep = flags & DEQUEUE_SLEEP;
  6952		int idle_h_nr_running = task_has_idle_policy(p);
  6953		bool was_sched_idle = sched_idle_rq(rq);
> 6954		bool kernel_task = !list_empty(&p->se.kernel_node);
  6955	
  6956		util_est_dequeue(&rq->cfs, p);
  6957	
  6958		for_each_sched_entity(se) {
  6959			cfs_rq = cfs_rq_of(se);
  6960			dequeue_entity(cfs_rq, se, flags);
  6961	
  6962			cfs_rq->h_nr_running--;
  6963			cfs_rq->idle_h_nr_running -= idle_h_nr_running;
  6964	
  6965			if (cfs_rq_is_idle(cfs_rq))
  6966				idle_h_nr_running = 1;
  6967			if (kernel_task)
  6968				dequeue_kernel(cfs_rq, se, 1);
  6969	
  6970			/* end evaluation on encountering a throttled cfs_rq */
  6971			if (cfs_rq_throttled(cfs_rq))
  6972				goto dequeue_throttle;
  6973	
  6974			/* Don't dequeue parent if it has other entities besides us */
  6975			if (cfs_rq->load.weight) {
  6976				/* Avoid re-evaluating load for this entity: */
  6977				se = parent_entity(se);
  6978				/*
  6979				 * Bias pick_next to pick a task from this cfs_rq, as
  6980				 * p is sleeping when it is within its sched_slice.
  6981				 */
  6982				if (task_sleep && se && !throttled_hierarchy(cfs_rq))
  6983					set_next_buddy(se);
  6984				break;
  6985			}
  6986			flags |= DEQUEUE_SLEEP;
  6987		}
  6988	
  6989		for_each_sched_entity(se) {
  6990			cfs_rq = cfs_rq_of(se);
  6991	
  6992			update_load_avg(cfs_rq, se, UPDATE_TG);
  6993			se_update_runnable(se);
  6994			update_cfs_group(se);
  6995	
  6996			cfs_rq->h_nr_running--;
  6997			cfs_rq->idle_h_nr_running -= idle_h_nr_running;
  6998	
  6999			if (cfs_rq_is_idle(cfs_rq))
  7000				idle_h_nr_running = 1;
  7001			if (kernel_task)
  7002				dequeue_kernel(cfs_rq, se, 1);
  7003	
  7004			/* end evaluation on encountering a throttled cfs_rq */
  7005			if (cfs_rq_throttled(cfs_rq))
  7006				goto dequeue_throttle;
  7007	
  7008		}
  7009	
  7010		/* At this point se is NULL and we are at root level*/
  7011		sub_nr_running(rq, 1);
  7012	
  7013		/* balance early to pull high priority tasks */
  7014		if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
  7015			rq->next_balance = jiffies;
  7016	
  7017	dequeue_throttle:
  7018		util_est_update(&rq->cfs, p, task_sleep);
  7019		hrtick_update(rq);
  7020	}
  7021	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace
  2024-02-02  8:09 ` [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace Valentin Schneider
  2024-02-03 19:12   ` kernel test robot
@ 2024-02-03 19:45   ` kernel test robot
  1 sibling, 0 replies; 15+ messages in thread
From: kernel test robot @ 2024-02-03 19:45 UTC (permalink / raw)
  To: Valentin Schneider; +Cc: llvm, oe-kbuild-all

Hi Valentin,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on tip/core/entry tip/master linus/master v6.8-rc2 next-20240202]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Valentin-Schneider/sched-fair-Only-throttle-CFS-tasks-on-return-to-userspace/20240202-162620
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20240202080920.3337862-2-vschneid%40redhat.com
patch subject: [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace
config: arm-randconfig-002-20240203 (https://download.01.org/0day-ci/archive/20240204/202402040329.OcEA18Jz-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240204/202402040329.OcEA18Jz-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202402040329.OcEA18Jz-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/sched/fair.c:6937:3: error: call to undeclared function 'unthrottle_on_enqueue'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
                   unthrottle_on_enqueue(p);
                   ^
   kernel/sched/fair.c:6954:40: error: no member named 'kernel_node' in 'struct sched_entity'
           bool kernel_task = !list_empty(&p->se.kernel_node);
                                           ~~~~~ ^
   kernel/sched/fair.c:13312:6: warning: no previous prototype for function 'free_fair_sched_group' [-Wmissing-prototypes]
   void free_fair_sched_group(struct task_group *tg) { }
        ^
   kernel/sched/fair.c:13312:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void free_fair_sched_group(struct task_group *tg) { }
   ^
   static 
   kernel/sched/fair.c:13314:5: warning: no previous prototype for function 'alloc_fair_sched_group' [-Wmissing-prototypes]
   int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
       ^
   kernel/sched/fair.c:13314:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
   ^
   static 
   kernel/sched/fair.c:13319:6: warning: no previous prototype for function 'online_fair_sched_group' [-Wmissing-prototypes]
   void online_fair_sched_group(struct task_group *tg) { }
        ^
   kernel/sched/fair.c:13319:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void online_fair_sched_group(struct task_group *tg) { }
   ^
   static 
   kernel/sched/fair.c:13321:6: warning: no previous prototype for function 'unregister_fair_sched_group' [-Wmissing-prototypes]
   void unregister_fair_sched_group(struct task_group *tg) { }
        ^
   kernel/sched/fair.c:13321:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void unregister_fair_sched_group(struct task_group *tg) { }
   ^
   static 
   4 warnings and 2 errors generated.


vim +/unthrottle_on_enqueue +6937 kernel/sched/fair.c

  6839	
  6840	/*
  6841	 * The enqueue_task method is called before nr_running is
  6842	 * increased. Here we update the fair scheduling stats and
  6843	 * then put the task into the rbtree:
  6844	 */
  6845	static void
  6846	enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
  6847	{
  6848		struct cfs_rq *cfs_rq;
  6849		struct sched_entity *se = &p->se;
  6850		int idle_h_nr_running = task_has_idle_policy(p);
  6851		int task_new = !(flags & ENQUEUE_WAKEUP);
  6852		bool kernel_task = is_kernel_task(p);
  6853	
  6854		/*
  6855		 * The code below (indirectly) updates schedutil which looks at
  6856		 * the cfs_rq utilization to select a frequency.
  6857		 * Let's add the task's estimated utilization to the cfs_rq's
  6858		 * estimated utilization, before we update schedutil.
  6859		 */
  6860		util_est_enqueue(&rq->cfs, p);
  6861	
  6862		/*
  6863		 * If in_iowait is set, the code below may not trigger any cpufreq
  6864		 * utilization updates, so do it here explicitly with the IOWAIT flag
  6865		 * passed.
  6866		 */
  6867		if (p->in_iowait)
  6868			cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
  6869	
  6870		for_each_sched_entity(se) {
  6871			if (se->on_rq)
  6872				break;
  6873			cfs_rq = cfs_rq_of(se);
  6874			enqueue_entity(cfs_rq, se, flags);
  6875	
  6876			cfs_rq->h_nr_running++;
  6877			cfs_rq->idle_h_nr_running += idle_h_nr_running;
  6878	
  6879			if (cfs_rq_is_idle(cfs_rq))
  6880				idle_h_nr_running = 1;
  6881			if (kernel_task)
  6882				enqueue_kernel(cfs_rq, se, 1);
  6883	
  6884			/* end evaluation on encountering a throttled cfs_rq */
  6885			if (cfs_rq_throttled(cfs_rq))
  6886				goto enqueue_throttle;
  6887	
  6888			flags = ENQUEUE_WAKEUP;
  6889		}
  6890	
  6891		for_each_sched_entity(se) {
  6892			cfs_rq = cfs_rq_of(se);
  6893	
  6894			update_load_avg(cfs_rq, se, UPDATE_TG);
  6895			se_update_runnable(se);
  6896			update_cfs_group(se);
  6897	
  6898			cfs_rq->h_nr_running++;
  6899			cfs_rq->idle_h_nr_running += idle_h_nr_running;
  6900	
  6901			if (cfs_rq_is_idle(cfs_rq))
  6902				idle_h_nr_running = 1;
  6903			if (kernel_task)
  6904				enqueue_kernel(cfs_rq, se, 1);
  6905	
  6906			/* end evaluation on encountering a throttled cfs_rq */
  6907			if (cfs_rq_throttled(cfs_rq))
  6908				goto enqueue_throttle;
  6909		}
  6910	
  6911		/* At this point se is NULL and we are at root level*/
  6912		add_nr_running(rq, 1);
  6913	
  6914		/*
  6915		 * Since new tasks are assigned an initial util_avg equal to
  6916		 * half of the spare capacity of their CPU, tiny tasks have the
  6917		 * ability to cross the overutilized threshold, which will
  6918		 * result in the load balancer ruining all the task placement
  6919		 * done by EAS. As a way to mitigate that effect, do not account
  6920		 * for the first enqueue operation of new tasks during the
  6921		 * overutilized flag detection.
  6922		 *
  6923		 * A better way of solving this problem would be to wait for
  6924		 * the PELT signals of tasks to converge before taking them
  6925		 * into account, but that is not straightforward to implement,
  6926		 * and the following generally works well enough in practice.
  6927		 */
  6928		if (!task_new)
  6929			update_overutilized_status(rq);
  6930	
  6931	enqueue_throttle:
  6932		assert_list_leaf_cfs_rq(rq);
  6933	
  6934		hrtick_update(rq);
  6935	
  6936		if (kernel_task)
> 6937			unthrottle_on_enqueue(p);
  6938	}
  6939	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead
  2024-02-02  8:09 ` [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead Valentin Schneider
@ 2024-02-03 23:01   ` kernel test robot
  2024-02-06 21:36   ` Benjamin Segall
  1 sibling, 0 replies; 15+ messages in thread
From: kernel test robot @ 2024-02-03 23:01 UTC (permalink / raw)
  To: Valentin Schneider; +Cc: oe-kbuild-all

Hi Valentin,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/core/entry tip/master linus/master v6.8-rc2 next-20240202]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Valentin-Schneider/sched-fair-Only-throttle-CFS-tasks-on-return-to-userspace/20240202-162620
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20240202080920.3337862-4-vschneid%40redhat.com
patch subject: [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20240204/202402040651.KMZz9c3q-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240204/202402040651.KMZz9c3q-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202402040651.KMZz9c3q-lkp@intel.com/

All warnings (new ones prefixed by >>):

   kernel/sched/fair.c: In function 'enqueue_task_fair':
   kernel/sched/fair.c:6942:17: error: implicit declaration of function 'unthrottle_on_enqueue' [-Werror=implicit-function-declaration]
    6942 |                 unthrottle_on_enqueue(p);
         |                 ^~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c: In function 'dequeue_task_fair':
   kernel/sched/fair.c:6959:46: error: 'struct sched_entity' has no member named 'kernel_node'
    6959 |         bool kernel_task = !list_empty(&p->se.kernel_node);
         |                                              ^
   kernel/sched/fair.c: At top level:
   kernel/sched/fair.c:13303:6: warning: no previous prototype for 'free_fair_sched_group' [-Wmissing-prototypes]
   13303 | void free_fair_sched_group(struct task_group *tg) { }
         |      ^~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:13305:5: warning: no previous prototype for 'alloc_fair_sched_group' [-Wmissing-prototypes]
   13305 | int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
         |     ^~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:13310:6: warning: no previous prototype for 'online_fair_sched_group' [-Wmissing-prototypes]
   13310 | void online_fair_sched_group(struct task_group *tg) { }
         |      ^~~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:13312:6: warning: no previous prototype for 'unregister_fair_sched_group' [-Wmissing-prototypes]
   13312 | void unregister_fair_sched_group(struct task_group *tg) { }
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/sched/fair.c:6750:13: warning: 'cfs_rq_throttled_loose' defined but not used [-Wunused-function]
    6750 | static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
         |             ^~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +/cfs_rq_throttled_loose +6750 kernel/sched/fair.c

ab84d31e15502f kernel/sched_fair.c Paul Turner       2011-07-21  6730  
029632fbb7b7c9 kernel/sched_fair.c Peter Zijlstra    2011-10-25  6731  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
029632fbb7b7c9 kernel/sched_fair.c Peter Zijlstra    2011-10-25  6732  {
029632fbb7b7c9 kernel/sched_fair.c Peter Zijlstra    2011-10-25  6733  	return NULL;
029632fbb7b7c9 kernel/sched_fair.c Peter Zijlstra    2011-10-25  6734  }
029632fbb7b7c9 kernel/sched_fair.c Peter Zijlstra    2011-10-25  6735  static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
0e59bdaea75f12 kernel/sched/fair.c Kirill Tkhai      2014-06-25  6736  static inline void update_runtime_enabled(struct rq *rq) {}
a4c96ae319b804 kernel/sched/fair.c Peter Boonstoppel 2012-08-09  6737  static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
88c56cfeaec464 kernel/sched/fair.c Phil Auld         2023-07-12  6738  #ifdef CONFIG_CGROUP_SCHED
88c56cfeaec464 kernel/sched/fair.c Phil Auld         2023-07-12  6739  bool cfs_task_bw_constrained(struct task_struct *p)
88c56cfeaec464 kernel/sched/fair.c Phil Auld         2023-07-12  6740  {
88c56cfeaec464 kernel/sched/fair.c Phil Auld         2023-07-12  6741  	return false;
88c56cfeaec464 kernel/sched/fair.c Phil Auld         2023-07-12  6742  }
88c56cfeaec464 kernel/sched/fair.c Phil Auld         2023-07-12  6743  #endif
f3743713c72187 kernel/sched/fair.c Benjamin Segall   2024-02-02  6744  static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count) {}
f3743713c72187 kernel/sched/fair.c Benjamin Segall   2024-02-02  6745  static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count) {}
f3743713c72187 kernel/sched/fair.c Benjamin Segall   2024-02-02  6746  static inline bool is_kernel_task(struct task_struct *p)
f3743713c72187 kernel/sched/fair.c Benjamin Segall   2024-02-02  6747  {
f3743713c72187 kernel/sched/fair.c Benjamin Segall   2024-02-02  6748  	return false;
f3743713c72187 kernel/sched/fair.c Benjamin Segall   2024-02-02  6749  }
f3743713c72187 kernel/sched/fair.c Benjamin Segall   2024-02-02 @6750  static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
f3743713c72187 kernel/sched/fair.c Benjamin Segall   2024-02-02  6751  {
f3743713c72187 kernel/sched/fair.c Benjamin Segall   2024-02-02  6752  	return false;
f3743713c72187 kernel/sched/fair.c Benjamin Segall   2024-02-02  6753  }
029632fbb7b7c9 kernel/sched_fair.c Peter Zijlstra    2011-10-25  6754  #endif /* CONFIG_CFS_BANDWIDTH */
029632fbb7b7c9 kernel/sched_fair.c Peter Zijlstra    2011-10-25  6755  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants
  2024-02-02  8:09 ` [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants Valentin Schneider
@ 2024-02-04  0:04   ` kernel test robot
  0 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2024-02-04  0:04 UTC (permalink / raw)
  To: Valentin Schneider; +Cc: llvm, oe-kbuild-all

Hi Valentin,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on tip/core/entry tip/master linus/master v6.8-rc2 next-20240202]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Valentin-Schneider/sched-fair-Only-throttle-CFS-tasks-on-return-to-userspace/20240202-162620
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20240202080920.3337862-6-vschneid%40redhat.com
patch subject: [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants
config: arm-randconfig-002-20240203 (https://download.01.org/0day-ci/archive/20240204/202402040702.ytDsOZ2e-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240204/202402040702.ytDsOZ2e-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202402040702.ytDsOZ2e-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/sched/fair.c:6969:53: error: no member named 'throttle_pending' in 'struct cfs_rq'
                   if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
                                                             ~~~~~~  ^
   kernel/sched/fair.c:6974:12: error: no member named 'h_user_running' in 'struct cfs_rq'; did you mean 'h_nr_running'?
                           cfs_rq->h_user_running++;
                                   ^~~~~~~~~~~~~~
                                   h_nr_running
   kernel/sched/sched.h:563:16: note: 'h_nr_running' declared here
           unsigned int            h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
                                   ^
   kernel/sched/fair.c:6976:31: error: no member named 'throttle_pending' in 'struct cfs_rq'
                   throttle_pending |= cfs_rq->throttle_pending;
                                       ~~~~~~  ^
>> kernel/sched/fair.c:6982:3: error: call to undeclared function 'assert_cfs_rq_counts'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
                   assert_cfs_rq_counts(cfs_rq);
                   ^
   kernel/sched/fair.c:6998:3: error: call to undeclared function 'assert_cfs_rq_counts'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
                   assert_cfs_rq_counts(cfs_rq);
                   ^
   kernel/sched/fair.c:7000:53: error: no member named 'throttle_pending' in 'struct cfs_rq'
                   if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
                                                             ~~~~~~  ^
   kernel/sched/fair.c:7005:12: error: no member named 'h_user_running' in 'struct cfs_rq'; did you mean 'h_nr_running'?
                           cfs_rq->h_user_running++;
                                   ^~~~~~~~~~~~~~
                                   h_nr_running
   kernel/sched/sched.h:563:16: note: 'h_nr_running' declared here
           unsigned int            h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
                                   ^
   kernel/sched/fair.c:7007:31: error: no member named 'throttle_pending' in 'struct cfs_rq'
                   throttle_pending |= cfs_rq->throttle_pending;
                                       ~~~~~~  ^
   kernel/sched/fair.c:7046:3: error: call to undeclared function 'unthrottle_on_enqueue'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
                   unthrottle_on_enqueue(p);
                   ^
   kernel/sched/fair.c:7063:40: error: no member named 'kernel_node' in 'struct sched_entity'
           bool kernel_task = !list_empty(&p->se.kernel_node);
                                           ~~~~~ ^
   kernel/sched/fair.c:7073:53: error: no member named 'throttle_pending' in 'struct cfs_rq'
                   if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
                                                             ~~~~~~  ^
   kernel/sched/fair.c:7078:12: error: no member named 'h_user_running' in 'struct cfs_rq'; did you mean 'h_nr_running'?
                           cfs_rq->h_user_running--;
                                   ^~~~~~~~~~~~~~
                                   h_nr_running
   kernel/sched/sched.h:563:16: note: 'h_nr_running' declared here
           unsigned int            h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
                                   ^
   kernel/sched/fair.c:7080:31: error: no member named 'throttle_pending' in 'struct cfs_rq'
                   throttle_pending |= cfs_rq->throttle_pending;
                                       ~~~~~~  ^
   kernel/sched/fair.c:7086:3: error: call to undeclared function 'assert_cfs_rq_counts'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
                   assert_cfs_rq_counts(cfs_rq);
                   ^
   kernel/sched/fair.c:7115:53: error: no member named 'throttle_pending' in 'struct cfs_rq'
                   if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
                                                             ~~~~~~  ^
   kernel/sched/fair.c:7120:12: error: no member named 'h_user_running' in 'struct cfs_rq'; did you mean 'h_nr_running'?
                           cfs_rq->h_user_running--;
                                   ^~~~~~~~~~~~~~
                                   h_nr_running
   kernel/sched/sched.h:563:16: note: 'h_nr_running' declared here
           unsigned int            h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
                                   ^
   kernel/sched/fair.c:7122:31: error: no member named 'throttle_pending' in 'struct cfs_rq'
                   throttle_pending |= cfs_rq->throttle_pending;
                                       ~~~~~~  ^
   kernel/sched/fair.c:7128:3: error: call to undeclared function 'assert_cfs_rq_counts'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
                   assert_cfs_rq_counts(cfs_rq);
                   ^
   kernel/sched/fair.c:13465:6: warning: no previous prototype for function 'free_fair_sched_group' [-Wmissing-prototypes]
   void free_fair_sched_group(struct task_group *tg) { }
        ^
   kernel/sched/fair.c:13465:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void free_fair_sched_group(struct task_group *tg) { }
   ^
   static 
   kernel/sched/fair.c:13467:5: warning: no previous prototype for function 'alloc_fair_sched_group' [-Wmissing-prototypes]
   int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
       ^
   kernel/sched/fair.c:13467:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
   ^
   static 
   kernel/sched/fair.c:13472:6: warning: no previous prototype for function 'online_fair_sched_group' [-Wmissing-prototypes]
   void online_fair_sched_group(struct task_group *tg) { }
        ^
   kernel/sched/fair.c:13472:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void online_fair_sched_group(struct task_group *tg) { }
   ^
   static 
   kernel/sched/fair.c:13474:6: warning: no previous prototype for function 'unregister_fair_sched_group' [-Wmissing-prototypes]
   void unregister_fair_sched_group(struct task_group *tg) { }
        ^
   kernel/sched/fair.c:13474:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void unregister_fair_sched_group(struct task_group *tg) { }
   ^
   static 
   4 warnings and 18 errors generated.


vim +/assert_cfs_rq_counts +6982 kernel/sched/fair.c

  6930	
  6931	/*
  6932	 * The enqueue_task method is called before nr_running is
  6933	 * increased. Here we update the fair scheduling stats and
  6934	 * then put the task into the rbtree:
  6935	 */
  6936	static void
  6937	enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
  6938	{
  6939		struct cfs_rq *cfs_rq;
  6940		struct sched_entity *se = &p->se;
  6941		int idle_h_nr_running = task_has_idle_policy(p);
  6942		int task_new = !(flags & ENQUEUE_WAKEUP);
  6943		bool kernel_task = is_kernel_task(p);
  6944		bool throttle_pending = false;
  6945	
  6946		/*
  6947		 * The code below (indirectly) updates schedutil which looks at
  6948		 * the cfs_rq utilization to select a frequency.
  6949		 * Let's add the task's estimated utilization to the cfs_rq's
  6950		 * estimated utilization, before we update schedutil.
  6951		 */
  6952		util_est_enqueue(&rq->cfs, p);
  6953	
  6954		/*
  6955		 * If in_iowait is set, the code below may not trigger any cpufreq
  6956		 * utilization updates, so do it here explicitly with the IOWAIT flag
  6957		 * passed.
  6958		 */
  6959		if (p->in_iowait)
  6960			cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
  6961	
  6962		for_each_sched_entity(se) {
  6963			if (se->on_rq)
  6964				break;
  6965			cfs_rq = cfs_rq_of(se);
  6966			enqueue_entity(cfs_rq, se, flags);
  6967	
  6968	
  6969			if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
  6970				cfs_rq->h_nr_running++;
  6971			if (kernel_task)
  6972				enqueue_kernel(cfs_rq, se, 1);
  6973			else if (!throttle_pending)
  6974				cfs_rq->h_user_running++;
  6975	
  6976			throttle_pending |= cfs_rq->throttle_pending;
  6977	
  6978			cfs_rq->idle_h_nr_running += idle_h_nr_running;
  6979			if (cfs_rq_is_idle(cfs_rq))
  6980				idle_h_nr_running = 1;
  6981	
> 6982			assert_cfs_rq_counts(cfs_rq);
  6983	
  6984			/* end evaluation on encountering a throttled cfs_rq */
  6985			if (cfs_rq_throttled(cfs_rq))
  6986				goto enqueue_throttle;
  6987	
  6988			flags = ENQUEUE_WAKEUP;
  6989		}
  6990	
  6991		for_each_sched_entity(se) {
  6992			cfs_rq = cfs_rq_of(se);
  6993	
  6994			update_load_avg(cfs_rq, se, UPDATE_TG);
  6995			se_update_runnable(se);
  6996			update_cfs_group(se);
  6997	
  6998			assert_cfs_rq_counts(cfs_rq);
  6999	
  7000			if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
  7001				cfs_rq->h_nr_running++;
  7002			if (kernel_task)
  7003				enqueue_kernel(cfs_rq, se, 1);
  7004			else if (!throttle_pending)
  7005				cfs_rq->h_user_running++;
  7006	
  7007			throttle_pending |= cfs_rq->throttle_pending;
  7008	
  7009			cfs_rq->idle_h_nr_running += idle_h_nr_running;
  7010			if (cfs_rq_is_idle(cfs_rq))
  7011				idle_h_nr_running = 1;
  7012	
  7013			assert_cfs_rq_counts(cfs_rq);
  7014	
  7015			/* end evaluation on encountering a throttled cfs_rq */
  7016			if (cfs_rq_throttled(cfs_rq))
  7017				goto enqueue_throttle;
  7018		}
  7019	
  7020		/* At this point se is NULL and we are at root level*/
  7021		add_nr_running(rq, 1);
  7022	
  7023		/*
  7024		 * Since new tasks are assigned an initial util_avg equal to
  7025		 * half of the spare capacity of their CPU, tiny tasks have the
  7026		 * ability to cross the overutilized threshold, which will
  7027		 * result in the load balancer ruining all the task placement
  7028		 * done by EAS. As a way to mitigate that effect, do not account
  7029		 * for the first enqueue operation of new tasks during the
  7030		 * overutilized flag detection.
  7031		 *
  7032		 * A better way of solving this problem would be to wait for
  7033		 * the PELT signals of tasks to converge before taking them
  7034		 * into account, but that is not straightforward to implement,
  7035		 * and the following generally works well enough in practice.
  7036		 */
  7037		if (!task_new)
  7038			update_overutilized_status(rq);
  7039	
  7040	enqueue_throttle:
  7041		assert_list_leaf_cfs_rq(rq);
  7042	
  7043		hrtick_update(rq);
  7044	
  7045		if (kernel_task)
  7046			unthrottle_on_enqueue(p);
  7047	}
  7048	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace
  2024-02-02  8:09 ` [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace Valentin Schneider
@ 2024-02-04  3:08   ` kernel test robot
  0 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2024-02-04  3:08 UTC (permalink / raw)
  To: Valentin Schneider; +Cc: oe-kbuild-all

Hi Valentin,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on tip/core/entry tip/master linus/master v6.8-rc2 next-20240202]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Valentin-Schneider/sched-fair-Only-throttle-CFS-tasks-on-return-to-userspace/20240202-162620
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20240202080920.3337862-5-vschneid%40redhat.com
patch subject: [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20240204/202402041155.ZPTGxQSD-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240204/202402041155.ZPTGxQSD-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202402041155.ZPTGxQSD-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/sched/fair.c: In function 'enqueue_task_fair':
>> kernel/sched/fair.c:6940:65: error: 'struct cfs_rq' has no member named 'throttle_pending'
    6940 |                 if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
         |                                                                 ^~
>> kernel/sched/fair.c:6945:33: error: 'struct cfs_rq' has no member named 'h_user_running'; did you mean 'h_nr_running'?
    6945 |                         cfs_rq->h_user_running++;
         |                                 ^~~~~~~~~~~~~~
         |                                 h_nr_running
   kernel/sched/fair.c:6947:43: error: 'struct cfs_rq' has no member named 'throttle_pending'
    6947 |                 throttle_pending |= cfs_rq->throttle_pending;
         |                                           ^~
   kernel/sched/fair.c:6969:65: error: 'struct cfs_rq' has no member named 'throttle_pending'
    6969 |                 if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
         |                                                                 ^~
   kernel/sched/fair.c:6974:33: error: 'struct cfs_rq' has no member named 'h_user_running'; did you mean 'h_nr_running'?
    6974 |                         cfs_rq->h_user_running++;
         |                                 ^~~~~~~~~~~~~~
         |                                 h_nr_running
   kernel/sched/fair.c:6976:43: error: 'struct cfs_rq' has no member named 'throttle_pending'
    6976 |                 throttle_pending |= cfs_rq->throttle_pending;
         |                                           ^~
   kernel/sched/fair.c:7014:17: error: implicit declaration of function 'unthrottle_on_enqueue' [-Werror=implicit-function-declaration]
    7014 |                 unthrottle_on_enqueue(p);
         |                 ^~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c: In function 'dequeue_task_fair':
   kernel/sched/fair.c:7031:46: error: 'struct sched_entity' has no member named 'kernel_node'
    7031 |         bool kernel_task = !list_empty(&p->se.kernel_node);
         |                                              ^
   kernel/sched/fair.c:7041:65: error: 'struct cfs_rq' has no member named 'throttle_pending'
    7041 |                 if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
         |                                                                 ^~
   kernel/sched/fair.c:7046:33: error: 'struct cfs_rq' has no member named 'h_user_running'; did you mean 'h_nr_running'?
    7046 |                         cfs_rq->h_user_running--;
         |                                 ^~~~~~~~~~~~~~
         |                                 h_nr_running
   kernel/sched/fair.c:7048:43: error: 'struct cfs_rq' has no member named 'throttle_pending'
    7048 |                 throttle_pending |= cfs_rq->throttle_pending;
         |                                           ^~
   kernel/sched/fair.c:7082:65: error: 'struct cfs_rq' has no member named 'throttle_pending'
    7082 |                 if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
         |                                                                 ^~
   kernel/sched/fair.c:7087:33: error: 'struct cfs_rq' has no member named 'h_user_running'; did you mean 'h_nr_running'?
    7087 |                         cfs_rq->h_user_running--;
         |                                 ^~~~~~~~~~~~~~
         |                                 h_nr_running
   kernel/sched/fair.c:7089:43: error: 'struct cfs_rq' has no member named 'throttle_pending'
    7089 |                 throttle_pending |= cfs_rq->throttle_pending;
         |                                           ^~
   kernel/sched/fair.c: At top level:
   kernel/sched/fair.c:13427:6: warning: no previous prototype for 'free_fair_sched_group' [-Wmissing-prototypes]
   13427 | void free_fair_sched_group(struct task_group *tg) { }
         |      ^~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:13429:5: warning: no previous prototype for 'alloc_fair_sched_group' [-Wmissing-prototypes]
   13429 | int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
         |     ^~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:13434:6: warning: no previous prototype for 'online_fair_sched_group' [-Wmissing-prototypes]
   13434 | void online_fair_sched_group(struct task_group *tg) { }
         |      ^~~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:13436:6: warning: no previous prototype for 'unregister_fair_sched_group' [-Wmissing-prototypes]
   13436 | void unregister_fair_sched_group(struct task_group *tg) { }
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:6807:13: warning: 'cfs_rq_throttled_loose' defined but not used [-Wunused-function]
    6807 | static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
         |             ^~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +6940 kernel/sched/fair.c

  6901	
  6902	/*
  6903	 * The enqueue_task method is called before nr_running is
  6904	 * increased. Here we update the fair scheduling stats and
  6905	 * then put the task into the rbtree:
  6906	 */
  6907	static void
  6908	enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
  6909	{
  6910		struct cfs_rq *cfs_rq;
  6911		struct sched_entity *se = &p->se;
  6912		int idle_h_nr_running = task_has_idle_policy(p);
  6913		int task_new = !(flags & ENQUEUE_WAKEUP);
  6914		bool kernel_task = is_kernel_task(p);
  6915		bool throttle_pending = false;
  6916	
  6917		/*
  6918		 * The code below (indirectly) updates schedutil which looks at
  6919		 * the cfs_rq utilization to select a frequency.
  6920		 * Let's add the task's estimated utilization to the cfs_rq's
  6921		 * estimated utilization, before we update schedutil.
  6922		 */
  6923		util_est_enqueue(&rq->cfs, p);
  6924	
  6925		/*
  6926		 * If in_iowait is set, the code below may not trigger any cpufreq
  6927		 * utilization updates, so do it here explicitly with the IOWAIT flag
  6928		 * passed.
  6929		 */
  6930		if (p->in_iowait)
  6931			cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
  6932	
  6933		for_each_sched_entity(se) {
  6934			if (se->on_rq)
  6935				break;
  6936			cfs_rq = cfs_rq_of(se);
  6937			enqueue_entity(cfs_rq, se, flags);
  6938	
  6939	
> 6940			if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
  6941				cfs_rq->h_nr_running++;
  6942			if (kernel_task)
  6943				enqueue_kernel(cfs_rq, se, 1);
  6944			else if (!throttle_pending)
> 6945				cfs_rq->h_user_running++;
  6946	
  6947			throttle_pending |= cfs_rq->throttle_pending;
  6948	
  6949			cfs_rq->idle_h_nr_running += idle_h_nr_running;
  6950			if (cfs_rq_is_idle(cfs_rq))
  6951				idle_h_nr_running = 1;
  6952	
  6953	
  6954			/* end evaluation on encountering a throttled cfs_rq */
  6955			if (cfs_rq_throttled(cfs_rq))
  6956				goto enqueue_throttle;
  6957	
  6958			flags = ENQUEUE_WAKEUP;
  6959		}
  6960	
  6961		for_each_sched_entity(se) {
  6962			cfs_rq = cfs_rq_of(se);
  6963	
  6964			update_load_avg(cfs_rq, se, UPDATE_TG);
  6965			se_update_runnable(se);
  6966			update_cfs_group(se);
  6967	
  6968	
  6969			if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
  6970				cfs_rq->h_nr_running++;
  6971			if (kernel_task)
  6972				enqueue_kernel(cfs_rq, se, 1);
  6973			else if (!throttle_pending)
  6974				cfs_rq->h_user_running++;
  6975	
  6976			throttle_pending |= cfs_rq->throttle_pending;
  6977	
  6978			cfs_rq->idle_h_nr_running += idle_h_nr_running;
  6979			if (cfs_rq_is_idle(cfs_rq))
  6980				idle_h_nr_running = 1;
  6981	
  6982	
  6983			/* end evaluation on encountering a throttled cfs_rq */
  6984			if (cfs_rq_throttled(cfs_rq))
  6985				goto enqueue_throttle;
  6986		}
  6987	
  6988		/* At this point se is NULL and we are at root level*/
  6989		add_nr_running(rq, 1);
  6990	
  6991		/*
  6992		 * Since new tasks are assigned an initial util_avg equal to
  6993		 * half of the spare capacity of their CPU, tiny tasks have the
  6994		 * ability to cross the overutilized threshold, which will
  6995		 * result in the load balancer ruining all the task placement
  6996		 * done by EAS. As a way to mitigate that effect, do not account
  6997		 * for the first enqueue operation of new tasks during the
  6998		 * overutilized flag detection.
  6999		 *
  7000		 * A better way of solving this problem would be to wait for
  7001		 * the PELT signals of tasks to converge before taking them
  7002		 * into account, but that is not straightforward to implement,
  7003		 * and the following generally works well enough in practice.
  7004		 */
  7005		if (!task_new)
  7006			update_overutilized_status(rq);
  7007	
  7008	enqueue_throttle:
  7009		assert_list_leaf_cfs_rq(rq);
  7010	
  7011		hrtick_update(rq);
  7012	
  7013		if (kernel_task)
  7014			unthrottle_on_enqueue(p);
  7015	}
  7016	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead
  2024-02-02  8:09 ` [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead Valentin Schneider
  2024-02-03 23:01   ` kernel test robot
@ 2024-02-06 21:36   ` Benjamin Segall
  2024-02-07 13:34     ` Valentin Schneider
  1 sibling, 1 reply; 15+ messages in thread
From: Benjamin Segall @ 2024-02-06 21:36 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
	Tomas Glozar

Valentin Schneider <vschneid@redhat.com> writes:

> cfs_rq_throttled_loose() does not check if there is runtime remaining in
> the cfs_b, and thus relies on check_cfs_rq_runtime() being ran previously
> for that to be checked.
>
> Cache the throttle attempt in throttle_cfs_rq and reuse that where
> needed.

The general idea of throttle_pending rather than constantly checking
runtime_remaining seems reasonable...

>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  kernel/sched/fair.c | 44 ++++++++++----------------------------------
>  1 file changed, 10 insertions(+), 34 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 96504be6ee14a..60778afbff207 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5462,7 +5462,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>   * 5) do not run the "skip" process, if something else is available
>   */
>  static struct sched_entity *
> -pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
> +pick_next_entity(struct cfs_rq *cfs_rq)
>  {
>  #ifdef CONFIG_CFS_BANDWIDTH
>  	/*
> @@ -5473,7 +5473,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
>  	 * throttle_cfs_rq.
>  	 */
>  	WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
> -	if (throttled && !list_empty(&cfs_rq->kernel_children)) {
> +	if (cfs_rq->throttle_pending && !list_empty(&cfs_rq->kernel_children)) {

... but we still need to know here if any of our parents are throttled
as well, ie a "throttled_pending_count", or to keep the "throttled"
parameter tracking in pnt_fair. (ie just replace the implementation of
cfs_rq_throttled_loose).

>  		/*
>  		 * TODO: you'd want to factor out pick_eevdf to just take
>  		 * tasks_timeline, and replace this list with a second rbtree
> @@ -5791,8 +5791,12 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>  	 * We don't actually throttle, though account() will have made sure to
>  	 * resched us so that we pick into a kernel task.
>  	 */
> -	if (cfs_rq->h_kernel_running)
> +	if (cfs_rq->h_kernel_running) {
> +		cfs_rq->throttle_pending = true;
>  		return false;
> +	}
> +
> +	cfs_rq->throttle_pending = false;

We also need to clear throttle_pending if quota refills and our
runtime_remaining goes positive. (And do the appropriate h_* accounting in
patch 4/5)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry
  2024-02-02  8:09 [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Valentin Schneider
                   ` (4 preceding siblings ...)
  2024-02-02  8:09 ` [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants Valentin Schneider
@ 2024-02-06 21:55 ` Benjamin Segall
  2024-02-07 13:34   ` Valentin Schneider
  5 siblings, 1 reply; 15+ messages in thread
From: Benjamin Segall @ 2024-02-06 21:55 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
	Tomas Glozar

Valentin Schneider <vschneid@redhat.com> writes:


> Proposed approach
> =================
>
> Peter mentioned [1] that there have been discussions on changing /when/ the
> throttling happens: rather than have it be done immediately upon updating
> the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
> for the task to be about to return to userspace: if it's in userspace, it can't
> hold any in-kernel lock.
>
> I submitted an initial jab at this [2] and Ben Segall added his own version to
> the conversation [3]. This series contains Ben's patch plus my additions. The
> main change here is updating the .h_nr_running counts throughout the cfs_rq
> hierachies to improve the picture given to load_balance().
>
> The main thing that remains doing for this series is making the second cfs_rq
> tree an actual RB tree (it's just a plain list ATM).
>
> This also doesn't touch rq.nr_running yet, I'm not entirely sure whether we want
> to expose this outside of CFS, but it is another field that's used by load balance.

Then there's also all the load values as well; I don't know the load
balance code well, but it looks like the main thing would be
runnable_avg and that it isn't doing anything that would particularly
care about h_nr_running and runnable_avg being out of sync.

Maybe pulling a pending-throttle user task and then not seeing the
update in h_nr_running could be a bit of trouble?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry
  2024-02-06 21:55 ` [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Benjamin Segall
@ 2024-02-07 13:34   ` Valentin Schneider
  0 siblings, 0 replies; 15+ messages in thread
From: Valentin Schneider @ 2024-02-07 13:34 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
	Tomas Glozar

On 06/02/24 13:55, Benjamin Segall wrote:
> Valentin Schneider <vschneid@redhat.com> writes:
>
>
>> Proposed approach
>> =================
>>
>> Peter mentioned [1] that there have been discussions on changing /when/ the
>> throttling happens: rather than have it be done immediately upon updating
>> the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
>> for the task to be about to return to userspace: if it's in userspace, it can't
>> hold any in-kernel lock.
>>
>> I submitted an initial jab at this [2] and Ben Segall added his own version to
>> the conversation [3]. This series contains Ben's patch plus my additions. The
>> main change here is updating the .h_nr_running counts throughout the cfs_rq
>> hierachies to improve the picture given to load_balance().
>>
>> The main thing that remains doing for this series is making the second cfs_rq
>> tree an actual RB tree (it's just a plain list ATM).
>>
>> This also doesn't touch rq.nr_running yet, I'm not entirely sure whether we want
>> to expose this outside of CFS, but it is another field that's used by load balance.
>
> Then there's also all the load values as well; I don't know the load
> balance code well, but it looks like the main thing would be
> runnable_avg and that it isn't doing anything that would particularly
> care about h_nr_running and runnable_avg being out of sync.
>

Yes, all of the runnable, load and util averages are still going to be an
issue unfortunately. AFAICT tackling this would imply pretty much dequeuing
the throttle_pending user tasks, which was my earlier attempt.

> Maybe pulling a pending-throttle user task and then not seeing the
> update in h_nr_running could be a bit of trouble?

That too is something I hadn't considered. Given the h_nr_running count is
updated accordingly, we could change can_migrate_task() to only allow
kernel tasks to be pulled if the hierarchy is ->throttle_pending. That
would probably require implementing a throttle_pending_count (as you
suggested in the other email) so we don't waste too much time checking up
the hierarchy for every task.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead
  2024-02-06 21:36   ` Benjamin Segall
@ 2024-02-07 13:34     ` Valentin Schneider
  0 siblings, 0 replies; 15+ messages in thread
From: Valentin Schneider @ 2024-02-07 13:34 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Daniel Bristot de Oliveira, Phil Auld, Clark Williams,
	Tomas Glozar

On 06/02/24 13:36, Benjamin Segall wrote:
> Valentin Schneider <vschneid@redhat.com> writes:
>
>> cfs_rq_throttled_loose() does not check if there is runtime remaining in
>> the cfs_b, and thus relies on check_cfs_rq_runtime() being ran previously
>> for that to be checked.
>>
>> Cache the throttle attempt in throttle_cfs_rq and reuse that where
>> needed.
>
> The general idea of throttle_pending rather than constantly checking
> runtime_remaining seems reasonable...
>
>>
>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>> ---
>>  kernel/sched/fair.c | 44 ++++++++++----------------------------------
>>  1 file changed, 10 insertions(+), 34 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 96504be6ee14a..60778afbff207 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5462,7 +5462,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>   * 5) do not run the "skip" process, if something else is available
>>   */
>>  static struct sched_entity *
>> -pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
>> +pick_next_entity(struct cfs_rq *cfs_rq)
>>  {
>>  #ifdef CONFIG_CFS_BANDWIDTH
>>      /*
>> @@ -5473,7 +5473,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
>>       * throttle_cfs_rq.
>>       */
>>      WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
>> -	if (throttled && !list_empty(&cfs_rq->kernel_children)) {
>> +	if (cfs_rq->throttle_pending && !list_empty(&cfs_rq->kernel_children)) {
>
> ... but we still need to know here if any of our parents are throttled
> as well, ie a "throttled_pending_count", or to keep the "throttled"
> parameter tracking in pnt_fair. (ie just replace the implementation of
> cfs_rq_throttled_loose).
>

Hm, good point. We should be good with reinstoring the throttled parameter
and feeding it a ->throttle_pending accumulator.

>>              /*
>>               * TODO: you'd want to factor out pick_eevdf to just take
>>               * tasks_timeline, and replace this list with a second rbtree
>> @@ -5791,8 +5791,12 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>>       * We don't actually throttle, though account() will have made sure to
>>       * resched us so that we pick into a kernel task.
>>       */
>> -	if (cfs_rq->h_kernel_running)
>> +	if (cfs_rq->h_kernel_running) {
>> +		cfs_rq->throttle_pending = true;
>>              return false;
>> +	}
>> +
>> +	cfs_rq->throttle_pending = false;
>
> We also need to clear throttle_pending if quota refills and our
> runtime_remaining goes positive. (And do the appropriate h_* accounting in
> patch 4/5)

Right, so we could move the throttle_pending logic to after
__assign_cfs_rq_runtime(), and then modify distribute_cfs_runtime() to
catch the !throttled but throttle_pending case.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-02-07 13:34 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-02  8:09 [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Valentin Schneider
2024-02-02  8:09 ` [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace Valentin Schneider
2024-02-03 19:12   ` kernel test robot
2024-02-03 19:45   ` kernel test robot
2024-02-02  8:09 ` [RFC PATCH v2 2/5] sched: Note schedule() invocations at return-to-user with SM_USER Valentin Schneider
2024-02-02  8:09 ` [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead Valentin Schneider
2024-02-03 23:01   ` kernel test robot
2024-02-06 21:36   ` Benjamin Segall
2024-02-07 13:34     ` Valentin Schneider
2024-02-02  8:09 ` [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace Valentin Schneider
2024-02-04  3:08   ` kernel test robot
2024-02-02  8:09 ` [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants Valentin Schneider
2024-02-04  0:04   ` kernel test robot
2024-02-06 21:55 ` [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry Benjamin Segall
2024-02-07 13:34   ` Valentin Schneider

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.