public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/7] Defer throttle when task exits to user
@ 2025-04-09 12:07 Aaron Lu
  2025-04-09 12:07 ` [RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
                   ` (9 more replies)
  0 siblings, 10 replies; 45+ messages in thread
From: Aaron Lu @ 2025-04-09 12:07 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

This is a continuous work based on Valentin Schneider's posting here:
Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/

Valentin has described the problem very well in the above link. We also
have task hung problem from time to time in our environment due to cfs quota.
It is mostly visible with rwsem: when a reader is throttled, writer comes in
and has to wait, the writer also makes all subsequent readers wait,
causing problems of priority inversion or even whole system hung.

To improve this situation, change the throttle model to task based, i.e.
when a cfs_rq is throttled, mark its throttled status but do not
remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
when they get picked, add a task work to them so that when they return
to user, they can be dequeued. In this way, tasks throttled will not
hold any kernel resources. When cfs_rq gets unthrottled, enqueue back
those throttled tasks.

There are consequences because of this new throttle model, e.g. for a
cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
return2user path, one task still running in kernel mode, this cfs_rq is
in a partial throttled state:
- Should its pelt clock be frozen?
- Should this state be accounted into throttled_time?

For pelt clock, I chose to keep the current behavior to freeze it on
cfs_rq's throttle time. The assumption is that tasks running in kernel
mode should not last too long, freezing the cfs_rq's pelt clock can keep
its load and its corresponding sched_entity's weight. Hopefully, this can
result in a stable situation for the remaining running tasks to quickly
finish their jobs in kernel mode.

For throttle time accounting, I can see several possibilities:
- Similar to current behavior: starts accounting when cfs_rq gets
  throttled(if cfs_rq->nr_queued > 0) and stops accounting when cfs_rq
  gets unthrottled. This has one drawback, e.g. if this cfs_rq has one
  task when it gets throttled and eventually, that task doesn't return
  to user but blocks, then this cfs_rq has no tasks on throttled list
  but time is accounted as throttled; Patch2 and patch3 implements this
  accounting(simple, fewer code change).
- Starts accounting when the throttled cfs_rq has at least one task on
  its throttled list; stops accounting when it's unthrottled. This kind
  of over accounts throttled time because partial throttle state is
  accounted.
- Starts accounting when the throttled cfs_rq has no tasks left and its
  throttled list is not empty; stops accounting when this cfs_rq is
  unthrottled; This kind of under accounts throttled time because partial
  throttle state is not accounted. Patch7 implements this accounting.
I do not have a strong feeling which accounting is the best, it's open
for discussion.

There is also the concern of increased duration of (un)throttle operations
in v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
setup on a 2sockets/384cpus AMD server, the longest duration of
distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
For throttle path, with Chengming's suggestion to move "task work setup"
from throttle time to pick time, it's not an issue anymore.

Patches:
Patch1 is preparation work;

Patch2-3 provide the main functionality.
Patch2 deals with throttle path: when a cfs_rq is to be throttled, mark
throttled status for this cfs_rq and when tasks in throttled hierarchy
gets picked, add a task work to them so that when those tasks return to
user space, the task work can throttle it by dequeuing the task and
remember this by adding the task to its cfs_rq's limbo list;
Patch3 deals with unthrottle path: when a cfs_rq is to be unthrottled,
enqueue back those tasks in limbo list;

Patch4 deals with the dequeue path when task changes group, sched class
etc. Task that is throttled is dequeued in fair, but task->on_rq is
still set so when it changes task group or sched class or has affinity
setting change, core will firstly dequeue it. But since this task is
already dequeued in fair class, this patch handle this situation.

Patch5-6 are clean ups. Some code are obsolete after switching to task
based throttle mechanism.

Patch7 implements an alternative accounting mechanism for task based
throttle.

Changes since v1:
- Move "add task work" from throttle time to pick time, suggested by
  Chengming Zhou;
- Use scope_gard() and cond_resched_tasks_rcu_qs() in
  throttle_cfs_rq_work(), suggested by K Prateek Nayak;
- Remove now obsolete throttled_lb_pair(), suggested by K Prateek Nayak;
- Fix cfs_rq->runtime_remaining condition check in unthrottle_cfs_rq(),
  suggested by K Prateek Nayak;
- Fix h_nr_runnable accounting for delayed dequeue case when task based
  throttle is in use;
- Implemented an alternative way of throttle time accounting for
  discussion purpose;
- Make !CONFIG_CFS_BANDWIDTH build.
I hope I didn't omit any feedbacks I've received, but feel free to let me
know if I did.

As in v1, all change logs are written by me and if they read bad, it's
my fault.

Comments are welcome.

Base commit: tip/sched/core, commit 6432e163ba1b("sched/isolation: Make
use of more than one housekeeping cpu").

Aaron Lu (4):
  sched/fair: Take care of group/affinity/sched_class change for
    throttled task
  sched/fair: get rid of throttled_lb_pair()
  sched/fair: fix h_nr_runnable accounting with per-task throttle
  sched/fair: alternative way of accounting throttle time

Valentin Schneider (3):
  sched/fair: Add related data structure for task based throttle
  sched/fair: Handle throttle path for task based throttle
  sched/fair: Handle unthrottle path for task based throttle

 include/linux/sched.h |   4 +
 kernel/sched/core.c   |   3 +
 kernel/sched/fair.c   | 449 ++++++++++++++++++++++--------------------
 kernel/sched/sched.h  |   7 +
 4 files changed, 248 insertions(+), 215 deletions(-)

-- 
2.39.5


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle
  2025-04-09 12:07 [RFC PATCH v2 0/7] Defer throttle when task exits to user Aaron Lu
@ 2025-04-09 12:07 ` Aaron Lu
  2025-04-14  3:58   ` K Prateek Nayak
  2025-04-09 12:07 ` [RFC PATCH v2 2/7] sched/fair: Handle throttle path " Aaron Lu
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 45+ messages in thread
From: Aaron Lu @ 2025-04-09 12:07 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

From: Valentin Schneider <vschneid@redhat.com>

Add related data structures for this new throttle functionality.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 include/linux/sched.h |  4 ++++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 12 ++++++++++++
 kernel/sched/sched.h  |  2 ++
 4 files changed, 21 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac19828934..0b55c79fee209 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -880,6 +880,10 @@ struct task_struct {
 
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
+#ifdef CONFIG_CFS_BANDWIDTH
+	struct callback_head		sched_throttle_work;
+	struct list_head		throttle_node;
+#endif
 #endif
 
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 79692f85643fe..3b8735bc527da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4492,6 +4492,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
+#ifdef CONFIG_CFS_BANDWIDTH
+	init_cfs_throttle_work(p);
+#endif
 #endif
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c19459c80422..894202d232efd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5823,6 +5823,18 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	       throttled_hierarchy(dest_cfs_rq);
 }
 
+static void throttle_cfs_rq_work(struct callback_head *work)
+{
+}
+
+void init_cfs_throttle_work(struct task_struct *p)
+{
+	init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
+	/* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
+	p->sched_throttle_work.next = &p->sched_throttle_work;
+	INIT_LIST_HEAD(&p->throttle_node);
+}
+
 static int tg_unthrottle_up(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5a6a503eb6de..921527327f107 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2703,6 +2703,8 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
 extern void init_dl_entity(struct sched_dl_entity *dl_se);
 
+extern void init_cfs_throttle_work(struct task_struct *p);
+
 #define BW_SHIFT		20
 #define BW_UNIT			(1 << BW_SHIFT)
 #define RATIO_SHIFT		8
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v2 2/7] sched/fair: Handle throttle path for task based throttle
  2025-04-09 12:07 [RFC PATCH v2 0/7] Defer throttle when task exits to user Aaron Lu
  2025-04-09 12:07 ` [RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
@ 2025-04-09 12:07 ` Aaron Lu
  2025-04-14  8:54   ` Florian Bezdeka
                     ` (2 more replies)
  2025-04-09 12:07 ` [RFC PATCH v2 3/7] sched/fair: Handle unthrottle " Aaron Lu
                   ` (7 subsequent siblings)
  9 siblings, 3 replies; 45+ messages in thread
From: Aaron Lu @ 2025-04-09 12:07 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

From: Valentin Schneider <vschneid@redhat.com>

In current throttle model, when a cfs_rq is throttled, its entity will
be dequeued from cpu's rq, making tasks attached to it not able to run,
thus achiveing the throttle target.

This has a drawback though: assume a task is a reader of percpu_rwsem
and is waiting. When it gets wakeup, it can not run till its task group's
next period comes, which can be a relatively long time. Waiting writer
will have to wait longer due to this and it also makes further reader
build up and eventually trigger task hung.

To improve this situation, change the throttle model to task based, i.e.
when a cfs_rq is throttled, record its throttled status but do not
remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
when they get picked, add a task work to them so that when they return
to user, they can be dequeued. In this way, tasks throttled will not
hold any kernel resources.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c  | 185 +++++++++++++++++++++----------------------
 kernel/sched/sched.h |   1 +
 2 files changed, 93 insertions(+), 93 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 894202d232efd..c566a5a90d065 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5516,8 +5516,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (flags & DEQUEUE_DELAYED)
 		finish_delayed_dequeue_entity(se);
 
-	if (cfs_rq->nr_queued == 0)
+	if (cfs_rq->nr_queued == 0) {
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
+		if (throttled_hierarchy(cfs_rq))
+			list_del_leaf_cfs_rq(cfs_rq);
+	}
 
 	return true;
 }
@@ -5598,7 +5601,7 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 	return se;
 }
 
-static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
@@ -5823,8 +5826,48 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	       throttled_hierarchy(dest_cfs_rq);
 }
 
+static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static void throttle_cfs_rq_work(struct callback_head *work)
 {
+	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
+	struct sched_entity *se;
+	struct cfs_rq *cfs_rq;
+	struct rq *rq;
+
+	WARN_ON_ONCE(p != current);
+	p->sched_throttle_work.next = &p->sched_throttle_work;
+
+	/*
+	 * If task is exiting, then there won't be a return to userspace, so we
+	 * don't have to bother with any of this.
+	 */
+	if ((p->flags & PF_EXITING))
+		return;
+
+	scoped_guard(task_rq_lock, p) {
+		se = &p->se;
+		cfs_rq = cfs_rq_of(se);
+
+		/* Raced, forget */
+		if (p->sched_class != &fair_sched_class)
+			return;
+
+		/*
+		 * If not in limbo, then either replenish has happened or this
+		 * task got migrated out of the throttled cfs_rq, move along.
+		 */
+		if (!cfs_rq->throttle_count)
+			return;
+
+		rq = scope.rq;
+		update_rq_clock(rq);
+		WARN_ON_ONCE(!list_empty(&p->throttle_node));
+		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
+		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+		resched_curr(rq);
+	}
+
+	cond_resched_tasks_rcu_qs();
 }
 
 void init_cfs_throttle_work(struct task_struct *p)
@@ -5864,32 +5907,53 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 	return 0;
 }
 
+static inline bool task_has_throttle_work(struct task_struct *p)
+{
+	return p->sched_throttle_work.next != &p->sched_throttle_work;
+}
+
+static inline void task_throttle_setup_work(struct task_struct *p)
+{
+	if (task_has_throttle_work(p))
+		return;
+
+	/*
+	 * Kthreads and exiting tasks don't return to userspace, so adding the
+	 * work is pointless
+	 */
+	if ((p->flags & (PF_EXITING | PF_KTHREAD)))
+		return;
+
+	task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
+}
+
 static int tg_throttle_down(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
+	cfs_rq->throttle_count++;
+	if (cfs_rq->throttle_count > 1)
+		return 0;
+
 	/* group is entering throttled state, stop time */
-	if (!cfs_rq->throttle_count) {
-		cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
-		list_del_leaf_cfs_rq(cfs_rq);
+	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
 
-		WARN_ON_ONCE(cfs_rq->throttled_clock_self);
-		if (cfs_rq->nr_queued)
-			cfs_rq->throttled_clock_self = rq_clock(rq);
-	}
-	cfs_rq->throttle_count++;
+	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
+	if (cfs_rq->nr_queued)
+		cfs_rq->throttled_clock_self = rq_clock(rq);
+	else
+		list_del_leaf_cfs_rq(cfs_rq);
 
+	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
 	return 0;
 }
 
-static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
+static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-	struct sched_entity *se;
-	long queued_delta, runnable_delta, idle_delta, dequeue = 1;
-	long rq_h_nr_queued = rq->cfs.h_nr_queued;
+	int dequeue = 1;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* This will start the period timer if necessary */
@@ -5910,74 +5974,13 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	raw_spin_unlock(&cfs_b->lock);
 
 	if (!dequeue)
-		return false;  /* Throttle no longer required. */
-
-	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+		return;  /* Throttle no longer required. */
 
 	/* freeze hierarchy runnable averages while throttled */
 	rcu_read_lock();
 	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
 	rcu_read_unlock();
 
-	queued_delta = cfs_rq->h_nr_queued;
-	runnable_delta = cfs_rq->h_nr_runnable;
-	idle_delta = cfs_rq->h_nr_idle;
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-		int flags;
-
-		/* throttled entity or throttle-on-deactivate */
-		if (!se->on_rq)
-			goto done;
-
-		/*
-		 * Abuse SPECIAL to avoid delayed dequeue in this instance.
-		 * This avoids teaching dequeue_entities() about throttled
-		 * entities and keeps things relatively simple.
-		 */
-		flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL;
-		if (se->sched_delayed)
-			flags |= DEQUEUE_DELAYED;
-		dequeue_entity(qcfs_rq, se, flags);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued -= queued_delta;
-		qcfs_rq->h_nr_runnable -= runnable_delta;
-		qcfs_rq->h_nr_idle -= idle_delta;
-
-		if (qcfs_rq->load.weight) {
-			/* Avoid re-evaluating load for this entity: */
-			se = parent_entity(se);
-			break;
-		}
-	}
-
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-		/* throttled entity or throttle-on-deactivate */
-		if (!se->on_rq)
-			goto done;
-
-		update_load_avg(qcfs_rq, se, 0);
-		se_update_runnable(se);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued -= queued_delta;
-		qcfs_rq->h_nr_runnable -= runnable_delta;
-		qcfs_rq->h_nr_idle -= idle_delta;
-	}
-
-	/* At this point se is NULL and we are at root level*/
-	sub_nr_running(rq, queued_delta);
-
-	/* Stop the fair server if throttling resulted in no runnable tasks */
-	if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
-		dl_server_stop(&rq->fair_server);
-done:
 	/*
 	 * Note: distribution will already see us throttled via the
 	 * throttled-list.  rq->lock protects completion.
@@ -5986,7 +5989,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	WARN_ON_ONCE(cfs_rq->throttled_clock);
 	if (cfs_rq->nr_queued)
 		cfs_rq->throttled_clock = rq_clock(rq);
-	return true;
+	return;
 }
 
 void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
@@ -6462,22 +6465,22 @@ static void sync_throttle(struct task_group *tg, int cpu)
 }
 
 /* conditionally throttle active cfs_rq's from put_prev_entity() */
-static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	if (!cfs_bandwidth_used())
-		return false;
+		return;
 
 	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
-		return false;
+		return;
 
 	/*
 	 * it's possible for a throttled entity to be forced into a running
 	 * state (e.g. set_curr_task), in this case we're finished.
 	 */
 	if (cfs_rq_throttled(cfs_rq))
-		return true;
+		return;
 
-	return throttle_cfs_rq(cfs_rq);
+	throttle_cfs_rq(cfs_rq);
 }
 
 static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
@@ -6573,6 +6576,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	cfs_rq->runtime_enabled = 0;
 	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 	INIT_LIST_HEAD(&cfs_rq->throttled_csd_list);
+	INIT_LIST_HEAD(&cfs_rq->throttled_limbo_list);
 }
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
@@ -6738,10 +6742,11 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
 #else /* CONFIG_CFS_BANDWIDTH */
 
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
-static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 static inline void sync_throttle(struct task_group *tg, int cpu) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static void task_throttle_setup_work(struct task_struct *p) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
@@ -7108,10 +7113,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
 
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			return 0;
-
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
 			slice = cfs_rq_min_slice(cfs_rq);
@@ -7148,10 +7149,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			return 0;
 	}
 
 	sub_nr_running(rq, h_nr_queued);
@@ -8860,8 +8857,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 		if (cfs_rq->curr && cfs_rq->curr->on_rq)
 			update_curr(cfs_rq);
 
-		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
-			goto again;
+		check_cfs_rq_runtime(cfs_rq);
 
 		se = pick_next_entity(rq, cfs_rq);
 		if (!se)
@@ -8888,6 +8884,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		goto idle;
 	se = &p->se;
 
+	if (throttled_hierarchy(cfs_rq_of(se)))
+		task_throttle_setup_work(p);
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	if (prev->sched_class != &fair_sched_class)
 		goto simple;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 921527327f107..97be6a6f53b9c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -736,6 +736,7 @@ struct cfs_rq {
 	int			throttle_count;
 	struct list_head	throttled_list;
 	struct list_head	throttled_csd_list;
+	struct list_head	throttled_limbo_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 };
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v2 3/7] sched/fair: Handle unthrottle path for task based throttle
  2025-04-09 12:07 [RFC PATCH v2 0/7] Defer throttle when task exits to user Aaron Lu
  2025-04-09 12:07 ` [RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
  2025-04-09 12:07 ` [RFC PATCH v2 2/7] sched/fair: Handle throttle path " Aaron Lu
@ 2025-04-09 12:07 ` Aaron Lu
  2025-04-09 12:07 ` [RFC PATCH v2 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Aaron Lu @ 2025-04-09 12:07 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

From: Valentin Schneider <vschneid@redhat.com>

On unthrottle, enqueue throttled tasks back so they can continue to run.

Note that for this task based throttling, the only throttle place is
when it returns to user space so as long as a task is enqueued, no
matter its cfs_rq is throttled or not, it will be allowed to run till it
reaches that throttle place.

leaf_cfs_rq list is handled differently now: as long as a task is
enqueued to a throttled or not cfs_rq, this cfs_rq will be added to that
list and when cfs_rq is throttled and all its tasks are dequeued, it
will be removed from that list. I think this is easy to reason so chose
to do so.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 129 ++++++++++++++++----------------------------
 1 file changed, 45 insertions(+), 84 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c566a5a90d065..4152088fc0546 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5357,18 +5357,17 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	if (cfs_rq->nr_queued == 1) {
 		check_enqueue_throttle(cfs_rq);
-		if (!throttled_hierarchy(cfs_rq)) {
-			list_add_leaf_cfs_rq(cfs_rq);
-		} else {
+		list_add_leaf_cfs_rq(cfs_rq);
 #ifdef CONFIG_CFS_BANDWIDTH
+		if (throttled_hierarchy(cfs_rq)) {
 			struct rq *rq = rq_of(cfs_rq);
 
 			if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
 				cfs_rq->throttled_clock = rq_clock(rq);
 			if (!cfs_rq->throttled_clock_self)
 				cfs_rq->throttled_clock_self = rq_clock(rq);
-#endif
 		}
+#endif
 	}
 }
 
@@ -5826,6 +5825,11 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	       throttled_hierarchy(dest_cfs_rq);
 }
 
+static inline bool task_is_throttled(struct task_struct *p)
+{
+	return !list_empty(&p->throttle_node);
+}
+
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static void throttle_cfs_rq_work(struct callback_head *work)
 {
@@ -5878,32 +5882,41 @@ void init_cfs_throttle_work(struct task_struct *p)
 	INIT_LIST_HEAD(&p->throttle_node);
 }
 
+static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static int tg_unthrottle_up(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+	struct task_struct *p, *tmp;
 
 	cfs_rq->throttle_count--;
-	if (!cfs_rq->throttle_count) {
-		cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
-					     cfs_rq->throttled_clock_pelt;
+	if (cfs_rq->throttle_count)
+		return 0;
 
-		/* Add cfs_rq with load or one or more already running entities to the list */
-		if (!cfs_rq_is_decayed(cfs_rq))
-			list_add_leaf_cfs_rq(cfs_rq);
+	cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
+		cfs_rq->throttled_clock_pelt;
 
-		if (cfs_rq->throttled_clock_self) {
-			u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
+	if (cfs_rq->throttled_clock_self) {
+		u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
 
-			cfs_rq->throttled_clock_self = 0;
+		cfs_rq->throttled_clock_self = 0;
 
-			if (WARN_ON_ONCE((s64)delta < 0))
-				delta = 0;
+		if (WARN_ON_ONCE((s64)delta < 0))
+			delta = 0;
 
-			cfs_rq->throttled_clock_self_time += delta;
-		}
+		cfs_rq->throttled_clock_self_time += delta;
+	}
+
+	/* Re-enqueue the tasks that have been throttled at this level. */
+	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
+		list_del_init(&p->throttle_node);
+		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
 	}
 
+	/* Add cfs_rq with load or one or more already running entities to the list */
+	if (!cfs_rq_is_decayed(cfs_rq))
+		list_add_leaf_cfs_rq(cfs_rq);
+
 	return 0;
 }
 
@@ -5996,11 +6009,20 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-	struct sched_entity *se;
-	long queued_delta, runnable_delta, idle_delta;
-	long rq_h_nr_queued = rq->cfs.h_nr_queued;
+	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
 
-	se = cfs_rq->tg->se[cpu_of(rq)];
+	/*
+	 * It's possible we are called with !runtime_remaining due to things
+	 * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
+	 * unthrottled us with a positive runtime_remaining but other still
+	 * running entities consumed those runtime before we reach here.
+	 *
+	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
+	 * because any enqueue below will immediately trigger a throttle, which
+	 * is not supposed to happen on unthrottle path.
+	 */
+	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
+		return;
 
 	cfs_rq->throttled = 0;
 
@@ -6028,62 +6050,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 			if (list_add_leaf_cfs_rq(cfs_rq_of(se)))
 				break;
 		}
-		goto unthrottle_throttle;
 	}
 
-	queued_delta = cfs_rq->h_nr_queued;
-	runnable_delta = cfs_rq->h_nr_runnable;
-	idle_delta = cfs_rq->h_nr_idle;
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-
-		/* Handle any unfinished DELAY_DEQUEUE business first. */
-		if (se->sched_delayed) {
-			int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED;
-
-			dequeue_entity(qcfs_rq, se, flags);
-		} else if (se->on_rq)
-			break;
-		enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued += queued_delta;
-		qcfs_rq->h_nr_runnable += runnable_delta;
-		qcfs_rq->h_nr_idle += idle_delta;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(qcfs_rq))
-			goto unthrottle_throttle;
-	}
-
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-
-		update_load_avg(qcfs_rq, se, UPDATE_TG);
-		se_update_runnable(se);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued += queued_delta;
-		qcfs_rq->h_nr_runnable += runnable_delta;
-		qcfs_rq->h_nr_idle += idle_delta;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(qcfs_rq))
-			goto unthrottle_throttle;
-	}
-
-	/* Start the fair server if un-throttling resulted in new runnable tasks */
-	if (!rq_h_nr_queued && rq->cfs.h_nr_queued)
-		dl_server_start(&rq->fair_server);
-
-	/* At this point se is NULL and we are at root level*/
-	add_nr_running(rq, queued_delta);
-
-unthrottle_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
 	/* Determine whether we need to wake up potentially idle CPU: */
@@ -6747,6 +6715,7 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 static inline void sync_throttle(struct task_group *tg, int cpu) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void task_throttle_setup_work(struct task_struct *p) {}
+static bool task_is_throttled(struct task_struct *p) { return false; }
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
@@ -6955,6 +6924,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		util_est_enqueue(&rq->cfs, p);
 
 	if (flags & ENQUEUE_DELAYED) {
+		WARN_ON_ONCE(task_is_throttled(p));
 		requeue_delayed_entity(se);
 		return;
 	}
@@ -6997,10 +6967,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = 1;
 
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			goto enqueue_throttle;
-
 		flags = ENQUEUE_WAKEUP;
 	}
 
@@ -7022,10 +6988,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = 1;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			goto enqueue_throttle;
 	}
 
 	if (!rq_h_nr_queued && rq->cfs.h_nr_queued) {
@@ -7055,7 +7017,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!task_new)
 		check_update_overutilized_status(rq);
 
-enqueue_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
 	hrtick_update(rq);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v2 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-04-09 12:07 [RFC PATCH v2 0/7] Defer throttle when task exits to user Aaron Lu
                   ` (2 preceding siblings ...)
  2025-04-09 12:07 ` [RFC PATCH v2 3/7] sched/fair: Handle unthrottle " Aaron Lu
@ 2025-04-09 12:07 ` Aaron Lu
  2025-04-09 12:07 ` [RFC PATCH v2 5/7] sched/fair: get rid of throttled_lb_pair() Aaron Lu
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Aaron Lu @ 2025-04-09 12:07 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

On task group change, for tasks whose on_rq equals to TASK_ON_RQ_QUEUED,
core will dequeue it and then requeued it.

The throttled task is still considered as queued by core because p->on_rq
is still set so core will dequeue it, but since the task is already
dequeued on throttle in fair, handle this case properly.

Affinity and sched class change is similar.

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4152088fc0546..76b8a5ffcbdd2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5882,6 +5882,20 @@ void init_cfs_throttle_work(struct task_struct *p)
 	INIT_LIST_HEAD(&p->throttle_node);
 }
 
+static void dequeue_throttled_task(struct task_struct *p, int flags)
+{
+	/*
+	 * Task is throttled and someone wants to dequeue it again:
+	 * it must be sched/core when core needs to do things like
+	 * task affinity change, task group change, task sched class
+	 * change etc.
+	 */
+	WARN_ON_ONCE(p->se.on_rq);
+	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
+
+	list_del_init(&p->throttle_node);
+}
+
 static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static int tg_unthrottle_up(struct task_group *tg, void *data)
 {
@@ -6716,6 +6730,7 @@ static inline void sync_throttle(struct task_group *tg, int cpu) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void task_throttle_setup_work(struct task_struct *p) {}
 static bool task_is_throttled(struct task_struct *p) { return false; }
+static void dequeue_throttled_task(struct task_struct *p, int flags) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
@@ -7146,6 +7161,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
  */
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (unlikely(task_is_throttled(p))) {
+		dequeue_throttled_task(p, flags);
+		return true;
+	}
+
 	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
 		util_est_dequeue(&rq->cfs, p);
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v2 5/7] sched/fair: get rid of throttled_lb_pair()
  2025-04-09 12:07 [RFC PATCH v2 0/7] Defer throttle when task exits to user Aaron Lu
                   ` (3 preceding siblings ...)
  2025-04-09 12:07 ` [RFC PATCH v2 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
@ 2025-04-09 12:07 ` Aaron Lu
  2025-04-09 12:07 ` [RFC PATCH v2 6/7] sched/fair: fix h_nr_runnable accounting with per-task throttle Aaron Lu
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Aaron Lu @ 2025-04-09 12:07 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

Now that throttled tasks are dequeued and can not stay on rq's cfs_tasks
list, there is no need to take special care of these throttled tasks
anymore in load balance.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 33 +++------------------------------
 1 file changed, 3 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 76b8a5ffcbdd2..ff4252995d677 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5808,23 +5808,6 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 	return cfs_bandwidth_used() && cfs_rq->throttle_count;
 }
 
-/*
- * Ensure that neither of the group entities corresponding to src_cpu or
- * dest_cpu are members of a throttled hierarchy when performing group
- * load-balance operations.
- */
-static inline int throttled_lb_pair(struct task_group *tg,
-				    int src_cpu, int dest_cpu)
-{
-	struct cfs_rq *src_cfs_rq, *dest_cfs_rq;
-
-	src_cfs_rq = tg->cfs_rq[src_cpu];
-	dest_cfs_rq = tg->cfs_rq[dest_cpu];
-
-	return throttled_hierarchy(src_cfs_rq) ||
-	       throttled_hierarchy(dest_cfs_rq);
-}
-
 static inline bool task_is_throttled(struct task_struct *p)
 {
 	return !list_empty(&p->throttle_node);
@@ -6742,12 +6725,6 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 	return 0;
 }
 
-static inline int throttled_lb_pair(struct task_group *tg,
-				    int src_cpu, int dest_cpu)
-{
-	return 0;
-}
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -9377,17 +9354,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) delayed dequeued unless we migrate load, or
-	 * 2) throttled_lb_pair, or
-	 * 3) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 4) running (obviously), or
-	 * 5) are cache-hot on their current CPU.
+	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
+	 * 3) running (obviously), or
+	 * 4) are cache-hot on their current CPU.
 	 */
 	if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
 		return 0;
 
-	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
-		return 0;
-
 	/*
 	 * We want to prioritize the migration of eligible tasks.
 	 * For ineligible tasks we soft-limit them and only allow
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v2 6/7] sched/fair: fix h_nr_runnable accounting with per-task throttle
  2025-04-09 12:07 [RFC PATCH v2 0/7] Defer throttle when task exits to user Aaron Lu
                   ` (4 preceding siblings ...)
  2025-04-09 12:07 ` [RFC PATCH v2 5/7] sched/fair: get rid of throttled_lb_pair() Aaron Lu
@ 2025-04-09 12:07 ` Aaron Lu
  2025-04-09 12:07 ` [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time Aaron Lu
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Aaron Lu @ 2025-04-09 12:07 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

Task based throttle does not adjust cfs_rq's h_nr_runnable on throttle
anymore but relies on standard en/dequeue_entity(), so there is no need
to take special care of h_nr_runnable in delayed dequeue operations.

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff4252995d677..20471a3aa35e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5406,8 +5406,6 @@ static void set_delayed(struct sched_entity *se)
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		cfs_rq->h_nr_runnable--;
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 	}
 }
 
@@ -5428,8 +5426,6 @@ static void clear_delayed(struct sched_entity *se)
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		cfs_rq->h_nr_runnable++;
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 	}
 }
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time
  2025-04-09 12:07 [RFC PATCH v2 0/7] Defer throttle when task exits to user Aaron Lu
                   ` (5 preceding siblings ...)
  2025-04-09 12:07 ` [RFC PATCH v2 6/7] sched/fair: fix h_nr_runnable accounting with per-task throttle Aaron Lu
@ 2025-04-09 12:07 ` Aaron Lu
  2025-04-09 14:24   ` Aaron Lu
  2025-04-17 14:06   ` Florian Bezdeka
  2025-04-14  3:05 ` [RFC PATCH v2 0/7] Defer throttle when task exits to user Chengming Zhou
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 45+ messages in thread
From: Aaron Lu @ 2025-04-09 12:07 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

Implement an alternative way of accounting cfs_rq throttle time which:
- starts accounting when a throttled cfs_rq has no tasks enqueued and its
  throttled list is not empty;
- stops accounting when this cfs_rq gets unthrottled or a task gets
  enqueued.

This way, the accounted throttle time is when the cfs_rq has absolutely
no tasks enqueued and has tasks throttled.

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c  | 112 ++++++++++++++++++++++++++++++++-----------
 kernel/sched/sched.h |   4 ++
 2 files changed, 89 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 20471a3aa35e6..70f7de82d1d9d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5300,6 +5300,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
+static void account_cfs_rq_throttle_self(struct cfs_rq *cfs_rq);
 
 static void
 requeue_delayed_entity(struct sched_entity *se);
@@ -5362,10 +5363,14 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		if (throttled_hierarchy(cfs_rq)) {
 			struct rq *rq = rq_of(cfs_rq);
 
-			if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
-				cfs_rq->throttled_clock = rq_clock(rq);
-			if (!cfs_rq->throttled_clock_self)
-				cfs_rq->throttled_clock_self = rq_clock(rq);
+			if (cfs_rq->throttled_clock) {
+				cfs_rq->throttled_time +=
+					rq_clock(rq) - cfs_rq->throttled_clock;
+				cfs_rq->throttled_clock = 0;
+			}
+
+			if (cfs_rq->throttled_clock_self)
+				account_cfs_rq_throttle_self(cfs_rq);
 		}
 #endif
 	}
@@ -5453,7 +5458,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		 * DELAY_DEQUEUE relies on spurious wakeups, special task
 		 * states must not suffer spurious wakeups, excempt them.
 		 */
-		if (flags & DEQUEUE_SPECIAL)
+		if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE))
 			delay = false;
 
 		WARN_ON_ONCE(delay && se->sched_delayed);
@@ -5513,8 +5518,24 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	if (cfs_rq->nr_queued == 0) {
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
-		if (throttled_hierarchy(cfs_rq))
+
+#ifdef CONFIG_CFS_BANDWIDTH
+		if (throttled_hierarchy(cfs_rq)) {
 			list_del_leaf_cfs_rq(cfs_rq);
+
+			if (cfs_rq->h_nr_throttled) {
+				struct rq *rq = rq_of(cfs_rq);
+
+				WARN_ON_ONCE(cfs_rq->throttled_clock_self);
+				cfs_rq->throttled_clock_self = rq_clock(rq);
+
+				if (cfs_rq_throttled(cfs_rq)) {
+					WARN_ON_ONCE(cfs_rq->throttled_clock);
+					cfs_rq->throttled_clock = rq_clock(rq);
+				}
+			}
+		}
+#endif
 	}
 
 	return true;
@@ -5809,6 +5830,18 @@ static inline bool task_is_throttled(struct task_struct *p)
 	return !list_empty(&p->throttle_node);
 }
 
+static inline void
+cfs_rq_inc_h_nr_throttled(struct cfs_rq *cfs_rq, unsigned int nr)
+{
+	cfs_rq->h_nr_throttled += nr;
+}
+
+static inline void
+cfs_rq_dec_h_nr_throttled(struct cfs_rq *cfs_rq, unsigned int nr)
+{
+	cfs_rq->h_nr_throttled -= nr;
+}
+
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static void throttle_cfs_rq_work(struct callback_head *work)
 {
@@ -5845,7 +5878,7 @@ static void throttle_cfs_rq_work(struct callback_head *work)
 		rq = scope.rq;
 		update_rq_clock(rq);
 		WARN_ON_ONCE(!list_empty(&p->throttle_node));
-		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
+		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_THROTTLE);
 		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
 		resched_curr(rq);
 	}
@@ -5863,16 +5896,37 @@ void init_cfs_throttle_work(struct task_struct *p)
 
 static void dequeue_throttled_task(struct task_struct *p, int flags)
 {
+	struct sched_entity *se = &p->se;
+
 	/*
 	 * Task is throttled and someone wants to dequeue it again:
 	 * it must be sched/core when core needs to do things like
 	 * task affinity change, task group change, task sched class
 	 * change etc.
 	 */
-	WARN_ON_ONCE(p->se.on_rq);
-	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
+	WARN_ON_ONCE(se->on_rq);
+	WARN_ON_ONCE(flags & DEQUEUE_THROTTLE);
 
 	list_del_init(&p->throttle_node);
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		cfs_rq->h_nr_throttled--;
+	}
+}
+
+static void account_cfs_rq_throttle_self(struct cfs_rq *cfs_rq)
+{
+	/* account self time */
+	u64 delta = rq_clock(rq_of(cfs_rq)) - cfs_rq->throttled_clock_self;
+
+	cfs_rq->throttled_clock_self = 0;
+
+	if (WARN_ON_ONCE((s64)delta < 0))
+		delta = 0;
+
+	cfs_rq->throttled_clock_self_time += delta;
 }
 
 static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
@@ -5889,27 +5943,21 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 	cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
 		cfs_rq->throttled_clock_pelt;
 
-	if (cfs_rq->throttled_clock_self) {
-		u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
-
-		cfs_rq->throttled_clock_self = 0;
-
-		if (WARN_ON_ONCE((s64)delta < 0))
-			delta = 0;
-
-		cfs_rq->throttled_clock_self_time += delta;
-	}
+	if (cfs_rq->throttled_clock_self)
+		account_cfs_rq_throttle_self(cfs_rq);
 
 	/* Re-enqueue the tasks that have been throttled at this level. */
 	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
 		list_del_init(&p->throttle_node);
-		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
+		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
 	}
 
 	/* Add cfs_rq with load or one or more already running entities to the list */
 	if (!cfs_rq_is_decayed(cfs_rq))
 		list_add_leaf_cfs_rq(cfs_rq);
 
+	WARN_ON_ONCE(cfs_rq->h_nr_throttled);
+
 	return 0;
 }
 
@@ -5945,10 +5993,7 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	/* group is entering throttled state, stop time */
 	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
 
-	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
-	if (cfs_rq->nr_queued)
-		cfs_rq->throttled_clock_self = rq_clock(rq);
-	else
+	if (!cfs_rq->nr_queued)
 		list_del_leaf_cfs_rq(cfs_rq);
 
 	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
@@ -5992,9 +6037,6 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	 * throttled-list.  rq->lock protects completion.
 	 */
 	cfs_rq->throttled = 1;
-	WARN_ON_ONCE(cfs_rq->throttled_clock);
-	if (cfs_rq->nr_queued)
-		cfs_rq->throttled_clock = rq_clock(rq);
 	return;
 }
 
@@ -6026,6 +6068,10 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		cfs_b->throttled_time += rq_clock(rq) - cfs_rq->throttled_clock;
 		cfs_rq->throttled_clock = 0;
 	}
+	if (cfs_rq->throttled_time) {
+		cfs_b->throttled_time += cfs_rq->throttled_time;
+		cfs_rq->throttled_time = 0;
+	}
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
 
@@ -6710,6 +6756,8 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void task_throttle_setup_work(struct task_struct *p) {}
 static bool task_is_throttled(struct task_struct *p) { return false; }
 static void dequeue_throttled_task(struct task_struct *p, int flags) {}
+static void cfs_rq_inc_h_nr_throttled(struct cfs_rq *cfs_rq, unsigned int nr) {}
+static void cfs_rq_dec_h_nr_throttled(struct cfs_rq *cfs_rq, unsigned int nr) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
@@ -6898,6 +6946,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	struct sched_entity *se = &p->se;
 	int h_nr_idle = task_has_idle_policy(p);
 	int h_nr_runnable = 1;
+	int h_nr_throttled = (flags & ENQUEUE_THROTTLE) ? 1 : 0;
 	int task_new = !(flags & ENQUEUE_WAKEUP);
 	int rq_h_nr_queued = rq->cfs.h_nr_queued;
 	u64 slice = 0;
@@ -6951,6 +7000,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_runnable += h_nr_runnable;
 		cfs_rq->h_nr_queued++;
 		cfs_rq->h_nr_idle += h_nr_idle;
+		cfs_rq_dec_h_nr_throttled(cfs_rq, h_nr_throttled);
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = 1;
@@ -6973,6 +7023,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_runnable += h_nr_runnable;
 		cfs_rq->h_nr_queued++;
 		cfs_rq->h_nr_idle += h_nr_idle;
+		cfs_rq_dec_h_nr_throttled(cfs_rq, h_nr_throttled);
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = 1;
@@ -7027,10 +7078,12 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	int rq_h_nr_queued = rq->cfs.h_nr_queued;
 	bool task_sleep = flags & DEQUEUE_SLEEP;
 	bool task_delayed = flags & DEQUEUE_DELAYED;
+	bool task_throttle = flags & DEQUEUE_THROTTLE;
 	struct task_struct *p = NULL;
 	int h_nr_idle = 0;
 	int h_nr_queued = 0;
 	int h_nr_runnable = 0;
+	int h_nr_throttled = 0;
 	struct cfs_rq *cfs_rq;
 	u64 slice = 0;
 
@@ -7040,6 +7093,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		h_nr_idle = task_has_idle_policy(p);
 		if (task_sleep || task_delayed || !se->sched_delayed)
 			h_nr_runnable = 1;
+
+		if (task_throttle)
+			h_nr_throttled = 1;
 	} else {
 		cfs_rq = group_cfs_rq(se);
 		slice = cfs_rq_min_slice(cfs_rq);
@@ -7058,6 +7114,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		cfs_rq->h_nr_runnable -= h_nr_runnable;
 		cfs_rq->h_nr_queued -= h_nr_queued;
 		cfs_rq->h_nr_idle -= h_nr_idle;
+		cfs_rq_inc_h_nr_throttled(cfs_rq, h_nr_throttled);
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
@@ -7095,6 +7152,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		cfs_rq->h_nr_runnable -= h_nr_runnable;
 		cfs_rq->h_nr_queued -= h_nr_queued;
 		cfs_rq->h_nr_idle -= h_nr_idle;
+		cfs_rq_inc_h_nr_throttled(cfs_rq, h_nr_throttled);
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 97be6a6f53b9c..54cdec21aa5c2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -721,6 +721,7 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
+	unsigned int		h_nr_throttled;
 	s64			runtime_remaining;
 
 	u64			throttled_pelt_idle;
@@ -732,6 +733,7 @@ struct cfs_rq {
 	u64			throttled_clock_pelt_time;
 	u64			throttled_clock_self;
 	u64			throttled_clock_self_time;
+	u64			throttled_time;
 	int			throttled;
 	int			throttle_count;
 	struct list_head	throttled_list;
@@ -2360,6 +2362,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define DEQUEUE_SPECIAL		0x10
 #define DEQUEUE_MIGRATING	0x100 /* Matches ENQUEUE_MIGRATING */
 #define DEQUEUE_DELAYED		0x200 /* Matches ENQUEUE_DELAYED */
+#define DEQUEUE_THROTTLE	0x800 /* Matches ENQUEUE_THROTTLE */
 
 #define ENQUEUE_WAKEUP		0x01
 #define ENQUEUE_RESTORE		0x02
@@ -2377,6 +2380,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define ENQUEUE_MIGRATING	0x100
 #define ENQUEUE_DELAYED		0x200
 #define ENQUEUE_RQ_SELECTED	0x400
+#define ENQUEUE_THROTTLE	0x800
 
 #define RETRY_TASK		((void *)-1UL)
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time
  2025-04-09 12:07 ` [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time Aaron Lu
@ 2025-04-09 14:24   ` Aaron Lu
  2025-04-17 14:06   ` Florian Bezdeka
  1 sibling, 0 replies; 45+ messages in thread
From: Aaron Lu @ 2025-04-09 14:24 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

On Wed, Apr 09, 2025 at 08:07:46PM +0800, Aaron Lu wrote:
> Implement an alternative way of accounting cfs_rq throttle time which:
> - starts accounting when a throttled cfs_rq has no tasks enqueued and its
>   throttled list is not empty;
> - stops accounting when this cfs_rq gets unthrottled or a task gets
>   enqueued.
> 
> This way, the accounted throttle time is when the cfs_rq has absolutely
> no tasks enqueued and has tasks throttled.
> 
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> ---
>  kernel/sched/fair.c  | 112 ++++++++++++++++++++++++++++++++-----------
>  kernel/sched/sched.h |   4 ++
>  2 files changed, 89 insertions(+), 27 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 20471a3aa35e6..70f7de82d1d9d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5300,6 +5300,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  
>  static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
>  static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
> +static void account_cfs_rq_throttle_self(struct cfs_rq *cfs_rq);
>  
>  static void
>  requeue_delayed_entity(struct sched_entity *se);
> @@ -5362,10 +5363,14 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  		if (throttled_hierarchy(cfs_rq)) {
>  			struct rq *rq = rq_of(cfs_rq);
>  
> -			if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
> -				cfs_rq->throttled_clock = rq_clock(rq);
> -			if (!cfs_rq->throttled_clock_self)
> -				cfs_rq->throttled_clock_self = rq_clock(rq);
> +			if (cfs_rq->throttled_clock) {
> +				cfs_rq->throttled_time +=
> +					rq_clock(rq) - cfs_rq->throttled_clock;
> +				cfs_rq->throttled_clock = 0;
> +			}

This probably needs more explanation.

We can also take cfs_b->lock and directly accounts the time into
cfs_b->throttled_time, but considering enqueue can be frequent so to
avoid possible lock contention, I chose to account this time to the cpu
local cfs_rq and on unthrottle, add the local accounted time to
cfs_b->throttled_time.

This has a side effect though: when reading cpu.stat and cpu.stat.local
for a task group with quota setting, the throttled_usec in cpu.stat can
be slightly smaller than throttled_usec in cpu.stat.local since some
throttled time is not accounted to cfs_b yet...

> +
> +			if (cfs_rq->throttled_clock_self)
> +				account_cfs_rq_throttle_self(cfs_rq);
>  		}
>  #endif
>  	}

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-09 12:07 [RFC PATCH v2 0/7] Defer throttle when task exits to user Aaron Lu
                   ` (6 preceding siblings ...)
  2025-04-09 12:07 ` [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time Aaron Lu
@ 2025-04-14  3:05 ` Chengming Zhou
  2025-04-14 11:47   ` Aaron Lu
  2025-04-14  8:54 ` Florian Bezdeka
  2025-04-14 16:34 ` K Prateek Nayak
  9 siblings, 1 reply; 45+ messages in thread
From: Chengming Zhou @ 2025-04-14  3:05 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka

On 2025/4/9 20:07, Aaron Lu wrote:
> This is a continuous work based on Valentin Schneider's posting here:
> Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
> https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
> 
> Valentin has described the problem very well in the above link. We also
> have task hung problem from time to time in our environment due to cfs quota.
> It is mostly visible with rwsem: when a reader is throttled, writer comes in
> and has to wait, the writer also makes all subsequent readers wait,
> causing problems of priority inversion or even whole system hung.
> 
> To improve this situation, change the throttle model to task based, i.e.
> when a cfs_rq is throttled, mark its throttled status but do not
> remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
> when they get picked, add a task work to them so that when they return
> to user, they can be dequeued. In this way, tasks throttled will not
> hold any kernel resources. When cfs_rq gets unthrottled, enqueue back
> those throttled tasks.
> 
> There are consequences because of this new throttle model, e.g. for a
> cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
> return2user path, one task still running in kernel mode, this cfs_rq is
> in a partial throttled state:
> - Should its pelt clock be frozen?
> - Should this state be accounted into throttled_time?
> 
> For pelt clock, I chose to keep the current behavior to freeze it on
> cfs_rq's throttle time. The assumption is that tasks running in kernel
> mode should not last too long, freezing the cfs_rq's pelt clock can keep
> its load and its corresponding sched_entity's weight. Hopefully, this can
> result in a stable situation for the remaining running tasks to quickly
> finish their jobs in kernel mode.

Seems reasonable to me, although I'm wondering is it possible or desirable
to implement per-task PELT freeze?

> 
> For throttle time accounting, I can see several possibilities:
> - Similar to current behavior: starts accounting when cfs_rq gets
>    throttled(if cfs_rq->nr_queued > 0) and stops accounting when cfs_rq
>    gets unthrottled. This has one drawback, e.g. if this cfs_rq has one
>    task when it gets throttled and eventually, that task doesn't return
>    to user but blocks, then this cfs_rq has no tasks on throttled list
>    but time is accounted as throttled; Patch2 and patch3 implements this
>    accounting(simple, fewer code change).
> - Starts accounting when the throttled cfs_rq has at least one task on
>    its throttled list; stops accounting when it's unthrottled. This kind
>    of over accounts throttled time because partial throttle state is
>    accounted.
> - Starts accounting when the throttled cfs_rq has no tasks left and its
>    throttled list is not empty; stops accounting when this cfs_rq is
>    unthrottled; This kind of under accounts throttled time because partial
>    throttle state is not accounted. Patch7 implements this accounting.
> I do not have a strong feeling which accounting is the best, it's open
> for discussion.

I personally prefer option 2, which has a more practical throttled time,
so we can know how long there are some tasks throttled in fact.

Thanks!

> 
> There is also the concern of increased duration of (un)throttle operations
> in v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
> setup on a 2sockets/384cpus AMD server, the longest duration of
> distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
> https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
> For throttle path, with Chengming's suggestion to move "task work setup"
> from throttle time to pick time, it's not an issue anymore.
> 
> Patches:
> Patch1 is preparation work;
> 
> Patch2-3 provide the main functionality.
> Patch2 deals with throttle path: when a cfs_rq is to be throttled, mark
> throttled status for this cfs_rq and when tasks in throttled hierarchy
> gets picked, add a task work to them so that when those tasks return to
> user space, the task work can throttle it by dequeuing the task and
> remember this by adding the task to its cfs_rq's limbo list;
> Patch3 deals with unthrottle path: when a cfs_rq is to be unthrottled,
> enqueue back those tasks in limbo list;
> 
> Patch4 deals with the dequeue path when task changes group, sched class
> etc. Task that is throttled is dequeued in fair, but task->on_rq is
> still set so when it changes task group or sched class or has affinity
> setting change, core will firstly dequeue it. But since this task is
> already dequeued in fair class, this patch handle this situation.
> 
> Patch5-6 are clean ups. Some code are obsolete after switching to task
> based throttle mechanism.
> 
> Patch7 implements an alternative accounting mechanism for task based
> throttle.
> 
> Changes since v1:
> - Move "add task work" from throttle time to pick time, suggested by
>    Chengming Zhou;
> - Use scope_gard() and cond_resched_tasks_rcu_qs() in
>    throttle_cfs_rq_work(), suggested by K Prateek Nayak;
> - Remove now obsolete throttled_lb_pair(), suggested by K Prateek Nayak;
> - Fix cfs_rq->runtime_remaining condition check in unthrottle_cfs_rq(),
>    suggested by K Prateek Nayak;
> - Fix h_nr_runnable accounting for delayed dequeue case when task based
>    throttle is in use;
> - Implemented an alternative way of throttle time accounting for
>    discussion purpose;
> - Make !CONFIG_CFS_BANDWIDTH build.
> I hope I didn't omit any feedbacks I've received, but feel free to let me
> know if I did.
> 
> As in v1, all change logs are written by me and if they read bad, it's
> my fault.
> 
> Comments are welcome.
> 
> Base commit: tip/sched/core, commit 6432e163ba1b("sched/isolation: Make
> use of more than one housekeeping cpu").
> 
> Aaron Lu (4):
>    sched/fair: Take care of group/affinity/sched_class change for
>      throttled task
>    sched/fair: get rid of throttled_lb_pair()
>    sched/fair: fix h_nr_runnable accounting with per-task throttle
>    sched/fair: alternative way of accounting throttle time
> 
> Valentin Schneider (3):
>    sched/fair: Add related data structure for task based throttle
>    sched/fair: Handle throttle path for task based throttle
>    sched/fair: Handle unthrottle path for task based throttle
> 
>   include/linux/sched.h |   4 +
>   kernel/sched/core.c   |   3 +
>   kernel/sched/fair.c   | 449 ++++++++++++++++++++++--------------------
>   kernel/sched/sched.h  |   7 +
>   4 files changed, 248 insertions(+), 215 deletions(-)
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle
  2025-04-09 12:07 ` [RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
@ 2025-04-14  3:58   ` K Prateek Nayak
  2025-04-14 11:55     ` Aaron Lu
  0 siblings, 1 reply; 45+ messages in thread
From: K Prateek Nayak @ 2025-04-14  3:58 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

Hello Aaron,

On 4/9/2025 5:37 PM, Aaron Lu wrote:
> From: Valentin Schneider <vschneid@redhat.com>
> 
> Add related data structures for this new throttle functionality.
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> ---
>   include/linux/sched.h |  4 ++++
>   kernel/sched/core.c   |  3 +++
>   kernel/sched/fair.c   | 12 ++++++++++++
>   kernel/sched/sched.h  |  2 ++
>   4 files changed, 21 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f96ac19828934..0b55c79fee209 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -880,6 +880,10 @@ struct task_struct {
>   
>   #ifdef CONFIG_CGROUP_SCHED
>   	struct task_group		*sched_task_group;
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	struct callback_head		sched_throttle_work;
> +	struct list_head		throttle_node;

Since throttled tasks are fully dequeued before placing on the
"throttled_limbo_list", is it possible to reuse "p->se.group_node"?

Currently, it is used to track the task on "rq->cfs_tasks" and during
load-balancing when moving a bunch of tasks between CPUs but since a
fully throttled task is not tracked by either, it should be safe to
reuse this bit (CONFIG_DEBUG_LIST will scream if I'm wrong) and save
up on some space in the  task_struct.

Thoughts?

-- 
Thanks and Regards,
Prateek

> +#endif
>   #endif
>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-09 12:07 [RFC PATCH v2 0/7] Defer throttle when task exits to user Aaron Lu
                   ` (7 preceding siblings ...)
  2025-04-14  3:05 ` [RFC PATCH v2 0/7] Defer throttle when task exits to user Chengming Zhou
@ 2025-04-14  8:54 ` Florian Bezdeka
  2025-04-14 12:04   ` Aaron Lu
  2025-04-14 16:34 ` K Prateek Nayak
  9 siblings, 1 reply; 45+ messages in thread
From: Florian Bezdeka @ 2025-04-14  8:54 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

Hi Aaron, Hi Valentin,

On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote:
> This is a continuous work based on Valentin Schneider's posting here:
> Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
> https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
> 
> Valentin has described the problem very well in the above link. We also
> have task hung problem from time to time in our environment due to cfs quota.
> It is mostly visible with rwsem: when a reader is throttled, writer comes in
> and has to wait, the writer also makes all subsequent readers wait,
> causing problems of priority inversion or even whole system hung.

for testing purposes I backported this series to 6.14. We're currently
hunting for a sporadic bug with PREEMPT_RT enabled. We see RCU stalls
and complete system freezes after a couple of days with some container
workload deployed. See [1]. 

It's too early to report "success", but this series seems to fix the
circular dependency / system hang. Testing is still ongoing.

While backporting I noticed some minor code style "issues". I will post
them afterwards. Feel free to ignore...

Best regards,
Florian

[1] https://lore.kernel.org/linux-rt-users/20250409135720.YuroItHp@linutronix.de/T/#t



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 2/7] sched/fair: Handle throttle path for task based throttle
  2025-04-09 12:07 ` [RFC PATCH v2 2/7] sched/fair: Handle throttle path " Aaron Lu
@ 2025-04-14  8:54   ` Florian Bezdeka
  2025-04-14 12:10     ` Aaron Lu
  2025-04-14 14:39   ` Florian Bezdeka
  2025-04-30 10:01   ` Aaron Lu
  2 siblings, 1 reply; 45+ messages in thread
From: Florian Bezdeka @ 2025-04-14  8:54 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote:
> From: Valentin Schneider <vschneid@redhat.com>
> 
> In current throttle model, when a cfs_rq is throttled, its entity will
> be dequeued from cpu's rq, making tasks attached to it not able to run,
> thus achiveing the throttle target.
> 
> This has a drawback though: assume a task is a reader of percpu_rwsem
> and is waiting. When it gets wakeup, it can not run till its task group's
> next period comes, which can be a relatively long time. Waiting writer
> will have to wait longer due to this and it also makes further reader
> build up and eventually trigger task hung.
> 
> To improve this situation, change the throttle model to task based, i.e.
> when a cfs_rq is throttled, record its throttled status but do not
> remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
> when they get picked, add a task work to them so that when they return
> to user, they can be dequeued. In this way, tasks throttled will not
> hold any kernel resources.
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> ---
>  kernel/sched/fair.c  | 185 +++++++++++++++++++++----------------------
>  kernel/sched/sched.h |   1 +
>  2 files changed, 93 insertions(+), 93 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 894202d232efd..c566a5a90d065 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5516,8 +5516,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	if (flags & DEQUEUE_DELAYED)
>  		finish_delayed_dequeue_entity(se);
>  
> -	if (cfs_rq->nr_queued == 0)
> +	if (cfs_rq->nr_queued == 0) {
>  		update_idle_cfs_rq_clock_pelt(cfs_rq);
> +		if (throttled_hierarchy(cfs_rq))
> +			list_del_leaf_cfs_rq(cfs_rq);
> +	}
>  
>  	return true;
>  }
> @@ -5598,7 +5601,7 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>  	return se;
>  }
>  
> -static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
> +static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
>  
>  static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
>  {
> @@ -5823,8 +5826,48 @@ static inline int throttled_lb_pair(struct task_group *tg,
>  	       throttled_hierarchy(dest_cfs_rq);
>  }
>  
> +static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
>  static void throttle_cfs_rq_work(struct callback_head *work)
>  {
> +	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
> +	struct sched_entity *se;
> +	struct cfs_rq *cfs_rq;
> +	struct rq *rq;
> +
> +	WARN_ON_ONCE(p != current);
> +	p->sched_throttle_work.next = &p->sched_throttle_work;
> +
> +	/*
> +	 * If task is exiting, then there won't be a return to userspace, so we
> +	 * don't have to bother with any of this.
> +	 */
> +	if ((p->flags & PF_EXITING))
> +		return;
> +
> +	scoped_guard(task_rq_lock, p) {
> +		se = &p->se;
> +		cfs_rq = cfs_rq_of(se);
> +
> +		/* Raced, forget */
> +		if (p->sched_class != &fair_sched_class)
> +			return;
> +
> +		/*
> +		 * If not in limbo, then either replenish has happened or this
> +		 * task got migrated out of the throttled cfs_rq, move along.
> +		 */
> +		if (!cfs_rq->throttle_count)
> +			return;
> +
> +		rq = scope.rq;
> +		update_rq_clock(rq);
> +		WARN_ON_ONCE(!list_empty(&p->throttle_node));
> +		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> +		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> +		resched_curr(rq);
> +	}
> +
> +	cond_resched_tasks_rcu_qs();
>  }
>  
>  void init_cfs_throttle_work(struct task_struct *p)
> @@ -5864,32 +5907,53 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>  	return 0;
>  }
>  
> +static inline bool task_has_throttle_work(struct task_struct *p)
> +{
> +	return p->sched_throttle_work.next != &p->sched_throttle_work;
> +}
> +
> +static inline void task_throttle_setup_work(struct task_struct *p)
> +{
> +	if (task_has_throttle_work(p))
> +		return;
> +
> +	/*
> +	 * Kthreads and exiting tasks don't return to userspace, so adding the
> +	 * work is pointless
> +	 */
> +	if ((p->flags & (PF_EXITING | PF_KTHREAD)))
> +		return;
> +
> +	task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
> +}
> +
>  static int tg_throttle_down(struct task_group *tg, void *data)
>  {
>  	struct rq *rq = data;
>  	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>  
> +	cfs_rq->throttle_count++;
> +	if (cfs_rq->throttle_count > 1)
> +		return 0;
> +
>  	/* group is entering throttled state, stop time */
> -	if (!cfs_rq->throttle_count) {
> -		cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> -		list_del_leaf_cfs_rq(cfs_rq);
> +	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
>  
> -		WARN_ON_ONCE(cfs_rq->throttled_clock_self);
> -		if (cfs_rq->nr_queued)
> -			cfs_rq->throttled_clock_self = rq_clock(rq);
> -	}
> -	cfs_rq->throttle_count++;
> +	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
> +	if (cfs_rq->nr_queued)
> +		cfs_rq->throttled_clock_self = rq_clock(rq);
> +	else
> +		list_del_leaf_cfs_rq(cfs_rq);
>  
> +	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
>  	return 0;
>  }

tg_throttle_down() is touched twice in this series. Some code added
here (as part of patch 2) is later removed again in patch 7.

Maybe there is some room for improvement...

>  
> -static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
> +static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>  	struct rq *rq = rq_of(cfs_rq);
>  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> -	struct sched_entity *se;
> -	long queued_delta, runnable_delta, idle_delta, dequeue = 1;
> -	long rq_h_nr_queued = rq->cfs.h_nr_queued;
> +	int dequeue = 1;
>  
>  	raw_spin_lock(&cfs_b->lock);
>  	/* This will start the period timer if necessary */
> @@ -5910,74 +5974,13 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>  	raw_spin_unlock(&cfs_b->lock);
>  
>  	if (!dequeue)
> -		return false;  /* Throttle no longer required. */
> -
> -	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> +		return;  /* Throttle no longer required. */
>  
>  	/* freeze hierarchy runnable averages while throttled */
>  	rcu_read_lock();
>  	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
>  	rcu_read_unlock();
>  
> -	queued_delta = cfs_rq->h_nr_queued;
> -	runnable_delta = cfs_rq->h_nr_runnable;
> -	idle_delta = cfs_rq->h_nr_idle;
> -	for_each_sched_entity(se) {
> -		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> -		int flags;
> -
> -		/* throttled entity or throttle-on-deactivate */
> -		if (!se->on_rq)
> -			goto done;
> -
> -		/*
> -		 * Abuse SPECIAL to avoid delayed dequeue in this instance.
> -		 * This avoids teaching dequeue_entities() about throttled
> -		 * entities and keeps things relatively simple.
> -		 */
> -		flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL;
> -		if (se->sched_delayed)
> -			flags |= DEQUEUE_DELAYED;
> -		dequeue_entity(qcfs_rq, se, flags);
> -
> -		if (cfs_rq_is_idle(group_cfs_rq(se)))
> -			idle_delta = cfs_rq->h_nr_queued;
> -
> -		qcfs_rq->h_nr_queued -= queued_delta;
> -		qcfs_rq->h_nr_runnable -= runnable_delta;
> -		qcfs_rq->h_nr_idle -= idle_delta;
> -
> -		if (qcfs_rq->load.weight) {
> -			/* Avoid re-evaluating load for this entity: */
> -			se = parent_entity(se);
> -			break;
> -		}
> -	}
> -
> -	for_each_sched_entity(se) {
> -		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> -		/* throttled entity or throttle-on-deactivate */
> -		if (!se->on_rq)
> -			goto done;
> -
> -		update_load_avg(qcfs_rq, se, 0);
> -		se_update_runnable(se);
> -
> -		if (cfs_rq_is_idle(group_cfs_rq(se)))
> -			idle_delta = cfs_rq->h_nr_queued;
> -
> -		qcfs_rq->h_nr_queued -= queued_delta;
> -		qcfs_rq->h_nr_runnable -= runnable_delta;
> -		qcfs_rq->h_nr_idle -= idle_delta;
> -	}
> -
> -	/* At this point se is NULL and we are at root level*/
> -	sub_nr_running(rq, queued_delta);
> -
> -	/* Stop the fair server if throttling resulted in no runnable tasks */
> -	if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
> -		dl_server_stop(&rq->fair_server);
> -done:
>  	/*
>  	 * Note: distribution will already see us throttled via the
>  	 * throttled-list.  rq->lock protects completion.
> @@ -5986,7 +5989,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>  	WARN_ON_ONCE(cfs_rq->throttled_clock);
>  	if (cfs_rq->nr_queued)
>  		cfs_rq->throttled_clock = rq_clock(rq);
> -	return true;
> +	return;

Obsolete now, could be removed.

>  }
>  
>  void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> @@ -6462,22 +6465,22 @@ static void sync_throttle(struct task_group *tg, int cpu)
>  }
>  
>  /* conditionally throttle active cfs_rq's from put_prev_entity() */
> -static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> +static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  {
>  	if (!cfs_bandwidth_used())
> -		return false;
> +		return;
>  
>  	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
> -		return false;
> +		return;
>  
>  	/*
>  	 * it's possible for a throttled entity to be forced into a running
>  	 * state (e.g. set_curr_task), in this case we're finished.
>  	 */
>  	if (cfs_rq_throttled(cfs_rq))
> -		return true;
> +		return;
>  
> -	return throttle_cfs_rq(cfs_rq);
> +	throttle_cfs_rq(cfs_rq);
>  }
>  
>  static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
> @@ -6573,6 +6576,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  	cfs_rq->runtime_enabled = 0;
>  	INIT_LIST_HEAD(&cfs_rq->throttled_list);
>  	INIT_LIST_HEAD(&cfs_rq->throttled_csd_list);
> +	INIT_LIST_HEAD(&cfs_rq->throttled_limbo_list);
>  }
>  
>  void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> @@ -6738,10 +6742,11 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
>  #else /* CONFIG_CFS_BANDWIDTH */
>  
>  static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
> -static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
> +static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
>  static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
>  static inline void sync_throttle(struct task_group *tg, int cpu) {}
>  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> +static void task_throttle_setup_work(struct task_struct *p) {}
>  
>  static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>  {
> @@ -7108,10 +7113,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  		if (cfs_rq_is_idle(cfs_rq))
>  			h_nr_idle = h_nr_queued;
>  
> -		/* end evaluation on encountering a throttled cfs_rq */
> -		if (cfs_rq_throttled(cfs_rq))
> -			return 0;
> -
>  		/* Don't dequeue parent if it has other entities besides us */
>  		if (cfs_rq->load.weight) {
>  			slice = cfs_rq_min_slice(cfs_rq);
> @@ -7148,10 +7149,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  
>  		if (cfs_rq_is_idle(cfs_rq))
>  			h_nr_idle = h_nr_queued;
> -
> -		/* end evaluation on encountering a throttled cfs_rq */
> -		if (cfs_rq_throttled(cfs_rq))
> -			return 0;
>  	}
>  
>  	sub_nr_running(rq, h_nr_queued);
> @@ -8860,8 +8857,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
>  		if (cfs_rq->curr && cfs_rq->curr->on_rq)
>  			update_curr(cfs_rq);
>  
> -		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> -			goto again;
> +		check_cfs_rq_runtime(cfs_rq);
>  
>  		se = pick_next_entity(rq, cfs_rq);
>  		if (!se)
> @@ -8888,6 +8884,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  		goto idle;
>  	se = &p->se;
>  
> +	if (throttled_hierarchy(cfs_rq_of(se)))
> +		task_throttle_setup_work(p);
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	if (prev->sched_class != &fair_sched_class)
>  		goto simple;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 921527327f107..97be6a6f53b9c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -736,6 +736,7 @@ struct cfs_rq {
>  	int			throttle_count;
>  	struct list_head	throttled_list;
>  	struct list_head	throttled_csd_list;
> +	struct list_head	throttled_limbo_list;
>  #endif /* CONFIG_CFS_BANDWIDTH */
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  };


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-14  3:05 ` [RFC PATCH v2 0/7] Defer throttle when task exits to user Chengming Zhou
@ 2025-04-14 11:47   ` Aaron Lu
  0 siblings, 0 replies; 45+ messages in thread
From: Aaron Lu @ 2025-04-14 11:47 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka

On Mon, Apr 14, 2025 at 11:05:30AM +0800, Chengming Zhou wrote:
> On 2025/4/9 20:07, Aaron Lu wrote:
> > This is a continuous work based on Valentin Schneider's posting here:
> > Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
> > https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
> > 
> > Valentin has described the problem very well in the above link. We also
> > have task hung problem from time to time in our environment due to cfs quota.
> > It is mostly visible with rwsem: when a reader is throttled, writer comes in
> > and has to wait, the writer also makes all subsequent readers wait,
> > causing problems of priority inversion or even whole system hung.
> > 
> > To improve this situation, change the throttle model to task based, i.e.
> > when a cfs_rq is throttled, mark its throttled status but do not
> > remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
> > when they get picked, add a task work to them so that when they return
> > to user, they can be dequeued. In this way, tasks throttled will not
> > hold any kernel resources. When cfs_rq gets unthrottled, enqueue back
> > those throttled tasks.
> > 
> > There are consequences because of this new throttle model, e.g. for a
> > cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
> > return2user path, one task still running in kernel mode, this cfs_rq is
> > in a partial throttled state:
> > - Should its pelt clock be frozen?
> > - Should this state be accounted into throttled_time?
> > 
> > For pelt clock, I chose to keep the current behavior to freeze it on
> > cfs_rq's throttle time. The assumption is that tasks running in kernel
> > mode should not last too long, freezing the cfs_rq's pelt clock can keep
> > its load and its corresponding sched_entity's weight. Hopefully, this can
> > result in a stable situation for the remaining running tasks to quickly
> > finish their jobs in kernel mode.
> 
> Seems reasonable to me, although I'm wondering is it possible or desirable
> to implement per-task PELT freeze?

Interesting idea.

One thing I'm thinking, would per-task PELT freeze cause task and its
cfs_rq's pelt clock un-sync? If so, I feel it would create some headaches
but I haven't thought through this yet.

> > 
> > For throttle time accounting, I can see several possibilities:
> > - Similar to current behavior: starts accounting when cfs_rq gets
> >    throttled(if cfs_rq->nr_queued > 0) and stops accounting when cfs_rq
> >    gets unthrottled. This has one drawback, e.g. if this cfs_rq has one
> >    task when it gets throttled and eventually, that task doesn't return
> >    to user but blocks, then this cfs_rq has no tasks on throttled list
> >    but time is accounted as throttled; Patch2 and patch3 implements this
> >    accounting(simple, fewer code change).
> > - Starts accounting when the throttled cfs_rq has at least one task on
> >    its throttled list; stops accounting when it's unthrottled. This kind
> >    of over accounts throttled time because partial throttle state is
> >    accounted.
> > - Starts accounting when the throttled cfs_rq has no tasks left and its
> >    throttled list is not empty; stops accounting when this cfs_rq is
> >    unthrottled; This kind of under accounts throttled time because partial
> >    throttle state is not accounted. Patch7 implements this accounting.
> > I do not have a strong feeling which accounting is the best, it's open
> > for discussion.
> 
> I personally prefer option 2, which has a more practical throttled time,
> so we can know how long there are some tasks throttled in fact.
> 
> Thanks!

Thanks for the input.

Now I think about this more, I feel option 2 is essentially a better
version of option 1 because it doesn't have the drawback of option 1
I mentioned above, so option 1 should probably just be ruled out.

Then there are only 2 options to consider and their difference is
basically whether to treat partial throttle state as throttled or not.

Thanks,
Aaron

> > 
> > There is also the concern of increased duration of (un)throttle operations
> > in v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
> > setup on a 2sockets/384cpus AMD server, the longest duration of
> > distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
> > https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
> > For throttle path, with Chengming's suggestion to move "task work setup"
> > from throttle time to pick time, it's not an issue anymore.
> > 
> > Patches:
> > Patch1 is preparation work;
> > 
> > Patch2-3 provide the main functionality.
> > Patch2 deals with throttle path: when a cfs_rq is to be throttled, mark
> > throttled status for this cfs_rq and when tasks in throttled hierarchy
> > gets picked, add a task work to them so that when those tasks return to
> > user space, the task work can throttle it by dequeuing the task and
> > remember this by adding the task to its cfs_rq's limbo list;
> > Patch3 deals with unthrottle path: when a cfs_rq is to be unthrottled,
> > enqueue back those tasks in limbo list;
> > 
> > Patch4 deals with the dequeue path when task changes group, sched class
> > etc. Task that is throttled is dequeued in fair, but task->on_rq is
> > still set so when it changes task group or sched class or has affinity
> > setting change, core will firstly dequeue it. But since this task is
> > already dequeued in fair class, this patch handle this situation.
> > 
> > Patch5-6 are clean ups. Some code are obsolete after switching to task
> > based throttle mechanism.
> > 
> > Patch7 implements an alternative accounting mechanism for task based
> > throttle.
> > 
> > Changes since v1:
> > - Move "add task work" from throttle time to pick time, suggested by
> >    Chengming Zhou;
> > - Use scope_gard() and cond_resched_tasks_rcu_qs() in
> >    throttle_cfs_rq_work(), suggested by K Prateek Nayak;
> > - Remove now obsolete throttled_lb_pair(), suggested by K Prateek Nayak;
> > - Fix cfs_rq->runtime_remaining condition check in unthrottle_cfs_rq(),
> >    suggested by K Prateek Nayak;
> > - Fix h_nr_runnable accounting for delayed dequeue case when task based
> >    throttle is in use;
> > - Implemented an alternative way of throttle time accounting for
> >    discussion purpose;
> > - Make !CONFIG_CFS_BANDWIDTH build.
> > I hope I didn't omit any feedbacks I've received, but feel free to let me
> > know if I did.
> > 
> > As in v1, all change logs are written by me and if they read bad, it's
> > my fault.
> > 
> > Comments are welcome.
> > 
> > Base commit: tip/sched/core, commit 6432e163ba1b("sched/isolation: Make
> > use of more than one housekeeping cpu").
> > 
> > Aaron Lu (4):
> >    sched/fair: Take care of group/affinity/sched_class change for
> >      throttled task
> >    sched/fair: get rid of throttled_lb_pair()
> >    sched/fair: fix h_nr_runnable accounting with per-task throttle
> >    sched/fair: alternative way of accounting throttle time
> > 
> > Valentin Schneider (3):
> >    sched/fair: Add related data structure for task based throttle
> >    sched/fair: Handle throttle path for task based throttle
> >    sched/fair: Handle unthrottle path for task based throttle
> > 
> >   include/linux/sched.h |   4 +
> >   kernel/sched/core.c   |   3 +
> >   kernel/sched/fair.c   | 449 ++++++++++++++++++++++--------------------
> >   kernel/sched/sched.h  |   7 +
> >   4 files changed, 248 insertions(+), 215 deletions(-)
> > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle
  2025-04-14  3:58   ` K Prateek Nayak
@ 2025-04-14 11:55     ` Aaron Lu
  2025-04-14 13:37       ` K Prateek Nayak
  0 siblings, 1 reply; 45+ messages in thread
From: Aaron Lu @ 2025-04-14 11:55 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka

On Mon, Apr 14, 2025 at 09:28:36AM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 4/9/2025 5:37 PM, Aaron Lu wrote:
> > From: Valentin Schneider <vschneid@redhat.com>
> > 
> > Add related data structures for this new throttle functionality.
> > 
> > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > ---
> >   include/linux/sched.h |  4 ++++
> >   kernel/sched/core.c   |  3 +++
> >   kernel/sched/fair.c   | 12 ++++++++++++
> >   kernel/sched/sched.h  |  2 ++
> >   4 files changed, 21 insertions(+)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index f96ac19828934..0b55c79fee209 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -880,6 +880,10 @@ struct task_struct {
> >   #ifdef CONFIG_CGROUP_SCHED
> >   	struct task_group		*sched_task_group;
> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +	struct callback_head		sched_throttle_work;
> > +	struct list_head		throttle_node;
> 
> Since throttled tasks are fully dequeued before placing on the
> "throttled_limbo_list", is it possible to reuse "p->se.group_node"?

I think it might be possible.

> Currently, it is used to track the task on "rq->cfs_tasks" and during
> load-balancing when moving a bunch of tasks between CPUs but since a
> fully throttled task is not tracked by either, it should be safe to
> reuse this bit (CONFIG_DEBUG_LIST will scream if I'm wrong) and save
> up on some space in the  task_struct.
> 
> Thoughts?

Is it that adding throttle_node would cause task_struct to just cross a
cacheline boundary? :-)

Or it's mainly a concern that system could have many tasks and any saving
in task_struct is worth to try?

I can see reusing another field would cause task_is_throttled() more
obscure to digest and implement, but I think it is doable.

Thanks,
Aaron

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-14  8:54 ` Florian Bezdeka
@ 2025-04-14 12:04   ` Aaron Lu
  2025-04-15  5:29     ` Jan Kiszka
  0 siblings, 1 reply; 45+ messages in thread
From: Aaron Lu @ 2025-04-14 12:04 UTC (permalink / raw)
  To: Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chengming Zhou, Chuyi Zhou, Jan Kiszka

Hi Florian,

On Mon, Apr 14, 2025 at 10:54:48AM +0200, Florian Bezdeka wrote:
> Hi Aaron, Hi Valentin,
> 
> On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote:
> > This is a continuous work based on Valentin Schneider's posting here:
> > Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
> > https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
> > 
> > Valentin has described the problem very well in the above link. We also
> > have task hung problem from time to time in our environment due to cfs quota.
> > It is mostly visible with rwsem: when a reader is throttled, writer comes in
> > and has to wait, the writer also makes all subsequent readers wait,
> > causing problems of priority inversion or even whole system hung.
> 
> for testing purposes I backported this series to 6.14. We're currently
> hunting for a sporadic bug with PREEMPT_RT enabled. We see RCU stalls
> and complete system freezes after a couple of days with some container
> workload deployed. See [1]. 

I tried to make a setup last week to reproduce the RT/cfs throttle
deadlock issue Valentin described but haven't succeeded yet...

> It's too early to report "success", but this series seems to fix the
> circular dependency / system hang. Testing is still ongoing.

Good to know this and thanks for giving it a try.

> While backporting I noticed some minor code style "issues". I will post
> them afterwards. Feel free to ignore...

Your comments are welcome.

Best regards,
Aaron

> 
> [1] https://lore.kernel.org/linux-rt-users/20250409135720.YuroItHp@linutronix.de/T/#t

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 2/7] sched/fair: Handle throttle path for task based throttle
  2025-04-14  8:54   ` Florian Bezdeka
@ 2025-04-14 12:10     ` Aaron Lu
  0 siblings, 0 replies; 45+ messages in thread
From: Aaron Lu @ 2025-04-14 12:10 UTC (permalink / raw)
  To: Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chengming Zhou, Chuyi Zhou, Jan Kiszka

On Mon, Apr 14, 2025 at 10:54:59AM +0200, Florian Bezdeka wrote:
> On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote:
> > From: Valentin Schneider <vschneid@redhat.com>
> > 
> > In current throttle model, when a cfs_rq is throttled, its entity will
> > be dequeued from cpu's rq, making tasks attached to it not able to run,
> > thus achiveing the throttle target.
> > 
> > This has a drawback though: assume a task is a reader of percpu_rwsem
> > and is waiting. When it gets wakeup, it can not run till its task group's
> > next period comes, which can be a relatively long time. Waiting writer
> > will have to wait longer due to this and it also makes further reader
> > build up and eventually trigger task hung.
> > 
> > To improve this situation, change the throttle model to task based, i.e.
> > when a cfs_rq is throttled, record its throttled status but do not
> > remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
> > when they get picked, add a task work to them so that when they return
> > to user, they can be dequeued. In this way, tasks throttled will not
> > hold any kernel resources.
> > 
> > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > ---
> >  kernel/sched/fair.c  | 185 +++++++++++++++++++++----------------------
> >  kernel/sched/sched.h |   1 +
> >  2 files changed, 93 insertions(+), 93 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 894202d232efd..c566a5a90d065 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5516,8 +5516,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  	if (flags & DEQUEUE_DELAYED)
> >  		finish_delayed_dequeue_entity(se);
> >  
> > -	if (cfs_rq->nr_queued == 0)
> > +	if (cfs_rq->nr_queued == 0) {
> >  		update_idle_cfs_rq_clock_pelt(cfs_rq);
> > +		if (throttled_hierarchy(cfs_rq))
> > +			list_del_leaf_cfs_rq(cfs_rq);
> > +	}
> >  
> >  	return true;
> >  }
> > @@ -5598,7 +5601,7 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
> >  	return se;
> >  }
> >  
> > -static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
> > +static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
> >  
> >  static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
> >  {
> > @@ -5823,8 +5826,48 @@ static inline int throttled_lb_pair(struct task_group *tg,
> >  	       throttled_hierarchy(dest_cfs_rq);
> >  }
> >  
> > +static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
> >  static void throttle_cfs_rq_work(struct callback_head *work)
> >  {
> > +	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
> > +	struct sched_entity *se;
> > +	struct cfs_rq *cfs_rq;
> > +	struct rq *rq;
> > +
> > +	WARN_ON_ONCE(p != current);
> > +	p->sched_throttle_work.next = &p->sched_throttle_work;
> > +
> > +	/*
> > +	 * If task is exiting, then there won't be a return to userspace, so we
> > +	 * don't have to bother with any of this.
> > +	 */
> > +	if ((p->flags & PF_EXITING))
> > +		return;
> > +
> > +	scoped_guard(task_rq_lock, p) {
> > +		se = &p->se;
> > +		cfs_rq = cfs_rq_of(se);
> > +
> > +		/* Raced, forget */
> > +		if (p->sched_class != &fair_sched_class)
> > +			return;
> > +
> > +		/*
> > +		 * If not in limbo, then either replenish has happened or this
> > +		 * task got migrated out of the throttled cfs_rq, move along.
> > +		 */
> > +		if (!cfs_rq->throttle_count)
> > +			return;
> > +
> > +		rq = scope.rq;
> > +		update_rq_clock(rq);
> > +		WARN_ON_ONCE(!list_empty(&p->throttle_node));
> > +		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> > +		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> > +		resched_curr(rq);
> > +	}
> > +
> > +	cond_resched_tasks_rcu_qs();
> >  }
> >  
> >  void init_cfs_throttle_work(struct task_struct *p)
> > @@ -5864,32 +5907,53 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> >  	return 0;
> >  }
> >  
> > +static inline bool task_has_throttle_work(struct task_struct *p)
> > +{
> > +	return p->sched_throttle_work.next != &p->sched_throttle_work;
> > +}
> > +
> > +static inline void task_throttle_setup_work(struct task_struct *p)
> > +{
> > +	if (task_has_throttle_work(p))
> > +		return;
> > +
> > +	/*
> > +	 * Kthreads and exiting tasks don't return to userspace, so adding the
> > +	 * work is pointless
> > +	 */
> > +	if ((p->flags & (PF_EXITING | PF_KTHREAD)))
> > +		return;
> > +
> > +	task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
> > +}
> > +
> >  static int tg_throttle_down(struct task_group *tg, void *data)
> >  {
> >  	struct rq *rq = data;
> >  	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> >  
> > +	cfs_rq->throttle_count++;
> > +	if (cfs_rq->throttle_count > 1)
> > +		return 0;
> > +
> >  	/* group is entering throttled state, stop time */
> > -	if (!cfs_rq->throttle_count) {
> > -		cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> > -		list_del_leaf_cfs_rq(cfs_rq);
> > +	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> >  
> > -		WARN_ON_ONCE(cfs_rq->throttled_clock_self);
> > -		if (cfs_rq->nr_queued)
> > -			cfs_rq->throttled_clock_self = rq_clock(rq);
> > -	}
> > -	cfs_rq->throttle_count++;
> > +	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
> > +	if (cfs_rq->nr_queued)
> > +		cfs_rq->throttled_clock_self = rq_clock(rq);
> > +	else
> > +		list_del_leaf_cfs_rq(cfs_rq);
> >  
> > +	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
> >  	return 0;
> >  }
> 
> tg_throttle_down() is touched twice in this series. Some code added
> here (as part of patch 2) is later removed again in patch 7.
> 
> Maybe there is some room for improvement...

Yes.

So the purpose of patch7 is to show an alternative accounting of this
new per-task throttle model and since we haven't decided the proper way
to do accounting yet so I chose to separate it out. Another rationale
is, I want to keep the core of the patchset(patch2 and patch3) as simple
as possible to ease reviewing. Does this make sense? If folding them is
better, I can do that too for next version.

> >  
> > -static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
> > +static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
> >  {
> >  	struct rq *rq = rq_of(cfs_rq);
> >  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> > -	struct sched_entity *se;
> > -	long queued_delta, runnable_delta, idle_delta, dequeue = 1;
> > -	long rq_h_nr_queued = rq->cfs.h_nr_queued;
> > +	int dequeue = 1;
> >  
> >  	raw_spin_lock(&cfs_b->lock);
> >  	/* This will start the period timer if necessary */
> > @@ -5910,74 +5974,13 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
> >  	raw_spin_unlock(&cfs_b->lock);
> >  
> >  	if (!dequeue)
> > -		return false;  /* Throttle no longer required. */
> > -
> > -	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> > +		return;  /* Throttle no longer required. */
> >  
> >  	/* freeze hierarchy runnable averages while throttled */
> >  	rcu_read_lock();
> >  	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
> >  	rcu_read_unlock();
> >  
> > -	queued_delta = cfs_rq->h_nr_queued;
> > -	runnable_delta = cfs_rq->h_nr_runnable;
> > -	idle_delta = cfs_rq->h_nr_idle;
> > -	for_each_sched_entity(se) {
> > -		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> > -		int flags;
> > -
> > -		/* throttled entity or throttle-on-deactivate */
> > -		if (!se->on_rq)
> > -			goto done;
> > -
> > -		/*
> > -		 * Abuse SPECIAL to avoid delayed dequeue in this instance.
> > -		 * This avoids teaching dequeue_entities() about throttled
> > -		 * entities and keeps things relatively simple.
> > -		 */
> > -		flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL;
> > -		if (se->sched_delayed)
> > -			flags |= DEQUEUE_DELAYED;
> > -		dequeue_entity(qcfs_rq, se, flags);
> > -
> > -		if (cfs_rq_is_idle(group_cfs_rq(se)))
> > -			idle_delta = cfs_rq->h_nr_queued;
> > -
> > -		qcfs_rq->h_nr_queued -= queued_delta;
> > -		qcfs_rq->h_nr_runnable -= runnable_delta;
> > -		qcfs_rq->h_nr_idle -= idle_delta;
> > -
> > -		if (qcfs_rq->load.weight) {
> > -			/* Avoid re-evaluating load for this entity: */
> > -			se = parent_entity(se);
> > -			break;
> > -		}
> > -	}
> > -
> > -	for_each_sched_entity(se) {
> > -		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> > -		/* throttled entity or throttle-on-deactivate */
> > -		if (!se->on_rq)
> > -			goto done;
> > -
> > -		update_load_avg(qcfs_rq, se, 0);
> > -		se_update_runnable(se);
> > -
> > -		if (cfs_rq_is_idle(group_cfs_rq(se)))
> > -			idle_delta = cfs_rq->h_nr_queued;
> > -
> > -		qcfs_rq->h_nr_queued -= queued_delta;
> > -		qcfs_rq->h_nr_runnable -= runnable_delta;
> > -		qcfs_rq->h_nr_idle -= idle_delta;
> > -	}
> > -
> > -	/* At this point se is NULL and we are at root level*/
> > -	sub_nr_running(rq, queued_delta);
> > -
> > -	/* Stop the fair server if throttling resulted in no runnable tasks */
> > -	if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
> > -		dl_server_stop(&rq->fair_server);
> > -done:
> >  	/*
> >  	 * Note: distribution will already see us throttled via the
> >  	 * throttled-list.  rq->lock protects completion.
> > @@ -5986,7 +5989,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
> >  	WARN_ON_ONCE(cfs_rq->throttled_clock);
> >  	if (cfs_rq->nr_queued)
> >  		cfs_rq->throttled_clock = rq_clock(rq);
> > -	return true;
> > +	return;
> 
> Obsolete now, could be removed.

Indeed and one less line of code :-)

Thanks,
Aaron

> >  }
> >  
> >  void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> > @@ -6462,22 +6465,22 @@ static void sync_throttle(struct task_group *tg, int cpu)
> >  }
> >  
> >  /* conditionally throttle active cfs_rq's from put_prev_entity() */
> > -static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> > +static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >  {
> >  	if (!cfs_bandwidth_used())
> > -		return false;
> > +		return;
> >  
> >  	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
> > -		return false;
> > +		return;
> >  
> >  	/*
> >  	 * it's possible for a throttled entity to be forced into a running
> >  	 * state (e.g. set_curr_task), in this case we're finished.
> >  	 */
> >  	if (cfs_rq_throttled(cfs_rq))
> > -		return true;
> > +		return;
> >  
> > -	return throttle_cfs_rq(cfs_rq);
> > +	throttle_cfs_rq(cfs_rq);
> >  }
> >  
> >  static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
> > @@ -6573,6 +6576,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >  	cfs_rq->runtime_enabled = 0;
> >  	INIT_LIST_HEAD(&cfs_rq->throttled_list);
> >  	INIT_LIST_HEAD(&cfs_rq->throttled_csd_list);
> > +	INIT_LIST_HEAD(&cfs_rq->throttled_limbo_list);
> >  }
> >  
> >  void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> > @@ -6738,10 +6742,11 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
> >  #else /* CONFIG_CFS_BANDWIDTH */
> >  
> >  static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
> > -static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
> > +static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> >  static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
> >  static inline void sync_throttle(struct task_group *tg, int cpu) {}
> >  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> > +static void task_throttle_setup_work(struct task_struct *p) {}
> >  
> >  static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
> >  {
> > @@ -7108,10 +7113,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >  		if (cfs_rq_is_idle(cfs_rq))
> >  			h_nr_idle = h_nr_queued;
> >  
> > -		/* end evaluation on encountering a throttled cfs_rq */
> > -		if (cfs_rq_throttled(cfs_rq))
> > -			return 0;
> > -
> >  		/* Don't dequeue parent if it has other entities besides us */
> >  		if (cfs_rq->load.weight) {
> >  			slice = cfs_rq_min_slice(cfs_rq);
> > @@ -7148,10 +7149,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >  
> >  		if (cfs_rq_is_idle(cfs_rq))
> >  			h_nr_idle = h_nr_queued;
> > -
> > -		/* end evaluation on encountering a throttled cfs_rq */
> > -		if (cfs_rq_throttled(cfs_rq))
> > -			return 0;
> >  	}
> >  
> >  	sub_nr_running(rq, h_nr_queued);
> > @@ -8860,8 +8857,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> >  		if (cfs_rq->curr && cfs_rq->curr->on_rq)
> >  			update_curr(cfs_rq);
> >  
> > -		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> > -			goto again;
> > +		check_cfs_rq_runtime(cfs_rq);
> >  
> >  		se = pick_next_entity(rq, cfs_rq);
> >  		if (!se)
> > @@ -8888,6 +8884,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> >  		goto idle;
> >  	se = &p->se;
> >  
> > +	if (throttled_hierarchy(cfs_rq_of(se)))
> > +		task_throttle_setup_work(p);
> > +
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> >  	if (prev->sched_class != &fair_sched_class)
> >  		goto simple;
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 921527327f107..97be6a6f53b9c 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -736,6 +736,7 @@ struct cfs_rq {
> >  	int			throttle_count;
> >  	struct list_head	throttled_list;
> >  	struct list_head	throttled_csd_list;
> > +	struct list_head	throttled_limbo_list;
> >  #endif /* CONFIG_CFS_BANDWIDTH */
> >  #endif /* CONFIG_FAIR_GROUP_SCHED */
> >  };
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle
  2025-04-14 11:55     ` Aaron Lu
@ 2025-04-14 13:37       ` K Prateek Nayak
  0 siblings, 0 replies; 45+ messages in thread
From: K Prateek Nayak @ 2025-04-14 13:37 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka

Hello Aaron,

On 4/14/2025 5:25 PM, Aaron Lu wrote:
> On Mon, Apr 14, 2025 at 09:28:36AM +0530, K Prateek Nayak wrote:
>> Hello Aaron,
>>
>> On 4/9/2025 5:37 PM, Aaron Lu wrote:
>>> From: Valentin Schneider <vschneid@redhat.com>
>>>
>>> Add related data structures for this new throttle functionality.
>>>
>>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>>> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
>>> ---
>>>    include/linux/sched.h |  4 ++++
>>>    kernel/sched/core.c   |  3 +++
>>>    kernel/sched/fair.c   | 12 ++++++++++++
>>>    kernel/sched/sched.h  |  2 ++
>>>    4 files changed, 21 insertions(+)
>>>
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index f96ac19828934..0b55c79fee209 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -880,6 +880,10 @@ struct task_struct {
>>>    #ifdef CONFIG_CGROUP_SCHED
>>>    	struct task_group		*sched_task_group;
>>> +#ifdef CONFIG_CFS_BANDWIDTH
>>> +	struct callback_head		sched_throttle_work;
>>> +	struct list_head		throttle_node;
>>
>> Since throttled tasks are fully dequeued before placing on the
>> "throttled_limbo_list", is it possible to reuse "p->se.group_node"?
> 
> I think it might be possible.
> 
>> Currently, it is used to track the task on "rq->cfs_tasks" and during
>> load-balancing when moving a bunch of tasks between CPUs but since a
>> fully throttled task is not tracked by either, it should be safe to
>> reuse this bit (CONFIG_DEBUG_LIST will scream if I'm wrong) and save
>> up on some space in the  task_struct.
>>
>> Thoughts?
> 
> Is it that adding throttle_node would cause task_struct to just cross a
> cacheline boundary? :-)
> 
> Or it's mainly a concern that system could have many tasks and any saving
> in task_struct is worth to try?

Mostly this :)

> 
> I can see reusing another field would cause task_is_throttled() more
> obscure to digest and implement, but I think it is doable.

I completely overlooked task_is_throttled() use-case. I think the
current implementation is much cleaner in that aspect; no need to
overload "p->se.group_node" and over-complicate this.

If we really want some space saving , declaring a "unsigned char
sched_throttled" in the hole next to "sched_delayed" would be cleaner
but I'd wait on Valentin and Peter's comments before going down that
path.

> 
> Thanks,
> Aaron

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 2/7] sched/fair: Handle throttle path for task based throttle
  2025-04-09 12:07 ` [RFC PATCH v2 2/7] sched/fair: Handle throttle path " Aaron Lu
  2025-04-14  8:54   ` Florian Bezdeka
@ 2025-04-14 14:39   ` Florian Bezdeka
  2025-04-14 15:02     ` K Prateek Nayak
  2025-04-30 10:01   ` Aaron Lu
  2 siblings, 1 reply; 45+ messages in thread
From: Florian Bezdeka @ 2025-04-14 14:39 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote:
> @@ -8888,6 +8884,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  		goto idle;
>  	se = &p->se;
>  
> +	if (throttled_hierarchy(cfs_rq_of(se)))
> +		task_throttle_setup_work(p);
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	if (prev->sched_class != &fair_sched_class)
>  		goto simple;

For testing purposes I would like to backport that to 6.1-stable. The
situation around pick_next_task_fair() seems to have changed meanwhile:

- it moved out of the CONFIG_SMP guard
- Completely different implementation

Backporting to 6.12 looks doable, but 6.6 and below looks challenging
at first glance. Do you have any insights that could help backporting,
especially for this hunk, but maybe even in general?

Best regards,
Florian

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 2/7] sched/fair: Handle throttle path for task based throttle
  2025-04-14 14:39   ` Florian Bezdeka
@ 2025-04-14 15:02     ` K Prateek Nayak
  0 siblings, 0 replies; 45+ messages in thread
From: K Prateek Nayak @ 2025-04-14 15:02 UTC (permalink / raw)
  To: Florian Bezdeka, Aaron Lu, Valentin Schneider, Ben Segall,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

Hello Florian,

On 4/14/2025 8:09 PM, Florian Bezdeka wrote:
> On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote:
>> @@ -8888,6 +8884,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>>   		goto idle;
>>   	se = &p->se;
>>   
>> +	if (throttled_hierarchy(cfs_rq_of(se)))
>> +		task_throttle_setup_work(p);
>> +
>>   #ifdef CONFIG_FAIR_GROUP_SCHED
>>   	if (prev->sched_class != &fair_sched_class)
>>   		goto simple;
> 
> For testing purposes I would like to backport that to 6.1-stable. The
> situation around pick_next_task_fair() seems to have changed meanwhile:
> 
> - it moved out of the CONFIG_SMP guard
> - Completely different implementation
> 
> Backporting to 6.12 looks doable, but 6.6 and below looks challenging

v6.6 introduced the EEVDF algorithm that changes a fair bit of
fair.c but the bandwidth control bits are mostly same and they all
get ripped out in Patch 2 and Patch 3.

> at first glance. Do you have any insights that could help backporting,
> especially for this hunk, but maybe even in general?

For the particular hunk, on v6.5, you can do:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3e25be58e2b..2a8d9f19d0db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8173,6 +8173,11 @@ done: __maybe_unused;
  
  	update_misfit_status(p, rq);
  
+#ifdef CONFIG_CFS_BANDWIDTH
+	if (throttled_hierarchy(cfs_rq_of(&p->se)))
+		task_throttle_setup_work(p);
+#endif
+
  	return p;
  
  idle:
--

Add task work just before you return "p" after the "done" label.

For most part, this should be easily portable since the bandwidth
control mechanism hasn't seen much changes except for the async
throttling and few bits around throttled time accounting. Also, you can
drop all the bits that refer "delayed" of "DEQUEUE_DELAYED" since those
are EEVDF specific (Patch 6 can be fully dropped on versions < v6.6).

> 
> Best regards,
> Florian

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-09 12:07 [RFC PATCH v2 0/7] Defer throttle when task exits to user Aaron Lu
                   ` (8 preceding siblings ...)
  2025-04-14  8:54 ` Florian Bezdeka
@ 2025-04-14 16:34 ` K Prateek Nayak
  2025-04-15 11:25   ` Aaron Lu
  9 siblings, 1 reply; 45+ messages in thread
From: K Prateek Nayak @ 2025-04-14 16:34 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

Hello Aaron,

On 4/9/2025 5:37 PM, Aaron Lu wrote:
> This is a continuous work based on Valentin Schneider's posting here:
> Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
> https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
> 
> Valentin has described the problem very well in the above link. We also
> have task hung problem from time to time in our environment due to cfs quota.
> It is mostly visible with rwsem: when a reader is throttled, writer comes in
> and has to wait, the writer also makes all subsequent readers wait,
> causing problems of priority inversion or even whole system hung.
> 
> To improve this situation, change the throttle model to task based, i.e.
> when a cfs_rq is throttled, mark its throttled status but do not
> remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
> when they get picked, add a task work to them so that when they return
> to user, they can be dequeued. In this way, tasks throttled will not
> hold any kernel resources. When cfs_rq gets unthrottled, enqueue back
> those throttled tasks.

I tried to reproduce the scenario that Valentin describes in the
parallel thread [1] and I haven't run into a stall yet with this
series applied on top of v6.15-rc1 [2].

So for Patch 1-6, feel free to add:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

I'm slowly crawling through the series and haven't gotten to the Patch 7
yet but so far I haven't seen anything unexpected in my initial testing
and it seems to solve possible circular dependency on PREEMPT_RT with
bandwidth replenishment (Sebastian has some doubts if my reproducer is
correct but that discussion is for the other thread)

Thank you for working on this and a big thanks to Valentin for the solid
groundwork.

[1] https://lore.kernel.org/linux-rt-users/f2e2c74c-b15d-4185-a6ea-4a19eee02417@amd.com/
[2] https://lore.kernel.org/linux-rt-users/534df953-3cfb-4b3d-8953-5ed9ef24eabc@amd.com/

> 
> There are consequences because of this new throttle model, e.g. for a
> cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
> return2user path, one task still running in kernel mode, this cfs_rq is
> in a partial throttled state:
> - Should its pelt clock be frozen?
> - Should this state be accounted into throttled_time?
> 
> For pelt clock, I chose to keep the current behavior to freeze it on
> cfs_rq's throttle time. The assumption is that tasks running in kernel
> mode should not last too long, freezing the cfs_rq's pelt clock can keep
> its load and its corresponding sched_entity's weight. Hopefully, this can
> result in a stable situation for the remaining running tasks to quickly
> finish their jobs in kernel mode.
> 
> For throttle time accounting, I can see several possibilities:
> - Similar to current behavior: starts accounting when cfs_rq gets
>    throttled(if cfs_rq->nr_queued > 0) and stops accounting when cfs_rq
>    gets unthrottled. This has one drawback, e.g. if this cfs_rq has one
>    task when it gets throttled and eventually, that task doesn't return
>    to user but blocks, then this cfs_rq has no tasks on throttled list
>    but time is accounted as throttled; Patch2 and patch3 implements this
>    accounting(simple, fewer code change).
> - Starts accounting when the throttled cfs_rq has at least one task on
>    its throttled list; stops accounting when it's unthrottled. This kind
>    of over accounts throttled time because partial throttle state is
>    accounted.
> - Starts accounting when the throttled cfs_rq has no tasks left and its
>    throttled list is not empty; stops accounting when this cfs_rq is
>    unthrottled; This kind of under accounts throttled time because partial
>    throttle state is not accounted. Patch7 implements this accounting.
> I do not have a strong feeling which accounting is the best, it's open
> for discussion.
> 
> There is also the concern of increased duration of (un)throttle operations
> in v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
> setup on a 2sockets/384cpus AMD server, the longest duration of
> distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
> https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
> For throttle path, with Chengming's suggestion to move "task work setup"
> from throttle time to pick time, it's not an issue anymore.
> 

[..snip..]

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-14 12:04   ` Aaron Lu
@ 2025-04-15  5:29     ` Jan Kiszka
  2025-04-15  6:05       ` K Prateek Nayak
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2025-04-15  5:29 UTC (permalink / raw)
  To: Aaron Lu, Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chengming Zhou, Chuyi Zhou

[-- Attachment #1: Type: text/plain, Size: 1588 bytes --]

On 14.04.25 14:04, Aaron Lu wrote:
> Hi Florian,
> 
> On Mon, Apr 14, 2025 at 10:54:48AM +0200, Florian Bezdeka wrote:
>> Hi Aaron, Hi Valentin,
>>
>> On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote:
>>> This is a continuous work based on Valentin Schneider's posting here:
>>> Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
>>> https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
>>>
>>> Valentin has described the problem very well in the above link. We also
>>> have task hung problem from time to time in our environment due to cfs quota.
>>> It is mostly visible with rwsem: when a reader is throttled, writer comes in
>>> and has to wait, the writer also makes all subsequent readers wait,
>>> causing problems of priority inversion or even whole system hung.
>>
>> for testing purposes I backported this series to 6.14. We're currently
>> hunting for a sporadic bug with PREEMPT_RT enabled. We see RCU stalls
>> and complete system freezes after a couple of days with some container
>> workload deployed. See [1]. 
> 
> I tried to make a setup last week to reproduce the RT/cfs throttle
> deadlock issue Valentin described but haven't succeeded yet...
> 

Attached the bits with which we succeeded, sometimes. Setup: Debian 12,
RT kernel, 2-4 cores VM, 1-5 instances of the test, 2 min - 2 h
patience. As we have to succeed with at least 3 race conditions in a
row, that is still not bad... But maybe someone has an idea how to
increase probabilities further.

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

[-- Attachment #2: epoll-stall.c --]
[-- Type: text/x-csrc, Size: 1173 bytes --]

#include <assert.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/epoll.h>
#include <sys/timerfd.h>

int main(int argc, char *argv[])
{
	int pipe, timerfd, epoll;
	struct epoll_event ev[2];
	struct itimerspec it;
	int ret;

	assert(argc == 2);
	pipe = open(argv[1], O_RDONLY);
	assert(pipe >= 0);

	timerfd = timerfd_create(CLOCK_MONOTONIC, 0);
	assert(timerfd >= 0);
	it.it_value.tv_sec = 0;
	it.it_value.tv_nsec = 1;
	it.it_interval.tv_sec = 0;
	it.it_interval.tv_nsec = 50000;
	ret = timerfd_settime(timerfd, 0, &it, NULL);
	assert(ret == 0);

	epoll = epoll_create1(0);
	assert(epoll >= 0);

	ev[0].events = EPOLLIN;
	ev[0].data.fd = pipe;
	ret = epoll_ctl(epoll, EPOLL_CTL_ADD, pipe, &ev[0]);
	assert(ret == 0);

	ev[1].events = EPOLLIN;
	ev[1].data.fd = timerfd;
	ret = epoll_ctl(epoll, EPOLL_CTL_ADD, timerfd, &ev[1]);
	assert(ret == 0);

	printf("starting loop\n");
	while (1) {
		struct epoll_event event;
		char buffer[8];
		size_t size;

		ret = epoll_wait(epoll, &event, 1, -1);
		assert(ret == 1);
		if (event.data.fd == timerfd)
			size = 8;
		else
			size = 1;
		ret = read(event.data.fd, buffer, size);
		assert(ret == size);
	}
}

[-- Attachment #3: epoll-stall-writer.c --]
[-- Type: text/x-csrc, Size: 302 bytes --]

#include <assert.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
	int pipe, ret;

	assert(argc == 2);
	pipe = open(argv[1], O_WRONLY);
	assert(pipe >= 0);

	printf("starting writer\n");
	while (1) {
		ret = write(pipe, "x", 1);
		assert(ret == 1);
	}
}

[-- Attachment #4: run.sh --]
[-- Type: application/x-shellscript, Size: 384 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-15  5:29     ` Jan Kiszka
@ 2025-04-15  6:05       ` K Prateek Nayak
  2025-04-15  6:09         ` Jan Kiszka
  0 siblings, 1 reply; 45+ messages in thread
From: K Prateek Nayak @ 2025-04-15  6:05 UTC (permalink / raw)
  To: Jan Kiszka, Aaron Lu, Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou

Hello Jan,

On 4/15/2025 10:59 AM, Jan Kiszka wrote:
> On 14.04.25 14:04, Aaron Lu wrote:
>> Hi Florian,
>>
>> On Mon, Apr 14, 2025 at 10:54:48AM +0200, Florian Bezdeka wrote:
>>> Hi Aaron, Hi Valentin,
>>>
>>> On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote:
>>>> This is a continuous work based on Valentin Schneider's posting here:
>>>> Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
>>>> https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
>>>>
>>>> Valentin has described the problem very well in the above link. We also
>>>> have task hung problem from time to time in our environment due to cfs quota.
>>>> It is mostly visible with rwsem: when a reader is throttled, writer comes in
>>>> and has to wait, the writer also makes all subsequent readers wait,
>>>> causing problems of priority inversion or even whole system hung.
>>>
>>> for testing purposes I backported this series to 6.14. We're currently
>>> hunting for a sporadic bug with PREEMPT_RT enabled. We see RCU stalls
>>> and complete system freezes after a couple of days with some container
>>> workload deployed. See [1].
>>
>> I tried to make a setup last week to reproduce the RT/cfs throttle
>> deadlock issue Valentin described but haven't succeeded yet...
>>
> 
> Attached the bits with which we succeeded, sometimes. Setup: Debian 12,
> RT kernel, 2-4 cores VM, 1-5 instances of the test, 2 min - 2 h
> patience. As we have to succeed with at least 3 race conditions in a
> row, that is still not bad... But maybe someone has an idea how to
> increase probabilities further.

Looking at run.sh, there are only fair tasks with one of them being run
with cfs bandwidth constraints. Are you saying something goes wrong on
PREEMPT_RT as a result of using bandwidth control on fair tasks?

What exactly is the symptom you are observing? Does one of the assert()
trip during the run? Do you see a stall logged on dmesg? Can you provide
more information on what to expect in this 2min - 2hr window?

Additionally, do you have RT throttling enabled in your setup? Can long
running RT tasks starve fair tasks on your setup?

> 
> Jan
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-15  6:05       ` K Prateek Nayak
@ 2025-04-15  6:09         ` Jan Kiszka
  2025-04-15  8:45           ` K Prateek Nayak
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2025-04-15  6:09 UTC (permalink / raw)
  To: K Prateek Nayak, Aaron Lu, Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou

On 15.04.25 08:05, K Prateek Nayak wrote:
> Hello Jan,
> 
> On 4/15/2025 10:59 AM, Jan Kiszka wrote:
>> On 14.04.25 14:04, Aaron Lu wrote:
>>> Hi Florian,
>>>
>>> On Mon, Apr 14, 2025 at 10:54:48AM +0200, Florian Bezdeka wrote:
>>>> Hi Aaron, Hi Valentin,
>>>>
>>>> On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote:
>>>>> This is a continuous work based on Valentin Schneider's posting here:
>>>>> Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to
>>>>> user entry
>>>>> https://lore.kernel.org/lkml/20240711130004.2157737-1-
>>>>> vschneid@redhat.com/
>>>>>
>>>>> Valentin has described the problem very well in the above link. We
>>>>> also
>>>>> have task hung problem from time to time in our environment due to
>>>>> cfs quota.
>>>>> It is mostly visible with rwsem: when a reader is throttled, writer
>>>>> comes in
>>>>> and has to wait, the writer also makes all subsequent readers wait,
>>>>> causing problems of priority inversion or even whole system hung.
>>>>
>>>> for testing purposes I backported this series to 6.14. We're currently
>>>> hunting for a sporadic bug with PREEMPT_RT enabled. We see RCU stalls
>>>> and complete system freezes after a couple of days with some container
>>>> workload deployed. See [1].
>>>
>>> I tried to make a setup last week to reproduce the RT/cfs throttle
>>> deadlock issue Valentin described but haven't succeeded yet...
>>>
>>
>> Attached the bits with which we succeeded, sometimes. Setup: Debian 12,
>> RT kernel, 2-4 cores VM, 1-5 instances of the test, 2 min - 2 h
>> patience. As we have to succeed with at least 3 race conditions in a
>> row, that is still not bad... But maybe someone has an idea how to
>> increase probabilities further.
> 
> Looking at run.sh, there are only fair tasks with one of them being run
> with cfs bandwidth constraints. Are you saying something goes wrong on
> PREEMPT_RT as a result of using bandwidth control on fair tasks?

Yes, exactly. Also our in-field workload that triggers (most likely)
this issue is not using RT tasks itself. Only kernel threads are RT here.

> 
> What exactly is the symptom you are observing? Does one of the assert()
> trip during the run? Do you see a stall logged on dmesg? Can you provide
> more information on what to expect in this 2min - 2hr window?

I've just lost my traces from yesterday ("you have 0 minutes to find a
power adapter"), but I got nice RCU stall warnings in the VM, including
backtraces from the involved tasks (minus the read-lock holder IIRC).
Maybe Florian can drop one of his dumps.

> 
> Additionally, do you have RT throttling enabled in your setup? Can long
> running RT tasks starve fair tasks on your setup?

RT throttling is enabled (default settings) but was not kicking in - why
should it in that scenario? The only RT thread, ktimers, ran into the
held lock and stopped.

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-15  6:09         ` Jan Kiszka
@ 2025-04-15  8:45           ` K Prateek Nayak
  2025-04-15 10:21             ` Jan Kiszka
  2025-04-15 10:34             ` K Prateek Nayak
  0 siblings, 2 replies; 45+ messages in thread
From: K Prateek Nayak @ 2025-04-15  8:45 UTC (permalink / raw)
  To: Jan Kiszka, Aaron Lu, Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Sebastian Andrzej Siewior,

(+ Sebastian)

Hello Jan,

On 4/15/2025 11:39 AM, Jan Kiszka wrote:
>>> Attached the bits with which we succeeded, sometimes. Setup: Debian 12,
>>> RT kernel, 2-4 cores VM, 1-5 instances of the test, 2 min - 2 h
>>> patience. As we have to succeed with at least 3 race conditions in a
>>> row, that is still not bad... But maybe someone has an idea how to
>>> increase probabilities further.
>>
>> Looking at run.sh, there are only fair tasks with one of them being run
>> with cfs bandwidth constraints. Are you saying something goes wrong on
>> PREEMPT_RT as a result of using bandwidth control on fair tasks?
> 
> Yes, exactly. Also our in-field workload that triggers (most likely)
> this issue is not using RT tasks itself. Only kernel threads are RT here.
> 
>>
>> What exactly is the symptom you are observing? Does one of the assert()
>> trip during the run? Do you see a stall logged on dmesg? Can you provide
>> more information on what to expect in this 2min - 2hr window?
> 
> I've just lost my traces from yesterday ("you have 0 minutes to find a
> power adapter"), but I got nice RCU stall warnings in the VM, including
> backtraces from the involved tasks (minus the read-lock holder IIRC).
> Maybe Florian can drop one of his dumps.

So I ran your reproducer on a 2vCPU VM running v6.15-rc1 PREEMPT_RT
and I saw:

     rcu: INFO: rcu_preempt self-detected stall on CPU
     rcu:     0-...!: (15000 ticks this GP) idle=8a74/0/0x1 softirq=0/0 fqs=0
     rcu:     (t=15001 jiffies g=12713 q=24 ncpus=2)
     rcu: rcu_preempt kthread timer wakeup didn't happen for 15000 jiffies! g12713 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
     rcu:     Possible timer handling issue on cpu=0 timer-softirq=17688
     rcu: rcu_preempt kthread starved for 15001 jiffies! g12713 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
     rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
     rcu: RCU grace-period kthread stack dump:
     task:rcu_preempt     state:I stack:0     pid:17    tgid:17    ppid:2      task_flags:0x208040 flags:0x00004000
     Call Trace:
      <TASK>
      __schedule+0x401/0x15a0
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? lock_timer_base+0x77/0xb0
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? __pfx_rcu_gp_kthread+0x10/0x10
      schedule+0x27/0xd0
      schedule_timeout+0x76/0x100
      ? __pfx_process_timeout+0x10/0x10
      rcu_gp_fqs_loop+0x10a/0x4b0
      rcu_gp_kthread+0xd3/0x160
      kthread+0xff/0x210
      ? rt_spin_lock+0x3c/0xc0
      ? __pfx_kthread+0x10/0x10
      ret_from_fork+0x34/0x50
      ? __pfx_kthread+0x10/0x10
      ret_from_fork_asm+0x1a/0x30
      </TASK>
     CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.15.0-rc1-test-dirty #746 PREEMPT_{RT,(full)}
     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
     RIP: 0010:pv_native_safe_halt+0xf/0x20
     Code: 22 df e9 1f 08 e5 fe 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 85 96 15 00 fb f4 <e9> f7 07 e5 fe 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
     RSP: 0018:ffffffff95803e50 EFLAGS: 00000216
     RAX: ffff8e2d61534000 RBX: 0000000000000000 RCX: 0000000000000000
     RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000081f8a6c
     RBP: ffffffff9581d280 R08: 0000000000000000 R09: ffff8e2cf7d32301
     R10: ffff8e2be11ae5c8 R11: 0000000000000001 R12: 0000000000000000
     R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000147b0
     FS:  0000000000000000(0000) GS:ffff8e2d61534000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 000055e77c3a5128 CR3: 000000010ff78003 CR4: 0000000000770ef0
     PKRU: 55555554
     Call Trace:
      <TASK>
      default_idle+0x9/0x20
      default_idle_call+0x30/0x100
      do_idle+0x20f/0x250
      ? do_idle+0xb/0x250
      cpu_startup_entry+0x29/0x30
      rest_init+0xde/0x100
      start_kernel+0x733/0xb20
      ? copy_bootdata+0x9/0xb0
      x86_64_start_reservations+0x18/0x30
      x86_64_start_kernel+0xba/0x110
      common_startup_64+0x13e/0x141
      </TASK>

Is this in line with what you are seeing?

> 
>>
>> Additionally, do you have RT throttling enabled in your setup? Can long
>> running RT tasks starve fair tasks on your setup?
> 
> RT throttling is enabled (default settings) but was not kicking in - why
> should it in that scenario? The only RT thread, ktimers, ran into the
> held lock and stopped.
> 
> Jan
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-15  8:45           ` K Prateek Nayak
@ 2025-04-15 10:21             ` Jan Kiszka
  2025-04-15 11:14               ` K Prateek Nayak
       [not found]               ` <ec2cea83-07fe-472f-8320-911d215473fd@amd.com>
  2025-04-15 10:34             ` K Prateek Nayak
  1 sibling, 2 replies; 45+ messages in thread
From: Jan Kiszka @ 2025-04-15 10:21 UTC (permalink / raw)
  To: K Prateek Nayak, Aaron Lu, Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Sebastian Andrzej Siewior,

On 15.04.25 10:45, K Prateek Nayak wrote:
> (+ Sebastian)
> 
> Hello Jan,
> 
> On 4/15/2025 11:39 AM, Jan Kiszka wrote:
>>>> Attached the bits with which we succeeded, sometimes. Setup: Debian 12,
>>>> RT kernel, 2-4 cores VM, 1-5 instances of the test, 2 min - 2 h
>>>> patience. As we have to succeed with at least 3 race conditions in a
>>>> row, that is still not bad... But maybe someone has an idea how to
>>>> increase probabilities further.
>>>
>>> Looking at run.sh, there are only fair tasks with one of them being run
>>> with cfs bandwidth constraints. Are you saying something goes wrong on
>>> PREEMPT_RT as a result of using bandwidth control on fair tasks?
>>
>> Yes, exactly. Also our in-field workload that triggers (most likely)
>> this issue is not using RT tasks itself. Only kernel threads are RT here.
>>
>>>
>>> What exactly is the symptom you are observing? Does one of the assert()
>>> trip during the run? Do you see a stall logged on dmesg? Can you provide
>>> more information on what to expect in this 2min - 2hr window?
>>
>> I've just lost my traces from yesterday ("you have 0 minutes to find a
>> power adapter"), but I got nice RCU stall warnings in the VM, including
>> backtraces from the involved tasks (minus the read-lock holder IIRC).
>> Maybe Florian can drop one of his dumps.
> 
> So I ran your reproducer on a 2vCPU VM running v6.15-rc1 PREEMPT_RT
> and I saw:
> 
>     rcu: INFO: rcu_preempt self-detected stall on CPU
>     rcu:     0-...!: (15000 ticks this GP) idle=8a74/0/0x1 softirq=0/0
> fqs=0
>     rcu:     (t=15001 jiffies g=12713 q=24 ncpus=2)
>     rcu: rcu_preempt kthread timer wakeup didn't happen for 15000
> jiffies! g12713 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
>     rcu:     Possible timer handling issue on cpu=0 timer-softirq=17688
>     rcu: rcu_preempt kthread starved for 15001 jiffies! g12713 f0x0
> RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
>     rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is
> now expected behavior.
>     rcu: RCU grace-period kthread stack dump:
>     task:rcu_preempt     state:I stack:0     pid:17    tgid:17   
> ppid:2      task_flags:0x208040 flags:0x00004000
>     Call Trace:
>      <TASK>
>      __schedule+0x401/0x15a0
>      ? srso_alias_return_thunk+0x5/0xfbef5
>      ? lock_timer_base+0x77/0xb0
>      ? srso_alias_return_thunk+0x5/0xfbef5
>      ? __pfx_rcu_gp_kthread+0x10/0x10
>      schedule+0x27/0xd0
>      schedule_timeout+0x76/0x100
>      ? __pfx_process_timeout+0x10/0x10
>      rcu_gp_fqs_loop+0x10a/0x4b0
>      rcu_gp_kthread+0xd3/0x160
>      kthread+0xff/0x210
>      ? rt_spin_lock+0x3c/0xc0
>      ? __pfx_kthread+0x10/0x10
>      ret_from_fork+0x34/0x50
>      ? __pfx_kthread+0x10/0x10
>      ret_from_fork_asm+0x1a/0x30
>      </TASK>
>     CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.15.0-rc1-test-
> dirty #746 PREEMPT_{RT,(full)}
>     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>     RIP: 0010:pv_native_safe_halt+0xf/0x20
>     Code: 22 df e9 1f 08 e5 fe 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90
> 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 85 96 15 00 fb f4 <e9> f7
> 07 e5 fe 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
>     RSP: 0018:ffffffff95803e50 EFLAGS: 00000216
>     RAX: ffff8e2d61534000 RBX: 0000000000000000 RCX: 0000000000000000
>     RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000081f8a6c
>     RBP: ffffffff9581d280 R08: 0000000000000000 R09: ffff8e2cf7d32301
>     R10: ffff8e2be11ae5c8 R11: 0000000000000001 R12: 0000000000000000
>     R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000147b0
>     FS:  0000000000000000(0000) GS:ffff8e2d61534000(0000)
> knlGS:0000000000000000
>     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     CR2: 000055e77c3a5128 CR3: 000000010ff78003 CR4: 0000000000770ef0
>     PKRU: 55555554
>     Call Trace:
>      <TASK>
>      default_idle+0x9/0x20
>      default_idle_call+0x30/0x100
>      do_idle+0x20f/0x250
>      ? do_idle+0xb/0x250
>      cpu_startup_entry+0x29/0x30
>      rest_init+0xde/0x100
>      start_kernel+0x733/0xb20
>      ? copy_bootdata+0x9/0xb0
>      x86_64_start_reservations+0x18/0x30
>      x86_64_start_kernel+0xba/0x110
>      common_startup_64+0x13e/0x141
>      </TASK>
> 
> Is this in line with what you are seeing?
> 

Yes, and if you wait a bit longer for the second reporting round, you
should get more task backtraces as well.

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-15  8:45           ` K Prateek Nayak
  2025-04-15 10:21             ` Jan Kiszka
@ 2025-04-15 10:34             ` K Prateek Nayak
  1 sibling, 0 replies; 45+ messages in thread
From: K Prateek Nayak @ 2025-04-15 10:34 UTC (permalink / raw)
  To: Jan Kiszka, Aaron Lu, Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Sebastian Andrzej Siewior,

On 4/15/2025 2:15 PM, K Prateek Nayak wrote:
> (+ Sebastian)
> 
> Hello Jan,
> 
> On 4/15/2025 11:39 AM, Jan Kiszka wrote:
>>>> Attached the bits with which we succeeded, sometimes. Setup: Debian 12,
>>>> RT kernel, 2-4 cores VM, 1-5 instances of the test, 2 min - 2 h

To improve the reproducibility, I pinned the two tasks to the same CPU
as the bandwidth timer and I could hit this consistently within a few
minutes at the most.

>>>> patience. As we have to succeed with at least 3 race conditions in a
>>>> row, that is still not bad... But maybe someone has an idea how to
>>>> increase probabilities further.
>>>
>>> Looking at run.sh, there are only fair tasks with one of them being run
>>> with cfs bandwidth constraints. Are you saying something goes wrong on
>>> PREEMPT_RT as a result of using bandwidth control on fair tasks?
>>
>> Yes, exactly. Also our in-field workload that triggers (most likely)
>> this issue is not using RT tasks itself. Only kernel threads are RT here.
>>
>>>
>>> What exactly is the symptom you are observing? Does one of the assert()
>>> trip during the run? Do you see a stall logged on dmesg? Can you provide
>>> more information on what to expect in this 2min - 2hr window?
>>
>> I've just lost my traces from yesterday ("you have 0 minutes to find a
>> power adapter"), but I got nice RCU stall warnings in the VM, including
>> backtraces from the involved tasks (minus the read-lock holder IIRC).
>> Maybe Florian can drop one of his dumps.
> 
> So I ran your reproducer on a 2vCPU VM running v6.15-rc1 PREEMPT_RT
> and I saw:
> 
>      rcu: INFO: rcu_preempt self-detected stall on CPU
>      rcu:     0-...!: (15000 ticks this GP) idle=8a74/0/0x1 softirq=0/0 fqs=0
>      rcu:     (t=15001 jiffies g=12713 q=24 ncpus=2)
>      rcu: rcu_preempt kthread timer wakeup didn't happen for 15000 jiffies! g12713 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
>      rcu:     Possible timer handling issue on cpu=0 timer-softirq=17688
>      rcu: rcu_preempt kthread starved for 15001 jiffies! g12713 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
>      rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
>      rcu: RCU grace-period kthread stack dump:
>      task:rcu_preempt     state:I stack:0     pid:17    tgid:17    ppid:2      task_flags:0x208040 flags:0x00004000
>      Call Trace:
>       <TASK>
>       __schedule+0x401/0x15a0
>       ? srso_alias_return_thunk+0x5/0xfbef5
>       ? lock_timer_base+0x77/0xb0
>       ? srso_alias_return_thunk+0x5/0xfbef5
>       ? __pfx_rcu_gp_kthread+0x10/0x10
>       schedule+0x27/0xd0
>       schedule_timeout+0x76/0x100
>       ? __pfx_process_timeout+0x10/0x10
>       rcu_gp_fqs_loop+0x10a/0x4b0
>       rcu_gp_kthread+0xd3/0x160
>       kthread+0xff/0x210
>       ? rt_spin_lock+0x3c/0xc0
>       ? __pfx_kthread+0x10/0x10
>       ret_from_fork+0x34/0x50
>       ? __pfx_kthread+0x10/0x10
>       ret_from_fork_asm+0x1a/0x30
>       </TASK>
>      CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.15.0-rc1-test-dirty #746 PREEMPT_{RT,(full)}
>      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>      RIP: 0010:pv_native_safe_halt+0xf/0x20
>      Code: 22 df e9 1f 08 e5 fe 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 85 96 15 00 fb f4 <e9> f7 07 e5 fe 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
>      RSP: 0018:ffffffff95803e50 EFLAGS: 00000216
>      RAX: ffff8e2d61534000 RBX: 0000000000000000 RCX: 0000000000000000
>      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000081f8a6c
>      RBP: ffffffff9581d280 R08: 0000000000000000 R09: ffff8e2cf7d32301
>      R10: ffff8e2be11ae5c8 R11: 0000000000000001 R12: 0000000000000000
>      R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000147b0
>      FS:  0000000000000000(0000) GS:ffff8e2d61534000(0000) knlGS:0000000000000000
>      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>      CR2: 000055e77c3a5128 CR3: 000000010ff78003 CR4: 0000000000770ef0
>      PKRU: 55555554
>      Call Trace:
>       <TASK>
>       default_idle+0x9/0x20
>       default_idle_call+0x30/0x100
>       do_idle+0x20f/0x250
>       ? do_idle+0xb/0x250
>       cpu_startup_entry+0x29/0x30
>       rest_init+0xde/0x100
>       start_kernel+0x733/0xb20
>       ? copy_bootdata+0x9/0xb0
>       x86_64_start_reservations+0x18/0x30
>       x86_64_start_kernel+0xba/0x110
>       common_startup_64+0x13e/0x141
>       </TASK>
> 
> Is this in line with what you are seeing?

These are the backtrace for timer and the individual
epoll-stall threads:

[  539.155042] task:ktimers/1       state:D stack:0     pid:31    tgid:31    ppid:2      task_flags:0x4208040 flags:0x00004000
[  539.155047] Call Trace:
[  539.155049]  <TASK>
[  539.155051]  __schedule+0x401/0x15a0
[  539.155055]  ? srso_alias_return_thunk+0x5/0xfbef5
[  539.155059]  ? propagate_entity_cfs_rq+0x115/0x290
[  539.155063]  ? srso_alias_return_thunk+0x5/0xfbef5
[  539.155067]  ? srso_alias_return_thunk+0x5/0xfbef5
[  539.155070]  ? rt_mutex_setprio+0x1c2/0x480
[  539.155075]  schedule_rtlock+0x1e/0x40
[  539.155078]  rtlock_slowlock_locked+0x20e/0xc60
[  539.155088]  ? srso_alias_return_thunk+0x5/0xfbef5
[  539.155093]  rt_read_lock+0x8f/0x190
[  539.155099]  ep_poll_callback+0x37/0x2b0
[  539.155105]  __wake_up_common+0x78/0xa0
[  539.155110]  timerfd_tmrproc+0x43/0x60
[  539.155114]  ? __pfx_timerfd_tmrproc+0x10/0x10
[  539.155116]  __hrtimer_run_queues+0xfd/0x2e0
[  539.155124]  hrtimer_run_softirq+0x9d/0xf0
[  539.155128]  handle_softirqs.constprop.0+0xc1/0x2a0
[  539.155134]  ? __pfx_smpboot_thread_fn+0x10/0x10
[  539.155139]  run_ktimerd+0x3e/0x80
[  539.155142]  smpboot_thread_fn+0xf3/0x220
[  539.155147]  kthread+0xff/0x210
[  539.155151]  ? rt_spin_lock+0x3c/0xc0
[  539.155155]  ? __pfx_kthread+0x10/0x10
[  539.155159]  ret_from_fork+0x34/0x50
[  539.155165]  ? __pfx_kthread+0x10/0x10
[  539.155168]  ret_from_fork_asm+0x1a/0x30
[  539.155176]  </TASK>

[  557.323846] task:epoll-stall     state:D stack:0     pid:885   tgid:885   ppid:1      task_flags:0x400000 flags:0x00004002
[  557.323848] Call Trace:
[  557.323849]  <TASK>
[  557.323851]  __schedule+0x401/0x15a0
[  557.323853]  ? rt_write_lock+0x108/0x260
[  557.323858]  schedule_rtlock+0x1e/0x40
[  557.323860]  rt_write_lock+0xaa/0x260
[  557.323864]  do_epoll_wait+0x21f/0x4a0
[  557.323869]  ? __pfx_ep_autoremove_wake_function+0x10/0x10
[  557.323872]  __x64_sys_epoll_wait+0x63/0x100
[  557.323876]  do_syscall_64+0x6f/0x120
[  557.323879]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323881]  ? ksys_read+0x6b/0xe0
[  557.323883]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323885]  ? syscall_exit_to_user_mode+0x51/0x1a0
[  557.323887]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323889]  ? do_syscall_64+0x7b/0x120
[  557.323890]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323892]  ? ep_send_events+0x26d/0x2b0
[  557.323896]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323898]  ? do_epoll_wait+0x17e/0x4a0
[  557.323900]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323902]  ? __rseq_handle_notify_resume+0xa7/0x500
[  557.323905]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323907]  ? aa_file_perm+0x123/0x4e0
[  557.323911]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323912]  ? get_nohz_timer_target+0x2a/0x180
[  557.323914]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323916]  ? _copy_to_iter+0xa3/0x630
[  557.323920]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323922]  ? timerqueue_add+0x6a/0xc0
[  557.323924]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323926]  ? hrtimer_start_range_ns+0x2e7/0x4a0
[  557.323931]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323932]  ? timerfd_read_iter+0x141/0x2b0
[  557.323934]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323936]  ? security_file_permission+0x123/0x140
[  557.323940]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323942]  ? vfs_read+0x264/0x340
[  557.323946]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323948]  ? ksys_read+0x6b/0xe0
[  557.323950]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323951]  ? syscall_exit_to_user_mode+0x51/0x1a0
[  557.323953]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323955]  ? do_syscall_64+0x7b/0x120
[  557.323957]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323958]  ? syscall_exit_to_user_mode+0x168/0x1a0
[  557.323960]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323962]  ? do_syscall_64+0x7b/0x120
[  557.323963]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.323965]  ? do_syscall_64+0x7b/0x120
[  557.323967]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  557.323968] RIP: 0033:0x7f0371fd3dea
[  557.323970] RSP: 002b:00007ffd1062cd68 EFLAGS: 00000246 ORIG_RAX: 00000000000000e8
[  557.323971] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0371fd3dea
[  557.323972] RDX: 0000000000000001 RSI: 00007ffd1062cda4 RDI: 0000000000000005
[  557.323973] RBP: 00007ffd1062ce00 R08: 0000000000000000 R09: 000055ea1d8f12a0
[  557.323974] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007ffd1062cf18
[  557.323975] R13: 000055ea03253249 R14: 000055ea03255d80 R15: 00007f0372121040
[  557.323979]  </TASK>

[  557.431402] task:epoll-stall-wri state:R  running task     stack:0     pid:887   tgid:887   ppid:1      task_flags:0x400100 flags:0x00000002
[  557.431405] Call Trace:
[  557.431406]  <TASK>
[  557.431408]  __schedule+0x401/0x15a0
[  557.431410]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.431412]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.431414]  ? psi_group_change+0x212/0x460
[  557.431417]  ? pick_eevdf+0x71/0x180
[  557.431419]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.431421]  ? update_curr+0x8d/0x240
[  557.431425]  preempt_schedule+0x41/0x60
[  557.431427]  preempt_schedule_thunk+0x16/0x30
[  557.431431]  try_to_wake_up+0x2f6/0x6e0
[  557.431433]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.431437]  ep_autoremove_wake_function+0x12/0x40
[  557.431439]  __wake_up_common+0x78/0xa0
[  557.431443]  __wake_up_sync+0x34/0x50
[  557.431445]  ep_poll_callback+0x13e/0x2b0
[  557.431448]  ? aa_file_perm+0x123/0x4e0
[  557.431451]  __wake_up_common+0x78/0xa0
[  557.431454]  __wake_up_sync_key+0x38/0x50
[  557.431456]  anon_pipe_write+0x43b/0x6d0
[  557.431461]  fifo_pipe_write+0x13/0xe0
[  557.431463]  vfs_write+0x374/0x420
[  557.431468]  ksys_write+0xc9/0xe0
[  557.431471]  do_syscall_64+0x6f/0x120
[  557.431473]  ? current_time+0x30/0x130
[  557.431476]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.431479]  ? vfs_write+0x1bd/0x420
[  557.431481]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.431483]  ? vfs_write+0x1bd/0x420
[  557.431487]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.431489]  ? ksys_write+0xc9/0xe0
[  557.431491]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.431492]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.431494]  ? syscall_exit_to_user_mode+0x51/0x1a0
[  557.431496]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.431498]  ? do_syscall_64+0x7b/0x120
[  557.431501]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.431502]  ? ksys_write+0xc9/0xe0
[  557.431504]  ? do_syscall_64+0x7b/0x120
[  557.431506]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.431508]  ? syscall_exit_to_user_mode+0x51/0x1a0
[  557.431510]  ? srso_alias_return_thunk+0x5/0xfbef5
[  557.431511]  ? do_syscall_64+0x7b/0x120
[  557.431513]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  557.431515] RIP: 0033:0x7f0ef191c887
[  557.431516] RSP: 002b:00007ffc15f50948 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  557.431517] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0ef191c887
[  557.431518] RDX: 0000000000000001 RSI: 0000561a16a0203d RDI: 0000000000000003
[  557.431519] RBP: 00007ffc15f50970 R08: 0000000000000065 R09: 0000561a45f0e2a0
[  557.431520] R10: 0000000000000077 R11: 0000000000000246 R12: 00007ffc15f50a88
[  557.431521] R13: 0000561a16a011a9 R14: 0000561a16a03da8 R15: 00007f0ef1a7b040
[  557.431525]  </TASK>

> 
>>
>>>
>>> Additionally, do you have RT throttling enabled in your setup? Can long
>>> running RT tasks starve fair tasks on your setup?
>>
>> RT throttling is enabled (default settings) but was not kicking in - why
>> should it in that scenario? The only RT thread, ktimers, ran into the
>> held lock and stopped.
>>
>> Jan
>>
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-15 10:21             ` Jan Kiszka
@ 2025-04-15 11:14               ` K Prateek Nayak
       [not found]               ` <ec2cea83-07fe-472f-8320-911d215473fd@amd.com>
  1 sibling, 0 replies; 45+ messages in thread
From: K Prateek Nayak @ 2025-04-15 11:14 UTC (permalink / raw)
  To: Jan Kiszka, Aaron Lu, Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Sebastian Andrzej Siewior,

Hello Jan,

On 4/15/2025 3:51 PM, Jan Kiszka wrote:
>> Is this in line with what you are seeing?
>>
> 
> Yes, and if you wait a bit longer for the second reporting round, you
> should get more task backtraces as well.

So looking at the backtrace [1], Aaron's patch should help with the
stalls you are seeing.

timerfd that queues a hrtimer also uses ep_poll_callback() to wakeup
the epoll waiter which queues ahead of the bandwidth timer and
requires the read lock but now since the writer tried to grab the
lock pushing readers on the slowpath. if epoll-stall-writer is now
throttled, it needs ktimer to replenish its bandwidth which cannot
happen without it grabbing the read lock first.

# epoll-stall-writer

ep_poll()
{
	...
	/*
	 * Does not disable IRQ / preemption on PREEMPT_RT; sends future readers on
	 * rwlock slowpath and they have to wait until epoll-stall-writer acquires
	 * and drops the write lock.
	 */
	write_lock_irq(&ep->lock);

	__set_current_state(TASK_INTERRUPTIBLE);

	/************** Preempted due to lack of bandwidth **************/

	...
	eavail = ep_events_available(ep);
	if (!eavail)
		__add_wait_queue_exclusive(&ep->wq, &wait);

	/* Never reaches here waiting for bandwidth */
	write_unlock_irq(&ep->lock);
}


# ktimers

ep_poll_callback(...)
{
	...

	/*
	 * Does not disable interrupts on PREEMPT_RT; ktimers needs the
	 * epoll-stall-writer to take the write lock and drop it to
	 * proceed but epoll-stall-writer requires ktimers to run the
	 * bandwidth timer to be runnable again. Deadlock!
  	 */
	read_lock_irqsave(&ep->lock, flags);

	...

	/* wakeup within read side critical section */
	if (sync)
		wake_up_sync(&ep->wq);
	else
		wake_up(&ep->wq);

	...

	read_unlock_irqrestore(&ep->lock, flags);
}

[1] https://lore.kernel.org/all/62304351-7fc0-48b6-883b-d346886dac8e@amd.com/

> 
> Jan
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-14 16:34 ` K Prateek Nayak
@ 2025-04-15 11:25   ` Aaron Lu
  0 siblings, 0 replies; 45+ messages in thread
From: Aaron Lu @ 2025-04-15 11:25 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka

Hi Prateek,

On Mon, Apr 14, 2025 at 10:04:02PM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 4/9/2025 5:37 PM, Aaron Lu wrote:
> > This is a continuous work based on Valentin Schneider's posting here:
> > Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
> > https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
> > 
> > Valentin has described the problem very well in the above link. We also
> > have task hung problem from time to time in our environment due to cfs quota.
> > It is mostly visible with rwsem: when a reader is throttled, writer comes in
> > and has to wait, the writer also makes all subsequent readers wait,
> > causing problems of priority inversion or even whole system hung.
> > 
> > To improve this situation, change the throttle model to task based, i.e.
> > when a cfs_rq is throttled, mark its throttled status but do not
> > remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
> > when they get picked, add a task work to them so that when they return
> > to user, they can be dequeued. In this way, tasks throttled will not
> > hold any kernel resources. When cfs_rq gets unthrottled, enqueue back
> > those throttled tasks.
> 
> I tried to reproduce the scenario that Valentin describes in the
> parallel thread [1] and I haven't run into a stall yet with this
> series applied on top of v6.15-rc1 [2].

Great to hear this.

> So for Patch 1-6, feel free to add:
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

Thanks!

> I'm slowly crawling through the series and haven't gotten to the Patch 7
> yet but so far I haven't seen anything unexpected in my initial testing
> and it seems to solve possible circular dependency on PREEMPT_RT with
> bandwidth replenishment (Sebastian has some doubts if my reproducer is
> correct but that discussion is for the other thread)
> 
> Thank you for working on this and a big thanks to Valentin for the solid
> groundwork.

Thank you a ton for your review and test.

> [1] https://lore.kernel.org/linux-rt-users/f2e2c74c-b15d-4185-a6ea-4a19eee02417@amd.com/
> [2] https://lore.kernel.org/linux-rt-users/534df953-3cfb-4b3d-8953-5ed9ef24eabc@amd.com/

Best regards,
Aaron

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
       [not found]               ` <ec2cea83-07fe-472f-8320-911d215473fd@amd.com>
@ 2025-04-15 15:49                 ` K Prateek Nayak
  2025-04-22  2:10                   ` Aaron Lu
  0 siblings, 1 reply; 45+ messages in thread
From: K Prateek Nayak @ 2025-04-15 15:49 UTC (permalink / raw)
  To: Jan Kiszka, Aaron Lu, Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Sebastian Andrzej Siewior,

Hello Jan,

Sorry for the noise.

On 4/15/2025 4:46 PM, K Prateek Nayak wrote:
> Hello Jan,
> 
> On 4/15/2025 3:51 PM, Jan Kiszka wrote:
>>> Is this in line with what you are seeing?
>>>
>>
>> Yes, and if you wait a bit longer for the second reporting round, you
>> should get more task backtraces as well.
> 
> So looking at the backtrace [1], Aaron's patch should help with the
> stalls you are seeing.
> 
> timerfd that queues a hrtimer also uses ep_poll_callback() to wakeup
> the epoll waiter which queues ahead of the bandwidth timer and
> requires the read lock but now since the writer tried to grab the
> lock pushing readers on the slowpath. if epoll-stall-writer is now
> throttled, it needs ktimer to replenish its bandwidth which cannot
> happen without it grabbing the read lock first.
> 
> # epoll-stall-writer

So I got confused between "epoll-stall" and "epoll-stall-writer" here.
Turns out the actual series of events (based on traces, and hopefully
correct this time) are slightly longer. The correct series of events
are:

# epoll-stall-writer

anon_pipe_write()
   __wake_up_common()
     ep_poll_callback() {
       read_lock_irq(&ep->lock)		/* Read lock acquired here */
       __wake_up_common()
         ep_autoremove_wake_function()
           try_to_wake_up()		/* Wakes up "epoll-stall" */
             preempt_schedule()
             ...

# "epoll-stall-writer" has run out of bandwidth, needs replenish to run
# sched_switch: "epoll-stall-writer" => "epoll-stall"

     ... /* Resumes from epoll_wait() */
     epoll_wait() => 1			/* Write to FIFO */
     read() 				/* Reads one byte of data */
     epoll_wait()
       write_lock_irq()			/* Tries to grab write lock; "epoll-stall-writer" still has read lock */
         schedule_rtlock()		/* Sleeps but put next readers on slowpath */
         ...

# sched_switch: "epoll-stall" => "swapper"
# CPU is idle

...

# Timer interrupt schedules ktimers
# sched_switch: "swapper" => "ktimers"

hrtimer_run_softirq()
   timerfd_tmrproc()
     __wake_up_common()
       ep_poll_callback() {
         read_lock_irq(&ep->lock)	/* Blocks since we are in rwlock slowpath */
           schedule_rtlock()
           ...

# sched_switch: "ktimers" => "swapper"
# Bandwidth replenish never happens
# Stall

 From a second look at trace, this should be the right series of
events since "epoll-stall-writer" with bandwidth control seems
to have cut off during while doing the wakeup and hasn't run
again.

Sorry for the noise.

[..snip..]

> 
> [1] https://lore.kernel.org/all/62304351-7fc0-48b6-883b-d346886dac8e@amd.com/
> 
>>
>> Jan
>>
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time
  2025-04-09 12:07 ` [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time Aaron Lu
  2025-04-09 14:24   ` Aaron Lu
@ 2025-04-17 14:06   ` Florian Bezdeka
  2025-04-18  3:15     ` Aaron Lu
  2025-05-07  9:09     ` Aaron Lu
  1 sibling, 2 replies; 45+ messages in thread
From: Florian Bezdeka @ 2025-04-17 14:06 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

Hi Aaron,

On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote:
> @@ -5889,27 +5943,21 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>  	cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
>  		cfs_rq->throttled_clock_pelt;
>  
> -	if (cfs_rq->throttled_clock_self) {
> -		u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
> -
> -		cfs_rq->throttled_clock_self = 0;
> -
> -		if (WARN_ON_ONCE((s64)delta < 0))
> -			delta = 0;
> -
> -		cfs_rq->throttled_clock_self_time += delta;
> -	}
> +	if (cfs_rq->throttled_clock_self)
> +		account_cfs_rq_throttle_self(cfs_rq);
>  
>  	/* Re-enqueue the tasks that have been throttled at this level. */
>  	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
>  		list_del_init(&p->throttle_node);
> -		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
> +		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
>  	}
>  
>  	/* Add cfs_rq with load or one or more already running entities to the list */
>  	if (!cfs_rq_is_decayed(cfs_rq))
>  		list_add_leaf_cfs_rq(cfs_rq);
>  
> +	WARN_ON_ONCE(cfs_rq->h_nr_throttled);
> +
>  	return 0;
>  }
>  

I got this warning while testing in our virtual environment:

Any idea?

[   26.639641] ------------[ cut here ]------------
[   26.639644] WARNING: CPU: 5 PID: 0 at kernel/sched/fair.c:5967 tg_unthrottle_up+0x1a6/0x3d0
[   26.639653] Modules linked in: veth xt_nat nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter bridge stp llc xt_recent rfkill ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt vsock_loopback vmw_vsock_virtio_transport_common ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog vmw_vsock_vmci_transport xt_comment vsock nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables intel_rapl_msr intel_rapl_common nfnetlink binfmt_misc intel_uncore_frequency_common isst_if_mbox_msr isst_if_common skx_edac_common nfit libnvdimm ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel snd_pcm crypto_simd cryptd snd_timer rapl snd soundcore vmw_balloon vmwgfx pcspkr drm_ttm_helper ttm drm_client_lib button ac drm_kms_helper sg vmw_vmci evdev joydev serio_raw drm loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 overlay nls_ascii nls_cp437 vfat fat ext4 crc16 mbcache jbd2 squashfs dm_verity dm_bufio reed_solomon dm_mod
[   26.639715]  sd_mod ata_generic mptspi mptscsih ata_piix mptbase libata scsi_transport_spi psmouse scsi_mod vmxnet3 i2c_piix4 i2c_smbus scsi_common
[   26.639726] CPU: 5 UID: 0 PID: 0 Comm: swapper/5 Not tainted 6.14.2-CFSfixes #1
[   26.639729] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.24224532.B64.2408191458 08/19/2024
[   26.639731] RIP: 0010:tg_unthrottle_up+0x1a6/0x3d0
[   26.639735] Code: 00 00 48 39 ca 74 14 48 8b 52 10 49 8b 8e 58 01 00 00 48 39 8a 28 01 00 00 74 24 41 8b 86 68 01 00 00 85 c0 0f 84 8d fe ff ff <0f> 0b e9 86 fe ff ff 49 8b 9e 38 01 00 00 41 8b 86 40 01 00 00 48
[   26.639737] RSP: 0000:ffffa5df8029cec8 EFLAGS: 00010002
[   26.639739] RAX: 0000000000000001 RBX: ffff981c6fcb6a80 RCX: ffff981943752e40
[   26.639741] RDX: 0000000000000005 RSI: ffff981c6fcb6a80 RDI: ffff981943752d00
[   26.639742] RBP: ffff9819607dc708 R08: ffff981c6fcb6a80 R09: 0000000000000000
[   26.639744] R10: 0000000000000001 R11: ffff981969936a10 R12: ffff9819607dc708
[   26.639745] R13: ffff9819607dc9d8 R14: ffff9819607dc800 R15: ffffffffad913fb0
[   26.639747] FS:  0000000000000000(0000) GS:ffff981c6fc80000(0000) knlGS:0000000000000000
[   26.639749] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   26.639750] CR2: 00007ff1292dc44c CR3: 000000015350e006 CR4: 00000000007706f0
[   26.639779] PKRU: 55555554
[   26.639781] Call Trace:
[   26.639783]  <IRQ>
[   26.639787]  ? __pfx_tg_unthrottle_up+0x10/0x10
[   26.639790]  ? __pfx_tg_nop+0x10/0x10
[   26.639793]  walk_tg_tree_from+0x58/0xb0
[   26.639797]  unthrottle_cfs_rq+0xf0/0x360
[   26.639800]  ? sched_clock_cpu+0xf/0x190
[   26.639808]  __cfsb_csd_unthrottle+0x11c/0x170
[   26.639812]  ? __pfx___cfsb_csd_unthrottle+0x10/0x10
[   26.639816]  __flush_smp_call_function_queue+0x103/0x410
[   26.639822]  __sysvec_call_function_single+0x1c/0xb0
[   26.639826]  sysvec_call_function_single+0x6c/0x90
[   26.639832]  </IRQ>
[   26.639833]  <TASK>
[   26.639834]  asm_sysvec_call_function_single+0x1a/0x20
[   26.639840] RIP: 0010:pv_native_safe_halt+0xf/0x20
[   26.639844] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 45 c1 13 00 fb f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
[   26.639846] RSP: 0000:ffffa5df80117ed8 EFLAGS: 00000242
[   26.639848] RAX: 0000000000000005 RBX: ffff981940804000 RCX: ffff9819a9df7000
[   26.639849] RDX: 0000000000000005 RSI: 0000000000000005 RDI: 000000000005c514
[   26.639851] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000001
[   26.639852] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[   26.639853] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   26.639858]  default_idle+0x9/0x20
[   26.639861]  default_idle_call+0x30/0x100
[   26.639863]  do_idle+0x1fd/0x240
[   26.639869]  cpu_startup_entry+0x29/0x30
[   26.639872]  start_secondary+0x11e/0x140
[   26.639875]  common_startup_64+0x13e/0x141
[   26.639881]  </TASK>
[   26.639882] ---[ end trace 0000000000000000 ]---

Best regards,
Florian

-- 
Siemens AG, Foundational Technologies
Linux Expert Center




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time
  2025-04-17 14:06   ` Florian Bezdeka
@ 2025-04-18  3:15     ` Aaron Lu
  2025-04-22 15:03       ` Florian Bezdeka
  2025-05-07  9:09     ` Aaron Lu
  1 sibling, 1 reply; 45+ messages in thread
From: Aaron Lu @ 2025-04-18  3:15 UTC (permalink / raw)
  To: Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chengming Zhou, Chuyi Zhou, Jan Kiszka

Hi Florian,

On Thu, Apr 17, 2025 at 04:06:16PM +0200, Florian Bezdeka wrote:
> Hi Aaron,
> 
> On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote:
> > @@ -5889,27 +5943,21 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> >  	cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> >  		cfs_rq->throttled_clock_pelt;
> >  
> > -	if (cfs_rq->throttled_clock_self) {
> > -		u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
> > -
> > -		cfs_rq->throttled_clock_self = 0;
> > -
> > -		if (WARN_ON_ONCE((s64)delta < 0))
> > -			delta = 0;
> > -
> > -		cfs_rq->throttled_clock_self_time += delta;
> > -	}
> > +	if (cfs_rq->throttled_clock_self)
> > +		account_cfs_rq_throttle_self(cfs_rq);
> >  
> >  	/* Re-enqueue the tasks that have been throttled at this level. */
> >  	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
> >  		list_del_init(&p->throttle_node);
> > -		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
> > +		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
> >  	}
> >  
> >  	/* Add cfs_rq with load or one or more already running entities to the list */
> >  	if (!cfs_rq_is_decayed(cfs_rq))
> >  		list_add_leaf_cfs_rq(cfs_rq);
> >  
> > +	WARN_ON_ONCE(cfs_rq->h_nr_throttled);
> > +
> >  	return 0;
> >  }
> >  
> 
> I got this warning while testing in our virtual environment:

Thanks for the report.

> 
> Any idea?
>

Most likely the accounting of h_nr_throttle is incorrect somewhere.

> [   26.639641] ------------[ cut here ]------------
> [   26.639644] WARNING: CPU: 5 PID: 0 at kernel/sched/fair.c:5967 tg_unthrottle_up+0x1a6/0x3d0

The line doesn't match the code though, the below warning should be at
line 5959:
WARN_ON_ONCE(cfs_rq->h_nr_throttled); 

> [   26.639653] Modules linked in: veth xt_nat nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter bridge stp llc xt_recent rfkill ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt vsock_loopback vmw_vsock_virtio_transport_common ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog vmw_vsock_vmci_transport xt_comment vsock nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables intel_rapl_msr intel_rapl_common nfnetlink binfmt_misc intel_uncore_frequency_common isst_if_mbox_msr isst_if_common skx_edac_common nfit libnvdimm ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel snd_pcm crypto_simd cryptd snd_timer rapl snd soundcore vmw_balloon vmwgfx pcspkr drm_ttm_helper ttm drm_client_lib button ac drm_kms_helper sg vmw_vmci evdev joydev serio_raw drm loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 overlay nls_ascii nls_cp437 vfat fat ext4 crc16 mbcache jbd2 squashfs dm_verity dm_bufio reed_solomon dm_mod
> [   26.639715]  sd_mod ata_generic mptspi mptscsih ata_piix mptbase libata scsi_transport_spi psmouse scsi_mod vmxnet3 i2c_piix4 i2c_smbus scsi_common
> [   26.639726] CPU: 5 UID: 0 PID: 0 Comm: swapper/5 Not tainted 6.14.2-CFSfixes #1

6.14.2-CFSfixes seems to be a backported kernel?
Do you also see this warning when using this series on top of the said
base commit 6432e163ba1b("sched/isolation: Make use of more than one
housekeeping cpu")? Just want to make sure it's not a problem due to
backport.

Thanks,
Aaron

> [   26.639729] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.24224532.B64.2408191458 08/19/2024
> [   26.639731] RIP: 0010:tg_unthrottle_up+0x1a6/0x3d0
> [   26.639735] Code: 00 00 48 39 ca 74 14 48 8b 52 10 49 8b 8e 58 01 00 00 48 39 8a 28 01 00 00 74 24 41 8b 86 68 01 00 00 85 c0 0f 84 8d fe ff ff <0f> 0b e9 86 fe ff ff 49 8b 9e 38 01 00 00 41 8b 86 40 01 00 00 48
> [   26.639737] RSP: 0000:ffffa5df8029cec8 EFLAGS: 00010002
> [   26.639739] RAX: 0000000000000001 RBX: ffff981c6fcb6a80 RCX: ffff981943752e40
> [   26.639741] RDX: 0000000000000005 RSI: ffff981c6fcb6a80 RDI: ffff981943752d00
> [   26.639742] RBP: ffff9819607dc708 R08: ffff981c6fcb6a80 R09: 0000000000000000
> [   26.639744] R10: 0000000000000001 R11: ffff981969936a10 R12: ffff9819607dc708
> [   26.639745] R13: ffff9819607dc9d8 R14: ffff9819607dc800 R15: ffffffffad913fb0
> [   26.639747] FS:  0000000000000000(0000) GS:ffff981c6fc80000(0000) knlGS:0000000000000000
> [   26.639749] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   26.639750] CR2: 00007ff1292dc44c CR3: 000000015350e006 CR4: 00000000007706f0
> [   26.639779] PKRU: 55555554
> [   26.639781] Call Trace:
> [   26.639783]  <IRQ>
> [   26.639787]  ? __pfx_tg_unthrottle_up+0x10/0x10
> [   26.639790]  ? __pfx_tg_nop+0x10/0x10
> [   26.639793]  walk_tg_tree_from+0x58/0xb0
> [   26.639797]  unthrottle_cfs_rq+0xf0/0x360
> [   26.639800]  ? sched_clock_cpu+0xf/0x190
> [   26.639808]  __cfsb_csd_unthrottle+0x11c/0x170
> [   26.639812]  ? __pfx___cfsb_csd_unthrottle+0x10/0x10
> [   26.639816]  __flush_smp_call_function_queue+0x103/0x410
> [   26.639822]  __sysvec_call_function_single+0x1c/0xb0
> [   26.639826]  sysvec_call_function_single+0x6c/0x90
> [   26.639832]  </IRQ>
> [   26.639833]  <TASK>
> [   26.639834]  asm_sysvec_call_function_single+0x1a/0x20
> [   26.639840] RIP: 0010:pv_native_safe_halt+0xf/0x20
> [   26.639844] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 45 c1 13 00 fb f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
> [   26.639846] RSP: 0000:ffffa5df80117ed8 EFLAGS: 00000242
> [   26.639848] RAX: 0000000000000005 RBX: ffff981940804000 RCX: ffff9819a9df7000
> [   26.639849] RDX: 0000000000000005 RSI: 0000000000000005 RDI: 000000000005c514
> [   26.639851] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000001
> [   26.639852] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
> [   26.639853] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [   26.639858]  default_idle+0x9/0x20
> [   26.639861]  default_idle_call+0x30/0x100
> [   26.639863]  do_idle+0x1fd/0x240
> [   26.639869]  cpu_startup_entry+0x29/0x30
> [   26.639872]  start_secondary+0x11e/0x140
> [   26.639875]  common_startup_64+0x13e/0x141
> [   26.639881]  </TASK>
> [   26.639882] ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-15 15:49                 ` K Prateek Nayak
@ 2025-04-22  2:10                   ` Aaron Lu
  2025-04-22  2:54                     ` K Prateek Nayak
  0 siblings, 1 reply; 45+ messages in thread
From: Aaron Lu @ 2025-04-22  2:10 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Jan Kiszka, Florian Bezdeka, Valentin Schneider, Ben Segall,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou,
	Sebastian Andrzej Siewior,

Hi Prateek,

On Tue, Apr 15, 2025 at 09:19:20PM +0530, K Prateek Nayak wrote:
> Hello Jan,
> 
> Sorry for the noise.
> 
> On 4/15/2025 4:46 PM, K Prateek Nayak wrote:
> > Hello Jan,
> > 
> > On 4/15/2025 3:51 PM, Jan Kiszka wrote:
> > > > Is this in line with what you are seeing?
> > > > 
> > > 
> > > Yes, and if you wait a bit longer for the second reporting round, you
> > > should get more task backtraces as well.
> > 
> > So looking at the backtrace [1], Aaron's patch should help with the
> > stalls you are seeing.
> > 
> > timerfd that queues a hrtimer also uses ep_poll_callback() to wakeup
> > the epoll waiter which queues ahead of the bandwidth timer and
> > requires the read lock but now since the writer tried to grab the
> > lock pushing readers on the slowpath. if epoll-stall-writer is now
> > throttled, it needs ktimer to replenish its bandwidth which cannot
> > happen without it grabbing the read lock first.
> > 
> > # epoll-stall-writer
> 
> So I got confused between "epoll-stall" and "epoll-stall-writer" here.
> Turns out the actual series of events (based on traces, and hopefully
> correct this time) are slightly longer. The correct series of events
> are:
> 
> # epoll-stall-writer
> 
> anon_pipe_write()
>   __wake_up_common()
>     ep_poll_callback() {
>       read_lock_irq(&ep->lock)		/* Read lock acquired here */

I was confused by this function's name. I had thought irq is off but
then realized under PREEMPT_RT, read_lock_irq() doesn't disable irq...

>       __wake_up_common()
>         ep_autoremove_wake_function()
>           try_to_wake_up()		/* Wakes up "epoll-stall" */
>             preempt_schedule()
>             ...
> 
> # "epoll-stall-writer" has run out of bandwidth, needs replenish to run

Luckily in this "only throttle when ret2user" model, epoll-stall-writer
does not need replenish to run again(and then unblock the others).

> # sched_switch: "epoll-stall-writer" => "epoll-stall"
> 
>     ... /* Resumes from epoll_wait() */
>     epoll_wait() => 1			/* Write to FIFO */
>     read() 				/* Reads one byte of data */
>     epoll_wait()
>       write_lock_irq()			/* Tries to grab write lock; "epoll-stall-writer" still has read lock */
>         schedule_rtlock()		/* Sleeps but put next readers on slowpath */
>         ...
> 
> # sched_switch: "epoll-stall" => "swapper"
> # CPU is idle
> 
> ...
> 
> # Timer interrupt schedules ktimers
> # sched_switch: "swapper" => "ktimers"
> 
> hrtimer_run_softirq()
>   timerfd_tmrproc()
>     __wake_up_common()
>       ep_poll_callback() {
>         read_lock_irq(&ep->lock)	/* Blocks since we are in rwlock slowpath */
>           schedule_rtlock()
>           ...
> 
> # sched_switch: "ktimers" => "swapper"
> # Bandwidth replenish never happens
> # Stall
> 
> From a second look at trace, this should be the right series of
> events since "epoll-stall-writer" with bandwidth control seems
> to have cut off during while doing the wakeup and hasn't run
> again.
> 
> Sorry for the noise.
> 

Thanks for the analysis.

I'm testing this reprod with this series and didn't notice any issue
yet, I'll report if anything unexpected happened.

Best wishes,
Aaron

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-22  2:10                   ` Aaron Lu
@ 2025-04-22  2:54                     ` K Prateek Nayak
  2025-04-22 14:54                       ` Florian Bezdeka
  0 siblings, 1 reply; 45+ messages in thread
From: K Prateek Nayak @ 2025-04-22  2:54 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Jan Kiszka, Florian Bezdeka, Valentin Schneider, Ben Segall,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou,
	Sebastian Andrzej Siewior,

Hello Aaron,

On 4/22/2025 7:40 AM, Aaron Lu wrote:
>> anon_pipe_write()
>>    __wake_up_common()
>>      ep_poll_callback() {
>>        read_lock_irq(&ep->lock)		/* Read lock acquired here */
> I was confused by this function's name. I had thought irq is off but
> then realized under PREEMPT_RT, read_lock_irq() doesn't disable irq...

Yup! Most of the interrupt handlers are run by the IRQ threads on
PREEMPT_RT and the ones that do run in the interrupt context have all
been adapted to use non-blocking locks whose *_irq variants disables
interrupts on PREEMPT_RT too.

> 
>>        __wake_up_common()
>>          ep_autoremove_wake_function()
>>            try_to_wake_up()		/* Wakes up "epoll-stall" */
>>              preempt_schedule()
>>              ...
>>
>> # "epoll-stall-writer" has run out of bandwidth, needs replenish to run
> Luckily in this "only throttle when ret2user" model, epoll-stall-writer
> does not need replenish to run again(and then unblock the others).

I can confirm that throttle deferral solves this issue. I have run Jan's
reproducer for a long time without seeing any hangs on your series. I
hope Florian can confirm the same.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
  2025-04-22  2:54                     ` K Prateek Nayak
@ 2025-04-22 14:54                       ` Florian Bezdeka
  0 siblings, 0 replies; 45+ messages in thread
From: Florian Bezdeka @ 2025-04-22 14:54 UTC (permalink / raw)
  To: K Prateek Nayak, Aaron Lu
  Cc: Jan Kiszka, Valentin Schneider, Ben Segall, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chengming Zhou, Chuyi Zhou, Sebastian Andrzej Siewior,

On Tue, 2025-04-22 at 08:24 +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 4/22/2025 7:40 AM, Aaron Lu wrote:
> > > anon_pipe_write()
> > >    __wake_up_common()
> > >      ep_poll_callback() {
> > >        read_lock_irq(&ep->lock)		/* Read lock acquired here */
> > I was confused by this function's name. I had thought irq is off but
> > then realized under PREEMPT_RT, read_lock_irq() doesn't disable irq...
> 
> Yup! Most of the interrupt handlers are run by the IRQ threads on
> PREEMPT_RT and the ones that do run in the interrupt context have all
> been adapted to use non-blocking locks whose *_irq variants disables
> interrupts on PREEMPT_RT too.
> 
> > 
> > >        __wake_up_common()
> > >          ep_autoremove_wake_function()
> > >            try_to_wake_up()		/* Wakes up "epoll-stall" */
> > >              preempt_schedule()
> > >              ...
> > > 
> > > # "epoll-stall-writer" has run out of bandwidth, needs replenish to run
> > Luckily in this "only throttle when ret2user" model, epoll-stall-writer
> > does not need replenish to run again(and then unblock the others).
> 
> I can confirm that throttle deferral solves this issue. I have run Jan's
> reproducer for a long time without seeing any hangs on your series. I
> hope Florian can confirm the same.
> 

Partially, yes.

First, let me clarify what I am testing: I'm testing with PREEMPT_RT
enabled, as that is the setup that makes problems in the field. For
those setups it's not a performance/jitter optimization it's a critical
fix. The system locks up completely.

I ported the series to 6.14. Background was stability and the
possibility to replace one of the devices in the field with a patched
version. We do not trust anything newer yet.

The test results: 6.14 + backport is still running fine for ~10 days
now on a system where the reproducer (that Jan posted already) crashed
a unpatched 6.14 in a couple of minutes. Success.

But: I also started a test with 6.14 vanilla (so unpatched) on a
different system. This one crashes within a couple of minutes. This is
a completely different story - as the series we're discussing here is
not even applied - but to be complete, this is the last message we get
from the device:

The device is completely locked up afterwards. PID 34 is ktimers on
CPU1.

kernel: ------------[ cut here ]------------
kernel: !se->on_rq
kernel: WARNING: CPU: 1 PID: 34 at kernel/sched/fair.c:699 update_entity_lag+0x7d/0x90
kernel: Modules linked in: veth xt_nat nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink xfr>
kernel:  sd_mod mptspi ata_generic mptscsih mptbase psmouse scsi_transport_spi ata_piix libata scs>
kernel: CPU: 1 UID: 0 PID: 34 Comm: ktimers/1 Not tainted 6.14.0 #1
kernel: Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.242>
kernel: RIP: 0010:update_entity_lag+0x7d/0x90
kernel: Code: 0f 4d d7 48 89 53 78 5b 5d c3 cc cc cc cc 80 3d e7 f4 dd 01 00 75 a9 48 c7 c7 d0 81 >
kernel: RSP: 0018:ffffacf58012fbe8 EFLAGS: 00010082
kernel: RAX: 0000000000000000 RBX: ffff9ee43ca00080 RCX: 0000000000000027
kernel: RDX: ffff9ee6efd21988 RSI: 0000000000000001 RDI: ffff9ee6efd21980
kernel: RBP: ffff9ee421929800 R08: 00000000462951bd R09: ffffffff8e654811
kernel: R10: ffffffff8e654811 R11: ffffffff8e608a2a R12: 000000000000000e
kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 000000000000000e
kernel: FS:  0000000000000000(0000) GS:ffff9ee6efd00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 000000c00082a000 CR3: 0000000113416002 CR4: 00000000007706f0
kernel: PKRU: 55555554
kernel: Call Trace:
kernel:  <TASK>
kernel:  ? __warn+0x91/0x190
kernel:  ? update_entity_lag+0x7d/0x90
kernel:  ? report_bug+0x164/0x190
kernel:  ? handle_bug+0x58/0x90
kernel:  ? exc_invalid_op+0x17/0x70
kernel:  ? asm_exc_invalid_op+0x1a/0x20
kernel:  ? ret_from_fork_asm+0x1a/0x30
kernel:  ? ret_from_fork+0x31/0x50
kernel:  ? ret_from_fork+0x31/0x50
kernel:  ? update_entity_lag+0x7d/0x90
kernel:  ? update_entity_lag+0x7d/0x90
kernel:  dequeue_entity+0x90/0x5a0
kernel:  dequeue_entities+0x121/0x640
kernel:  dequeue_task_fair+0xbf/0x290
kernel:  rt_mutex_setprio+0x37c/0x690
kernel:  rtlock_slowlock_locked+0xca1/0x1860
kernel:  ? lock_acquire+0xcb/0x2e0
kernel:  ? run_ktimerd+0xe/0x80
kernel:  ? __pfx_smpboot_thread_fn+0x10/0x10
kernel:  rt_spin_lock+0x86/0x160
kernel:  __local_bh_disable_ip+0x9d/0x190
kernel:  ksoftirqd_run_begin+0xe/0x30
kernel:  run_ktimerd+0xe/0x80
kernel:  smpboot_thread_fn+0xda/0x1d0



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time
  2025-04-18  3:15     ` Aaron Lu
@ 2025-04-22 15:03       ` Florian Bezdeka
  2025-04-23 11:26         ` Aaron Lu
  0 siblings, 1 reply; 45+ messages in thread
From: Florian Bezdeka @ 2025-04-22 15:03 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chengming Zhou, Chuyi Zhou, Jan Kiszka

[-- Attachment #1: Type: text/plain, Size: 4049 bytes --]

On Fri, 2025-04-18 at 11:15 +0800, Aaron Lu wrote:
> Hi Florian,
> 
> On Thu, Apr 17, 2025 at 04:06:16PM +0200, Florian Bezdeka wrote:
> > Hi Aaron,
> > 
> > On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote:
> > > @@ -5889,27 +5943,21 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> > >  	cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> > >  		cfs_rq->throttled_clock_pelt;
> > >  
> > > -	if (cfs_rq->throttled_clock_self) {
> > > -		u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
> > > -
> > > -		cfs_rq->throttled_clock_self = 0;
> > > -
> > > -		if (WARN_ON_ONCE((s64)delta < 0))
> > > -			delta = 0;
> > > -
> > > -		cfs_rq->throttled_clock_self_time += delta;
> > > -	}
> > > +	if (cfs_rq->throttled_clock_self)
> > > +		account_cfs_rq_throttle_self(cfs_rq);
> > >  
> > >  	/* Re-enqueue the tasks that have been throttled at this level. */
> > >  	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
> > >  		list_del_init(&p->throttle_node);
> > > -		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
> > > +		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
> > >  	}
> > >  
> > >  	/* Add cfs_rq with load or one or more already running entities to the list */
> > >  	if (!cfs_rq_is_decayed(cfs_rq))
> > >  		list_add_leaf_cfs_rq(cfs_rq);
> > >  
> > > +	WARN_ON_ONCE(cfs_rq->h_nr_throttled);
> > > +
> > >  	return 0;
> > >  }
> > >  
> > 
> > I got this warning while testing in our virtual environment:
> 
> Thanks for the report.
> 
> > 
> > Any idea?
> > 
> 
> Most likely the accounting of h_nr_throttle is incorrect somewhere.
> 
> > [   26.639641] ------------[ cut here ]------------
> > [   26.639644] WARNING: CPU: 5 PID: 0 at kernel/sched/fair.c:5967 tg_unthrottle_up+0x1a6/0x3d0
> 
> The line doesn't match the code though, the below warning should be at
> line 5959:
> WARN_ON_ONCE(cfs_rq->h_nr_throttled);

See below.

> 
> > [   26.639653] Modules linked in: veth xt_nat nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter bridge stp llc xt_recent rfkill ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt vsock_loopback vmw_vsock_virtio_transport_common ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog vmw_vsock_vmci_transport xt_comment vsock nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables intel_rapl_msr intel_rapl_common nfnetlink binfmt_misc intel_uncore_frequency_common isst_if_mbox_msr isst_if_common skx_edac_common nfit libnvdimm ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel snd_pcm crypto_simd cryptd snd_timer rapl snd soundcore vmw_balloon vmwgfx pcspkr drm_ttm_helper ttm drm_client_lib button ac drm_kms_helper sg vmw_vmci evdev joydev serio_raw drm loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 overlay nls_ascii nls_cp437 vfat fat ext4 crc16 mbcache jbd2 squashfs dm_verity dm_bufio reed_solomon dm_mod
> > [   26.639715]  sd_mod ata_generic mptspi mptscsih ata_piix mptbase libata scsi_transport_spi psmouse scsi_mod vmxnet3 i2c_piix4 i2c_smbus scsi_common
> > [   26.639726] CPU: 5 UID: 0 PID: 0 Comm: swapper/5 Not tainted 6.14.2-CFSfixes #1
> 
> 6.14.2-CFSfixes seems to be a backported kernel?
> Do you also see this warning when using this series on top of the said
> base commit 6432e163ba1b("sched/isolation: Make use of more than one
> housekeeping cpu")? Just want to make sure it's not a problem due to
> backport.

Right, I should have mentioned that crucial detail. Sorry.

I ported your series to 6.14.2 because we did/do not trust anything
newer yet for testing. The problematic workload was not available in
our lab at that time, so we had to be very carefully about deployed
kernel versions.

I'm attaching the backported patches now, so you can compare / review
if you like. Spoiler: The only differences are line numbers ;-)


Best regards,
Florian


[-- Attachment #2: 0001-sched-fair-Add-related-data-structure-for-task-based.patch --]
[-- Type: text/x-patch, Size: 2702 bytes --]

From 8d97fe0ceb34367714aa5c11110b5f0264eff130 Mon Sep 17 00:00:00 2001
From: Valentin Schneider <vschneid@redhat.com>
Date: Wed, 9 Apr 2025 20:07:40 +0800
Subject: [PATCH 1/7] sched/fair: Add related data structure for task based
 throttle

From: Valentin Schneider <vschneid@redhat.com>

Add related data structures for this new throttle functionality.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Florian Bezdeka <florian.bezdeka@siemens.com>
---
 include/linux/sched.h |  4 ++++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 12 ++++++++++++
 kernel/sched/sched.h  |  2 ++
 4 files changed, 21 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e5c38718ff56..35a7b61300979 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -864,6 +864,10 @@ struct task_struct {
 
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
+#ifdef CONFIG_CFS_BANDWIDTH
+	struct callback_head		sched_throttle_work;
+	struct list_head		throttle_node;
+#endif
 #endif
 
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3c7c942c7c429..c4bb3ad52ce3a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4494,6 +4494,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
+#ifdef CONFIG_CFS_BANDWIDTH
+	init_cfs_throttle_work(p);
+#endif
 #endif
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 89c7260103e18..d27dd55b65dc2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5832,6 +5832,18 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	       throttled_hierarchy(dest_cfs_rq);
 }
 
+static void throttle_cfs_rq_work(struct callback_head *work)
+{
+}
+
+void init_cfs_throttle_work(struct task_struct *p)
+{
+	init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
+	/* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
+	p->sched_throttle_work.next = &p->sched_throttle_work;
+	INIT_LIST_HEAD(&p->throttle_node);
+}
+
 static int tg_unthrottle_up(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1aa65a0ac5864..81196d6888b66 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2724,6 +2724,8 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
 extern void init_dl_entity(struct sched_dl_entity *dl_se);
 
+extern void init_cfs_throttle_work(struct task_struct *p);
+
 #define BW_SHIFT		20
 #define BW_UNIT			(1 << BW_SHIFT)
 #define RATIO_SHIFT		8
-- 
2.39.5


[-- Attachment #3: 0002-sched-fair-Handle-throttle-path-for-task-based-throt.patch --]
[-- Type: text/x-patch, Size: 11529 bytes --]

From 1cbef381c60422b966d8f6b4466a7adb8b3bbdca Mon Sep 17 00:00:00 2001
From: Florian Bezdeka <florian.bezdeka@siemens.com>
Date: Wed, 9 Apr 2025 22:59:53 +0200
Subject: [PATCH 2/7] sched/fair: Handle throttle path for task based throttle

In current throttle model, when a cfs_rq is throttled, its entity will
be dequeued from cpu's rq, making tasks attached to it not able to run,
thus achiveing the throttle target.

This has a drawback though: assume a task is a reader of percpu_rwsem
and is waiting. When it gets wakeup, it can not run till its task group's
next period comes, which can be a relatively long time. Waiting writer
will have to wait longer due to this and it also makes further reader
build up and eventually trigger task hung.

To improve this situation, change the throttle model to task based, i.e.
when a cfs_rq is throttled, record its throttled status but do not
remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
when they get picked, add a task work to them so that when they return
to user, they can be dequeued. In this way, tasks throttled will not
hold any kernel resources.

[Florian: manual backport to 6.14]

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Florian Bezdeka <florian.bezdeka@siemens.com>
---
 kernel/sched/fair.c  | 185 +++++++++++++++++++++----------------------
 kernel/sched/sched.h |   1 +
 2 files changed, 92 insertions(+), 94 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d27dd55b65dc2..62810cd3faa2e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5525,9 +5525,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (flags & DEQUEUE_DELAYED)
 		finish_delayed_dequeue_entity(se);
 
-	if (cfs_rq->nr_queued == 0)
+	if (cfs_rq->nr_queued == 0) {
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
-
+		if (throttled_hierarchy(cfs_rq))
+			list_del_leaf_cfs_rq(cfs_rq);
+	}
 	return true;
 }
 
@@ -5607,7 +5609,7 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 	return se;
 }
 
-static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
@@ -5832,8 +5834,48 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	       throttled_hierarchy(dest_cfs_rq);
 }
 
+static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static void throttle_cfs_rq_work(struct callback_head *work)
 {
+	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
+	struct sched_entity *se;
+	struct cfs_rq *cfs_rq;
+	struct rq *rq;
+
+	WARN_ON_ONCE(p != current);
+	p->sched_throttle_work.next = &p->sched_throttle_work;
+
+	/*
+	 * If task is exiting, then there won't be a return to userspace, so we
+	 * don't have to bother with any of this.
+	 */
+	if ((p->flags & PF_EXITING))
+		return;
+
+	scoped_guard(task_rq_lock, p) {
+		se = &p->se;
+		cfs_rq = cfs_rq_of(se);
+
+		/* Raced, forget */
+		if (p->sched_class != &fair_sched_class)
+			return;
+
+		/*
+		 * If not in limbo, then either replenish has happened or this
+		 * task got migrated out of the throttled cfs_rq, move along.
+		 */
+		if (!cfs_rq->throttle_count)
+			return;
+
+		rq = scope.rq;
+		update_rq_clock(rq);
+		WARN_ON_ONCE(!list_empty(&p->throttle_node));
+		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
+		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+		resched_curr(rq);
+	}
+
+	cond_resched_tasks_rcu_qs();
 }
 
 void init_cfs_throttle_work(struct task_struct *p)
@@ -5873,32 +5915,53 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 	return 0;
 }
 
+static inline bool task_has_throttle_work(struct task_struct *p)
+{
+	return p->sched_throttle_work.next != &p->sched_throttle_work;
+}
+
+static inline void task_throttle_setup_work(struct task_struct *p)
+{
+	if (task_has_throttle_work(p))
+		return;
+
+	/*
+	 * Kthreads and exiting tasks don't return to userspace, so adding the
+	 * work is pointless
+	 */
+	if ((p->flags & (PF_EXITING | PF_KTHREAD)))
+		return;
+
+	task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
+}
+
 static int tg_throttle_down(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
+	cfs_rq->throttle_count++;
+	if (cfs_rq->throttle_count > 1)
+		return 0;
+
 	/* group is entering throttled state, stop time */
-	if (!cfs_rq->throttle_count) {
-		cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
-		list_del_leaf_cfs_rq(cfs_rq);
+	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
 
-		SCHED_WARN_ON(cfs_rq->throttled_clock_self);
-		if (cfs_rq->nr_queued)
-			cfs_rq->throttled_clock_self = rq_clock(rq);
-	}
-	cfs_rq->throttle_count++;
+	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
+	if (cfs_rq->nr_queued)
+		cfs_rq->throttled_clock_self = rq_clock(rq);
+	else
+		list_del_leaf_cfs_rq(cfs_rq);
 
+	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
 	return 0;
 }
 
-static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
+static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-	struct sched_entity *se;
-	long queued_delta, runnable_delta, idle_delta, dequeue = 1;
-	long rq_h_nr_queued = rq->cfs.h_nr_queued;
+	int dequeue = 1;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* This will start the period timer if necessary */
@@ -5919,74 +5982,13 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	raw_spin_unlock(&cfs_b->lock);
 
 	if (!dequeue)
-		return false;  /* Throttle no longer required. */
-
-	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+		return;  /* Throttle no longer required. */
 
 	/* freeze hierarchy runnable averages while throttled */
 	rcu_read_lock();
 	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
 	rcu_read_unlock();
 
-	queued_delta = cfs_rq->h_nr_queued;
-	runnable_delta = cfs_rq->h_nr_runnable;
-	idle_delta = cfs_rq->h_nr_idle;
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-		int flags;
-
-		/* throttled entity or throttle-on-deactivate */
-		if (!se->on_rq)
-			goto done;
-
-		/*
-		 * Abuse SPECIAL to avoid delayed dequeue in this instance.
-		 * This avoids teaching dequeue_entities() about throttled
-		 * entities and keeps things relatively simple.
-		 */
-		flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL;
-		if (se->sched_delayed)
-			flags |= DEQUEUE_DELAYED;
-		dequeue_entity(qcfs_rq, se, flags);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued -= queued_delta;
-		qcfs_rq->h_nr_runnable -= runnable_delta;
-		qcfs_rq->h_nr_idle -= idle_delta;
-
-		if (qcfs_rq->load.weight) {
-			/* Avoid re-evaluating load for this entity: */
-			se = parent_entity(se);
-			break;
-		}
-	}
-
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-		/* throttled entity or throttle-on-deactivate */
-		if (!se->on_rq)
-			goto done;
-
-		update_load_avg(qcfs_rq, se, 0);
-		se_update_runnable(se);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued -= queued_delta;
-		qcfs_rq->h_nr_runnable -= runnable_delta;
-		qcfs_rq->h_nr_idle -= idle_delta;
-	}
-
-	/* At this point se is NULL and we are at root level*/
-	sub_nr_running(rq, queued_delta);
-
-	/* Stop the fair server if throttling resulted in no runnable tasks */
-	if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
-		dl_server_stop(&rq->fair_server);
-done:
 	/*
 	 * Note: distribution will already see us throttled via the
 	 * throttled-list.  rq->lock protects completion.
@@ -5995,7 +5997,6 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	SCHED_WARN_ON(cfs_rq->throttled_clock);
 	if (cfs_rq->nr_queued)
 		cfs_rq->throttled_clock = rq_clock(rq);
-	return true;
 }
 
 void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
@@ -6471,22 +6472,22 @@ static void sync_throttle(struct task_group *tg, int cpu)
 }
 
 /* conditionally throttle active cfs_rq's from put_prev_entity() */
-static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	if (!cfs_bandwidth_used())
-		return false;
+		return;
 
 	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
-		return false;
+		return;
 
 	/*
 	 * it's possible for a throttled entity to be forced into a running
 	 * state (e.g. set_curr_task), in this case we're finished.
 	 */
 	if (cfs_rq_throttled(cfs_rq))
-		return true;
+		return;
 
-	return throttle_cfs_rq(cfs_rq);
+	throttle_cfs_rq(cfs_rq);
 }
 
 static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
@@ -6582,6 +6583,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	cfs_rq->runtime_enabled = 0;
 	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 	INIT_LIST_HEAD(&cfs_rq->throttled_csd_list);
+	INIT_LIST_HEAD(&cfs_rq->throttled_limbo_list);
 }
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
@@ -6747,10 +6749,11 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
 #else /* CONFIG_CFS_BANDWIDTH */
 
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
-static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 static inline void sync_throttle(struct task_group *tg, int cpu) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static void task_throttle_setup_work(struct task_struct *p) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
@@ -7117,10 +7120,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
 
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			return 0;
-
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
 			slice = cfs_rq_min_slice(cfs_rq);
@@ -7157,10 +7156,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			return 0;
 	}
 
 	sub_nr_running(rq, h_nr_queued);
@@ -8869,8 +8864,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 		if (cfs_rq->curr && cfs_rq->curr->on_rq)
 			update_curr(cfs_rq);
 
-		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
-			goto again;
+		check_cfs_rq_runtime(cfs_rq);
 
 		se = pick_next_entity(rq, cfs_rq);
 		if (!se)
@@ -8897,6 +8891,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		goto idle;
 	se = &p->se;
 
+	if (throttled_hierarchy(cfs_rq_of(se)))
+		task_throttle_setup_work(p);
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	if (prev->sched_class != &fair_sched_class)
 		goto simple;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 81196d6888b66..a7ae74366c078 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -742,6 +742,7 @@ struct cfs_rq {
 	int			throttle_count;
 	struct list_head	throttled_list;
 	struct list_head	throttled_csd_list;
+	struct list_head	throttled_limbo_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 };
-- 
2.39.5


[-- Attachment #4: 0003-sched-fair-Handle-unthrottle-path-for-task-based-thr.patch --]
[-- Type: text/x-patch, Size: 8385 bytes --]

From 44468ef7c3701c780bec4f7c09e1869c301f39c8 Mon Sep 17 00:00:00 2001
From: Florian Bezdeka <florian.bezdeka@siemens.com>
Date: Wed, 9 Apr 2025 23:29:46 +0200
Subject: [PATCH 3/7] sched/fair: Handle unthrottle path for task based
 throttle

On unthrottle, enqueue throttled tasks back so they can continue to run.

Note that for this task based throttling, the only throttle place is
when it returns to user space so as long as a task is enqueued, no
matter its cfs_rq is throttled or not, it will be allowed to run till it
reaches that throttle place.

leaf_cfs_rq list is handled differently now: as long as a task is
enqueued to a throttled or not cfs_rq, this cfs_rq will be added to that
list and when cfs_rq is throttled and all its tasks are dequeued, it
will be removed from that list. I think this is easy to reason so chose
to do so.

[Florian: manual backport to 6.14]

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Florian Bezdeka <florian.bezdeka@siemens.com>
---
 kernel/sched/fair.c | 129 ++++++++++++++++----------------------------
 1 file changed, 45 insertions(+), 84 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 62810cd3faa2e..8e3c2ae8da64a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5366,18 +5366,17 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	if (cfs_rq->nr_queued == 1) {
 		check_enqueue_throttle(cfs_rq);
-		if (!throttled_hierarchy(cfs_rq)) {
-			list_add_leaf_cfs_rq(cfs_rq);
-		} else {
+		list_add_leaf_cfs_rq(cfs_rq);
 #ifdef CONFIG_CFS_BANDWIDTH
+		if (throttled_hierarchy(cfs_rq)) {
 			struct rq *rq = rq_of(cfs_rq);
 
 			if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
 				cfs_rq->throttled_clock = rq_clock(rq);
 			if (!cfs_rq->throttled_clock_self)
 				cfs_rq->throttled_clock_self = rq_clock(rq);
-#endif
 		}
+#endif
 	}
 }
 
@@ -5834,6 +5833,11 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	       throttled_hierarchy(dest_cfs_rq);
 }
 
+static inline bool task_is_throttled(struct task_struct *p)
+{
+	return !list_empty(&p->throttle_node);
+}
+
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static void throttle_cfs_rq_work(struct callback_head *work)
 {
@@ -5886,32 +5890,41 @@ void init_cfs_throttle_work(struct task_struct *p)
 	INIT_LIST_HEAD(&p->throttle_node);
 }
 
+static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static int tg_unthrottle_up(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+	struct task_struct *p, *tmp;
 
 	cfs_rq->throttle_count--;
-	if (!cfs_rq->throttle_count) {
-		cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
-					     cfs_rq->throttled_clock_pelt;
+	if (cfs_rq->throttle_count)
+		return 0;
 
-		/* Add cfs_rq with load or one or more already running entities to the list */
-		if (!cfs_rq_is_decayed(cfs_rq))
-			list_add_leaf_cfs_rq(cfs_rq);
+	cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
+		cfs_rq->throttled_clock_pelt;
 
-		if (cfs_rq->throttled_clock_self) {
-			u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
+	if (cfs_rq->throttled_clock_self) {
+		u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
 
-			cfs_rq->throttled_clock_self = 0;
+		cfs_rq->throttled_clock_self = 0;
 
-			if (SCHED_WARN_ON((s64)delta < 0))
-				delta = 0;
+		if (SCHED_WARN_ON((s64)delta < 0))
+			delta = 0;
 
-			cfs_rq->throttled_clock_self_time += delta;
-		}
+		cfs_rq->throttled_clock_self_time += delta;
+	}
+
+	/* Re-enqueue the tasks that have been throttled at this level. */
+	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
+		list_del_init(&p->throttle_node);
+		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
 	}
 
+	/* Add cfs_rq with load or one or more already running entities to the list */
+	if (!cfs_rq_is_decayed(cfs_rq))
+		list_add_leaf_cfs_rq(cfs_rq);
+
 	return 0;
 }
 
@@ -6003,11 +6016,20 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-	struct sched_entity *se;
-	long queued_delta, runnable_delta, idle_delta;
-	long rq_h_nr_queued = rq->cfs.h_nr_queued;
+	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
 
-	se = cfs_rq->tg->se[cpu_of(rq)];
+	/*
+	 * It's possible we are called with !runtime_remaining due to things
+	 * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
+	 * unthrottled us with a positive runtime_remaining but other still
+	 * running entities consumed those runtime before we reach here.
+	 *
+	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
+	 * because any enqueue below will immediately trigger a throttle, which
+	 * is not supposed to happen on unthrottle path.
+	 */
+	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
+		return;
 
 	cfs_rq->throttled = 0;
 
@@ -6035,62 +6057,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 			if (list_add_leaf_cfs_rq(cfs_rq_of(se)))
 				break;
 		}
-		goto unthrottle_throttle;
 	}
 
-	queued_delta = cfs_rq->h_nr_queued;
-	runnable_delta = cfs_rq->h_nr_runnable;
-	idle_delta = cfs_rq->h_nr_idle;
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-
-		/* Handle any unfinished DELAY_DEQUEUE business first. */
-		if (se->sched_delayed) {
-			int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED;
-
-			dequeue_entity(qcfs_rq, se, flags);
-		} else if (se->on_rq)
-			break;
-		enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued += queued_delta;
-		qcfs_rq->h_nr_runnable += runnable_delta;
-		qcfs_rq->h_nr_idle += idle_delta;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(qcfs_rq))
-			goto unthrottle_throttle;
-	}
-
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-
-		update_load_avg(qcfs_rq, se, UPDATE_TG);
-		se_update_runnable(se);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued += queued_delta;
-		qcfs_rq->h_nr_runnable += runnable_delta;
-		qcfs_rq->h_nr_idle += idle_delta;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(qcfs_rq))
-			goto unthrottle_throttle;
-	}
-
-	/* Start the fair server if un-throttling resulted in new runnable tasks */
-	if (!rq_h_nr_queued && rq->cfs.h_nr_queued)
-		dl_server_start(&rq->fair_server);
-
-	/* At this point se is NULL and we are at root level*/
-	add_nr_running(rq, queued_delta);
-
-unthrottle_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
 	/* Determine whether we need to wake up potentially idle CPU: */
@@ -6754,6 +6722,7 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 static inline void sync_throttle(struct task_group *tg, int cpu) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void task_throttle_setup_work(struct task_struct *p) {}
+static bool task_is_throttled(struct task_struct *p) { return false; }
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
@@ -6962,6 +6931,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		util_est_enqueue(&rq->cfs, p);
 
 	if (flags & ENQUEUE_DELAYED) {
+		WARN_ON_ONCE(task_is_throttled(p));
 		requeue_delayed_entity(se);
 		return;
 	}
@@ -7004,10 +6974,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = 1;
 
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			goto enqueue_throttle;
-
 		flags = ENQUEUE_WAKEUP;
 	}
 
@@ -7029,10 +6995,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = 1;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			goto enqueue_throttle;
 	}
 
 	if (!rq_h_nr_queued && rq->cfs.h_nr_queued) {
@@ -7062,7 +7024,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!task_new)
 		check_update_overutilized_status(rq);
 
-enqueue_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
 	hrtick_update(rq);
-- 
2.39.5


[-- Attachment #5: 0004-sched-fair-Take-care-of-group-affinity-sched_class-c.patch --]
[-- Type: text/x-patch, Size: 2528 bytes --]

From 18957647d10a17b36cd8ecdb31f661aa931ea8a1 Mon Sep 17 00:00:00 2001
From: Florian Bezdeka <florian.bezdeka@siemens.com>
Date: Wed, 9 Apr 2025 23:33:42 +0200
Subject: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class
 change for throttled task

On task group change, for tasks whose on_rq equals to TASK_ON_RQ_QUEUED,
core will dequeue it and then requeued it.

The throttled task is still considered as queued by core because p->on_rq
is still set so core will dequeue it, but since the task is already
dequeued on throttle in fair, handle this case properly.

Affinity and sched class change is similar.

[Florian: manual backport to 6.14]

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Florian Bezdeka <florian.bezdeka@siemens.com>
---
 kernel/sched/fair.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e3c2ae8da64a..fc42ffcfcccb9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5890,6 +5890,20 @@ void init_cfs_throttle_work(struct task_struct *p)
 	INIT_LIST_HEAD(&p->throttle_node);
 }
 
+static void dequeue_throttled_task(struct task_struct *p, int flags)
+{
+	/*
+	 * Task is throttled and someone wants to dequeue it again:
+	 * it must be sched/core when core needs to do things like
+	 * task affinity change, task group change, task sched class
+	 * change etc.
+	 */
+	WARN_ON_ONCE(p->se.on_rq);
+	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
+
+	list_del_init(&p->throttle_node);
+}
+
 static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static int tg_unthrottle_up(struct task_group *tg, void *data)
 {
@@ -6723,6 +6737,7 @@ static inline void sync_throttle(struct task_group *tg, int cpu) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void task_throttle_setup_work(struct task_struct *p) {}
 static bool task_is_throttled(struct task_struct *p) { return false; }
+static void dequeue_throttled_task(struct task_struct *p, int flags) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
@@ -7153,6 +7168,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
  */
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (unlikely(task_is_throttled(p))) {
+		dequeue_throttled_task(p, flags);
+		return true;
+	}
+
 	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
 		util_est_dequeue(&rq->cfs, p);
 
-- 
2.39.5


[-- Attachment #6: 0005-sched-fair-get-rid-of-throttled_lb_pair.patch --]
[-- Type: text/x-patch, Size: 2795 bytes --]

From 4d42e8cc7e88965f5386bfdf1f8d5988a8e8a496 Mon Sep 17 00:00:00 2001
From: Florian Bezdeka <florian.bezdeka@siemens.com>
Date: Wed, 9 Apr 2025 23:40:53 +0200
Subject: [PATCH 5/7] sched/fair: get rid of throttled_lb_pair()

Now that throttled tasks are dequeued and can not stay on rq's cfs_tasks
list, there is no need to take special care of these throttled tasks
anymore in load balance.

[Florian: manual backport to 6.14]

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Florian Bezdeka <florian.bezdeka@siemens.com>
---
 kernel/sched/fair.c | 33 +++------------------------------
 1 file changed, 3 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fc42ffcfcccb9..a12d2fb98d083 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5816,23 +5816,6 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 	return cfs_bandwidth_used() && cfs_rq->throttle_count;
 }
 
-/*
- * Ensure that neither of the group entities corresponding to src_cpu or
- * dest_cpu are members of a throttled hierarchy when performing group
- * load-balance operations.
- */
-static inline int throttled_lb_pair(struct task_group *tg,
-				    int src_cpu, int dest_cpu)
-{
-	struct cfs_rq *src_cfs_rq, *dest_cfs_rq;
-
-	src_cfs_rq = tg->cfs_rq[src_cpu];
-	dest_cfs_rq = tg->cfs_rq[dest_cpu];
-
-	return throttled_hierarchy(src_cfs_rq) ||
-	       throttled_hierarchy(dest_cfs_rq);
-}
-
 static inline bool task_is_throttled(struct task_struct *p)
 {
 	return !list_empty(&p->throttle_node);
@@ -6749,12 +6732,6 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 	return 0;
 }
 
-static inline int throttled_lb_pair(struct task_group *tg,
-				    int src_cpu, int dest_cpu)
-{
-	return 0;
-}
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -9384,17 +9361,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) delayed dequeued unless we migrate load, or
-	 * 2) throttled_lb_pair, or
-	 * 3) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 4) running (obviously), or
-	 * 5) are cache-hot on their current CPU.
+	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
+	 * 3) running (obviously), or
+	 * 4) are cache-hot on their current CPU.
 	 */
 	if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
 		return 0;
 
-	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
-		return 0;
-
 	/*
 	 * We want to prioritize the migration of eligible tasks.
 	 * For ineligible tasks we soft-limit them and only allow
-- 
2.39.5


[-- Attachment #7: 0006-sched-fair-fix-h_nr_runnable-accounting-with-per-tas.patch --]
[-- Type: text/x-patch, Size: 1228 bytes --]

From d88498597846c0e79d8da196a738816c7290d2d4 Mon Sep 17 00:00:00 2001
From: Florian Bezdeka <florian.bezdeka@siemens.com>
Date: Wed, 9 Apr 2025 23:42:42 +0200
Subject: [PATCH 6/7] sched/fair: fix h_nr_runnable accounting with per-task
 throttle

Task based throttle does not adjust cfs_rq's h_nr_runnable on throttle
anymore but relies on standard en/dequeue_entity(), so there is no need
to take special care of h_nr_runnable in delayed dequeue operations.

[Florian: manual backport to 6.14]

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Florian Bezdeka <florian.bezdeka@siemens.com>
---
 kernel/sched/fair.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a12d2fb98d083..4e9079f2e3a6a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5415,8 +5415,6 @@ static void set_delayed(struct sched_entity *se)
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		cfs_rq->h_nr_runnable--;
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 	}
 }
 
@@ -5437,8 +5435,6 @@ static void clear_delayed(struct sched_entity *se)
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		cfs_rq->h_nr_runnable++;
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 	}
 }
 
-- 
2.39.5


[-- Attachment #8: 0007-sched-fair-alternative-way-of-accounting-throttle-ti.patch --]
[-- Type: text/x-patch, Size: 11369 bytes --]

From ebd646f8eabb245e5d6e04389febc513b47a5b9a Mon Sep 17 00:00:00 2001
From: Florian Bezdeka <florian.bezdeka@siemens.com>
Date: Wed, 16 Apr 2025 12:41:03 +0200
Subject: [PATCH 7/7] sched/fair: alternative way of accounting throttle time

Implement an alternative way of accounting cfs_rq throttle time which:
- starts accounting when a throttled cfs_rq has no tasks enqueued and its
  throttled list is not empty;
- stops accounting when this cfs_rq gets unthrottled or a task gets
  enqueued.

This way, the accounted throttle time is when the cfs_rq has absolutely
no tasks enqueued and has tasks throttled.

[Florian: manual backport to 6.14]

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Florian Bezdeka <florian.bezdeka@siemens.com>
---
 kernel/sched/fair.c  | 112 ++++++++++++++++++++++++++++++++-----------
 kernel/sched/sched.h |   4 ++
 2 files changed, 89 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e9079f2e3a6a..e515a1b43bba8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5309,6 +5309,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
+static void account_cfs_rq_throttle_self(struct cfs_rq *cfs_rq);
 
 static void
 requeue_delayed_entity(struct sched_entity *se);
@@ -5371,10 +5372,14 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		if (throttled_hierarchy(cfs_rq)) {
 			struct rq *rq = rq_of(cfs_rq);
 
-			if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
-				cfs_rq->throttled_clock = rq_clock(rq);
-			if (!cfs_rq->throttled_clock_self)
-				cfs_rq->throttled_clock_self = rq_clock(rq);
+			if (cfs_rq->throttled_clock) {
+				cfs_rq->throttled_time +=
+					rq_clock(rq) - cfs_rq->throttled_clock;
+				cfs_rq->throttled_clock = 0;
+			}
+
+			if (cfs_rq->throttled_clock_self)
+				account_cfs_rq_throttle_self(cfs_rq);
 		}
 #endif
 	}
@@ -5462,7 +5467,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		 * DELAY_DEQUEUE relies on spurious wakeups, special task
 		 * states must not suffer spurious wakeups, excempt them.
 		 */
-		if (flags & DEQUEUE_SPECIAL)
+		if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE))
 			delay = false;
 
 		SCHED_WARN_ON(delay && se->sched_delayed);
@@ -5522,8 +5527,24 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	if (cfs_rq->nr_queued == 0) {
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
-		if (throttled_hierarchy(cfs_rq))
+
+#ifdef CONFIG_CFS_BANDWIDTH
+		if (throttled_hierarchy(cfs_rq)) {
 			list_del_leaf_cfs_rq(cfs_rq);
+
+			if (cfs_rq->h_nr_throttled) {
+				struct rq *rq = rq_of(cfs_rq);
+
+				WARN_ON_ONCE(cfs_rq->throttled_clock_self);
+				cfs_rq->throttled_clock_self = rq_clock(rq);
+
+				if (cfs_rq_throttled(cfs_rq)) {
+					WARN_ON_ONCE(cfs_rq->throttled_clock);
+					cfs_rq->throttled_clock = rq_clock(rq);
+				}
+			}
+		}
+#endif
 	}
 	return true;
 }
@@ -5817,6 +5838,18 @@ static inline bool task_is_throttled(struct task_struct *p)
 	return !list_empty(&p->throttle_node);
 }
 
+static inline void
+cfs_rq_inc_h_nr_throttled(struct cfs_rq *cfs_rq, unsigned int nr)
+{
+	cfs_rq->h_nr_throttled += nr;
+}
+
+static inline void
+cfs_rq_dec_h_nr_throttled(struct cfs_rq *cfs_rq, unsigned int nr)
+{
+	cfs_rq->h_nr_throttled -= nr;
+}
+
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static void throttle_cfs_rq_work(struct callback_head *work)
 {
@@ -5853,7 +5886,7 @@ static void throttle_cfs_rq_work(struct callback_head *work)
 		rq = scope.rq;
 		update_rq_clock(rq);
 		WARN_ON_ONCE(!list_empty(&p->throttle_node));
-		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
+		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_THROTTLE);
 		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
 		resched_curr(rq);
 	}
@@ -5871,16 +5904,37 @@ void init_cfs_throttle_work(struct task_struct *p)
 
 static void dequeue_throttled_task(struct task_struct *p, int flags)
 {
+	struct sched_entity *se = &p->se;
+
 	/*
 	 * Task is throttled and someone wants to dequeue it again:
 	 * it must be sched/core when core needs to do things like
 	 * task affinity change, task group change, task sched class
 	 * change etc.
 	 */
-	WARN_ON_ONCE(p->se.on_rq);
-	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
+	WARN_ON_ONCE(se->on_rq);
+	WARN_ON_ONCE(flags & DEQUEUE_THROTTLE);
 
 	list_del_init(&p->throttle_node);
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		cfs_rq->h_nr_throttled--;
+	}
+}
+
+static void account_cfs_rq_throttle_self(struct cfs_rq *cfs_rq)
+{
+	/* account self time */
+	u64 delta = rq_clock(rq_of(cfs_rq)) - cfs_rq->throttled_clock_self;
+
+	cfs_rq->throttled_clock_self = 0;
+
+	if (WARN_ON_ONCE((s64)delta < 0))
+		delta = 0;
+
+	cfs_rq->throttled_clock_self_time += delta;
 }
 
 static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
@@ -5897,27 +5951,21 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 	cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
 		cfs_rq->throttled_clock_pelt;
 
-	if (cfs_rq->throttled_clock_self) {
-		u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
-
-		cfs_rq->throttled_clock_self = 0;
-
-		if (SCHED_WARN_ON((s64)delta < 0))
-			delta = 0;
-
-		cfs_rq->throttled_clock_self_time += delta;
-	}
+	if (cfs_rq->throttled_clock_self)
+		account_cfs_rq_throttle_self(cfs_rq);
 
 	/* Re-enqueue the tasks that have been throttled at this level. */
 	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
 		list_del_init(&p->throttle_node);
-		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
+		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
 	}
 
 	/* Add cfs_rq with load or one or more already running entities to the list */
 	if (!cfs_rq_is_decayed(cfs_rq))
 		list_add_leaf_cfs_rq(cfs_rq);
 
+	WARN_ON_ONCE(cfs_rq->h_nr_throttled);
+
 	return 0;
 }
 
@@ -5953,10 +6001,7 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	/* group is entering throttled state, stop time */
 	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
 
-	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
-	if (cfs_rq->nr_queued)
-		cfs_rq->throttled_clock_self = rq_clock(rq);
-	else
+	if (!cfs_rq->nr_queued)
 		list_del_leaf_cfs_rq(cfs_rq);
 
 	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
@@ -6000,9 +6045,6 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	 * throttled-list.  rq->lock protects completion.
 	 */
 	cfs_rq->throttled = 1;
-	SCHED_WARN_ON(cfs_rq->throttled_clock);
-	if (cfs_rq->nr_queued)
-		cfs_rq->throttled_clock = rq_clock(rq);
 }
 
 void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
@@ -6033,6 +6075,10 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		cfs_b->throttled_time += rq_clock(rq) - cfs_rq->throttled_clock;
 		cfs_rq->throttled_clock = 0;
 	}
+	if (cfs_rq->throttled_time) {
+		cfs_b->throttled_time += cfs_rq->throttled_time;
+		cfs_rq->throttled_time = 0;
+	}
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
 
@@ -6717,6 +6763,8 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void task_throttle_setup_work(struct task_struct *p) {}
 static bool task_is_throttled(struct task_struct *p) { return false; }
 static void dequeue_throttled_task(struct task_struct *p, int flags) {}
+static void cfs_rq_inc_h_nr_throttled(struct cfs_rq *cfs_rq, unsigned int nr) {}
+static void cfs_rq_dec_h_nr_throttled(struct cfs_rq *cfs_rq, unsigned int nr) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
@@ -6905,6 +6953,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	struct sched_entity *se = &p->se;
 	int h_nr_idle = task_has_idle_policy(p);
 	int h_nr_runnable = 1;
+	int h_nr_throttled = (flags & ENQUEUE_THROTTLE) ? 1 : 0;
 	int task_new = !(flags & ENQUEUE_WAKEUP);
 	int rq_h_nr_queued = rq->cfs.h_nr_queued;
 	u64 slice = 0;
@@ -6958,6 +7007,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_runnable += h_nr_runnable;
 		cfs_rq->h_nr_queued++;
 		cfs_rq->h_nr_idle += h_nr_idle;
+		cfs_rq_dec_h_nr_throttled(cfs_rq, h_nr_throttled);
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = 1;
@@ -6980,6 +7030,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_runnable += h_nr_runnable;
 		cfs_rq->h_nr_queued++;
 		cfs_rq->h_nr_idle += h_nr_idle;
+		cfs_rq_dec_h_nr_throttled(cfs_rq, h_nr_throttled);
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = 1;
@@ -7034,10 +7085,12 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	int rq_h_nr_queued = rq->cfs.h_nr_queued;
 	bool task_sleep = flags & DEQUEUE_SLEEP;
 	bool task_delayed = flags & DEQUEUE_DELAYED;
+	bool task_throttle = flags & DEQUEUE_THROTTLE;
 	struct task_struct *p = NULL;
 	int h_nr_idle = 0;
 	int h_nr_queued = 0;
 	int h_nr_runnable = 0;
+	int h_nr_throttled = 0;
 	struct cfs_rq *cfs_rq;
 	u64 slice = 0;
 
@@ -7047,6 +7100,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		h_nr_idle = task_has_idle_policy(p);
 		if (task_sleep || task_delayed || !se->sched_delayed)
 			h_nr_runnable = 1;
+
+		if (task_throttle)
+			h_nr_throttled = 1;
 	} else {
 		cfs_rq = group_cfs_rq(se);
 		slice = cfs_rq_min_slice(cfs_rq);
@@ -7065,6 +7121,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		cfs_rq->h_nr_runnable -= h_nr_runnable;
 		cfs_rq->h_nr_queued -= h_nr_queued;
 		cfs_rq->h_nr_idle -= h_nr_idle;
+		cfs_rq_inc_h_nr_throttled(cfs_rq, h_nr_throttled);
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
@@ -7102,6 +7159,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		cfs_rq->h_nr_runnable -= h_nr_runnable;
 		cfs_rq->h_nr_queued -= h_nr_queued;
 		cfs_rq->h_nr_idle -= h_nr_idle;
+		cfs_rq_inc_h_nr_throttled(cfs_rq, h_nr_throttled);
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a7ae74366c078..f994123c327b5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -727,6 +727,7 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
+	unsigned int		h_nr_throttled;
 	s64			runtime_remaining;
 
 	u64			throttled_pelt_idle;
@@ -738,6 +739,7 @@ struct cfs_rq {
 	u64			throttled_clock_pelt_time;
 	u64			throttled_clock_self;
 	u64			throttled_clock_self_time;
+	u64			throttled_time;
 	int			throttled;
 	int			throttle_count;
 	struct list_head	throttled_list;
@@ -2381,6 +2383,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define DEQUEUE_SPECIAL		0x10
 #define DEQUEUE_MIGRATING	0x100 /* Matches ENQUEUE_MIGRATING */
 #define DEQUEUE_DELAYED		0x200 /* Matches ENQUEUE_DELAYED */
+#define DEQUEUE_THROTTLE	0x800 /* Matches ENQUEUE_THROTTLE */
 
 #define ENQUEUE_WAKEUP		0x01
 #define ENQUEUE_RESTORE		0x02
@@ -2398,6 +2401,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define ENQUEUE_MIGRATING	0x100
 #define ENQUEUE_DELAYED		0x200
 #define ENQUEUE_RQ_SELECTED	0x400
+#define ENQUEUE_THROTTLE	0x800
 
 #define RETRY_TASK		((void *)-1UL)
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time
  2025-04-22 15:03       ` Florian Bezdeka
@ 2025-04-23 11:26         ` Aaron Lu
  2025-04-23 12:15           ` Florian Bezdeka
  0 siblings, 1 reply; 45+ messages in thread
From: Aaron Lu @ 2025-04-23 11:26 UTC (permalink / raw)
  To: Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chengming Zhou, Chuyi Zhou, Jan Kiszka

On Tue, Apr 22, 2025 at 05:03:19PM +0200, Florian Bezdeka wrote:
... ...

> Right, I should have mentioned that crucial detail. Sorry.
> 
> I ported your series to 6.14.2 because we did/do not trust anything
> newer yet for testing. The problematic workload was not available in
> our lab at that time, so we had to be very carefully about deployed
> kernel versions.
> 
> I'm attaching the backported patches now, so you can compare / review
> if you like. Spoiler: The only differences are line numbers ;-)

I didn't notice any problem regarding backport after a quick look.

May I know what kind of workload triggered this warning? I haven't been
able to trigger it, I'll have to stare harder at the code.

Thanks,
Aaron

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time
  2025-04-23 11:26         ` Aaron Lu
@ 2025-04-23 12:15           ` Florian Bezdeka
  2025-04-24  2:26             ` Aaron Lu
  0 siblings, 1 reply; 45+ messages in thread
From: Florian Bezdeka @ 2025-04-23 12:15 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chengming Zhou, Chuyi Zhou, Jan Kiszka

On Wed, 2025-04-23 at 19:26 +0800, Aaron Lu wrote:
> On Tue, Apr 22, 2025 at 05:03:19PM +0200, Florian Bezdeka wrote:
> ... ...
> 
> > Right, I should have mentioned that crucial detail. Sorry.
> > 
> > I ported your series to 6.14.2 because we did/do not trust anything
> > newer yet for testing. The problematic workload was not available in
> > our lab at that time, so we had to be very carefully about deployed
> > kernel versions.
> > 
> > I'm attaching the backported patches now, so you can compare / review
> > if you like. Spoiler: The only differences are line numbers ;-)
> 
> I didn't notice any problem regarding backport after a quick look.
> 
> May I know what kind of workload triggered this warning? I haven't been
> able to trigger it, I'll have to stare harder at the code.

There are a couple of containers running. Nothing special as far as I
can tell. Network, IO, at least one container heavily using the epoll
interface.

The system is still operating fine though...

Once again: PREEMPT_RT enabled, so maybe handling an IRQ over the
accounting code could happen? Looking at the warning again it looks
like unthrottle_cfs_rq() is called from IRQ context. Is that expected?

Best regards,
Florian

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time
  2025-04-23 12:15           ` Florian Bezdeka
@ 2025-04-24  2:26             ` Aaron Lu
  0 siblings, 0 replies; 45+ messages in thread
From: Aaron Lu @ 2025-04-24  2:26 UTC (permalink / raw)
  To: Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chengming Zhou, Chuyi Zhou, Jan Kiszka

On Wed, Apr 23, 2025 at 02:15:55PM +0200, Florian Bezdeka wrote:
> On Wed, 2025-04-23 at 19:26 +0800, Aaron Lu wrote:
> > On Tue, Apr 22, 2025 at 05:03:19PM +0200, Florian Bezdeka wrote:
> > ... ...
> > 
> > > Right, I should have mentioned that crucial detail. Sorry.
> > > 
> > > I ported your series to 6.14.2 because we did/do not trust anything
> > > newer yet for testing. The problematic workload was not available in
> > > our lab at that time, so we had to be very carefully about deployed
> > > kernel versions.
> > > 
> > > I'm attaching the backported patches now, so you can compare / review
> > > if you like. Spoiler: The only differences are line numbers ;-)
> > 
> > I didn't notice any problem regarding backport after a quick look.
> > 
> > May I know what kind of workload triggered this warning? I haven't been
> > able to trigger it, I'll have to stare harder at the code.
> 
> There are a couple of containers running. Nothing special as far as I
> can tell. Network, IO, at least one container heavily using the epoll
> interface.

Thanks for the info, I'll run with PREEMPT_RT enabled and see if I can
find anything.

> 
> The system is still operating fine though...
>

So that means only the h_nr_throttle accounting is incorrect. The throttle
time accounting will be affected but looks like the functionality is OK.

> Once again: PREEMPT_RT enabled, so maybe handling an IRQ over the
> accounting code could happen? Looking at the warning again it looks
> like unthrottle_cfs_rq() is called from IRQ context. Is that expected?

Yes it is.

The period timer handler will distribute runtime to individual
cfs_rqs of this task_group and those cfs_rqs are per-cpu. The timer
handler did this asynchronously, i.e. it sends IPI to corresponding CPU
to let them deal with unthrottling their cfs_rq by their own, to reduce
the time this timer handler runs. See commit 8ad075c2eb1f("sched: Async
unthrottling for cfs bandwidth").

I think this creates an interesting result in PREEMPT_RT: the CPU that
runs the hrtimer handler unthrottles its cfs_rq in ktimerd context while
all others unthrottle their cfs_rqs in hardirq context. I don't see any
problem with this, it just seems inconsistent.

Thanks,
Aaron

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 2/7] sched/fair: Handle throttle path for task based throttle
  2025-04-09 12:07 ` [RFC PATCH v2 2/7] sched/fair: Handle throttle path " Aaron Lu
  2025-04-14  8:54   ` Florian Bezdeka
  2025-04-14 14:39   ` Florian Bezdeka
@ 2025-04-30 10:01   ` Aaron Lu
  2 siblings, 0 replies; 45+ messages in thread
From: Aaron Lu @ 2025-04-30 10:01 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

On Wed, Apr 09, 2025 at 08:07:41PM +0800, Aaron Lu wrote:
... ...
> @@ -8888,6 +8884,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  		goto idle;
>  	se = &p->se;
>  
> +	if (throttled_hierarchy(cfs_rq_of(se)))
> +		task_throttle_setup_work(p);
> +

Looks like this will miss core scheduling case, where the task pick is
done in pick_task_fair().

I plan to do something below on top to fix core scheduling case:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 70f7de82d1d9d..500b41f9aea72 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8858,6 +8858,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 {
 	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
+	struct task_struct *p;
 
 again:
 	cfs_rq = &rq->cfs;
@@ -8877,7 +8878,11 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
-	return task_of(se);
+	p = task_of(se);
+	if (throttled_hierarchy(cfs_rq_of(se)))
+		task_throttle_setup_work(p);
+
+	return p;
 }
 
 static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
@@ -8896,9 +8901,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		goto idle;
 	se = &p->se;
 
-	if (throttled_hierarchy(cfs_rq_of(se)))
-		task_throttle_setup_work(p);
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	if (prev->sched_class != &fair_sched_class)
 		goto simple;

For non-core-scheduling, this has the same effect as current and for
core-scheduling, this will make sure task picked will also get throttle
task work added. It could add throttle task work to a task unnecessarily
because in core scheduling case, a task picked may not be able to run
due to cookie and priority reasons but at least, it will not miss the
throttle work this way.

Alternatively, I can add a task_throttle_setup_work(p) somewhere in
set_next_task_fair() but that would add one more callsite of
throttle_setup_work() and is not as clean and simple as the above diff.

Feel free to let me know your thoughts, thanks!

>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	if (prev->sched_class != &fair_sched_class)
>  		goto simple;

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time
  2025-04-17 14:06   ` Florian Bezdeka
  2025-04-18  3:15     ` Aaron Lu
@ 2025-05-07  9:09     ` Aaron Lu
  2025-05-07  9:33       ` Florian Bezdeka
  1 sibling, 1 reply; 45+ messages in thread
From: Aaron Lu @ 2025-05-07  9:09 UTC (permalink / raw)
  To: Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chengming Zhou, Chuyi Zhou, Jan Kiszka

Hi Florian,

On Thu, Apr 17, 2025 at 04:06:16PM +0200, Florian Bezdeka wrote:
> Hi Aaron,
> 
> On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote:
> > @@ -5889,27 +5943,21 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> >  	cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> >  		cfs_rq->throttled_clock_pelt;
> >  
> > -	if (cfs_rq->throttled_clock_self) {
> > -		u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
> > -
> > -		cfs_rq->throttled_clock_self = 0;
> > -
> > -		if (WARN_ON_ONCE((s64)delta < 0))
> > -			delta = 0;
> > -
> > -		cfs_rq->throttled_clock_self_time += delta;
> > -	}
> > +	if (cfs_rq->throttled_clock_self)
> > +		account_cfs_rq_throttle_self(cfs_rq);
> >  
> >  	/* Re-enqueue the tasks that have been throttled at this level. */
> >  	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
> >  		list_del_init(&p->throttle_node);
> > -		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
> > +		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
> >  	}
> >  
> >  	/* Add cfs_rq with load or one or more already running entities to the list */
> >  	if (!cfs_rq_is_decayed(cfs_rq))
> >  		list_add_leaf_cfs_rq(cfs_rq);
> >  
> > +	WARN_ON_ONCE(cfs_rq->h_nr_throttled);
> > +
> >  	return 0;
> >  }
> >  
> 
> I got this warning while testing in our virtual environment:
> 
> Any idea?
>

I made a stupid mistake here: I thought when a cfs_rq gets unthrottled,
it should have no tasks in throttled state, hence I added that check in
tg_unthrottle_up():
        WARN_ON_ONCE(cfs_rq->h_nr_throttled);

But h_nr_throttled tracks hierarchical throttled task number, which
means if this cfs_rq has descendent cfs_rqs that are still in throttled
state, its h_nr_throttled can be > 0 when it gets unthrottled.

I just made a setup to emulate this scenario and can reproduce this
warning. I guess in your setup, there are multiple cpu.max settings in a
cgroup hierarchy.

It's just the warn_on_once() itself is incorrect, I'll remove it in next
version, thanks for the report!

Best regards,
Aaron

> [   26.639641] ------------[ cut here ]------------
> [   26.639644] WARNING: CPU: 5 PID: 0 at kernel/sched/fair.c:5967 tg_unthrottle_up+0x1a6/0x3d0
> [   26.639653] Modules linked in: veth xt_nat nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter bridge stp llc xt_recent rfkill ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt vsock_loopback vmw_vsock_virtio_transport_common ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog vmw_vsock_vmci_transport xt_comment vsock nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables intel_rapl_msr intel_rapl_common nfnetlink binfmt_misc intel_uncore_frequency_common isst_if_mbox_msr isst_if_common skx_edac_common nfit libnvdimm ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel snd_pcm crypto_simd cryptd snd_timer rapl snd soundcore vmw_balloon vmwgfx pcspkr drm_ttm_helper ttm drm_client_lib button ac drm_kms_helper sg vmw_vmci evdev joydev serio_raw drm loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 overlay nls_ascii nls_cp437 vfat fat ext4 crc16 mbcache jbd2 squashfs dm_verity dm_bufio reed_solomon dm_mod
> [   26.639715]  sd_mod ata_generic mptspi mptscsih ata_piix mptbase libata scsi_transport_spi psmouse scsi_mod vmxnet3 i2c_piix4 i2c_smbus scsi_common
> [   26.639726] CPU: 5 UID: 0 PID: 0 Comm: swapper/5 Not tainted 6.14.2-CFSfixes #1
> [   26.639729] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.24224532.B64.2408191458 08/19/2024
> [   26.639731] RIP: 0010:tg_unthrottle_up+0x1a6/0x3d0
> [   26.639735] Code: 00 00 48 39 ca 74 14 48 8b 52 10 49 8b 8e 58 01 00 00 48 39 8a 28 01 00 00 74 24 41 8b 86 68 01 00 00 85 c0 0f 84 8d fe ff ff <0f> 0b e9 86 fe ff ff 49 8b 9e 38 01 00 00 41 8b 86 40 01 00 00 48
> [   26.639737] RSP: 0000:ffffa5df8029cec8 EFLAGS: 00010002
> [   26.639739] RAX: 0000000000000001 RBX: ffff981c6fcb6a80 RCX: ffff981943752e40
> [   26.639741] RDX: 0000000000000005 RSI: ffff981c6fcb6a80 RDI: ffff981943752d00
> [   26.639742] RBP: ffff9819607dc708 R08: ffff981c6fcb6a80 R09: 0000000000000000
> [   26.639744] R10: 0000000000000001 R11: ffff981969936a10 R12: ffff9819607dc708
> [   26.639745] R13: ffff9819607dc9d8 R14: ffff9819607dc800 R15: ffffffffad913fb0
> [   26.639747] FS:  0000000000000000(0000) GS:ffff981c6fc80000(0000) knlGS:0000000000000000
> [   26.639749] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   26.639750] CR2: 00007ff1292dc44c CR3: 000000015350e006 CR4: 00000000007706f0
> [   26.639779] PKRU: 55555554
> [   26.639781] Call Trace:
> [   26.639783]  <IRQ>
> [   26.639787]  ? __pfx_tg_unthrottle_up+0x10/0x10
> [   26.639790]  ? __pfx_tg_nop+0x10/0x10
> [   26.639793]  walk_tg_tree_from+0x58/0xb0
> [   26.639797]  unthrottle_cfs_rq+0xf0/0x360
> [   26.639800]  ? sched_clock_cpu+0xf/0x190
> [   26.639808]  __cfsb_csd_unthrottle+0x11c/0x170
> [   26.639812]  ? __pfx___cfsb_csd_unthrottle+0x10/0x10
> [   26.639816]  __flush_smp_call_function_queue+0x103/0x410
> [   26.639822]  __sysvec_call_function_single+0x1c/0xb0
> [   26.639826]  sysvec_call_function_single+0x6c/0x90
> [   26.639832]  </IRQ>
> [   26.639833]  <TASK>
> [   26.639834]  asm_sysvec_call_function_single+0x1a/0x20
> [   26.639840] RIP: 0010:pv_native_safe_halt+0xf/0x20
> [   26.639844] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 45 c1 13 00 fb f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
> [   26.639846] RSP: 0000:ffffa5df80117ed8 EFLAGS: 00000242
> [   26.639848] RAX: 0000000000000005 RBX: ffff981940804000 RCX: ffff9819a9df7000
> [   26.639849] RDX: 0000000000000005 RSI: 0000000000000005 RDI: 000000000005c514
> [   26.639851] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000001
> [   26.639852] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
> [   26.639853] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [   26.639858]  default_idle+0x9/0x20
> [   26.639861]  default_idle_call+0x30/0x100
> [   26.639863]  do_idle+0x1fd/0x240
> [   26.639869]  cpu_startup_entry+0x29/0x30
> [   26.639872]  start_secondary+0x11e/0x140
> [   26.639875]  common_startup_64+0x13e/0x141
> [   26.639881]  </TASK>
> [   26.639882] ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time
  2025-05-07  9:09     ` Aaron Lu
@ 2025-05-07  9:33       ` Florian Bezdeka
  2025-05-08  2:45         ` Aaron Lu
  0 siblings, 1 reply; 45+ messages in thread
From: Florian Bezdeka @ 2025-05-07  9:33 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chengming Zhou, Chuyi Zhou, Jan Kiszka

On Wed, 2025-05-07 at 17:09 +0800, Aaron Lu wrote:
> Hi Florian,
> 
> On Thu, Apr 17, 2025 at 04:06:16PM +0200, Florian Bezdeka wrote:
> > Hi Aaron,
> > 
> > On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote:
> > > @@ -5889,27 +5943,21 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> > >  	cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> > >  		cfs_rq->throttled_clock_pelt;
> > >  
> > > -	if (cfs_rq->throttled_clock_self) {
> > > -		u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
> > > -
> > > -		cfs_rq->throttled_clock_self = 0;
> > > -
> > > -		if (WARN_ON_ONCE((s64)delta < 0))
> > > -			delta = 0;
> > > -
> > > -		cfs_rq->throttled_clock_self_time += delta;
> > > -	}
> > > +	if (cfs_rq->throttled_clock_self)
> > > +		account_cfs_rq_throttle_self(cfs_rq);
> > >  
> > >  	/* Re-enqueue the tasks that have been throttled at this level. */
> > >  	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
> > >  		list_del_init(&p->throttle_node);
> > > -		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
> > > +		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
> > >  	}
> > >  
> > >  	/* Add cfs_rq with load or one or more already running entities to the list */
> > >  	if (!cfs_rq_is_decayed(cfs_rq))
> > >  		list_add_leaf_cfs_rq(cfs_rq);
> > >  
> > > +	WARN_ON_ONCE(cfs_rq->h_nr_throttled);
> > > +
> > >  	return 0;
> > >  }
> > >  
> > 
> > I got this warning while testing in our virtual environment:
> > 
> > Any idea?
> > 
> 
> I made a stupid mistake here: I thought when a cfs_rq gets unthrottled,
> it should have no tasks in throttled state, hence I added that check in
> tg_unthrottle_up():
>         WARN_ON_ONCE(cfs_rq->h_nr_throttled);
> 
> But h_nr_throttled tracks hierarchical throttled task number, which
> means if this cfs_rq has descendent cfs_rqs that are still in throttled
> state, its h_nr_throttled can be > 0 when it gets unthrottled.
> 
> I just made a setup to emulate this scenario and can reproduce this
> warning. I guess in your setup, there are multiple cpu.max settings in a
> cgroup hierarchy.

I will have a look.

> 
> It's just the warn_on_once() itself is incorrect, I'll remove it in next
> version, thanks for the report!

You're welcome. IOW: I can ignore the warning. Great.

I meanwhile forward ported the 5.15 based series that you provided to
6.1 and applied massive testing in our lab. It looks very promising up
to now. Our freeze seems solved now.

Thanks for you're help! Very much appreciated!

We updated one device in the field today - at customer site. It will
take another week until I can report success. Let's hope.

The tests based on 6.14 are also looking good.

To sum up: This series fixes (or seems to fix, let's wait for one more
week to be sure) a critical RT issue. Is there a chance that once we
made it into mainline that we see (official) backports? 6.12 or 6.1
would be nice.

I could paste my 6.1 and 6.12 series, if that would help. But as there
will be at least one more iteration that work needs a refresh as well.

Best regards,
Florian
> 
> > [   26.639641] ------------[ cut here ]------------
> > [   26.639644] WARNING: CPU: 5 PID: 0 at kernel/sched/fair.c:5967 tg_unthrottle_up+0x1a6/0x3d0
> > [   26.639653] Modules linked in: veth xt_nat nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter bridge stp llc xt_recent rfkill ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt vsock_loopback vmw_vsock_virtio_transport_common ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog vmw_vsock_vmci_transport xt_comment vsock nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables intel_rapl_msr intel_rapl_common nfnetlink binfmt_misc intel_uncore_frequency_common isst_if_mbox_msr isst_if_common skx_edac_common nfit libnvdimm ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel snd_pcm crypto_simd cryptd snd_timer rapl snd soundcore vmw_balloon vmwgfx pcspkr drm_ttm_helper ttm drm_client_lib button ac drm_kms_helper sg vmw_vmci evdev joydev serio_raw drm loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 overlay nls_ascii nls_cp437 vfat fat ext4 crc16 mbcache jbd2 squashfs dm_verity dm_bufio reed_solomon dm_mod
> > [   26.639715]  sd_mod ata_generic mptspi mptscsih ata_piix mptbase libata scsi_transport_spi psmouse scsi_mod vmxnet3 i2c_piix4 i2c_smbus scsi_common
> > [   26.639726] CPU: 5 UID: 0 PID: 0 Comm: swapper/5 Not tainted 6.14.2-CFSfixes #1
> > [   26.639729] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.24224532.B64.2408191458 08/19/2024
> > [   26.639731] RIP: 0010:tg_unthrottle_up+0x1a6/0x3d0
> > [   26.639735] Code: 00 00 48 39 ca 74 14 48 8b 52 10 49 8b 8e 58 01 00 00 48 39 8a 28 01 00 00 74 24 41 8b 86 68 01 00 00 85 c0 0f 84 8d fe ff ff <0f> 0b e9 86 fe ff ff 49 8b 9e 38 01 00 00 41 8b 86 40 01 00 00 48
> > [   26.639737] RSP: 0000:ffffa5df8029cec8 EFLAGS: 00010002
> > [   26.639739] RAX: 0000000000000001 RBX: ffff981c6fcb6a80 RCX: ffff981943752e40
> > [   26.639741] RDX: 0000000000000005 RSI: ffff981c6fcb6a80 RDI: ffff981943752d00
> > [   26.639742] RBP: ffff9819607dc708 R08: ffff981c6fcb6a80 R09: 0000000000000000
> > [   26.639744] R10: 0000000000000001 R11: ffff981969936a10 R12: ffff9819607dc708
> > [   26.639745] R13: ffff9819607dc9d8 R14: ffff9819607dc800 R15: ffffffffad913fb0
> > [   26.639747] FS:  0000000000000000(0000) GS:ffff981c6fc80000(0000) knlGS:0000000000000000
> > [   26.639749] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   26.639750] CR2: 00007ff1292dc44c CR3: 000000015350e006 CR4: 00000000007706f0
> > [   26.639779] PKRU: 55555554
> > [   26.639781] Call Trace:
> > [   26.639783]  <IRQ>
> > [   26.639787]  ? __pfx_tg_unthrottle_up+0x10/0x10
> > [   26.639790]  ? __pfx_tg_nop+0x10/0x10
> > [   26.639793]  walk_tg_tree_from+0x58/0xb0
> > [   26.639797]  unthrottle_cfs_rq+0xf0/0x360
> > [   26.639800]  ? sched_clock_cpu+0xf/0x190
> > [   26.639808]  __cfsb_csd_unthrottle+0x11c/0x170
> > [   26.639812]  ? __pfx___cfsb_csd_unthrottle+0x10/0x10
> > [   26.639816]  __flush_smp_call_function_queue+0x103/0x410
> > [   26.639822]  __sysvec_call_function_single+0x1c/0xb0
> > [   26.639826]  sysvec_call_function_single+0x6c/0x90
> > [   26.639832]  </IRQ>
> > [   26.639833]  <TASK>
> > [   26.639834]  asm_sysvec_call_function_single+0x1a/0x20
> > [   26.639840] RIP: 0010:pv_native_safe_halt+0xf/0x20
> > [   26.639844] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 45 c1 13 00 fb f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
> > [   26.639846] RSP: 0000:ffffa5df80117ed8 EFLAGS: 00000242
> > [   26.639848] RAX: 0000000000000005 RBX: ffff981940804000 RCX: ffff9819a9df7000
> > [   26.639849] RDX: 0000000000000005 RSI: 0000000000000005 RDI: 000000000005c514
> > [   26.639851] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000001
> > [   26.639852] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
> > [   26.639853] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > [   26.639858]  default_idle+0x9/0x20
> > [   26.639861]  default_idle_call+0x30/0x100
> > [   26.639863]  do_idle+0x1fd/0x240
> > [   26.639869]  cpu_startup_entry+0x29/0x30
> > [   26.639872]  start_secondary+0x11e/0x140
> > [   26.639875]  common_startup_64+0x13e/0x141
> > [   26.639881]  </TASK>
> > [   26.639882] ---[ end trace 0000000000000000 ]---


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time
  2025-05-07  9:33       ` Florian Bezdeka
@ 2025-05-08  2:45         ` Aaron Lu
  2025-05-08  6:13           ` Jan Kiszka
  0 siblings, 1 reply; 45+ messages in thread
From: Aaron Lu @ 2025-05-08  2:45 UTC (permalink / raw)
  To: Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chengming Zhou, Chuyi Zhou, Jan Kiszka

On Wed, May 07, 2025 at 11:33:42AM +0200, Florian Bezdeka wrote:
> On Wed, 2025-05-07 at 17:09 +0800, Aaron Lu wrote:
... ...
> > It's just the warn_on_once() itself is incorrect, I'll remove it in next
> > version, thanks for the report!
> 
> You're welcome. IOW: I can ignore the warning. Great.
>

Right :-)

> I meanwhile forward ported the 5.15 based series that you provided to
> 6.1 and applied massive testing in our lab. It looks very promising up
> to now. Our freeze seems solved now.
> 

Good to know this.

> Thanks for you're help! Very much appreciated!
> 

You are welcome.

> We updated one device in the field today - at customer site. It will
> take another week until I can report success. Let's hope.
> 
> The tests based on 6.14 are also looking good.
> 
> To sum up: This series fixes (or seems to fix, let's wait for one more
> week to be sure) a critical RT issue. Is there a chance that once we
> made it into mainline that we see (official) backports? 6.12 or 6.1
> would be nice.

I don't think there will be official backports if this series entered
mainline because stable kernels only take fixes while this series changed
throttle behavior dramatically. Of course, this is just my personal
view, and the maintainer will make the final decision.

Thanks,
Aaron

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time
  2025-05-08  2:45         ` Aaron Lu
@ 2025-05-08  6:13           ` Jan Kiszka
  2025-05-08 13:43             ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2025-05-08  6:13 UTC (permalink / raw)
  To: Aaron Lu, Florian Bezdeka, Sebastian Andrzej Siewior
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chengming Zhou, Chuyi Zhou

On 08.05.25 04:45, Aaron Lu wrote:
> On Wed, May 07, 2025 at 11:33:42AM +0200, Florian Bezdeka wrote:
>> To sum up: This series fixes (or seems to fix, let's wait for one more
>> week to be sure) a critical RT issue. Is there a chance that once we
>> made it into mainline that we see (official) backports? 6.12 or 6.1
>> would be nice.
> 
> I don't think there will be official backports if this series entered
> mainline because stable kernels only take fixes while this series changed
> throttle behavior dramatically. Of course, this is just my personal
> view, and the maintainer will make the final decision.

With 6.12 carrying RT in-tree and this patches serious fixing a hard
lock-up of that configuration, a backport to 6.12-stable would be
required IMHO. Backports beyond that should be a topic for the
(separate) rt-stable trees.

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time
  2025-05-08  6:13           ` Jan Kiszka
@ 2025-05-08 13:43             ` Steven Rostedt
  0 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-05-08 13:43 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Aaron Lu, Florian Bezdeka, Sebastian Andrzej Siewior,
	Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Clark Williams, daniel.wagner, josephtsalisbury,
	lgoncalv, Tom Zanussi, williams, dwagner

On Thu, 8 May 2025 08:13:39 +0200
Jan Kiszka <jan.kiszka@siemens.com> wrote:

> On 08.05.25 04:45, Aaron Lu wrote:
> > On Wed, May 07, 2025 at 11:33:42AM +0200, Florian Bezdeka wrote:  
> >> To sum up: This series fixes (or seems to fix, let's wait for one more
> >> week to be sure) a critical RT issue. Is there a chance that once we
> >> made it into mainline that we see (official) backports? 6.12 or 6.1
> >> would be nice.  
> > 
> > I don't think there will be official backports if this series entered
> > mainline because stable kernels only take fixes while this series changed
> > throttle behavior dramatically. Of course, this is just my personal
> > view, and the maintainer will make the final decision.  
> 
> With 6.12 carrying RT in-tree and this patches serious fixing a hard
> lock-up of that configuration, a backport to 6.12-stable would be
> required IMHO. Backports beyond that should be a topic for the
> (separate) rt-stable trees.
>

Agreed, and I'm adding the stable RT maintainers as well in case this needs
to go earlier than 6.12.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2025-05-08 13:43 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-09 12:07 [RFC PATCH v2 0/7] Defer throttle when task exits to user Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-04-14  3:58   ` K Prateek Nayak
2025-04-14 11:55     ` Aaron Lu
2025-04-14 13:37       ` K Prateek Nayak
2025-04-09 12:07 ` [RFC PATCH v2 2/7] sched/fair: Handle throttle path " Aaron Lu
2025-04-14  8:54   ` Florian Bezdeka
2025-04-14 12:10     ` Aaron Lu
2025-04-14 14:39   ` Florian Bezdeka
2025-04-14 15:02     ` K Prateek Nayak
2025-04-30 10:01   ` Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 3/7] sched/fair: Handle unthrottle " Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 5/7] sched/fair: get rid of throttled_lb_pair() Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 6/7] sched/fair: fix h_nr_runnable accounting with per-task throttle Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time Aaron Lu
2025-04-09 14:24   ` Aaron Lu
2025-04-17 14:06   ` Florian Bezdeka
2025-04-18  3:15     ` Aaron Lu
2025-04-22 15:03       ` Florian Bezdeka
2025-04-23 11:26         ` Aaron Lu
2025-04-23 12:15           ` Florian Bezdeka
2025-04-24  2:26             ` Aaron Lu
2025-05-07  9:09     ` Aaron Lu
2025-05-07  9:33       ` Florian Bezdeka
2025-05-08  2:45         ` Aaron Lu
2025-05-08  6:13           ` Jan Kiszka
2025-05-08 13:43             ` Steven Rostedt
2025-04-14  3:05 ` [RFC PATCH v2 0/7] Defer throttle when task exits to user Chengming Zhou
2025-04-14 11:47   ` Aaron Lu
2025-04-14  8:54 ` Florian Bezdeka
2025-04-14 12:04   ` Aaron Lu
2025-04-15  5:29     ` Jan Kiszka
2025-04-15  6:05       ` K Prateek Nayak
2025-04-15  6:09         ` Jan Kiszka
2025-04-15  8:45           ` K Prateek Nayak
2025-04-15 10:21             ` Jan Kiszka
2025-04-15 11:14               ` K Prateek Nayak
     [not found]               ` <ec2cea83-07fe-472f-8320-911d215473fd@amd.com>
2025-04-15 15:49                 ` K Prateek Nayak
2025-04-22  2:10                   ` Aaron Lu
2025-04-22  2:54                     ` K Prateek Nayak
2025-04-22 14:54                       ` Florian Bezdeka
2025-04-15 10:34             ` K Prateek Nayak
2025-04-14 16:34 ` K Prateek Nayak
2025-04-15 11:25   ` Aaron Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox