linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention
@ 2025-08-08 20:02 Kuba Piecuch
  2025-08-08 20:02 ` [RFC PATCH 1/3] sched: add bool return value to sched_class::yield_task() Kuba Piecuch
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Kuba Piecuch @ 2025-08-08 20:02 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot, dietmar.eggemann, joshdon
  Cc: linux-kernel, Kuba Piecuch

Problem statement
=================

Calls to sched_yield() can touch data shared with other threads.
Because of this, userspace threads could generate high levels of contention
by calling sched_yield() in a tight loop from multiple cores.

For example, if cputimer is enabled for a process (e.g. through
setitimer(ITIMER_PROF, ...)), all threads of that process
will do an atomic add on the per-process field
p->signal->cputimer->cputime_atomic.sum_exec_runtime inside
account_group_exec_runtime(), which is called inside update_curr().

Currently, calling sched_yield() will always call update_curr() at least
once in schedule(), and potentially one more time in yield_task_fair().
Thus, userspace threads can generate quite a lot of contention for the
cacheline containing cputime_atomic.sum_exec_runtime if multiple threads of
a process call sched_yield() in a tight loop.

At Google, we suspect that this contention led to a full machine lockup in
at least one instance, with ~50% of CPU cycles spent in the atomic add
inside account_group_exec_runtime() according to
`perf record -a -e cycles`.


Proposed solution
=================

To alleviate the contention, this patchset introduces the ability to limit
how frequently a thread is allowed to yield. It adds a new sched debugfs
knob called yield_interval_ns. A thread is allowed to yield at most once
every yield_interval_ns nanoseconds. Subsequent calls to sched_yield()
within the interval simply return without calling schedule().

The default value of the knob is 0, meaning the throttling feature is
disabled by default.


Performance
===========

To test the impact on performance and contention, we used a benchmark
consisting of a process with a profiling timer enabled and N threads
sequentially assigned to logical cores, with 2 threads per core. Each
thread calls sched_yield() in a tight loop. We measured the total number
of unthrottled sched_yield() calls made by all threads within a fixed time.
In addition, we recorded the benchmark runs with
`perf record -a -g -e cycles`. We used the perf data to determine the
percentage of CPU time spent in the problematic atomic add instruction and
used that as a measure of contention.
We ran the benchmark on an Intel Emerald Rapids CPU with 60 physical cores.

With throttling disabled, there was no measurable performance impact to
sched_yield().
Setting the interval to 1ns, which enables the throttling code but doesn't
actually throttle any calls to sched_yield(), results in a 1-3% penalty
for sched_yield() with low thread counts, but disappears quickly as the
thread count gets higher and contention becomes more of a factor.

With throttling disabled, CPU time spent in the atomic add instruction
for N=80 threads is roughly 80%.
By setting yield_interval_ns to 10000, the percentage decreases to 1-2%,
but the total number of unthrottled sched_yield() calls decreases by ~60%.


Alternatives considered
=======================

An alternative we considered was to make the cputime accounting more
scalable by accumulating a thread's cputime locally in task_struct and
flushing it to the process-wide cputime when it reaches some threshold
value or when the thread is taken off the CPU. However, we determined that
the implementation is too intrusive compared to the benefit it provided.
It also wouldn't address other potential points of contention on the
sched_yield() path.


Kuba Piecuch (3):
  sched: add bool return value to sched_class::yield_task()
  sched/fair: don't schedule() in yield if nr_running == 1
  sched/fair: add debugfs knob for yield throttling

 include/linux/sched.h    |  2 ++
 kernel/sched/core.c      |  1 +
 kernel/sched/deadline.c  |  4 +++-
 kernel/sched/debug.c     |  2 ++
 kernel/sched/ext.c       |  4 +++-
 kernel/sched/fair.c      | 35 +++++++++++++++++++++++++++++++++--
 kernel/sched/rt.c        |  3 ++-
 kernel/sched/sched.h     |  4 +++-
 kernel/sched/stop_task.c |  2 +-
 kernel/sched/syscalls.c  |  9 ++++++++-
 10 files changed, 58 insertions(+), 8 deletions(-)

-- 
2.51.0.rc0.155.g4a0f42376b-goog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/3] sched: add bool return value to sched_class::yield_task()
  2025-08-08 20:02 [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention Kuba Piecuch
@ 2025-08-08 20:02 ` Kuba Piecuch
  2025-08-08 20:02 ` [RFC PATCH 2/3] sched/fair: don't schedule() in yield if nr_running == 1 Kuba Piecuch
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Kuba Piecuch @ 2025-08-08 20:02 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot, dietmar.eggemann, joshdon
  Cc: linux-kernel, Kuba Piecuch

The return value controls whether do_sched_yield() ends up calling
schedule(). This can be used e.g. as an optimization when we're certain
that the yield won't result in another task being scheduled.

This patch does not change the current behavior, i.e. every call to
do_sched_yield() will still go through to schedule().

Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
---
 kernel/sched/deadline.c  | 4 +++-
 kernel/sched/ext.c       | 4 +++-
 kernel/sched/fair.c      | 6 ++++--
 kernel/sched/rt.c        | 3 ++-
 kernel/sched/sched.h     | 2 +-
 kernel/sched/stop_task.c | 2 +-
 kernel/sched/syscalls.c  | 9 ++++++++-
 7 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 89019a1408264..d8dcb73bd3433 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2194,7 +2194,7 @@ static bool dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
  *   yield_task_dl will indicate that some spare budget
  *   is available for other task instances to use it.
  */
-static void yield_task_dl(struct rq *rq)
+static bool yield_task_dl(struct rq *rq)
 {
 	/*
 	 * We make the task go to sleep until its current deadline by
@@ -2212,6 +2212,8 @@ static void yield_task_dl(struct rq *rq)
 	 * and double the fastpath cost.
 	 */
 	rq_clock_skip_update(rq);
+
+	return true;
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 7dd5cbcb7a069..dd0a0b6b7aa05 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2502,7 +2502,7 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags
 	return true;
 }
 
-static void yield_task_scx(struct rq *rq)
+static bool yield_task_scx(struct rq *rq)
 {
 	struct scx_sched *sch = scx_root;
 	struct task_struct *p = rq->curr;
@@ -2511,6 +2511,8 @@ static void yield_task_scx(struct rq *rq)
 		SCX_CALL_OP_2TASKS_RET(sch, SCX_KF_REST, yield, rq, p, NULL);
 	else
 		p->scx.slice = 0;
+
+	return true;
 }
 
 static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a14da5396fb2..c06a2f8290822 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9010,7 +9010,7 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
 /*
  * sched_yield() is very simple
  */
-static void yield_task_fair(struct rq *rq)
+static bool yield_task_fair(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
@@ -9020,7 +9020,7 @@ static void yield_task_fair(struct rq *rq)
 	 * Are we the only task in the tree?
 	 */
 	if (unlikely(rq->nr_running == 1))
-		return;
+		return true;
 
 	clear_buddies(cfs_rq, se);
 
@@ -9037,6 +9037,8 @@ static void yield_task_fair(struct rq *rq)
 	rq_clock_skip_update(rq);
 
 	se->deadline += calc_delta_fair(se->slice, se);
+
+	return true;
 }
 
 static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e40422c370335..1fa535457cc40 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1533,9 +1533,10 @@ static void requeue_task_rt(struct rq *rq, struct task_struct *p, int head)
 	}
 }
 
-static void yield_task_rt(struct rq *rq)
+static bool yield_task_rt(struct rq *rq)
 {
 	requeue_task_rt(rq, rq->curr, 0);
+	return true;
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 83e3aa9171429..8b2cd54a09942 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2395,7 +2395,7 @@ struct sched_class {
 
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
 	bool (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
-	void (*yield_task)   (struct rq *rq);
+	bool (*yield_task)   (struct rq *rq);
 	bool (*yield_to_task)(struct rq *rq, struct task_struct *p);
 
 	void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 058dd42e3d9b5..c6da16a39e08d 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -54,7 +54,7 @@ dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 	return true;
 }
 
-static void yield_task_stop(struct rq *rq)
+static bool yield_task_stop(struct rq *rq)
 {
 	BUG(); /* the stop task should never yield, its pointless. */
 }
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 547c1f05b667e..e708a31c8e313 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -1354,11 +1354,18 @@ static void do_sched_yield(void)
 {
 	struct rq_flags rf;
 	struct rq *rq;
+	bool should_resched;
 
 	rq = this_rq_lock_irq(&rf);
 
+	should_resched = current->sched_class->yield_task(rq);
+
+	if (unlikely(!should_resched)) {
+		rq_unlock_irq(rq, &rf);
+		return;
+	}
+
 	schedstat_inc(rq->yld_count);
-	current->sched_class->yield_task(rq);
 
 	preempt_disable();
 	rq_unlock_irq(rq, &rf);
-- 
2.51.0.rc0.155.g4a0f42376b-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/3] sched/fair: don't schedule() in yield if nr_running == 1
  2025-08-08 20:02 [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention Kuba Piecuch
  2025-08-08 20:02 ` [RFC PATCH 1/3] sched: add bool return value to sched_class::yield_task() Kuba Piecuch
@ 2025-08-08 20:02 ` Kuba Piecuch
  2025-08-08 20:02 ` [RFC PATCH 3/3] sched/fair: add debugfs knob for yield throttling Kuba Piecuch
  2025-08-11  8:36 ` [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention Peter Zijlstra
  3 siblings, 0 replies; 10+ messages in thread
From: Kuba Piecuch @ 2025-08-08 20:02 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot, dietmar.eggemann, joshdon
  Cc: linux-kernel, Kuba Piecuch

There's no need to schedule() if we know that there are no other tasks
to pick.

Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c06a2f8290822..3f9bfc64e0bc5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9020,7 +9020,7 @@ static bool yield_task_fair(struct rq *rq)
 	 * Are we the only task in the tree?
 	 */
 	if (unlikely(rq->nr_running == 1))
-		return true;
+		return false;
 
 	clear_buddies(cfs_rq, se);
 
-- 
2.51.0.rc0.155.g4a0f42376b-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/3] sched/fair: add debugfs knob for yield throttling
  2025-08-08 20:02 [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention Kuba Piecuch
  2025-08-08 20:02 ` [RFC PATCH 1/3] sched: add bool return value to sched_class::yield_task() Kuba Piecuch
  2025-08-08 20:02 ` [RFC PATCH 2/3] sched/fair: don't schedule() in yield if nr_running == 1 Kuba Piecuch
@ 2025-08-08 20:02 ` Kuba Piecuch
  2025-08-11  8:36 ` [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention Peter Zijlstra
  3 siblings, 0 replies; 10+ messages in thread
From: Kuba Piecuch @ 2025-08-08 20:02 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot, dietmar.eggemann, joshdon
  Cc: linux-kernel, Kuba Piecuch

yield_interval_ns specifies the interval within which any given thread
is allowed to yield at most once. Subsequent calls to sched_yield()
within the interval simply return without calling schedule().

Allowing unlimited calls to sched_yield() allows for DoS-like behavior
because threads can continually call into schedule() which results in
various types of contention.

For example, if a process has a profiling timer enabled, every call to
update_curr() results in an atomic add to a shared process-wide variable
p->signal->cputimer->cputime_atomic.sum_exec_runtime, performed in
account_group_exec_runtime().

In a synthetic benchmark consisting of 80 threads (2 per core) calling
sched_yield() in a busy loop with a profiling timer enabled, we have
observed that ~80% of CPU time is spent in the single atomic add
instruction. Setting yield_interval_ns to 10000 lowers that percentage
to 1-2%, at the cost of decreasing the total number yields that end up
calling schedule() by ~60%. The benchmark was run on an Intel Emerald
Rapids CPU with 60 physical cores.

Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  1 +
 kernel/sched/debug.c  |  2 ++
 kernel/sched/fair.c   | 29 +++++++++++++++++++++++++++++
 kernel/sched/sched.h  |  2 ++
 5 files changed, 36 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index aa9c5be7a6325..c637025792fc6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -946,6 +946,8 @@ struct task_struct {
 
 	struct sched_info		sched_info;
 
+	ktime_t last_yield;
+
 	struct list_head		tasks;
 #ifdef CONFIG_SMP
 	struct plist_node		pushable_tasks;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 81c6df746df17..acc87c9ff5681 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4493,6 +4493,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	p->on_rq			= 0;
 
+	p->last_yield			= ktime_set(0, 0);
 	p->se.on_rq			= 0;
 	p->se.exec_start		= 0;
 	p->se.sum_exec_runtime		= 0;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 557246880a7e0..93d2c988d491d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -512,6 +512,8 @@ static __init int sched_init_debug(void)
 	debugfs_create_u32("latency_warn_ms", 0644, debugfs_sched, &sysctl_resched_latency_warn_ms);
 	debugfs_create_u32("latency_warn_once", 0644, debugfs_sched, &sysctl_resched_latency_warn_once);
 
+	debugfs_create_u32("yield_interval_ns", 0644, debugfs_sched, &sysctl_sched_yield_interval);
+
 #ifdef CONFIG_SMP
 	debugfs_create_file("tunable_scaling", 0644, debugfs_sched, NULL, &sched_scaling_fops);
 	debugfs_create_u32("migration_cost_ns", 0644, debugfs_sched, &sysctl_sched_migration_cost);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3f9bfc64e0bc5..39ca52128f502 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -81,6 +81,18 @@ static unsigned int normalized_sysctl_sched_base_slice	= 700000ULL;
 
 __read_mostly unsigned int sysctl_sched_migration_cost	= 500000UL;
 
+/*
+ * This interval controls how often a given CFS thread can yield.
+ * A given thread can only yield once within this interval.
+ * The throttling is accomplished by making calls to sched_yield() return
+ * without actually calling schedule().
+ * A value of 0 means yields are not throttled.
+ *
+ * (default: 0, units: nanoseconds)
+ */
+__read_mostly unsigned int sysctl_sched_yield_interval;
+
+
 static int __init setup_sched_thermal_decay_shift(char *str)
 {
 	pr_warn("Ignoring the deprecated sched_thermal_decay_shift= option\n");
@@ -9015,6 +9027,7 @@ static bool yield_task_fair(struct rq *rq)
 	struct task_struct *curr = rq->curr;
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	struct sched_entity *se = &curr->se;
+	ktime_t now, throttle_end_time;
 
 	/*
 	 * Are we the only task in the tree?
@@ -9024,6 +9037,22 @@ static bool yield_task_fair(struct rq *rq)
 
 	clear_buddies(cfs_rq, se);
 
+	if (unlikely(sysctl_sched_yield_interval)) {
+		/*
+		 * Limit how often a given thread can call schedule() via
+		 * sched_yield() to once every sysctl_sched_yield_interval
+		 * nanoseconds.
+		 */
+		now = ktime_get();
+		throttle_end_time = ktime_add_ns(curr->last_yield,
+						 sysctl_sched_yield_interval);
+
+		if (unlikely(ktime_before(now, throttle_end_time)))
+			return false;
+
+		curr->last_yield = now;
+	}
+
 	update_rq_clock(rq);
 	/*
 	 * Update run-time statistics of the 'current'.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8b2cd54a09942..14e3d90b0df0e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2827,6 +2827,8 @@ extern __read_mostly unsigned int sysctl_sched_migration_cost;
 
 extern unsigned int sysctl_sched_base_slice;
 
+extern __read_mostly unsigned int sysctl_sched_yield_interval;
+
 extern int sysctl_resched_latency_warn_ms;
 extern int sysctl_resched_latency_warn_once;
 
-- 
2.51.0.rc0.155.g4a0f42376b-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention
  2025-08-08 20:02 [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention Kuba Piecuch
                   ` (2 preceding siblings ...)
  2025-08-08 20:02 ` [RFC PATCH 3/3] sched/fair: add debugfs knob for yield throttling Kuba Piecuch
@ 2025-08-11  8:36 ` Peter Zijlstra
  2025-08-11 13:35   ` Kuba Piecuch
  3 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2025-08-11  8:36 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: mingo, vincent.guittot, dietmar.eggemann, joshdon, linux-kernel

On Fri, Aug 08, 2025 at 08:02:47PM +0000, Kuba Piecuch wrote:
> Problem statement
> =================
> 
> Calls to sched_yield() can touch data shared with other threads.
> Because of this, userspace threads could generate high levels of contention
> by calling sched_yield() in a tight loop from multiple cores.
> 
> For example, if cputimer is enabled for a process (e.g. through
> setitimer(ITIMER_PROF, ...)), all threads of that process
> will do an atomic add on the per-process field
> p->signal->cputimer->cputime_atomic.sum_exec_runtime inside
> account_group_exec_runtime(), which is called inside update_curr().
> 
> Currently, calling sched_yield() will always call update_curr() at least
> once in schedule(), and potentially one more time in yield_task_fair().
> Thus, userspace threads can generate quite a lot of contention for the
> cacheline containing cputime_atomic.sum_exec_runtime if multiple threads of
> a process call sched_yield() in a tight loop.
> 
> At Google, we suspect that this contention led to a full machine lockup in
> at least one instance, with ~50% of CPU cycles spent in the atomic add
> inside account_group_exec_runtime() according to
> `perf record -a -e cycles`.

I've gotta ask, WTH is your userspace calling yield() so much?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention
  2025-08-11  8:36 ` [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention Peter Zijlstra
@ 2025-08-11 13:35   ` Kuba Piecuch
  2025-08-14 14:53     ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Kuba Piecuch @ 2025-08-11 13:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, dietmar.eggemann, joshdon, linux-kernel

On Mon, Aug 11, 2025 at 10:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Aug 08, 2025 at 08:02:47PM +0000, Kuba Piecuch wrote:
> > Problem statement
> > =================
> >
> > Calls to sched_yield() can touch data shared with other threads.
> > Because of this, userspace threads could generate high levels of contention
> > by calling sched_yield() in a tight loop from multiple cores.
> >
> > For example, if cputimer is enabled for a process (e.g. through
> > setitimer(ITIMER_PROF, ...)), all threads of that process
> > will do an atomic add on the per-process field
> > p->signal->cputimer->cputime_atomic.sum_exec_runtime inside
> > account_group_exec_runtime(), which is called inside update_curr().
> >
> > Currently, calling sched_yield() will always call update_curr() at least
> > once in schedule(), and potentially one more time in yield_task_fair().
> > Thus, userspace threads can generate quite a lot of contention for the
> > cacheline containing cputime_atomic.sum_exec_runtime if multiple threads of
> > a process call sched_yield() in a tight loop.
> >
> > At Google, we suspect that this contention led to a full machine lockup in
> > at least one instance, with ~50% of CPU cycles spent in the atomic add
> > inside account_group_exec_runtime() according to
> > `perf record -a -e cycles`.
>
> I've gotta ask, WTH is your userspace calling yield() so much?

The code calling sched_yield() was in the wait loop for a spinlock. It
would repeatedly yield until the compare-and-swap instruction succeeded
in acquiring the lock. This code runs in the SIGPROF handler.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention
  2025-08-11 13:35   ` Kuba Piecuch
@ 2025-08-14 14:53     ` Peter Zijlstra
  2025-08-17 14:29       ` David Laight
  2025-08-19 14:08       ` Kuba Piecuch
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Zijlstra @ 2025-08-14 14:53 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: mingo, vincent.guittot, dietmar.eggemann, joshdon, linux-kernel

On Mon, Aug 11, 2025 at 03:35:35PM +0200, Kuba Piecuch wrote:
> On Mon, Aug 11, 2025 at 10:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Aug 08, 2025 at 08:02:47PM +0000, Kuba Piecuch wrote:
> > > Problem statement
> > > =================
> > >
> > > Calls to sched_yield() can touch data shared with other threads.
> > > Because of this, userspace threads could generate high levels of contention
> > > by calling sched_yield() in a tight loop from multiple cores.
> > >
> > > For example, if cputimer is enabled for a process (e.g. through
> > > setitimer(ITIMER_PROF, ...)), all threads of that process
> > > will do an atomic add on the per-process field
> > > p->signal->cputimer->cputime_atomic.sum_exec_runtime inside
> > > account_group_exec_runtime(), which is called inside update_curr().
> > >
> > > Currently, calling sched_yield() will always call update_curr() at least
> > > once in schedule(), and potentially one more time in yield_task_fair().
> > > Thus, userspace threads can generate quite a lot of contention for the
> > > cacheline containing cputime_atomic.sum_exec_runtime if multiple threads of
> > > a process call sched_yield() in a tight loop.
> > >
> > > At Google, we suspect that this contention led to a full machine lockup in
> > > at least one instance, with ~50% of CPU cycles spent in the atomic add
> > > inside account_group_exec_runtime() according to
> > > `perf record -a -e cycles`.
> >
> > I've gotta ask, WTH is your userspace calling yield() so much?
> 
> The code calling sched_yield() was in the wait loop for a spinlock. It
> would repeatedly yield until the compare-and-swap instruction succeeded
> in acquiring the lock. This code runs in the SIGPROF handler.

Well, then don't do that... userspace spinlocks are terrible, and
bashing yield like that isn't helpful either.

Throttling yield seems like entirely the wrong thing to do. Yes, yield()
is poorly defined (strictly speaking UB for anything not FIFO/RR) but
making it actively worse doesn't seem helpful.

The whole itimer thing is not scalable -- blaming that on yield seems
hardly fair.

Why not use timer_create(), with CLOCK_THREAD_CPUTIME_ID and
SIGEV_SIGNAL instead?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention
  2025-08-14 14:53     ` Peter Zijlstra
@ 2025-08-17 14:29       ` David Laight
  2025-08-19 14:08       ` Kuba Piecuch
  1 sibling, 0 replies; 10+ messages in thread
From: David Laight @ 2025-08-17 14:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kuba Piecuch, mingo, vincent.guittot, dietmar.eggemann, joshdon,
	linux-kernel

On Thu, 14 Aug 2025 16:53:08 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Aug 11, 2025 at 03:35:35PM +0200, Kuba Piecuch wrote:
> > On Mon, Aug 11, 2025 at 10:36 AM Peter Zijlstra <peterz@infradead.org> wrote:  
...
> > The code calling sched_yield() was in the wait loop for a spinlock. It
> > would repeatedly yield until the compare-and-swap instruction succeeded
> > in acquiring the lock. This code runs in the SIGPROF handler.  
> 
> Well, then don't do that... userspace spinlocks are terrible, and
> bashing yield like that isn't helpful either.

All it takes is the kernel to take a hardware interrupt while your
'spin lock' is held and any other thread trying to acquire the
lock will sit at 100% cpu until all the interrupt work finishes.
A typical ethernet interrupt will schedule more work from a softint
context, with non-threaded napi you have to wait for that to finish.
That can all take milliseconds.

The same is true for a futex based lock - but at least the waiting
threads sleep.

Pretty much the only solution it to replace the userspace locks with
atomic operations (and hope the atomics make progress).

I'm pretty sure it only makes sense to have spin locks that disable
interrupts.

	David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention
  2025-08-14 14:53     ` Peter Zijlstra
  2025-08-17 14:29       ` David Laight
@ 2025-08-19 14:08       ` Kuba Piecuch
  2025-08-20 15:49         ` Kuba Piecuch
  1 sibling, 1 reply; 10+ messages in thread
From: Kuba Piecuch @ 2025-08-19 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, dietmar.eggemann, joshdon, linux-kernel,
	david.laight.linux

On Thu, Aug 14, 2025 at 4:53 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Aug 11, 2025 at 03:35:35PM +0200, Kuba Piecuch wrote:
> > On Mon, Aug 11, 2025 at 10:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Fri, Aug 08, 2025 at 08:02:47PM +0000, Kuba Piecuch wrote:
> > > > Problem statement
> > > > =================
> > > >
> > > > Calls to sched_yield() can touch data shared with other threads.
> > > > Because of this, userspace threads could generate high levels of contention
> > > > by calling sched_yield() in a tight loop from multiple cores.
> > > >
> > > > For example, if cputimer is enabled for a process (e.g. through
> > > > setitimer(ITIMER_PROF, ...)), all threads of that process
> > > > will do an atomic add on the per-process field
> > > > p->signal->cputimer->cputime_atomic.sum_exec_runtime inside
> > > > account_group_exec_runtime(), which is called inside update_curr().
> > > >
> > > > Currently, calling sched_yield() will always call update_curr() at least
> > > > once in schedule(), and potentially one more time in yield_task_fair().
> > > > Thus, userspace threads can generate quite a lot of contention for the
> > > > cacheline containing cputime_atomic.sum_exec_runtime if multiple threads of
> > > > a process call sched_yield() in a tight loop.
> > > >
> > > > At Google, we suspect that this contention led to a full machine lockup in
> > > > at least one instance, with ~50% of CPU cycles spent in the atomic add
> > > > inside account_group_exec_runtime() according to
> > > > `perf record -a -e cycles`.
> > >
> > > I've gotta ask, WTH is your userspace calling yield() so much?
> >
> > The code calling sched_yield() was in the wait loop for a spinlock. It
> > would repeatedly yield until the compare-and-swap instruction succeeded
> > in acquiring the lock. This code runs in the SIGPROF handler.
>
> Well, then don't do that... userspace spinlocks are terrible, and
> bashing yield like that isn't helpful either.
>
> Throttling yield seems like entirely the wrong thing to do. Yes, yield()
> is poorly defined (strictly speaking UB for anything not FIFO/RR) but
> making it actively worse doesn't seem helpful.
>
> The whole itimer thing is not scalable -- blaming that on yield seems
> hardly fair.
>
> Why not use timer_create(), with CLOCK_THREAD_CPUTIME_ID and
> SIGEV_SIGNAL instead?

I agree that there are userspace changes we can make to reduce contention
and prevent future lockups. What that doesn't address is the potential for
userspace to trigger kernel lockups, maliciously or unintentionally, via
spamming yield(). This patch series introduces a way to reduce contention
and risk of userspace-induced lockups regardless of userspace behavior
-- that's the value proposition.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention
  2025-08-19 14:08       ` Kuba Piecuch
@ 2025-08-20 15:49         ` Kuba Piecuch
  0 siblings, 0 replies; 10+ messages in thread
From: Kuba Piecuch @ 2025-08-20 15:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, dietmar.eggemann, joshdon, linux-kernel,
	david.laight.linux

On Tue, Aug 19, 2025 at 4:08 PM Kuba Piecuch <jpiecuch@google.com> wrote:
>
> On Thu, Aug 14, 2025 at 4:53 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Aug 11, 2025 at 03:35:35PM +0200, Kuba Piecuch wrote:
> > > On Mon, Aug 11, 2025 at 10:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Fri, Aug 08, 2025 at 08:02:47PM +0000, Kuba Piecuch wrote:
> > > > > Problem statement
> > > > > =================
> > > > >
> > > > > Calls to sched_yield() can touch data shared with other threads.
> > > > > Because of this, userspace threads could generate high levels of contention
> > > > > by calling sched_yield() in a tight loop from multiple cores.
> > > > >
> > > > > For example, if cputimer is enabled for a process (e.g. through
> > > > > setitimer(ITIMER_PROF, ...)), all threads of that process
> > > > > will do an atomic add on the per-process field
> > > > > p->signal->cputimer->cputime_atomic.sum_exec_runtime inside
> > > > > account_group_exec_runtime(), which is called inside update_curr().
> > > > >
> > > > > Currently, calling sched_yield() will always call update_curr() at least
> > > > > once in schedule(), and potentially one more time in yield_task_fair().
> > > > > Thus, userspace threads can generate quite a lot of contention for the
> > > > > cacheline containing cputime_atomic.sum_exec_runtime if multiple threads of
> > > > > a process call sched_yield() in a tight loop.
> > > > >
> > > > > At Google, we suspect that this contention led to a full machine lockup in
> > > > > at least one instance, with ~50% of CPU cycles spent in the atomic add
> > > > > inside account_group_exec_runtime() according to
> > > > > `perf record -a -e cycles`.
> > > >
> > > > I've gotta ask, WTH is your userspace calling yield() so much?
> > >
> > > The code calling sched_yield() was in the wait loop for a spinlock. It
> > > would repeatedly yield until the compare-and-swap instruction succeeded
> > > in acquiring the lock. This code runs in the SIGPROF handler.
> >
> > Well, then don't do that... userspace spinlocks are terrible, and
> > bashing yield like that isn't helpful either.
> >
> > Throttling yield seems like entirely the wrong thing to do. Yes, yield()
> > is poorly defined (strictly speaking UB for anything not FIFO/RR) but
> > making it actively worse doesn't seem helpful.
> >
> > The whole itimer thing is not scalable -- blaming that on yield seems
> > hardly fair.
> >
> > Why not use timer_create(), with CLOCK_THREAD_CPUTIME_ID and
> > SIGEV_SIGNAL instead?
>
> I agree that there are userspace changes we can make to reduce contention
> and prevent future lockups. What that doesn't address is the potential for
> userspace to trigger kernel lockups, maliciously or unintentionally, via
> spamming yield(). This patch series introduces a way to reduce contention
> and risk of userspace-induced lockups regardless of userspace behavior
> -- that's the value proposition.

At a more basic level, we need to agree that there's a kernel issue here
that should be resolved: userspace potentially being able to trigger a hard
lockup via suboptimal/inappropriate use of syscalls.

Not long ago, there was a similar issue involving getrusage() [1]: a
process with many threads was causing hard lockups when the threads were
calling getrusage() too frequently.  You could've said "don't call
getrusage() so much", but that would be addressing a symptom, not the
cause.

Granted, the fix in that case [2] was more elegant and less hacky than
what I'm proposing here, but there are alternative approaches that we can
pursue. We just need to agree that there's a problem in the kernel that
needs to be solved.

[1]: https://lore.kernel.org/all/20240117192534.1327608-1-dylanbhatch@google.com/
[2]: https://lore.kernel.org/all/20240122155023.GA26169@redhat.com/

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-08-20 15:49 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-08 20:02 [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention Kuba Piecuch
2025-08-08 20:02 ` [RFC PATCH 1/3] sched: add bool return value to sched_class::yield_task() Kuba Piecuch
2025-08-08 20:02 ` [RFC PATCH 2/3] sched/fair: don't schedule() in yield if nr_running == 1 Kuba Piecuch
2025-08-08 20:02 ` [RFC PATCH 3/3] sched/fair: add debugfs knob for yield throttling Kuba Piecuch
2025-08-11  8:36 ` [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention Peter Zijlstra
2025-08-11 13:35   ` Kuba Piecuch
2025-08-14 14:53     ` Peter Zijlstra
2025-08-17 14:29       ` David Laight
2025-08-19 14:08       ` Kuba Piecuch
2025-08-20 15:49         ` Kuba Piecuch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).