[PATCH 0/3] sched/core: Fixes and enhancements around spurious need

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] sched/core: Fixes and enhancements around spurious need_resched() and idle load balancing
@ 2024-07-10  9:02 K Prateek Nayak
  2024-07-10  9:02 ` [PATCH 1/3] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() K Prateek Nayak
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: K Prateek Nayak @ 2024-07-10  9:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Paul E. McKenney,
	Imran Khan, Leonardo Bras, Guo Ren, Rik van Riel, Tejun Heo,
	Cruz Zhao, Lai Jiangshan, Joel Fernandes, Zqiang, Julia Lawall,
	Gautham R. Shenoy, K Prateek Nayak

Since commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()"), an idle CPU in TIF_POLLING_NRFLAG can
be pulled out of idle by setting TIF_NEED_RESCHED instead of sending an
actual IPI. This affects at least three scenarios that have been
described below:

 o A need_resched() check within a call function does not necessarily
   indicate a task wakeup since a CPU intending to send an IPI to an
   idle target in TIF_POLLING_NRFLAG mode can simply queue the
   SMP-call-function and set the TIF_NEED_RESCHED flag to pull the
   polling target out of idle. The SMP-call-function will be executed by
   flush_smp_call_function_queue() on the idle-exit path. On x86, where
   mwait_idle_with_hints() sets TIF_POLLING_NRFLAG for long idling,
   this leads to idle load balancer bailing out early since
   need_resched() check in nohz_csd_func() returns true in most
   instances.

o A TIF_POLLING_NRFLAG idling CPU woken up to process an IPI will end
  up calling schedule() even in cases where the call function does not
  wake up a new task on the idle CPU, thus delaying the idle re-entry.

o Julia Lawall reported a case where a softirq raised from a
  SMP-call-function on an idle CPU will wake up ksoftirqd since
  flush_smp_call_function_queue() executes in the idle thread's context.
  This can throw off the idle load balancer by making the idle CPU
  appear busy since ksoftirqd just woke on the said CPU [1].

The three patches address each of the above issue individually, the
first one by removing the need_resched() check in nohz_csd_func() with
a proper justification, the second by introducing a fast-path in
__schedule() to speed up idle re-entry in case TIF_NEED_RESCHED was set
simply to process an IPI that did not perform a wakeup, and the third by
notifying raise_softirq() that the softirq was raised from a
SMP-call-function executed by the idle or migration thread in
flush_smp_call_function_queue(), and waking ksoftirqd is unnecessary
since a call to do_softirq_post_smp_call_flush() will follow soon.

Previous attempts to solve these problems involved introducing a new
TIF_NOTIFY_IPI flag to notify a TIF_POLLING_NRFLAG CPU of a pending IPI
and skip calling __schedule() in such cases but it involved using atomic
ops which could have performance implications [2]. Instead, Peter
suggested the approach outlined in the first two patches of the series.
The third one is an RFC to that (hopefully) solves the problem Julia was
chasing down related to idle load balancing.

[1] https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/
[2] https://lore.kernel.org/lkml/20240615014256.GQ8774@noisy.programming.kicks-ass.net/

This patch is based on tip:sched/core at commit c793a62823d1
("sched/core: Drop spinlocks on contention iff kernel is preemptible")

--
K Prateek Nayak (2):
  sched/core: Remove the unnecessary need_resched() check in
    nohz_csd_func()
  softirq: Avoid waking up ksoftirqd from
    flush_smp_call_function_queue()

Peter Zijlstra (1):
  sched/core: Introduce SM_IDLE and an idle re-entry fast-path in
    __schedule()

 kernel/sched/core.c | 40 ++++++++++++++++++++--------------------
 kernel/sched/smp.h  |  2 ++
 kernel/smp.c        | 32 ++++++++++++++++++++++++++++++++
 kernel/softirq.c    | 10 +++++++++-
 4 files changed, 63 insertions(+), 21 deletions(-)

base-commit: c793a62823d1ce8f70d9cfc7803e3ea436277cda
-- 
2.34.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()
  2024-07-10  9:02 [PATCH 0/3] sched/core: Fixes and enhancements around spurious need_resched() and idle load balancing K Prateek Nayak
@ 2024-07-10  9:02 ` K Prateek Nayak
  2024-07-10 14:53   ` Peter Zijlstra
  2024-07-23  6:46   ` K Prateek Nayak
  2024-07-10  9:02 ` [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule() K Prateek Nayak
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 18+ messages in thread
From: K Prateek Nayak @ 2024-07-10  9:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Paul E. McKenney,
	Imran Khan, Leonardo Bras, Guo Ren, Rik van Riel, Tejun Heo,
	Cruz Zhao, Lai Jiangshan, Joel Fernandes, Zqiang, Julia Lawall,
	Gautham R. Shenoy, K Prateek Nayak

The need_resched() check currently in nohz_csd_func() can be tracked
to have been added in scheduler_ipi() back in 2011 via commit
ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance")

Since then, it has travelled quite a bit but it seems like an idle_cpu()
check currently is sufficient to detect the need to bail out from an
idle load balancing. To justify this removal, consider all the following
case where an idle load balancing could race with a task wakeup:

o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU
  on wakelist if wakee cpu is idle") a target perceived to be idle
  (target_rq->nr_running == 0) will return true for
  ttwu_queue_cond(target) which will offload the task wakeup to the idle
  target via an IPI.

  In all such cases target_rq->ttwu_pending will be set to 1 before
  queuing the wake function.

  If an idle load balance races here, following scenarios are possible:

  - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
    IPI is sent to the CPU to wake it out of idle. If the
    nohz_csd_func() queues before sched_ttwu_pending(), the idle load
    balance will bail out since idle_cpu(target) returns 0 since
    target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
    sched_ttwu_pending() it should see rq->nr_running to be non-zero and
    bail out of idle load balancing.

  - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
    the sender will simply set TIF_NEED_RESCHED for the target to put it
    out of idle and flush_smp_call_function_queue() in do_idle() will
    execute the call function. Depending on the ordering of the queuing
    of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
    nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
    target_rq->nr_running to be non-zero if there is a genuine task
    wakeup racing with the idle load balance kick.

o The waker CPU perceives the target CPU to be busy
  (targer_rq->nr_running != 0) but the CPU is in fact going idle and due
  to a series of unfortunate events, the system reaches a case where the
  waker CPU decides to perform the wakeup by itself in ttwu_queue() on
  the target CPU but target is concurrently selected for idle load
  balance (Can this happen? I'm not sure, but we'll consider its
  possibility to estimate the worst case scenario).

  ttwu_do_activate() calls enqueue_task() which would increment
  "rq->nr_running" post which it calls wakeup_preempt() which is
  responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
  setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
  thing to note in this case is that rq->nr_running is already non-zero
  in case of a wakeup before TIF_NEED_RESCHED is set which would
  lead to idle_cpu() check returning false.

In all cases, it seems that need_resched() check is unnecessary when
checking for idle_cpu() first since an impending wakeup racing with idle
load balancer will either set the "rq->ttwu_pending" or indicate a newly
woken task via "rq->nr_running".

Chasing the reason why this check might have existed in the first place,
I came across  Peter's suggestion on the fist iteration of Suresh's
patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was:

	sched_ttwu_do_pending(list);

	if (unlikely((rq->idle == current) &&
	    rq->nohz_balance_kick &&
	    !need_resched()))
		raise_softirq_irqoff(SCHED_SOFTIRQ);

However, since this was preceded by sched_ttwu_do_pending() which is
equivalent of sched_ttwu_pending() in the current upstream kernel, the
need_resched() check was necessary to catch a newly queued task. Peter
suggested modifying it to:

	if (idle_cpu() && rq->nohz_balance_kick && !need_resched())
		raise_softirq_irqoff(SCHED_SOFTIRQ);

where idle_cpu() seems to have replaced "rq->idle == current" check.
However, even back then, the idle_cpu() check would have been sufficient
to have caught the enqueue of a new task and since commit b2a02fc43a1f
("smp: Optimize send_call_function_single_ipi()") overloads the
interpretation of TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove
the need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based
on Peter's suggestion.

Link: https://lore.kernel.org/all/1317670590.20367.38.camel@twins/ [1]
Link: https://lore.kernel.org/lkml/20240615014521.GR8774@noisy.programming.kicks-ass.net/
Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0935f9d4bb7b..1e0c77eac65a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1205,7 +1205,7 @@ static void nohz_csd_func(void *info)
 	WARN_ON(!(flags & NOHZ_KICK_MASK));

 	rq->idle_balance = idle_cpu(cpu);
-	if (rq->idle_balance && !need_resched()) {
+	if (rq->idle_balance) {
 		rq->nohz_idle_balance = flags;
 		raise_softirq_irqoff(SCHED_SOFTIRQ);
 	}
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
  2024-07-10  9:02 [PATCH 0/3] sched/core: Fixes and enhancements around spurious need_resched() and idle load balancing K Prateek Nayak
  2024-07-10  9:02 ` [PATCH 1/3] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() K Prateek Nayak
@ 2024-07-10  9:02 ` K Prateek Nayak
  2024-07-11  8:00   ` Vincent Guittot
  2024-07-30 16:13   ` Chen Yu
  2024-07-10  9:02 ` [RFC PATCH 3/3] softirq: Avoid waking up ksoftirqd from flush_smp_call_function_queue() K Prateek Nayak
  2024-07-29  2:42 ` [PATCH 0/3] sched/core: Fixes and enhancements around spurious need_resched() and idle load balancing Chen Yu
  3 siblings, 2 replies; 18+ messages in thread
From: K Prateek Nayak @ 2024-07-10  9:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Paul E. McKenney,
	Imran Khan, Leonardo Bras, Guo Ren, Rik van Riel, Tejun Heo,
	Cruz Zhao, Lai Jiangshan, Joel Fernandes, Zqiang, Julia Lawall,
	Gautham R. Shenoy, K Prateek Nayak

From: Peter Zijlstra <peterz@infradead.org>

Since commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
IPI without actually sending an interrupt. Even in cases where the IPI
handler does not queue a task on the idle CPU, do_idle() will call
__schedule() since need_resched() returns true in these cases.

Introduce and use SM_IDLE to identify call to __schedule() from
schedule_idle() and shorten the idle re-entry time by skipping
pick_next_task() when nr_running is 0 and the previous task is the idle
task.

With the SM_IDLE fast-path, the time taken to complete a fixed set of
IPIs using ipistorm improves significantly. Following are the numbers
from a dual socket 3rd Generation EPYC system (2 x 64C/128T) (boost on,
C2 disabled) running ipistorm between CPU8 and CPU16:

cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1

   ==================================================================
   Test          : ipistorm (modified)
   Units         : Normalized runtime
   Interpretation: Lower is better
   Statistic     : AMean
   ==================================================================
   kernel:				time [pct imp]
   tip:sched/core			1.00 [baseline]
   tip:sched/core + SM_IDLE		0.25 [75.11%]

[ kprateek: Commit log and testing ]

Link: https://lore.kernel.org/lkml/20240615012814.GP8774@noisy.programming.kicks-ass.net/
Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/core.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1e0c77eac65a..417d3ebbdf60 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6343,19 +6343,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  * Constants for the sched_mode argument of __schedule().
  *
  * The mode argument allows RT enabled kernels to differentiate a
- * preemption from blocking on an 'sleeping' spin/rwlock. Note that
- * SM_MASK_PREEMPT for !RT has all bits set, which allows the compiler to
- * optimize the AND operation out and just check for zero.
+ * preemption from blocking on an 'sleeping' spin/rwlock.
  */
-#define SM_NONE			0x0
-#define SM_PREEMPT		0x1
-#define SM_RTLOCK_WAIT		0x2
-
-#ifndef CONFIG_PREEMPT_RT
-# define SM_MASK_PREEMPT	(~0U)
-#else
-# define SM_MASK_PREEMPT	SM_PREEMPT
-#endif
+#define SM_IDLE			(-1)
+#define SM_NONE			0
+#define SM_PREEMPT		1
+#define SM_RTLOCK_WAIT		2
 
 /*
  * __schedule() is the main scheduler function.
@@ -6396,11 +6389,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  *
  * WARNING: must be called with preemption disabled!
  */
-static void __sched notrace __schedule(unsigned int sched_mode)
+static void __sched notrace __schedule(int sched_mode)
 {
 	struct task_struct *prev, *next;
 	unsigned long *switch_count;
 	unsigned long prev_state;
+	bool preempt = sched_mode > 0;
 	struct rq_flags rf;
 	struct rq *rq;
 	int cpu;
@@ -6409,13 +6403,13 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	rq = cpu_rq(cpu);
 	prev = rq->curr;
 
-	schedule_debug(prev, !!sched_mode);
+	schedule_debug(prev, preempt);
 
 	if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
 		hrtick_clear(rq);
 
 	local_irq_disable();
-	rcu_note_context_switch(!!sched_mode);
+	rcu_note_context_switch(preempt);
 
 	/*
 	 * Make sure that signal_pending_state()->signal_pending() below
@@ -6449,7 +6443,12 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	 * that we form a control dependency vs deactivate_task() below.
 	 */
 	prev_state = READ_ONCE(prev->__state);
-	if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
+	if (sched_mode == SM_IDLE) {
+		if (!rq->nr_running) {
+			next = prev;
+			goto picked;
+		}
+	} else if (!preempt && prev_state) {
 		if (signal_pending_state(prev_state, prev)) {
 			WRITE_ONCE(prev->__state, TASK_RUNNING);
 		} else {
@@ -6483,6 +6482,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	}
 
 	next = pick_next_task(rq, prev, &rf);
+picked:
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
@@ -6523,7 +6523,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		migrate_disable_switch(rq, prev);
 		psi_sched_switch(prev, next, !task_on_rq_queued(prev));
 
-		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state);
+		trace_sched_switch(preempt, prev, next, prev_state);
 
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
@@ -6599,7 +6599,7 @@ static void sched_update_worker(struct task_struct *tsk)
 	}
 }
 
-static __always_inline void __schedule_loop(unsigned int sched_mode)
+static __always_inline void __schedule_loop(int sched_mode)
 {
 	do {
 		preempt_disable();
@@ -6644,7 +6644,7 @@ void __sched schedule_idle(void)
 	 */
 	WARN_ON_ONCE(current->__state);
 	do {
-		__schedule(SM_NONE);
+		__schedule(SM_IDLE);
 	} while (need_resched());
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 3/3] softirq: Avoid waking up ksoftirqd from flush_smp_call_function_queue()
  2024-07-10  9:02 [PATCH 0/3] sched/core: Fixes and enhancements around spurious need_resched() and idle load balancing K Prateek Nayak
  2024-07-10  9:02 ` [PATCH 1/3] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() K Prateek Nayak
  2024-07-10  9:02 ` [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule() K Prateek Nayak
@ 2024-07-10  9:02 ` K Prateek Nayak
  2024-07-10 15:05   ` Peter Zijlstra
  2024-07-29  2:42 ` [PATCH 0/3] sched/core: Fixes and enhancements around spurious need_resched() and idle load balancing Chen Yu
  3 siblings, 1 reply; 18+ messages in thread
From: K Prateek Nayak @ 2024-07-10  9:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Paul E. McKenney,
	Imran Khan, Leonardo Bras, Guo Ren, Rik van Riel, Tejun Heo,
	Cruz Zhao, Lai Jiangshan, Joel Fernandes, Zqiang, Julia Lawall,
	Gautham R. Shenoy, K Prateek Nayak

Since commit b2a02fc43a1f4 ("smp: Optimize
send_call_function_single_ipi()"), sending an actual interrupt to an
idle CPU in TIF_POLLING_NRFLAG mode can be avoided by queuing the SMP
call function on the call function queue of the CPU and setting the
TIF_NEED_RESCHED bit in idle task's thread info. The call function is
handled in the idle exit path when do_idle() calls
flush_smp_call_function_queue().

However, since flush_smp_call_function_queue() is executed in idle
thread's context, in_interrupt() check within a call function will
return false. raise_softirq() uses this check to decide whether to wake
ksoftirqd, since, a softirq raised from an interrupt context will be
handled at irq exit. In all other cases, raise_softirq() wakes up
ksoftirqd to handle the softirq on !PREEMPT_RT kernel.

Since flush_smp_call_function_queue() calls
do_softirq_post_smp_call_flush(), waking up ksoftirqd is not necessary
since the softirqs raised by the call functions will be handled soon
after the call function queue is flushed. Mark
__flush_smp_call_function_queue() within flush_smp_call_function_queue()
with "will_do_softirq_post_flush" and use "do_softirq_pending()" to
notify raise_softirq() an impending call to do_softirq() and avoid
waking up ksoftirqd.

Adding a trace_printk() in nohz_csd_func() at the spot of raising
SCHED_SOFTIRQ and enabling trace events for sched_switch, sched_wakeup,
and softirq_entry (for SCHED_SOFTIRQ vector alone) helps observing the
current behavior:

	  <idle>-0       [000] dN.1. nohz_csd_func: Raise SCHED_SOFTIRQ for idle balance
	  <idle>-0       [000] dN.4. sched_wakeup: comm=ksoftirqd/0 pid=16 prio=120 target_cpu=000
	  <idle>-0       [000] .Ns1. softirq_entry: vec=7 [action=SCHED]
	  <idle>-0       [000] d..2. sched_switch: prev_comm=swapper/0 ==> next_comm=ksoftirqd/0
     ksoftirqd/0-16      [000] d..2. sched_switch: prev_comm=ksoftirqd/0 ==> next_comm=swapper/0

ksoftirqd is woken up before the idle thread calls
do_softirq_post_smp_call_flush() which can make the runqueue appear
busy and prevent the idle load balancer from pulling task from an
overloaded runqueue towards itself[1]. Following are the observations
with the changes when enabling the same set of events:

	  <idle>-0       [000] dN.1.   106.134226: nohz_csd_func: Raise SCHED_SOFTIRQ for idle balance
	  <idle>-0       [000] .Ns1.   106.134227: softirq_entry: vec=7 [action=SCHED]
	  ...

No unnecessary ksoftirqd wakeups are seen from idle task's context to
service the softirq. When doing performance testing, it was noticed that
per-CPU "will_do_softirq_post_flush" variable needs to be defined as
cacheline aligned to minimize performance overheads of the writes in
flush_smp_call_function_queue(). Following is the IPI throughput
measured using a modified version of ipistorm that performs a fixed set
of IPIs between two CPUs on a dual socket 3rd Generation EPYC system
(2 x 64C/128T) (boost on, C2 disabled) by running ipistorm between CPU8
and CPU16:

cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1

   ==================================================================
   Test          : ipistorm (modified)
   Units         : Normalized runtime
   Interpretation: Lower is better
   Statistic     : AMean
   ==================================================================
   kernel:					time [pct imp]
   tip:sched/core				1.00 [baseline]
   tip:sched/core + SM_IDLE			0.25 [75.11%]
   tip:sched/core + SM_IDLE + unaligned var	0.47 [53.74%] *
   tip:sched/core + SM_IDLE + aligned var	0.25 [75.04%]

* The version where "will_do_softirq_post_flush" was not cacheline
  aligned takes twice as long as the cacheline aligned version to
  perform a fixed set of IPIs.

Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Closes: https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/ [1]
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/smp.h |  2 ++
 kernel/smp.c       | 32 ++++++++++++++++++++++++++++++++
 kernel/softirq.c   | 10 +++++++++-
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/smp.h b/kernel/sched/smp.h
index 21ac44428bb0..3731e79fe19b 100644
--- a/kernel/sched/smp.h
+++ b/kernel/sched/smp.h
@@ -9,7 +9,9 @@ extern void sched_ttwu_pending(void *arg);
 extern bool call_function_single_prep_ipi(int cpu);
 
 #ifdef CONFIG_SMP
+extern bool do_softirq_pending(void);
 extern void flush_smp_call_function_queue(void);
 #else
+static inline bool do_softirq_pending(void) { return false; }
 static inline void flush_smp_call_function_queue(void) { }
 #endif
diff --git a/kernel/smp.c b/kernel/smp.c
index f085ebcdf9e7..2eab5e1d5cef 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -559,6 +559,36 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
 	}
 }
 
+/* Indicate an impending call to do_softirq_post_smp_call_flush() */
+static DEFINE_PER_CPU_ALIGNED(bool, will_do_softirq_post_flush);
+
+static __always_inline void __set_will_do_softirq_post_flush(void)
+{
+	this_cpu_write(will_do_softirq_post_flush, true);
+}
+
+static __always_inline void __clr_will_do_softirq_post_flush(void)
+{
+	this_cpu_write(will_do_softirq_post_flush, false);
+}
+
+/**
+ * do_softirq_pending - Check if do_softirq_post_smp_call_flush() will
+ *			be called after the invocation of
+ *			__flush_smp_call_function_queue()
+ *
+ * When flush_smp_call_function_queue() executes in the context of idle,
+ * migration thread, a softirq raised from the smp-call-function ends up
+ * waking ksoftirqd despite an impending softirq processing via
+ * do_softirq_post_smp_call_flush().
+ *
+ * Indicate an impending do_softirq() to should_wake_ksoftirqd() despite
+ * not being in an interrupt context.
+ */
+__always_inline bool do_softirq_pending(void)
+{
+	return this_cpu_read(will_do_softirq_post_flush);
+}
 
 /**
  * flush_smp_call_function_queue - Flush pending smp-call-function callbacks
@@ -583,7 +613,9 @@ void flush_smp_call_function_queue(void)
 	local_irq_save(flags);
 	/* Get the already pending soft interrupts for RT enabled kernels */
 	was_pending = local_softirq_pending();
+	__set_will_do_softirq_post_flush();
 	__flush_smp_call_function_queue(true);
+	__clr_will_do_softirq_post_flush();
 	if (local_softirq_pending())
 		do_softirq_post_smp_call_flush(was_pending);
 
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 02582017759a..b39eeed03042 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -34,6 +34,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
 
+#include "sched/smp.h"
+
 /*
    - No shared variables, all the data are CPU local.
    - If a softirq needs serialization, let it serialize itself
@@ -413,7 +415,13 @@ static inline void ksoftirqd_run_end(void)
 
 static inline bool should_wake_ksoftirqd(void)
 {
-	return true;
+	/*
+	 * Avoid waking up ksoftirqd when a softirq is raised from a
+	 * call-function executed by flush_smp_call_function_queue()
+	 * in idle, migration thread's context since it'll soon call
+	 * do_softirq_post_smp_call_flush().
+	 */
+	return !do_softirq_pending();
 }
 
 static inline void invoke_softirq(void)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()
  2024-07-10  9:02 ` [PATCH 1/3] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() K Prateek Nayak
@ 2024-07-10 14:53   ` Peter Zijlstra
  2024-07-10 17:57     ` K Prateek Nayak
  2024-07-23  6:46   ` K Prateek Nayak
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2024-07-10 14:53 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Paul E. McKenney,
	Imran Khan, Leonardo Bras, Guo Ren, Rik van Riel, Tejun Heo,
	Cruz Zhao, Lai Jiangshan, Joel Fernandes, Zqiang, Julia Lawall,
	Gautham R. Shenoy

On Wed, Jul 10, 2024 at 09:02:08AM +0000, K Prateek Nayak wrote:
> The need_resched() check currently in nohz_csd_func() can be tracked
> to have been added in scheduler_ipi() back in 2011 via commit
> ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance")
> 
> Since then, it has travelled quite a bit but it seems like an idle_cpu()
> check currently is sufficient to detect the need to bail out from an
> idle load balancing. To justify this removal, consider all the following
> case where an idle load balancing could race with a task wakeup:
> 
> o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU
>   on wakelist if wakee cpu is idle") a target perceived to be idle
>   (target_rq->nr_running == 0) will return true for
>   ttwu_queue_cond(target) which will offload the task wakeup to the idle
>   target via an IPI.
> 
>   In all such cases target_rq->ttwu_pending will be set to 1 before
>   queuing the wake function.
> 
>   If an idle load balance races here, following scenarios are possible:
> 
>   - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
>     IPI is sent to the CPU to wake it out of idle. If the
>     nohz_csd_func() queues before sched_ttwu_pending(), the idle load
>     balance will bail out since idle_cpu(target) returns 0 since
>     target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
>     sched_ttwu_pending() it should see rq->nr_running to be non-zero and
>     bail out of idle load balancing.
> 
>   - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
>     the sender will simply set TIF_NEED_RESCHED for the target to put it
>     out of idle and flush_smp_call_function_queue() in do_idle() will
>     execute the call function. Depending on the ordering of the queuing
>     of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
>     nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
>     target_rq->nr_running to be non-zero if there is a genuine task
>     wakeup racing with the idle load balance kick.

For completion sake, we should also consider the !TTWU_QUEUE case, this
configuration is default for PREEMPT_RT, where the wake_list is a source
of non-determinism.

In quick reading I think that case should be fine, since we directly
enqueue remotely and ->nr_running adjusts accordingly, but it is late in
the day and I'm easily mistaken.

> o The waker CPU perceives the target CPU to be busy
>   (targer_rq->nr_running != 0) but the CPU is in fact going idle and due
>   to a series of unfortunate events, the system reaches a case where the
>   waker CPU decides to perform the wakeup by itself in ttwu_queue() on
>   the target CPU but target is concurrently selected for idle load
>   balance (Can this happen? I'm not sure, but we'll consider its
>   possibility to estimate the worst case scenario).
> 
>   ttwu_do_activate() calls enqueue_task() which would increment
>   "rq->nr_running" post which it calls wakeup_preempt() which is
>   responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
>   setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
>   thing to note in this case is that rq->nr_running is already non-zero
>   in case of a wakeup before TIF_NEED_RESCHED is set which would
>   lead to idle_cpu() check returning false.
> 
> In all cases, it seems that need_resched() check is unnecessary when
> checking for idle_cpu() first since an impending wakeup racing with idle
> load balancer will either set the "rq->ttwu_pending" or indicate a newly
> woken task via "rq->nr_running".

Right.

> Chasing the reason why this check might have existed in the first place,
> I came across  Peter's suggestion on the fist iteration of Suresh's
> patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was:
> 
> 	sched_ttwu_do_pending(list);
> 
> 	if (unlikely((rq->idle == current) &&
> 	    rq->nohz_balance_kick &&
> 	    !need_resched()))
> 		raise_softirq_irqoff(SCHED_SOFTIRQ);
> 
> However, since this was preceded by sched_ttwu_do_pending() which is
> equivalent of sched_ttwu_pending() in the current upstream kernel, the
> need_resched() check was necessary to catch a newly queued task. Peter
> suggested modifying it to:
> 
> 	if (idle_cpu() && rq->nohz_balance_kick && !need_resched())
> 		raise_softirq_irqoff(SCHED_SOFTIRQ);
> 
> where idle_cpu() seems to have replaced "rq->idle == current" check.
> However, even back then, the idle_cpu() check would have been sufficient
> to have caught the enqueue of a new task and since commit b2a02fc43a1f
> ("smp: Optimize send_call_function_single_ipi()") overloads the
> interpretation of TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove
> the need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based
> on Peter's suggestion.

... sooo many years ago :-)

> Link: https://lore.kernel.org/all/1317670590.20367.38.camel@twins/ [1]
> Link: https://lore.kernel.org/lkml/20240615014521.GR8774@noisy.programming.kicks-ass.net/
> Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0935f9d4bb7b..1e0c77eac65a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1205,7 +1205,7 @@ static void nohz_csd_func(void *info)
>  	WARN_ON(!(flags & NOHZ_KICK_MASK));
>  
>  	rq->idle_balance = idle_cpu(cpu);
> -	if (rq->idle_balance && !need_resched()) {
> +	if (rq->idle_balance) {
>  		rq->nohz_idle_balance = flags;
>  		raise_softirq_irqoff(SCHED_SOFTIRQ);
>  	}
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] softirq: Avoid waking up ksoftirqd from flush_smp_call_function_queue()
  2024-07-10  9:02 ` [RFC PATCH 3/3] softirq: Avoid waking up ksoftirqd from flush_smp_call_function_queue() K Prateek Nayak
@ 2024-07-10 15:05   ` Peter Zijlstra
  2024-07-10 18:20     ` K Prateek Nayak
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2024-07-10 15:05 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Paul E. McKenney,
	Imran Khan, Leonardo Bras, Guo Ren, Rik van Riel, Tejun Heo,
	Cruz Zhao, Lai Jiangshan, Joel Fernandes, Zqiang, Julia Lawall,
	Gautham R. Shenoy

On Wed, Jul 10, 2024 at 09:02:10AM +0000, K Prateek Nayak wrote:

> diff --git a/kernel/sched/smp.h b/kernel/sched/smp.h
> index 21ac44428bb0..3731e79fe19b 100644
> --- a/kernel/sched/smp.h
> +++ b/kernel/sched/smp.h
> @@ -9,7 +9,9 @@ extern void sched_ttwu_pending(void *arg);
>  extern bool call_function_single_prep_ipi(int cpu);
>  
>  #ifdef CONFIG_SMP
> +extern bool do_softirq_pending(void);
>  extern void flush_smp_call_function_queue(void);
>  #else
> +static inline bool do_softirq_pending(void) { return false; }
>  static inline void flush_smp_call_function_queue(void) { }
>  #endif
> diff --git a/kernel/smp.c b/kernel/smp.c
> index f085ebcdf9e7..2eab5e1d5cef 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -559,6 +559,36 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
>  	}
>  }
>  
> +/* Indicate an impending call to do_softirq_post_smp_call_flush() */
> +static DEFINE_PER_CPU_ALIGNED(bool, will_do_softirq_post_flush);
> +
> +static __always_inline void __set_will_do_softirq_post_flush(void)
> +{
> +	this_cpu_write(will_do_softirq_post_flush, true);
> +}
> +
> +static __always_inline void __clr_will_do_softirq_post_flush(void)
> +{
> +	this_cpu_write(will_do_softirq_post_flush, false);
> +}
> +
> +/**
> + * do_softirq_pending - Check if do_softirq_post_smp_call_flush() will
> + *			be called after the invocation of
> + *			__flush_smp_call_function_queue()
> + *
> + * When flush_smp_call_function_queue() executes in the context of idle,
> + * migration thread, a softirq raised from the smp-call-function ends up
> + * waking ksoftirqd despite an impending softirq processing via
> + * do_softirq_post_smp_call_flush().
> + *
> + * Indicate an impending do_softirq() to should_wake_ksoftirqd() despite
> + * not being in an interrupt context.
> + */
> +__always_inline bool do_softirq_pending(void)
> +{
> +	return this_cpu_read(will_do_softirq_post_flush);
> +}
>  
>  /**
>   * flush_smp_call_function_queue - Flush pending smp-call-function callbacks
> @@ -583,7 +613,9 @@ void flush_smp_call_function_queue(void)
>  	local_irq_save(flags);
>  	/* Get the already pending soft interrupts for RT enabled kernels */
>  	was_pending = local_softirq_pending();
> +	__set_will_do_softirq_post_flush();
>  	__flush_smp_call_function_queue(true);
> +	__clr_will_do_softirq_post_flush();
>  	if (local_softirq_pending())
>  		do_softirq_post_smp_call_flush(was_pending);
>  
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 02582017759a..b39eeed03042 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -34,6 +34,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/irq.h>
>  
> +#include "sched/smp.h"
> +
>  /*
>     - No shared variables, all the data are CPU local.
>     - If a softirq needs serialization, let it serialize itself
> @@ -413,7 +415,13 @@ static inline void ksoftirqd_run_end(void)
>  
>  static inline bool should_wake_ksoftirqd(void)
>  {
> -	return true;
> +	/*
> +	 * Avoid waking up ksoftirqd when a softirq is raised from a
> +	 * call-function executed by flush_smp_call_function_queue()
> +	 * in idle, migration thread's context since it'll soon call
> +	 * do_softirq_post_smp_call_flush().
> +	 */
> +	return !do_softirq_pending();
>  }

On first reading I wonder why you've not re-used and hooked into the
PREEMPT_RT variant of should_wake_ksoftirqd(). That already has a per
CPU variable to do exactly this.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()
  2024-07-10 14:53   ` Peter Zijlstra
@ 2024-07-10 17:57     ` K Prateek Nayak
  0 siblings, 0 replies; 18+ messages in thread
From: K Prateek Nayak @ 2024-07-10 17:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Paul E. McKenney,
	Imran Khan, Leonardo Bras, Guo Ren, Rik van Riel, Tejun Heo,
	Cruz Zhao, Lai Jiangshan, Joel Fernandes, Zqiang, Julia Lawall,
	Gautham R. Shenoy

Hello Peter,

On 7/10/2024 8:23 PM, Peter Zijlstra wrote:
> On Wed, Jul 10, 2024 at 09:02:08AM +0000, K Prateek Nayak wrote:
>> The need_resched() check currently in nohz_csd_func() can be tracked
>> to have been added in scheduler_ipi() back in 2011 via commit
>> ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance")
>>
>> Since then, it has travelled quite a bit but it seems like an idle_cpu()
>> check currently is sufficient to detect the need to bail out from an
>> idle load balancing. To justify this removal, consider all the following
>> case where an idle load balancing could race with a task wakeup:
>>
>> o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU
>>    on wakelist if wakee cpu is idle") a target perceived to be idle
>>    (target_rq->nr_running == 0) will return true for
>>    ttwu_queue_cond(target) which will offload the task wakeup to the idle
>>    target via an IPI.
>>
>>    In all such cases target_rq->ttwu_pending will be set to 1 before
>>    queuing the wake function.
>>
>>    If an idle load balance races here, following scenarios are possible:
>>
>>    - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
>>      IPI is sent to the CPU to wake it out of idle. If the
>>      nohz_csd_func() queues before sched_ttwu_pending(), the idle load
>>      balance will bail out since idle_cpu(target) returns 0 since
>>      target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
>>      sched_ttwu_pending() it should see rq->nr_running to be non-zero and
>>      bail out of idle load balancing.
>>
>>    - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
>>      the sender will simply set TIF_NEED_RESCHED for the target to put it
>>      out of idle and flush_smp_call_function_queue() in do_idle() will
>>      execute the call function. Depending on the ordering of the queuing
>>      of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
>>      nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
>>      target_rq->nr_running to be non-zero if there is a genuine task
>>      wakeup racing with the idle load balance kick.
> 
> For completion sake, we should also consider the !TTWU_QUEUE case, this
> configuration is default for PREEMPT_RT, where the wake_list is a source
> of non-determinism.
> 
> In quick reading I think that case should be fine, since we directly
> enqueue remotely and ->nr_running adjusts accordingly, but it is late in
> the day and I'm easily mistaken.

 From what I've seen, an enqueue will always update "rq->nr_running"
before setting the "NEED_RESCHED" flag but I'll go confirm that again
and report back in case what that is false.

> 
>> o The waker CPU perceives the target CPU to be busy
>>    (targer_rq->nr_running != 0) but the CPU is in fact going idle and due
>>    to a series of unfortunate events, the system reaches a case where the
>>    waker CPU decides to perform the wakeup by itself in ttwu_queue() on
>>    the target CPU but target is concurrently selected for idle load
>>    balance (Can this happen? I'm not sure, but we'll consider its
>>    possibility to estimate the worst case scenario).
>>
>>    ttwu_do_activate() calls enqueue_task() which would increment
>>    "rq->nr_running" post which it calls wakeup_preempt() which is
>>    responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
>>    setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
>>    thing to note in this case is that rq->nr_running is already non-zero
>>    in case of a wakeup before TIF_NEED_RESCHED is set which would
>>    lead to idle_cpu() check returning false.
>>
>> In all cases, it seems that need_resched() check is unnecessary when
>> checking for idle_cpu() first since an impending wakeup racing with idle
>> load balancer will either set the "rq->ttwu_pending" or indicate a newly
>> woken task via "rq->nr_running".
> 
> Right.
> 
>> [..snip..]

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] softirq: Avoid waking up ksoftirqd from flush_smp_call_function_queue()
  2024-07-10 15:05   ` Peter Zijlstra
@ 2024-07-10 18:20     ` K Prateek Nayak
  2024-07-23  4:50       ` K Prateek Nayak
  0 siblings, 1 reply; 18+ messages in thread
From: K Prateek Nayak @ 2024-07-10 18:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Paul E. McKenney,
	Imran Khan, Leonardo Bras, Guo Ren, Rik van Riel, Tejun Heo,
	Cruz Zhao, Lai Jiangshan, Joel Fernandes, Zqiang, Julia Lawall,
	Gautham R. Shenoy

Hello Peter,

Thank you for the feedback.

On 7/10/2024 8:35 PM, Peter Zijlstra wrote:
> On Wed, Jul 10, 2024 at 09:02:10AM +0000, K Prateek Nayak wrote:
> 
>> [..snip..]
> 
> On first reading I wonder why you've not re-used and hooked into the
> PREEMPT_RT variant of should_wake_ksoftirqd(). That already has a per
> CPU variable to do exactly this.

With this RFC, I intended to check if everyone was onboard with the idea
and of the use-case. One caveat with re-using the existing
"softirq_ctrl.cnt" hook that PREEMPT_RT uses is that we'll need to
expose the functions that increment and decrement it, for it to be used
in kernel/smp.c. I'll make those changes in v2 and we can see if there
are sufficient WARN_ON() to catch any incorrect usage in !PREEMPT_RT
case.

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
  2024-07-10  9:02 ` [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule() K Prateek Nayak
@ 2024-07-11  8:00   ` Vincent Guittot
  2024-07-11  9:19     ` Peter Zijlstra
  2024-07-30 16:13   ` Chen Yu
  1 sibling, 1 reply; 18+ messages in thread
From: Vincent Guittot @ 2024-07-11  8:00 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Paul E. McKenney,
	Imran Khan, Leonardo Bras, Guo Ren, Rik van Riel, Tejun Heo,
	Cruz Zhao, Lai Jiangshan, Joel Fernandes, Zqiang, Julia Lawall,
	Gautham R. Shenoy

On Wed, 10 Jul 2024 at 11:03, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> From: Peter Zijlstra <peterz@infradead.org>
>
> Since commit b2a02fc43a1f ("smp: Optimize
> send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
> can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
> IPI without actually sending an interrupt. Even in cases where the IPI
> handler does not queue a task on the idle CPU, do_idle() will call
> __schedule() since need_resched() returns true in these cases.
>
> Introduce and use SM_IDLE to identify call to __schedule() from
> schedule_idle() and shorten the idle re-entry time by skipping
> pick_next_task() when nr_running is 0 and the previous task is the idle
> task.
>
> With the SM_IDLE fast-path, the time taken to complete a fixed set of
> IPIs using ipistorm improves significantly. Following are the numbers
> from a dual socket 3rd Generation EPYC system (2 x 64C/128T) (boost on,
> C2 disabled) running ipistorm between CPU8 and CPU16:
>
> cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1
>
>    ==================================================================
>    Test          : ipistorm (modified)
>    Units         : Normalized runtime
>    Interpretation: Lower is better
>    Statistic     : AMean
>    ==================================================================
>    kernel:                              time [pct imp]
>    tip:sched/core                       1.00 [baseline]
>    tip:sched/core + SM_IDLE             0.25 [75.11%]
>
> [ kprateek: Commit log and testing ]
>
> Link: https://lore.kernel.org/lkml/20240615012814.GP8774@noisy.programming.kicks-ass.net/
> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/core.c | 38 +++++++++++++++++++-------------------
>  1 file changed, 19 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1e0c77eac65a..417d3ebbdf60 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6343,19 +6343,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>   * Constants for the sched_mode argument of __schedule().
>   *
>   * The mode argument allows RT enabled kernels to differentiate a
> - * preemption from blocking on an 'sleeping' spin/rwlock. Note that
> - * SM_MASK_PREEMPT for !RT has all bits set, which allows the compiler to
> - * optimize the AND operation out and just check for zero.
> + * preemption from blocking on an 'sleeping' spin/rwlock.
>   */
> -#define SM_NONE                        0x0
> -#define SM_PREEMPT             0x1
> -#define SM_RTLOCK_WAIT         0x2
> -
> -#ifndef CONFIG_PREEMPT_RT
> -# define SM_MASK_PREEMPT       (~0U)
> -#else
> -# define SM_MASK_PREEMPT       SM_PREEMPT
> -#endif
> +#define SM_IDLE                        (-1)
> +#define SM_NONE                        0
> +#define SM_PREEMPT             1
> +#define SM_RTLOCK_WAIT         2
>
>  /*
>   * __schedule() is the main scheduler function.
> @@ -6396,11 +6389,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>   *
>   * WARNING: must be called with preemption disabled!
>   */
> -static void __sched notrace __schedule(unsigned int sched_mode)
> +static void __sched notrace __schedule(int sched_mode)
>  {
>         struct task_struct *prev, *next;
>         unsigned long *switch_count;
>         unsigned long prev_state;
> +       bool preempt = sched_mode > 0;
>         struct rq_flags rf;
>         struct rq *rq;
>         int cpu;
> @@ -6409,13 +6403,13 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>         rq = cpu_rq(cpu);
>         prev = rq->curr;
>
> -       schedule_debug(prev, !!sched_mode);
> +       schedule_debug(prev, preempt);
>
>         if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
>                 hrtick_clear(rq);
>
>         local_irq_disable();
> -       rcu_note_context_switch(!!sched_mode);
> +       rcu_note_context_switch(preempt);
>
>         /*
>          * Make sure that signal_pending_state()->signal_pending() below
> @@ -6449,7 +6443,12 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>          * that we form a control dependency vs deactivate_task() below.
>          */
>         prev_state = READ_ONCE(prev->__state);
> -       if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
> +       if (sched_mode == SM_IDLE) {
> +               if (!rq->nr_running) {
> +                       next = prev;
> +                       goto picked;
> +               }
> +       } else if (!preempt && prev_state) {

With CONFIG_PREEMPT_RT, it was only for SM_PREEMPT but not for SM_RTLOCK_WAIT

>                 if (signal_pending_state(prev_state, prev)) {
>                         WRITE_ONCE(prev->__state, TASK_RUNNING);
>                 } else {
> @@ -6483,6 +6482,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>         }
>
>         next = pick_next_task(rq, prev, &rf);
> +picked:
>         clear_tsk_need_resched(prev);
>         clear_preempt_need_resched();
>  #ifdef CONFIG_SCHED_DEBUG
> @@ -6523,7 +6523,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>                 migrate_disable_switch(rq, prev);
>                 psi_sched_switch(prev, next, !task_on_rq_queued(prev));
>
> -               trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state);
> +               trace_sched_switch(preempt, prev, next, prev_state);
>
>                 /* Also unlocks the rq: */
>                 rq = context_switch(rq, prev, next, &rf);
> @@ -6599,7 +6599,7 @@ static void sched_update_worker(struct task_struct *tsk)
>         }
>  }
>
> -static __always_inline void __schedule_loop(unsigned int sched_mode)
> +static __always_inline void __schedule_loop(int sched_mode)
>  {
>         do {
>                 preempt_disable();
> @@ -6644,7 +6644,7 @@ void __sched schedule_idle(void)
>          */
>         WARN_ON_ONCE(current->__state);
>         do {
> -               __schedule(SM_NONE);
> +               __schedule(SM_IDLE);
>         } while (need_resched());
>  }
>
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
  2024-07-11  8:00   ` Vincent Guittot
@ 2024-07-11  9:19     ` Peter Zijlstra
  2024-07-11 13:14       ` Vincent Guittot
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2024-07-11  9:19 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: K Prateek Nayak, Ingo Molnar, Juri Lelli, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Paul E. McKenney,
	Imran Khan, Leonardo Bras, Guo Ren, Rik van Riel, Tejun Heo,
	Cruz Zhao, Lai Jiangshan, Joel Fernandes, Zqiang, Julia Lawall,
	Gautham R. Shenoy

On Thu, Jul 11, 2024 at 10:00:15AM +0200, Vincent Guittot wrote:
> On Wed, 10 Jul 2024 at 11:03, K Prateek Nayak <kprateek.nayak@amd.com> wrote:

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1e0c77eac65a..417d3ebbdf60 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6343,19 +6343,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >   * Constants for the sched_mode argument of __schedule().
> >   *
> >   * The mode argument allows RT enabled kernels to differentiate a
> > - * preemption from blocking on an 'sleeping' spin/rwlock. Note that
> > - * SM_MASK_PREEMPT for !RT has all bits set, which allows the compiler to
> > - * optimize the AND operation out and just check for zero.
> > + * preemption from blocking on an 'sleeping' spin/rwlock.
> >   */
> > -#define SM_NONE                        0x0
> > -#define SM_PREEMPT             0x1
> > -#define SM_RTLOCK_WAIT         0x2
> > -
> > -#ifndef CONFIG_PREEMPT_RT
> > -# define SM_MASK_PREEMPT       (~0U)
> > -#else
> > -# define SM_MASK_PREEMPT       SM_PREEMPT
> > -#endif
> > +#define SM_IDLE                        (-1)
> > +#define SM_NONE                        0
> > +#define SM_PREEMPT             1
> > +#define SM_RTLOCK_WAIT         2
> >
> >  /*
> >   * __schedule() is the main scheduler function.
> > @@ -6396,11 +6389,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >   *
> >   * WARNING: must be called with preemption disabled!
> >   */
> > -static void __sched notrace __schedule(unsigned int sched_mode)
> > +static void __sched notrace __schedule(int sched_mode)
> >  {
> >         struct task_struct *prev, *next;
> >         unsigned long *switch_count;
> >         unsigned long prev_state;
> > +       bool preempt = sched_mode > 0;
> >         struct rq_flags rf;
> >         struct rq *rq;
> >         int cpu;
> > @@ -6409,13 +6403,13 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> >         rq = cpu_rq(cpu);
> >         prev = rq->curr;
> >
> > -       schedule_debug(prev, !!sched_mode);
> > +       schedule_debug(prev, preempt);
> >
> >         if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
> >                 hrtick_clear(rq);
> >
> >         local_irq_disable();
> > -       rcu_note_context_switch(!!sched_mode);
> > +       rcu_note_context_switch(preempt);
> >
> >         /*
> >          * Make sure that signal_pending_state()->signal_pending() below
> > @@ -6449,7 +6443,12 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> >          * that we form a control dependency vs deactivate_task() below.
> >          */
> >         prev_state = READ_ONCE(prev->__state);
> > -       if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
> > +       if (sched_mode == SM_IDLE) {
> > +               if (!rq->nr_running) {
> > +                       next = prev;
> > +                       goto picked;
> > +               }
> > +       } else if (!preempt && prev_state) {
> 
> With CONFIG_PREEMPT_RT, it was only for SM_PREEMPT but not for SM_RTLOCK_WAIT

Bah, yes. But then schedule_debug() and rcu_note_context_switch() have
an argument that is called 'preempt' but is set for SM_RTLOCK_WAIT.

Now, I think the RCU think is actually correct here, it doesn't want to
consider SM_RTLOCK_WAIT as a voluntary schedule point, because spinlocks
don't either. But it is confusing as heck.

We can either write things like:

	} else if (sched_mode != SM_PREEMPT && prev_state) {

or do silly things like:

#define SM_IDLE	(-16)

keep the SM_MASK_PREEMPT trickery and do:

	} else if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {

Not sure that is actually going to matter at this point though.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
  2024-07-11  9:19     ` Peter Zijlstra
@ 2024-07-11 13:14       ` Vincent Guittot
  2024-07-12  6:40         ` K Prateek Nayak
  0 siblings, 1 reply; 18+ messages in thread
From: Vincent Guittot @ 2024-07-11 13:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, Ingo Molnar, Juri Lelli, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Paul E. McKenney,
	Imran Khan, Leonardo Bras, Guo Ren, Rik van Riel, Tejun Heo,
	Cruz Zhao, Lai Jiangshan, Joel Fernandes, Zqiang, Julia Lawall,
	Gautham R. Shenoy

On Thu, 11 Jul 2024 at 11:19, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Jul 11, 2024 at 10:00:15AM +0200, Vincent Guittot wrote:
> > On Wed, 10 Jul 2024 at 11:03, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 1e0c77eac65a..417d3ebbdf60 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -6343,19 +6343,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > >   * Constants for the sched_mode argument of __schedule().
> > >   *
> > >   * The mode argument allows RT enabled kernels to differentiate a
> > > - * preemption from blocking on an 'sleeping' spin/rwlock. Note that
> > > - * SM_MASK_PREEMPT for !RT has all bits set, which allows the compiler to
> > > - * optimize the AND operation out and just check for zero.
> > > + * preemption from blocking on an 'sleeping' spin/rwlock.
> > >   */
> > > -#define SM_NONE                        0x0
> > > -#define SM_PREEMPT             0x1
> > > -#define SM_RTLOCK_WAIT         0x2
> > > -
> > > -#ifndef CONFIG_PREEMPT_RT
> > > -# define SM_MASK_PREEMPT       (~0U)
> > > -#else
> > > -# define SM_MASK_PREEMPT       SM_PREEMPT
> > > -#endif
> > > +#define SM_IDLE                        (-1)
> > > +#define SM_NONE                        0
> > > +#define SM_PREEMPT             1
> > > +#define SM_RTLOCK_WAIT         2
> > >
> > >  /*
> > >   * __schedule() is the main scheduler function.
> > > @@ -6396,11 +6389,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > >   *
> > >   * WARNING: must be called with preemption disabled!
> > >   */
> > > -static void __sched notrace __schedule(unsigned int sched_mode)
> > > +static void __sched notrace __schedule(int sched_mode)
> > >  {
> > >         struct task_struct *prev, *next;
> > >         unsigned long *switch_count;
> > >         unsigned long prev_state;
> > > +       bool preempt = sched_mode > 0;
> > >         struct rq_flags rf;
> > >         struct rq *rq;
> > >         int cpu;
> > > @@ -6409,13 +6403,13 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> > >         rq = cpu_rq(cpu);
> > >         prev = rq->curr;
> > >
> > > -       schedule_debug(prev, !!sched_mode);
> > > +       schedule_debug(prev, preempt);
> > >
> > >         if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
> > >                 hrtick_clear(rq);
> > >
> > >         local_irq_disable();
> > > -       rcu_note_context_switch(!!sched_mode);
> > > +       rcu_note_context_switch(preempt);
> > >
> > >         /*
> > >          * Make sure that signal_pending_state()->signal_pending() below
> > > @@ -6449,7 +6443,12 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> > >          * that we form a control dependency vs deactivate_task() below.
> > >          */
> > >         prev_state = READ_ONCE(prev->__state);
> > > -       if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
> > > +       if (sched_mode == SM_IDLE) {
> > > +               if (!rq->nr_running) {
> > > +                       next = prev;
> > > +                       goto picked;
> > > +               }
> > > +       } else if (!preempt && prev_state) {
> >
> > With CONFIG_PREEMPT_RT, it was only for SM_PREEMPT but not for SM_RTLOCK_WAIT
>
> Bah, yes. But then schedule_debug() and rcu_note_context_switch() have
> an argument that is called 'preempt' but is set for SM_RTLOCK_WAIT.
>
> Now, I think the RCU think is actually correct here, it doesn't want to
> consider SM_RTLOCK_WAIT as a voluntary schedule point, because spinlocks
> don't either. But it is confusing as heck.
>
> We can either write things like:
>
>         } else if (sched_mode != SM_PREEMPT && prev_state) {

this would work with something like below

#ifdef CONFIG_PREEMPT_RT
          # define SM_RTLOCK_WAIT       2
#else
         # define SM_RTLOCK_WAIT       SM_PREEMPT
#endif

>
> or do silly things like:
>
> #define SM_IDLE (-16)
>
> keep the SM_MASK_PREEMPT trickery and do:
>
>         } else if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
>
> Not sure that is actually going to matter at this point though.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
  2024-07-11 13:14       ` Vincent Guittot
@ 2024-07-12  6:40         ` K Prateek Nayak
  0 siblings, 0 replies; 18+ messages in thread
From: K Prateek Nayak @ 2024-07-12  6:40 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, linux-kernel, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Paul E. McKenney,
	Imran Khan, Leonardo Bras, Guo Ren, Rik van Riel, Tejun Heo,
	Cruz Zhao, Lai Jiangshan, Joel Fernandes, Zqiang, Julia Lawall,
	Gautham R. Shenoy

Hello Vincent, Peter,

On 7/11/2024 6:44 PM, Vincent Guittot wrote:
> On Thu, 11 Jul 2024 at 11:19, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Thu, Jul 11, 2024 at 10:00:15AM +0200, Vincent Guittot wrote:
>>> On Wed, 10 Jul 2024 at 11:03, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 1e0c77eac65a..417d3ebbdf60 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -6343,19 +6343,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>>>>    * Constants for the sched_mode argument of __schedule().
>>>>    *
>>>>    * The mode argument allows RT enabled kernels to differentiate a
>>>> - * preemption from blocking on an 'sleeping' spin/rwlock. Note that
>>>> - * SM_MASK_PREEMPT for !RT has all bits set, which allows the compiler to
>>>> - * optimize the AND operation out and just check for zero.
>>>> + * preemption from blocking on an 'sleeping' spin/rwlock.
>>>>    */
>>>> -#define SM_NONE                        0x0
>>>> -#define SM_PREEMPT             0x1
>>>> -#define SM_RTLOCK_WAIT         0x2
>>>> -
>>>> -#ifndef CONFIG_PREEMPT_RT
>>>> -# define SM_MASK_PREEMPT       (~0U)
>>>> -#else
>>>> -# define SM_MASK_PREEMPT       SM_PREEMPT
>>>> -#endif
>>>> +#define SM_IDLE                        (-1)
>>>> +#define SM_NONE                        0
>>>> +#define SM_PREEMPT             1
>>>> +#define SM_RTLOCK_WAIT         2
>>>>
>>>>   /*
>>>>    * __schedule() is the main scheduler function.
>>>> @@ -6396,11 +6389,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>>>>    *
>>>>    * WARNING: must be called with preemption disabled!
>>>>    */
>>>> -static void __sched notrace __schedule(unsigned int sched_mode)
>>>> +static void __sched notrace __schedule(int sched_mode)
>>>>   {
>>>>          struct task_struct *prev, *next;
>>>>          unsigned long *switch_count;
>>>>          unsigned long prev_state;
>>>> +       bool preempt = sched_mode > 0;
>>>>          struct rq_flags rf;
>>>>          struct rq *rq;
>>>>          int cpu;
>>>> @@ -6409,13 +6403,13 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>>>>          rq = cpu_rq(cpu);
>>>>          prev = rq->curr;
>>>>
>>>> -       schedule_debug(prev, !!sched_mode);
>>>> +       schedule_debug(prev, preempt);
>>>>
>>>>          if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
>>>>                  hrtick_clear(rq);
>>>>
>>>>          local_irq_disable();
>>>> -       rcu_note_context_switch(!!sched_mode);
>>>> +       rcu_note_context_switch(preempt);
>>>>
>>>>          /*
>>>>           * Make sure that signal_pending_state()->signal_pending() below
>>>> @@ -6449,7 +6443,12 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>>>>           * that we form a control dependency vs deactivate_task() below.
>>>>           */
>>>>          prev_state = READ_ONCE(prev->__state);
>>>> -       if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
>>>> +       if (sched_mode == SM_IDLE) {
>>>> +               if (!rq->nr_running) {
>>>> +                       next = prev;
>>>> +                       goto picked;
>>>> +               }
>>>> +       } else if (!preempt && prev_state) {
>>>
>>> With CONFIG_PREEMPT_RT, it was only for SM_PREEMPT but not for SM_RTLOCK_WAIT
>>
>> Bah, yes. But then schedule_debug() and rcu_note_context_switch() have
>> an argument that is called 'preempt' but is set for SM_RTLOCK_WAIT.
>>
>> Now, I think the RCU think is actually correct here, it doesn't want to
>> consider SM_RTLOCK_WAIT as a voluntary schedule point, because spinlocks
>> don't either. But it is confusing as heck.
>>
>> We can either write things like:
>>
>>          } else if (sched_mode != SM_PREEMPT && prev_state) {
> 
> this would work with something like below
> 
> #ifdef CONFIG_PREEMPT_RT
>            # define SM_RTLOCK_WAIT       2
> #else
>           # define SM_RTLOCK_WAIT       SM_PREEMPT
> #endif

Since "SM_RTLOCK_WAIT" is only used by "schedule_rtlock()" which is only
defined for PREEMPT_RT kernels (from a quick grep on linux-6.10.y-rt),
it should just work (famous last words) and we can perhaps skip the else
part too?

With this patch, we need to have the following view of what "preempt"
should be for the components in __schedule() looking at "sched_mode":

		schedule_debug()/		SM_MASK_PREEMPT check/
		rcu_note_context_switch()	trace_sched_switch()
SM_IDLE			F				F
SM_NONE			F				F
SM_PREEMPT		T				T
SM_RTLOCK_WAIT *	T				F

   * SM_RTLOCK_WAIT  is only used in PREEMPT_RT

> 
>>
>> or do silly things like:

... and since we are talking about silly ideas, here is one:

(only build tested on tip:sched/core and linux-6.10.y-rt)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 417d3ebbdf60..d9273af69f9e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6345,10 +6345,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
   * The mode argument allows RT enabled kernels to differentiate a
   * preemption from blocking on an 'sleeping' spin/rwlock.
   */
-#define SM_IDLE			(-1)
-#define SM_NONE			0
-#define SM_PREEMPT		1
-#define SM_RTLOCK_WAIT		2
+#ifdef CONFIG_PREEMPT_RT
+#define SM_RTLOCK_WAIT		(-2)
+#endif
+#define SM_IDLE			0
+#define SM_NONE			1
+#define SM_PREEMPT		2
  
  /*
   * __schedule() is the main scheduler function.
@@ -6391,10 +6393,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
   */
  static void __sched notrace __schedule(int sched_mode)
  {
+	/*
+	 * For PREEMPT_RT kernel, SM_RTLOCK_WAIT is considered as
+	 * preemption by schedule_debug() and
+	 * rcu_note_context_switch().
+	 */
+	bool preempt = (unsigned int) sched_mode > SM_NONE;
  	struct task_struct *prev, *next;
  	unsigned long *switch_count;
  	unsigned long prev_state;
-	bool preempt = sched_mode > 0;
  	struct rq_flags rf;
  	struct rq *rq;
  	int cpu;
@@ -6438,6 +6445,14 @@ static void __sched notrace __schedule(int sched_mode)
  
  	switch_count = &prev->nivcsw;
  
+#ifdef CONFIG_PREEMPT_RT
+	/*
+	 * PREEMPT_RT kernel do not consider SM_RTLOCK_WAIT as
+	 * preemption when looking at prev->state.
+	 */
+	preempt = sched_mode > SM_NONE;
+#endif
+
  	/*
  	 * We must load prev->state once (task_struct::state is volatile), such
  	 * that we form a control dependency vs deactivate_task() below.
--

>>
>> #define SM_IDLE (-16)
>>
>> keep the SM_MASK_PREEMPT trickery and do:
>>
>>          } else if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
>>
>> Not sure that is actually going to matter at this point though.

-- 
Thanks and Regards,
Prateek

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] softirq: Avoid waking up ksoftirqd from flush_smp_call_function_queue()
  2024-07-10 18:20     ` K Prateek Nayak
@ 2024-07-23  4:50       ` K Prateek Nayak
  0 siblings, 0 replies; 18+ messages in thread
From: K Prateek Nayak @ 2024-07-23  4:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Paul E. McKenney,
	Imran Khan, Leonardo Bras, Guo Ren, Rik van Riel, Tejun Heo,
	Cruz Zhao, Lai Jiangshan, Joel Fernandes, Zqiang, Julia Lawall,
	Gautham R. Shenoy

Hello Peter,

On 7/10/2024 11:50 PM, K Prateek Nayak wrote:
> Hello Peter,
> 
> Thank you for the feedback.
> 
> On 7/10/2024 8:35 PM, Peter Zijlstra wrote:
>> On Wed, Jul 10, 2024 at 09:02:10AM +0000, K Prateek Nayak wrote:
>>
>>> [..snip..]
>>
>> On first reading I wonder why you've not re-used and hooked into the
>> PREEMPT_RT variant of should_wake_ksoftirqd(). That already has a per
>> CPU variable to do exactly this.
> 
> With this RFC, I intended to check if everyone was onboard with the idea
> and of the use-case. One caveat with re-using the existing
> "softirq_ctrl.cnt" hook that PREEMPT_RT uses is that we'll need to
> expose the functions that increment and decrement it, for it to be used
> in kernel/smp.c. I'll make those changes in v2 and we can see if there
> are sufficient WARN_ON() to catch any incorrect usage in !PREEMPT_RT
> case.
> 

Reusing the PREEMPT_RT bits, I was able to come up with the approach
below. Following are some nuances with this approach:

- Although I don't believe "set_do_softirq_pending()" can be nested,
   since it is used only from flush_smp_call_function_queue() which is
   called with IRQs disabled, I'm still using inc and dec since I'm not
   sure if it can be nested in a local_bh_{disable,enable}() or if
   those can be called from SMP-call-function.

- I used the same modified version of ipistorm to test the changes on
   top of v6.10-rc6-rt11 with LOCKDEP enabled and did not see any splats
   during the run of the benchmark. If there are better tests that
   stress this part on RT kernels, I'm all ears.

- I've abandoned any micro-optimization since I see different numbers
   on different machines and am sticking to the simplest approach since
   it works and is an improvement over the baseline.

I'll leave the diff below:

diff --git a/kernel/sched/smp.h b/kernel/sched/smp.h
index 21ac44428bb0..83f70626ff1e 100644
--- a/kernel/sched/smp.h
+++ b/kernel/sched/smp.h
@@ -9,7 +9,16 @@ extern void sched_ttwu_pending(void *arg);
  extern bool call_function_single_prep_ipi(int cpu);
  
  #ifdef CONFIG_SMP
+/*
+ * Used to indicate a pending call to do_softirq() from
+ * flush_smp_call_function_queue()
+ */
+extern void set_do_softirq_pending(void);
+extern void clr_do_softirq_pending(void);
+
  extern void flush_smp_call_function_queue(void);
  #else
+static inline void set_do_softirq_pending(void) { }
+static inline void clr_do_softirq_pending(void) { }
  static inline void flush_smp_call_function_queue(void) { }
  #endif
diff --git a/kernel/smp.c b/kernel/smp.c
index f085ebcdf9e7..c191bad912f6 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -583,7 +583,9 @@ void flush_smp_call_function_queue(void)
  	local_irq_save(flags);
  	/* Get the already pending soft interrupts for RT enabled kernels */
  	was_pending = local_softirq_pending();
+	set_do_softirq_pending();
  	__flush_smp_call_function_queue(true);
+	clr_do_softirq_pending();
  	if (local_softirq_pending())
  		do_softirq_post_smp_call_flush(was_pending);
  
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 00e32e279fa9..8308687fc7b9 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -88,22 +88,7 @@ EXPORT_PER_CPU_SYMBOL_GPL(hardirqs_enabled);
  EXPORT_PER_CPU_SYMBOL_GPL(hardirq_context);
  #endif
  
-/*
- * SOFTIRQ_OFFSET usage:
- *
- * On !RT kernels 'count' is the preempt counter, on RT kernels this applies
- * to a per CPU counter and to task::softirqs_disabled_cnt.
- *
- * - count is changed by SOFTIRQ_OFFSET on entering or leaving softirq
- *   processing.
- *
- * - count is changed by SOFTIRQ_DISABLE_OFFSET (= 2 * SOFTIRQ_OFFSET)
- *   on local_bh_disable or local_bh_enable.
- *
- * This lets us distinguish between whether we are currently processing
- * softirq and whether we just have bh disabled.
- */
-#ifdef CONFIG_PREEMPT_RT
+#define DO_SOFTIRQ_PENDING_MASK	GENMASK(SOFTIRQ_SHIFT - 1, 0)
  
  /*
   * RT accounts for BH disabled sections in task::softirqs_disabled_cnt and
@@ -116,16 +101,56 @@ EXPORT_PER_CPU_SYMBOL_GPL(hardirq_context);
   *
   * The per CPU counter prevents pointless wakeups of ksoftirqd in case that
   * the task which is in a softirq disabled section is preempted or blocks.
+ *
+ * The bottom bits of softirq_ctrl::cnt is used to indicate an impending call
+ * to do_softirq() to prevent pointless wakeups of ksoftirqd since the CPU
+ * promises to handle softirqs soon.
   */
  struct softirq_ctrl {
+#ifdef CONFIG_PREEMPT_RT
  	local_lock_t	lock;
+#endif
  	int		cnt;
  };
  
  static DEFINE_PER_CPU(struct softirq_ctrl, softirq_ctrl) = {
+#ifdef CONFIG_PREEMPT_RT
  	.lock	= INIT_LOCAL_LOCK(softirq_ctrl.lock),
+#endif
  };
  
+inline void set_do_softirq_pending(void)
+{
+	__this_cpu_inc(softirq_ctrl.cnt);
+}
+
+inline void clr_do_softirq_pending(void)
+{
+	__this_cpu_dec(softirq_ctrl.cnt);
+}
+
+static inline bool should_wake_ksoftirqd(void)
+{
+	return !this_cpu_read(softirq_ctrl.cnt);
+}
+
+/*
+ * SOFTIRQ_OFFSET usage:
+ *
+ * On !RT kernels 'count' is the preempt counter, on RT kernels this applies
+ * to a per CPU counter and to task::softirqs_disabled_cnt.
+ *
+ * - count is changed by SOFTIRQ_OFFSET on entering or leaving softirq
+ *   processing.
+ *
+ * - count is changed by SOFTIRQ_DISABLE_OFFSET (= 2 * SOFTIRQ_OFFSET)
+ *   on local_bh_disable or local_bh_enable.
+ *
+ * This lets us distinguish between whether we are currently processing
+ * softirq and whether we just have bh disabled.
+ */
+#ifdef CONFIG_PREEMPT_RT
+
  /**
   * local_bh_blocked() - Check for idle whether BH processing is blocked
   *
@@ -138,7 +163,7 @@ static DEFINE_PER_CPU(struct softirq_ctrl, softirq_ctrl) = {
   */
  bool local_bh_blocked(void)
  {
-	return __this_cpu_read(softirq_ctrl.cnt) != 0;
+	return (__this_cpu_read(softirq_ctrl.cnt) & SOFTIRQ_MASK) != 0;
  }
  
  void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
@@ -155,7 +180,8 @@ void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
  			/* Required to meet the RCU bottomhalf requirements. */
  			rcu_read_lock();
  		} else {
-			DEBUG_LOCKS_WARN_ON(this_cpu_read(softirq_ctrl.cnt));
+			DEBUG_LOCKS_WARN_ON(this_cpu_read(softirq_ctrl.cnt) &
+					    SOFTIRQ_MASK);
  		}
  	}
  
@@ -163,7 +189,7 @@ void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
  	 * Track the per CPU softirq disabled state. On RT this is per CPU
  	 * state to allow preemption of bottom half disabled sections.
  	 */
-	newcnt = __this_cpu_add_return(softirq_ctrl.cnt, cnt);
+	newcnt = __this_cpu_add_return(softirq_ctrl.cnt, cnt) & SOFTIRQ_MASK;
  	/*
  	 * Reflect the result in the task state to prevent recursion on the
  	 * local lock and to make softirq_count() & al work.
@@ -184,7 +210,7 @@ static void __local_bh_enable(unsigned int cnt, bool unlock)
  	int newcnt;
  
  	DEBUG_LOCKS_WARN_ON(current->softirq_disable_cnt !=
-			    this_cpu_read(softirq_ctrl.cnt));
+			    (this_cpu_read(softirq_ctrl.cnt) & SOFTIRQ_MASK));
  
  	if (IS_ENABLED(CONFIG_TRACE_IRQFLAGS) && softirq_count() == cnt) {
  		raw_local_irq_save(flags);
@@ -192,7 +218,7 @@ static void __local_bh_enable(unsigned int cnt, bool unlock)
  		raw_local_irq_restore(flags);
  	}
  
-	newcnt = __this_cpu_sub_return(softirq_ctrl.cnt, cnt);
+	newcnt = __this_cpu_sub_return(softirq_ctrl.cnt, cnt) & SOFTIRQ_MASK;
  	current->softirq_disable_cnt = newcnt;
  
  	if (!newcnt && unlock) {
@@ -212,7 +238,7 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
  	lockdep_assert_irqs_enabled();
  
  	local_irq_save(flags);
-	curcnt = __this_cpu_read(softirq_ctrl.cnt);
+	curcnt = __this_cpu_read(softirq_ctrl.cnt) & SOFTIRQ_MASK;
  
  	/*
  	 * If this is not reenabling soft interrupts, no point in trying to
@@ -253,7 +279,7 @@ void softirq_preempt(void)
  	if (WARN_ON_ONCE(!preemptible()))
  		return;
  
-	if (WARN_ON_ONCE(__this_cpu_read(softirq_ctrl.cnt) != SOFTIRQ_OFFSET))
+	if (WARN_ON_ONCE((__this_cpu_read(softirq_ctrl.cnt) & SOFTIRQ_MASK) != SOFTIRQ_OFFSET))
  		return;
  
  	__local_bh_enable(SOFTIRQ_OFFSET, true);
@@ -282,11 +308,6 @@ static inline void ksoftirqd_run_end(void)
  static inline void softirq_handle_begin(void) { }
  static inline void softirq_handle_end(void) { }
  
-static inline bool should_wake_ksoftirqd(void)
-{
-	return !this_cpu_read(softirq_ctrl.cnt);
-}
-
  static inline void invoke_softirq(void)
  {
  	if (should_wake_ksoftirqd())
@@ -424,11 +445,6 @@ static inline void ksoftirqd_run_end(void)
  	local_irq_enable();
  }
  
-static inline bool should_wake_ksoftirqd(void)
-{
-	return true;
-}
-
  static inline void invoke_softirq(void)
  {
  	if (!force_irqthreads() || !__this_cpu_read(ksoftirqd)) {
--

Some of the (softirq_ctrl.cnt & SOFTIRQ_MASK) masking might be
unnecessary but I'm not familiar enough with all these bits to make a
sound call. Any and all comments are appreciated :)

--
Thanks and Regards,
Prateek

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()
  2024-07-10  9:02 ` [PATCH 1/3] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() K Prateek Nayak
  2024-07-10 14:53   ` Peter Zijlstra
@ 2024-07-23  6:46   ` K Prateek Nayak
  1 sibling, 0 replies; 18+ messages in thread
From: K Prateek Nayak @ 2024-07-23  6:46 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior,,
	Christoph Hellwig, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Paul E. McKenney,
	Imran Khan, Leonardo Bras, Guo Ren, Rik van Riel, Tejun Heo,
	Cruz Zhao, Lai Jiangshan, Joel Fernandes, Zqiang, Julia Lawall,
	Gautham R. Shenoy, Ingo Molnar, Juri Lelli, Vincent Guittot

(+ Thomas, Sebastian, Christoph)

Hello everyone,

Adding folks who were cc'd on
https://lore.kernel.org/all/20220413133024.356509586@linutronix.de/

On 7/10/2024 2:32 PM, K Prateek Nayak wrote:
> The need_resched() check currently in nohz_csd_func() can be tracked
> to have been added in scheduler_ipi() back in 2011 via commit
> ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance")
> 
> Since then, it has travelled quite a bit but it seems like an idle_cpu()
> check currently is sufficient to detect the need to bail out from an
> idle load balancing. To justify this removal, consider all the following
> case where an idle load balancing could race with a task wakeup:
> 
> o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU
>    on wakelist if wakee cpu is idle") a target perceived to be idle
>    (target_rq->nr_running == 0) will return true for
>    ttwu_queue_cond(target) which will offload the task wakeup to the idle
>    target via an IPI.
> 
>    In all such cases target_rq->ttwu_pending will be set to 1 before
>    queuing the wake function.
> 
>    If an idle load balance races here, following scenarios are possible:
> 
>    - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
>      IPI is sent to the CPU to wake it out of idle. If the
>      nohz_csd_func() queues before sched_ttwu_pending(), the idle load
>      balance will bail out since idle_cpu(target) returns 0 since
>      target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
>      sched_ttwu_pending() it should see rq->nr_running to be non-zero and
>      bail out of idle load balancing.
> 
>    - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
>      the sender will simply set TIF_NEED_RESCHED for the target to put it
>      out of idle and flush_smp_call_function_queue() in do_idle() will
>      execute the call function. Depending on the ordering of the queuing
>      of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
>      nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
>      target_rq->nr_running to be non-zero if there is a genuine task
>      wakeup racing with the idle load balance kick.
> 
> o The waker CPU perceives the target CPU to be busy
>    (targer_rq->nr_running != 0) but the CPU is in fact going idle and due
>    to a series of unfortunate events, the system reaches a case where the
>    waker CPU decides to perform the wakeup by itself in ttwu_queue() on
>    the target CPU but target is concurrently selected for idle load
>    balance (Can this happen? I'm not sure, but we'll consider its
>    possibility to estimate the worst case scenario).
> 
>    ttwu_do_activate() calls enqueue_task() which would increment
>    "rq->nr_running" post which it calls wakeup_preempt() which is
>    responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
>    setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
>    thing to note in this case is that rq->nr_running is already non-zero
>    in case of a wakeup before TIF_NEED_RESCHED is set which would
>    lead to idle_cpu() check returning false.
> 
> In all cases, it seems that need_resched() check is unnecessary when
> checking for idle_cpu() first since an impending wakeup racing with idle
> load balancer will either set the "rq->ttwu_pending" or indicate a newly
> woken task via "rq->nr_running".
> 
> Chasing the reason why this check might have existed in the first place,
> I came across  Peter's suggestion on the fist iteration of Suresh's
> patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was:
> 
> 	sched_ttwu_do_pending(list);
> 
> 	if (unlikely((rq->idle == current) &&
> 	    rq->nohz_balance_kick &&
> 	    !need_resched()))
> 		raise_softirq_irqoff(SCHED_SOFTIRQ);
> 
> However, since this was preceded by sched_ttwu_do_pending() which is
> equivalent of sched_ttwu_pending() in the current upstream kernel, the
> need_resched() check was necessary to catch a newly queued task. Peter
> suggested modifying it to:
> 
> 	if (idle_cpu() && rq->nohz_balance_kick && !need_resched())
> 		raise_softirq_irqoff(SCHED_SOFTIRQ);
> 
> where idle_cpu() seems to have replaced "rq->idle == current" check.
> However, even back then, the idle_cpu() check would have been sufficient
> to have caught the enqueue of a new task and since commit b2a02fc43a1f
> ("smp: Optimize send_call_function_single_ipi()") overloads the
> interpretation of TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove
> the need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based
> on Peter's suggestion.
> 
> Link: https://lore.kernel.org/all/1317670590.20367.38.camel@twins/ [1]
> Link: https://lore.kernel.org/lkml/20240615014521.GR8774@noisy.programming.kicks-ass.net/
> Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")

Turns out the above commit + commit 1a90bfd22020 ("smp: Make softirq
handling RT safe in flush_smp_call_function_queue()") will trigger the
WARN_ON_ONCE() in do_softirq_post_smp_call_flush for RT kernels after
this change since the nohz_csd_func() will now raise a SCHED_SOFTIRQ
to trigger the idle balance and is executed from
flush_smp_call_function_queue() in do_idle().

I noticed the following splat early into the boot during my testing
of the series:

     ------------[ cut here ]------------
     WARNING: CPU: 4 PID: 0 at kernel/softirq.c:326 do_softirq_post_smp_call_flush+0x1a/0x40
     Modules linked in:
     CPU: 4 PID: 0 Comm: swapper/4 Not tainted 6.10.0-rc6-rt11-test-rt+ #1160
     Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
     RIP: 0010:do_softirq_post_smp_call_flush+0x1a/0x40
     Code: ...
     RSP: 0018:ffffb3ae003a7eb8 EFLAGS: 00010002
     RAX: 0000000000000080 RBX: 0000000000000282 RCX: 0000000000000007
     RDX: 0000000000000000 RSI: ffff9fc3fb4492e0 RDI: 0000000000000000
     RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff
     R10: 000000000000009b R11: ffff9f8586e2d4d0 R12: 0000000000000000
     R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
     FS:  0000000000000000(0000) GS:ffff9fc3fb400000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 0000000000000000 CR3: 000000807d470001 CR4: 0000000000770ef0
     PKRU: 55555554
     Call Trace:
      <TASK>
      ? __warn+0x88/0x180
      ? do_softirq_post_smp_call_flush+0x1a/0x40
      ? report_bug+0x18e/0x1a0
      ? handle_bug+0x42/0x70
      ? exc_invalid_op+0x18/0x70
      ? asm_exc_invalid_op+0x1a/0x20
      ? do_softirq_post_smp_call_flush+0x1a/0x40
      ? srso_alias_return_thunk+0x5/0xfbef5
      flush_smp_call_function_queue+0x7a/0x90
      do_idle+0x15f/0x270
      cpu_startup_entry+0x29/0x30
      start_secondary+0x12b/0x160
      common_startup_64+0x13e/0x141
      </TASK>
     ---[ end trace 0000000000000000 ]---

which points to:

     WARN_ON_ONCE(was_pending != local_softirq_pending())

Since MWAIT based idling on x86 sets the TIF_POLLING_NRFLAG, IPIs to
an idle CPU are optimized out by commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") and the logic instead relies on
flush_smp_call_function_queue() in the idle exit path to execute the
SMP-call-function. This previously went undetected since the sender of
IPI sets the TIF_NEED_RESCHED bit which would have tripped the
need_resched() check in nohz_csd_func() and prevented it from raising
the SOFTIRQ.

Would it be okay to allow raising a SCHED_SOFTIRQ from
flush_smp_call_function_queue() on PREEMPT_RT kernels? Something like:

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 8308687fc7b9..d8ce76e6e318 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -314,17 +314,24 @@ static inline void invoke_softirq(void)
  		wakeup_softirqd();
  }
  
+#define SCHED_SOFTIRQ_MASK	BIT(SCHED_SOFTIRQ)
+
  /*
   * flush_smp_call_function_queue() can raise a soft interrupt in a function
- * call. On RT kernels this is undesired and the only known functionality
- * in the block layer which does this is disabled on RT. If soft interrupts
- * get raised which haven't been raised before the flush, warn so it can be
+ * call. On RT kernels this is undesired and the only known functionalities
+ * are in the block layer which is disabled on RT, and in the scheduler for
+ * idle load balancing. If soft interrupts get raised which haven't been
+ * raised before the flush, warn if it is not a SCHED_SOFTIRQ so it can be
   * investigated.
   */
  void do_softirq_post_smp_call_flush(unsigned int was_pending)
  {
-	if (WARN_ON_ONCE(was_pending != local_softirq_pending()))
+	unsigned int is_pending = local_softirq_pending();
+
+	if (unlikely(was_pending != is_pending)) {
+		WARN_ON_ONCE(was_pending != (is_pending & ~SCHED_SOFTIRQ_MASK));
  		invoke_softirq();
+	}
  }
  
  #else /* CONFIG_PREEMPT_RT */
--

With the above diff, I do not see the splat I was seeing initially. If
there are no strong objections, I can fold in the above diff in v2.
-- 
Thanks and Regards,
Prateek

> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>   kernel/sched/core.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0935f9d4bb7b..1e0c77eac65a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1205,7 +1205,7 @@ static void nohz_csd_func(void *info)
>   	WARN_ON(!(flags & NOHZ_KICK_MASK));
>   
>   	rq->idle_balance = idle_cpu(cpu);
> -	if (rq->idle_balance && !need_resched()) {
> +	if (rq->idle_balance) {
>   		rq->nohz_idle_balance = flags;
>   		raise_softirq_irqoff(SCHED_SOFTIRQ);
>   	}

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/3] sched/core: Fixes and enhancements around spurious need_resched() and idle load balancing
  2024-07-10  9:02 [PATCH 0/3] sched/core: Fixes and enhancements around spurious need_resched() and idle load balancing K Prateek Nayak
                   ` (2 preceding siblings ...)
  2024-07-10  9:02 ` [RFC PATCH 3/3] softirq: Avoid waking up ksoftirqd from flush_smp_call_function_queue() K Prateek Nayak
@ 2024-07-29  2:42 ` Chen Yu
  3 siblings, 0 replies; 18+ messages in thread
From: Chen Yu @ 2024-07-29  2:42 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Paul E. McKenney, Imran Khan, Leonardo Bras, Guo Ren,
	Rik van Riel, Tejun Heo, Cruz Zhao, Lai Jiangshan, Joel Fernandes,
	Zqiang, Julia Lawall, Gautham R. Shenoy, Matt Fleming, Yujie Liu

Hi Prateek,

On 2024-07-10 at 09:02:07 +0000, K Prateek Nayak wrote:
> Since commit b2a02fc43a1f ("smp: Optimize
> send_call_function_single_ipi()"), an idle CPU in TIF_POLLING_NRFLAG can
> be pulled out of idle by setting TIF_NEED_RESCHED instead of sending an
> actual IPI. This affects at least three scenarios that have been
> described below:
> 
>  o A need_resched() check within a call function does not necessarily
>    indicate a task wakeup since a CPU intending to send an IPI to an
>    idle target in TIF_POLLING_NRFLAG mode can simply queue the
>    SMP-call-function and set the TIF_NEED_RESCHED flag to pull the
>    polling target out of idle. The SMP-call-function will be executed by
>    flush_smp_call_function_queue() on the idle-exit path. On x86, where
>    mwait_idle_with_hints() sets TIF_POLLING_NRFLAG for long idling,
>    this leads to idle load balancer bailing out early since
>    need_resched() check in nohz_csd_func() returns true in most
>    instances.
> 
> o A TIF_POLLING_NRFLAG idling CPU woken up to process an IPI will end
>   up calling schedule() even in cases where the call function does not
>   wake up a new task on the idle CPU, thus delaying the idle re-entry.
> 
> o Julia Lawall reported a case where a softirq raised from a
>   SMP-call-function on an idle CPU will wake up ksoftirqd since
>   flush_smp_call_function_queue() executes in the idle thread's context.
>   This can throw off the idle load balancer by making the idle CPU
>   appear busy since ksoftirqd just woke on the said CPU [1].
> 
> The three patches address each of the above issue individually, the
> first one by removing the need_resched() check in nohz_csd_func() with
> a proper justification, the second by introducing a fast-path in
> __schedule() to speed up idle re-entry in case TIF_NEED_RESCHED was set
> simply to process an IPI that did not perform a wakeup, and the third by
> notifying raise_softirq() that the softirq was raised from a
> SMP-call-function executed by the idle or migration thread in
> flush_smp_call_function_queue(), and waking ksoftirqd is unnecessary
> since a call to do_softirq_post_smp_call_flush() will follow soon.
> 
> Previous attempts to solve these problems involved introducing a new
> TIF_NOTIFY_IPI flag to notify a TIF_POLLING_NRFLAG CPU of a pending IPI
> and skip calling __schedule() in such cases but it involved using atomic
> ops which could have performance implications [2]. Instead, Peter
> suggested the approach outlined in the first two patches of the series.
> The third one is an RFC to that (hopefully) solves the problem Julia was
> chasing down related to idle load balancing.
> 
> [1] https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/
> [2] https://lore.kernel.org/lkml/20240615014256.GQ8774@noisy.programming.kicks-ass.net/
> 
> This patch is based on tip:sched/core at commit c793a62823d1
> ("sched/core: Drop spinlocks on contention iff kernel is preemptible")
>

As discussed in another thread[1], I did a simple test to stress the newidle
balance with this patch applied, to see if there is any impact on newidle balance
cost.
 
This synthetic test script(shown below) launches NR_CPU process (on my platform
it is 240). Each process is a loop of nanosleep(1 us), which is supposed to
trigger newidle balance as much as possible.
 
Before the 3 patches are applied, on commit c793a62823d1:
 
3.67%     0.33%  [kernel.kallsyms]     [k] sched_balance_newidle
2.85%     2.16%  [kernel.kallsyms]     [k] update_sd_lb_stats.constprop.0
 
After 3 patches are applied:
3.51%     0.32%  [kernel.kallsyms]         [k] sched_balance_newidle
2.71%     2.06%  [kernel.kallsyms]         [k] update_sd_lb_stats.constprop.0
 
It seems that there is no much difference regarding the precent of newidle
balance. My understanding is that, the current patch set mainly deals with
the false positive of need_resched() caused by IPI, thus avoid newidle balance
if it is IPI. In the synthetic test, it is always a real wakeup, so it does
not cause much difference. I think ipi intensive workload would benefit most
from this change, so I'm planning to use ipistorm to have a double check.
 
According to the scenario of this synthetic test, it seems that there is no
need to launch the newidle balance, because the sleeping task will be woken
up soon, there is no need to pull another task to it. Besides, the calculating
statistics is not linealy scalable and remains the bottleneck of newly idle
balance. I'm thinking of doing some evaluation based on your fix.

 
The test code:
i=1;while [ $i -le "240" ]; do ./nano_sleep 1000 & i=$(($i+1)); done;

int main(int argc, char **argv)
{
	int response, sleep_ns;
	struct timespec remaining, request = { 0, 100 };

	if (argc != 2) {
		printf("please specify the sleep nanosleep\n");
		return -1;
	}
	sleep_ns = atoi(argv[1]);

	while (1) {
		request.tv_sec = 0;
		request.tv_nsec = sleep_ns;
		response = nanosleep(&request, &remaining);
	}
	return 0;
}

[1] https://lore.kernel.org/lkml/20240717121745.GF26750@noisy.programming.kicks-ass.net/

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
  2024-07-10  9:02 ` [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule() K Prateek Nayak
  2024-07-11  8:00   ` Vincent Guittot
@ 2024-07-30 16:13   ` Chen Yu
  2024-08-04  4:05     ` Chen Yu
  1 sibling, 1 reply; 18+ messages in thread
From: Chen Yu @ 2024-07-30 16:13 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Paul E. McKenney, Imran Khan, Leonardo Bras, Guo Ren,
	Rik van Riel, Tejun Heo, Cruz Zhao, Lai Jiangshan, Joel Fernandes,
	Zqiang, Julia Lawall, Gautham R. Shenoy

On 2024-07-10 at 09:02:09 +0000, K Prateek Nayak wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Since commit b2a02fc43a1f ("smp: Optimize
> send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
> can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
> IPI without actually sending an interrupt. Even in cases where the IPI
> handler does not queue a task on the idle CPU, do_idle() will call
> __schedule() since need_resched() returns true in these cases.
> 
> Introduce and use SM_IDLE to identify call to __schedule() from
> schedule_idle() and shorten the idle re-entry time by skipping
> pick_next_task() when nr_running is 0 and the previous task is the idle
> task.
> 
> With the SM_IDLE fast-path, the time taken to complete a fixed set of
> IPIs using ipistorm improves significantly. Following are the numbers
> from a dual socket 3rd Generation EPYC system (2 x 64C/128T) (boost on,
> C2 disabled) running ipistorm between CPU8 and CPU16:
> 
> cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1
> 
>    ==================================================================
>    Test          : ipistorm (modified)
>    Units         : Normalized runtime
>    Interpretation: Lower is better
>    Statistic     : AMean
>    ==================================================================
>    kernel:				time [pct imp]
>    tip:sched/core			1.00 [baseline]
>    tip:sched/core + SM_IDLE		0.25 [75.11%]
> 
> [ kprateek: Commit log and testing ]
> 
> Link: https://lore.kernel.org/lkml/20240615012814.GP8774@noisy.programming.kicks-ass.net/
> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>

Only with current patch applied on top of sched/core commit c793a62823d1,
a significant throughput/run-to-run variance improvement is observed
on an Intel 240 CPUs/ 2 Nodes server. C-states >= C1E are disabled,
CPU frequency governor is set to performance and turbo-boost disabled.

Without the patch(lower the better):

158490995
113086433
737869191
302454894
731262790
677283357
729767478
830949261
399824606
743681976

(Amean): 542467098
(Std):   257011706


With the patch(lower the better):
128060992
115646768
132734621
150330954
113143538
169875051
145010400
151589193
162165800
159963320

(Amean): 142852063
(Std):    18646313

I've launched full tests for schbench/hackbench/netperf/tbench
to see if there is any difference.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
  2024-07-30 16:13   ` Chen Yu
@ 2024-08-04  4:05     ` Chen Yu
  2024-08-05  4:03       ` K Prateek Nayak
  0 siblings, 1 reply; 18+ messages in thread
From: Chen Yu @ 2024-08-04  4:05 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Paul E. McKenney, Imran Khan, Leonardo Bras, Guo Ren,
	Rik van Riel, Tejun Heo, Cruz Zhao, Lai Jiangshan, Joel Fernandes,
	Zqiang, Julia Lawall, Gautham R. Shenoy

On 2024-07-31 at 00:13:40 +0800, Chen Yu wrote:
> On 2024-07-10 at 09:02:09 +0000, K Prateek Nayak wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > Since commit b2a02fc43a1f ("smp: Optimize
> > send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
> > can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
> > IPI without actually sending an interrupt. Even in cases where the IPI
> > handler does not queue a task on the idle CPU, do_idle() will call
> > __schedule() since need_resched() returns true in these cases.
> > 
> > Introduce and use SM_IDLE to identify call to __schedule() from
> > schedule_idle() and shorten the idle re-entry time by skipping
> > pick_next_task() when nr_running is 0 and the previous task is the idle
> > task.
> > 
> > With the SM_IDLE fast-path, the time taken to complete a fixed set of
> > IPIs using ipistorm improves significantly. Following are the numbers
> > from a dual socket 3rd Generation EPYC system (2 x 64C/128T) (boost on,
> > C2 disabled) running ipistorm between CPU8 and CPU16:
> > 
> > cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1
> > 
> >    ==================================================================
> >    Test          : ipistorm (modified)
> >    Units         : Normalized runtime
> >    Interpretation: Lower is better
> >    Statistic     : AMean
> >    ==================================================================
> >    kernel:				time [pct imp]
> >    tip:sched/core			1.00 [baseline]
> >    tip:sched/core + SM_IDLE		0.25 [75.11%]
> > 
> > [ kprateek: Commit log and testing ]
> > 
> > Link: https://lore.kernel.org/lkml/20240615012814.GP8774@noisy.programming.kicks-ass.net/
> > Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> >
> 
> Only with current patch applied on top of sched/core commit c793a62823d1,
> a significant throughput/run-to-run variance improvement is observed
> on an Intel 240 CPUs/ 2 Nodes server. C-states >= C1E are disabled,
> CPU frequency governor is set to performance and turbo-boost disabled.
> 
> Without the patch(lower the better):
> 
> 158490995
> 113086433
> 737869191
> 302454894
> 731262790
> 677283357
> 729767478
> 830949261
> 399824606
> 743681976
> 
> (Amean): 542467098
> (Std):   257011706
> 
> 
> With the patch(lower the better):
> 128060992
> 115646768
> 132734621
> 150330954
> 113143538
> 169875051
> 145010400
> 151589193
> 162165800
> 159963320
> 
> (Amean): 142852063
> (Std):    18646313
> 
> I've launched full tests for schbench/hackbench/netperf/tbench
> to see if there is any difference.
>

Tested without CONFIG_PREEMPT_RT, so issue for SM_RTLOCK_WAIT as mentioned
by Vincent might not bring any impact. There is no obvious difference
(regression) detected according to the test in the 0day environment. Overall
this patch looks good to me. Once you send a refresh version out I'll re-launch
the test.

Tested on Xeon server with 128 CPUs, 4 Numa nodes, under different

      baseline                  with-SM_IDLE

hackbench
load level (25% ~ 100%)

hackbench-pipe-process.throughput
%25:
    846099            -0.3%     843217
%50:
    972015            +0.0%     972185
%100:
   1395650            -1.3%    1376963

hackbench-pipe-threads.throughput
%25:
    746629            -0.0%     746345
%50:
    885165            -0.4%     881602
%100:
   1227790            +1.3%    1243757

hackbench-socket-process.throughput
%25:
    395784            +1.2%     400717
%50:
    441312            +0.3%     442783
%100:
    324283 ±  2%      +6.0%     343826

hackbench-socket-threads.throughput
%25:
    379700            -0.8%     376642
%50:
    425315            -0.4%     423749
%100:
    311937 ±  2%      +0.9%     314892



      baseline                  with-SM_IDLE

schbench.request_latency_90%_us

1-mthread-1-worker:
      4562            -0.0%       4560
1-mthread-16-workers:
      4564            -0.0%       4563
12.5%-mthread-1:
      4565            +0.0%       4567
12.5%-mthread-16-workers:
     39204            +0.1%      39248
25%-mthread-1-worker:
      4574            +0.0%       4574
25%-mthread-16-workers:
    161944            +0.1%     162053
50%-mthread-1-workers:
      4784 ±  5%      +0.1%       4789 ±  5%
50%-mthread-16-workers:
    659156            +0.4%     661679
100%-mthread-1-workers:
      9328            +0.0%       9329
100%-mthread-16-workers:
   2489753            -0.7%    2472140


      baseline                  with-SM_IDLE

netperf.Throughput:

25%-TCP_RR:
   2449875            +0.0%    2450622        netperf.Throughput_total_tps
25%-UDP_RR:
   2746806            +0.1%    2748935        netperf.Throughput_total_tps
25%-TCP_STREAM:
   1352061            +0.7%    1361497        netperf.Throughput_total_Mbps
25%-UDP_STREAM:
   1815205            +0.1%    1816202        netperf.Throughput_total_Mbps
50%-TCP_RR:
   3981514            -0.3%    3970327        netperf.Throughput_total_tps
50%-UDP_RR:
   4496584            -1.3%    4438363        netperf.Throughput_total_tps
50%-TCP_STREAM:
   1478872            +0.4%    1484196        netperf.Throughput_total_Mbps
50%-UDP_STREAM:
   1739540            +0.3%    1744074        netperf.Throughput_total_Mbps
75%-TCP_RR:
   3696607            -0.5%    3677044        netperf.Throughput_total_tps
75%-UDP_RR:
   4161206            +1.3%    4217274 ±  2%  netperf.Throughput_total_tps
75%-TCP_STREAM:
    895874            +5.7%     946546 ±  5%  netperf.Throughput_total_Mbps
75%-UDP_STREAM:
   4100019            -0.3%    4088367        netperf.Throughput_total_Mbps
100%-TCP_RR:
   6724456            -1.7%    6610976        netperf.Throughput_total_tps
100%-UDP_RR:
   7329959            -0.5%    7294653        netperf.Throughput_total_tps
100%-TCP_STREAM:
    808165            +0.3%     810360        netperf.Throughput_total_Mbps
100%-UDP_STREAM:
   5562651            +0.0%    5564106        netperf.Throughput_total_Mbps

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
  2024-08-04  4:05     ` Chen Yu
@ 2024-08-05  4:03       ` K Prateek Nayak
  0 siblings, 0 replies; 18+ messages in thread
From: K Prateek Nayak @ 2024-08-05  4:03 UTC (permalink / raw)
  To: Chen Yu
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Paul E. McKenney, Imran Khan, Leonardo Bras, Guo Ren,
	Rik van Riel, Tejun Heo, Cruz Zhao, Lai Jiangshan, Joel Fernandes,
	Zqiang, Julia Lawall, Gautham R. Shenoy

Hello Chenyu,

Thank you for testing the series. I'll have a second version out soon.

On 8/4/2024 9:35 AM, Chen Yu wrote:
> On 2024-07-31 at 00:13:40 +0800, Chen Yu wrote:
>> On 2024-07-10 at 09:02:09 +0000, K Prateek Nayak wrote:
>>> From: Peter Zijlstra <peterz@infradead.org>
>>>
>>> Since commit b2a02fc43a1f ("smp: Optimize
>>> send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
>>> can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
>>> IPI without actually sending an interrupt. Even in cases where the IPI
>>> handler does not queue a task on the idle CPU, do_idle() will call
>>> __schedule() since need_resched() returns true in these cases.
>>>
>>> Introduce and use SM_IDLE to identify call to __schedule() from
>>> schedule_idle() and shorten the idle re-entry time by skipping
>>> pick_next_task() when nr_running is 0 and the previous task is the idle
>>> task.
>>>
>>> With the SM_IDLE fast-path, the time taken to complete a fixed set of
>>> IPIs using ipistorm improves significantly. Following are the numbers
>>> from a dual socket 3rd Generation EPYC system (2 x 64C/128T) (boost on,
>>> C2 disabled) running ipistorm between CPU8 and CPU16:
>>>
>>> cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1
>>>
>>>     ==================================================================
>>>     Test          : ipistorm (modified)
>>>     Units         : Normalized runtime
>>>     Interpretation: Lower is better
>>>     Statistic     : AMean
>>>     ==================================================================
>>>     kernel:				time [pct imp]
>>>     tip:sched/core			1.00 [baseline]
>>>     tip:sched/core + SM_IDLE		0.25 [75.11%]
>>>
>>> [ kprateek: Commit log and testing ]
>>>
>>> Link: https://lore.kernel.org/lkml/20240615012814.GP8774@noisy.programming.kicks-ass.net/
>>> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
>>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>>
>>
>> Only with current patch applied on top of sched/core commit c793a62823d1,
>> a significant throughput/run-to-run variance improvement is observed
>> on an Intel 240 CPUs/ 2 Nodes server. C-states >= C1E are disabled,
>> CPU frequency governor is set to performance and turbo-boost disabled.
>>
>> Without the patch(lower the better):
>>
>> 158490995
>> 113086433
>> 737869191
>> 302454894
>> 731262790
>> 677283357
>> 729767478
>> 830949261
>> 399824606
>> 743681976
>>
>> (Amean): 542467098
>> (Std):   257011706
>>
>>
>> With the patch(lower the better):
>> 128060992
>> 115646768
>> 132734621
>> 150330954
>> 113143538
>> 169875051
>> 145010400
>> 151589193
>> 162165800
>> 159963320
>>
>> (Amean): 142852063
>> (Std):    18646313
>>
>> I've launched full tests for schbench/hackbench/netperf/tbench
>> to see if there is any difference.
>>
> 
> Tested without CONFIG_PREEMPT_RT, so issue for SM_RTLOCK_WAIT as mentioned
> by Vincent might not bring any impact. There is no obvious difference
> (regression) detected according to the test in the 0day environment. Overall
> this patch looks good to me. Once you send a refresh version out I'll re-launch
> the test.

Since SM_RTLOCK_WAIT is only used by schedule_rtlock(), which is only
defined for PREEMPT_RT kernels, non RT build should have no issue. I
could spot at least one case in rtlock_slowlock_locked() where the
pre->__state is set to "TASK_RTLOCK_WAIT" and schedule_rtlock() is
called. With this patch, it would pass the "sched_mode > SM_NONE" check
and call it an involuntary context-switch but on tip,
(preempt & SM_MASK_PREEMPT) would return false and eventually it'll
call deactivate_task() to dequeue the waiting task so this does need
fixing.

 From a brief look, all calls to schedule with "SM_RTLOCK_WAIT" already
set the task->__state to a non-zero value. I'll look into this further
after the respin and see if there is some optimization possible there
but for the time being, I'll respin this with the condition changed
to:

	...
     } else if (preempt != SM_PREEMPT && prev_state) {
	...

just to keep it explicit.

Thank you again for testing this version.
-- 
Thanks and Regards,
Prateek

> 
> Tested on Xeon server with 128 CPUs, 4 Numa nodes, under different
> 
>        baseline                  with-SM_IDLE
> 
> hackbench
> load level (25% ~ 100%)
> 
> hackbench-pipe-process.throughput
> %25:
>      846099            -0.3%     843217
> %50:
>      972015            +0.0%     972185
> %100:
>     1395650            -1.3%    1376963
> 
> hackbench-pipe-threads.throughput
> %25:
>      746629            -0.0%     746345
> %50:
>      885165            -0.4%     881602
> %100:
>     1227790            +1.3%    1243757
> 
> hackbench-socket-process.throughput
> %25:
>      395784            +1.2%     400717
> %50:
>      441312            +0.3%     442783
> %100:
>      324283 ±  2%      +6.0%     343826
> 
> hackbench-socket-threads.throughput
> %25:
>      379700            -0.8%     376642
> %50:
>      425315            -0.4%     423749
> %100:
>      311937 ±  2%      +0.9%     314892
> 
> 
> 
>        baseline                  with-SM_IDLE
> 
> schbench.request_latency_90%_us
> 
> 1-mthread-1-worker:
>        4562            -0.0%       4560
> 1-mthread-16-workers:
>        4564            -0.0%       4563
> 12.5%-mthread-1:
>        4565            +0.0%       4567
> 12.5%-mthread-16-workers:
>       39204            +0.1%      39248
> 25%-mthread-1-worker:
>        4574            +0.0%       4574
> 25%-mthread-16-workers:
>      161944            +0.1%     162053
> 50%-mthread-1-workers:
>        4784 ±  5%      +0.1%       4789 ±  5%
> 50%-mthread-16-workers:
>      659156            +0.4%     661679
> 100%-mthread-1-workers:
>        9328            +0.0%       9329
> 100%-mthread-16-workers:
>     2489753            -0.7%    2472140
> 
> 
>        baseline                  with-SM_IDLE
> 
> netperf.Throughput:
> 
> 25%-TCP_RR:
>     2449875            +0.0%    2450622        netperf.Throughput_total_tps
> 25%-UDP_RR:
>     2746806            +0.1%    2748935        netperf.Throughput_total_tps
> 25%-TCP_STREAM:
>     1352061            +0.7%    1361497        netperf.Throughput_total_Mbps
> 25%-UDP_STREAM:
>     1815205            +0.1%    1816202        netperf.Throughput_total_Mbps
> 50%-TCP_RR:
>     3981514            -0.3%    3970327        netperf.Throughput_total_tps
> 50%-UDP_RR:
>     4496584            -1.3%    4438363        netperf.Throughput_total_tps
> 50%-TCP_STREAM:
>     1478872            +0.4%    1484196        netperf.Throughput_total_Mbps
> 50%-UDP_STREAM:
>     1739540            +0.3%    1744074        netperf.Throughput_total_Mbps
> 75%-TCP_RR:
>     3696607            -0.5%    3677044        netperf.Throughput_total_tps
> 75%-UDP_RR:
>     4161206            +1.3%    4217274 ±  2%  netperf.Throughput_total_tps
> 75%-TCP_STREAM:
>      895874            +5.7%     946546 ±  5%  netperf.Throughput_total_Mbps
> 75%-UDP_STREAM:
>     4100019            -0.3%    4088367        netperf.Throughput_total_Mbps
> 100%-TCP_RR:
>     6724456            -1.7%    6610976        netperf.Throughput_total_tps
> 100%-UDP_RR:
>     7329959            -0.5%    7294653        netperf.Throughput_total_tps
> 100%-TCP_STREAM:
>      808165            +0.3%     810360        netperf.Throughput_total_Mbps
> 100%-UDP_STREAM:
>     5562651            +0.0%    5564106        netperf.Throughput_total_Mbps
> 
> thanks,
> Chenyu


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-08-05  4:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-10  9:02 [PATCH 0/3] sched/core: Fixes and enhancements around spurious need_resched() and idle load balancing K Prateek Nayak
2024-07-10  9:02 ` [PATCH 1/3] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() K Prateek Nayak
2024-07-10 14:53   ` Peter Zijlstra
2024-07-10 17:57     ` K Prateek Nayak
2024-07-23  6:46   ` K Prateek Nayak
2024-07-10  9:02 ` [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule() K Prateek Nayak
2024-07-11  8:00   ` Vincent Guittot
2024-07-11  9:19     ` Peter Zijlstra
2024-07-11 13:14       ` Vincent Guittot
2024-07-12  6:40         ` K Prateek Nayak
2024-07-30 16:13   ` Chen Yu
2024-08-04  4:05     ` Chen Yu
2024-08-05  4:03       ` K Prateek Nayak
2024-07-10  9:02 ` [RFC PATCH 3/3] softirq: Avoid waking up ksoftirqd from flush_smp_call_function_queue() K Prateek Nayak
2024-07-10 15:05   ` Peter Zijlstra
2024-07-10 18:20     ` K Prateek Nayak
2024-07-23  4:50       ` K Prateek Nayak
2024-07-29  2:42 ` [PATCH 0/3] sched/core: Fixes and enhancements around spurious need_resched() and idle load balancing Chen Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox