public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/4] sched/fair: Idle load balancer fixes for fallouts from IPI optimization to TIF_POLLING CPUs
@ 2024-11-19  5:44 K Prateek Nayak
  2024-11-19  5:44 ` [PATCH v5 1/4] softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel K Prateek Nayak
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: K Prateek Nayak @ 2024-11-19  5:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	linux-kernel, linux-rt-devel
  Cc: Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Tejun Heo, NeilBrown, Jens Axboe,
	Frederic Weisbecker, Zqiang, Caleb Sander Mateos,
	Gautham R . Shenoy, Chen Yu, Julia Lawall, K Prateek Nayak

Hello everyone,

This is the fifth version with one more patch addition since last
version, and some updated benchmarking data down below. Any and all
feedback is highly appreciated.

This series is based on:

    git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core

at commit f9ed1f7c2e26 ("genirq/proc: Use seq_put_decimal_ull_width()
for decimal values")

Since commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()"), an idle CPU in TIF_POLLING_NRFLAG can
be pulled out of idle by setting TIF_NEED_RESCHED instead of sending an
actual IPI. This affects at least three scenarios that have been
described below:

1. A need_resched() check within a call function does not necessarily
   indicate a task wakeup since a CPU intending to send an IPI to an
   idle target in TIF_POLLING_NRFLAG mode can simply queue the
   SMP-call-function and set the TIF_NEED_RESCHED flag to pull the
   polling target out of idle. The SMP-call-function will be executed by
   flush_smp_call_function_queue() on the idle-exit path. On x86, where
   mwait_idle_with_hints() sets TIF_POLLING_NRFLAG for long idling,
   this leads to idle load balancer bailing out early since
   need_resched() check in nohz_csd_func() returns true in most
   instances. This is also true for softirqs being handled from 
   do_softirq_post_smp_call_flush(), which again, is processed before
   __schedule() clears the TIF_NEED_RESCHED flag.

2. A TIF_POLLING_NRFLAG idling CPU woken up to process an IPI will end
   up calling schedule() even in cases where the call function does not
   wake up a new task on the idle CPU, thus delaying the idle re-entry.

3. Julia Lawall reported a case where a softirq raised from a
   SMP-call-function on an idle CPU will wake up ksoftirqd since
   flush_smp_call_function_queue() executes in the idle thread's
   context. This can throw off the idle load balancer by making the idle
   CPU appear busy since ksoftirqd just woke on the said CPU [1].

Solution to (2.) was sent independently in [2] since it was not
dependent on the changes enclosed in this series which reworks some
PREEMPT_RT specific bits.

(1.) was solved by dropping the need_resched() check in nohz_csd_func()
(please refer Patch 2 and Patch 3 for the full version of the
explanation) which led to a splat on PREEMPT_RT kernels [3].

Since flush_smp_call_function_queue() and the following
do_softirq_post_smp_call_flush() runs with [reem[t disabled, it is not
ideal for the IRQ handlers to raise a SOFTIRQ, prolonging the preempt
disabled section especially on PREEMPT_RT kernels. Patch 1 works around
this by allowing processing of SCHED_SOFTIRQ without raising a warning
since the idle load balancer has checkpoints to detect wakup of tasks
on the idle CPU in which case it immediately bails out of the load
balaning process and proceeds to call schedule()

With the above solution, problem discussed in (3.) is even more
prominent with idle load balancing waking up ksoftirqd to unnecessarily.
Previous versions attempted to solve this by introducing a per-cpu
variable to keep track on an impending call to do_softirq(). Sebastian
suggested using __raise_softirq_irqoff() for SCHED_SOFTIRQ since it
avoids wakeup of ksoftirqd and since SCHED_SOFTIRQ is being raised from
the IPI handler, it is guaranteed to be serviced on IRQ exit or via
do_softirq_post_smp_call_flush()

Sebastian also raised concerns about threadirqs possibly interfering
with load balancing, however in case of idle load balancing, the CPU
performing the idle load balancing on behalf of other nohz idle CPUs
only performs load balance for itself at the very end. If the idle load
balancing was offloaded to ksoftirqd, and the CPU bails out from idle
load balancing, the scheduler will perform a newidle balance if
ksoftirqd was not able to pull any task since it will be the last fair
task on the runqueue and should handle any imbalance. These details have
been added in the commit log of Patch 3 for the record.

Chenyu had reported a regression when running a modified version of
ipistorm that performs a fixed set of IPIs between two CPUs on his
setup with a previous version of this series which updated certain
per-CPU variable at flush_smp_call_function_queue() boundary to prevent
ksoftirqd wakeups. Since these updates are now gone, courtesy of the
new approach suggested by Sebastian, this regression sould no longer be
seen. The results from testing on the patched kernel is identical to
the base kernel.

Julia Lawall reported reduction in the number of load balancing attempts
on v6.11 based tip:sched/core at NUMA level. The issue was root-caused
to commit 3dcac251b066 ("sched/core: Introduce SM_IDLE and an idle
re-entry fast-path in __schedule()") which would skip the
newidle_balance() if schedule_idle() was called without any task wakeups
on the idle CPUs to favor a faster idle re-entry. To rule out any
surprises from this series in particular, I tested the bt.B.x benchmark
where she originally observed this behavior on. Following are the
numbers from a dual socket Intel Ice Lake Xeon server (2 x 32C/64T):

   ==================================================================
   Test          : bt.B.x (OMP variant)
   Units         : % improvement over base kernel in Mop/s throughput
   Interpretation: Higher is better
   ======================= Intel Ice Lake Xeon ======================
   kernel:						[pct imp]
   performance gov, boost en	  			  0.89%
   performance gov, boost dis	  			  0.26%
   ==================================================================

I did not see any discernable difference with this one over the base
kernel.

[1] https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/
[2] https://lore.kernel.org/lkml/20240809092240.6921-1-kprateek.nayak@amd.com/
[3] https://lore.kernel.org/lkml/225e6d74-ed43-51dd-d1aa-c75c86dd58eb@amd.com/
[4] https://lore.kernel.org/lkml/20240710150557.GB27299@noisy.programming.kicks-ass.net/
---
v4..v5:

o Added Patch 3 which modifies another need_resched() check in the idle
  load balancer's SOFTIRQ handler path that I had missed previously.

o Collected Reviewed-by from Sebastian on Patch 4.

o Reworded the commit messages for Patch 1, and Patch 4.

o Added more clarifications around load balancing in the cover letter
  for threadirqs case.

v4: https://lore.kernel.org/lkml/20241030071557.1422-1-kprateek.nayak@amd.com/

v3..v4:

o Use __raise_softirq_irqoff() for SCHED_SOFTIRQ. (Sebastian)

o Updated benchmark numbers and reworded the cover letter.

v3: https://lore.kernel.org/lkml/20241014090339.2478-1-kprateek.nayak@amd.com/

v2..v3:

o Removed ifdefs around local_lock_t. (Peter)

o Reworded Patch 1 to add more details on raising SCHED_SOFTIRQ from
  flush_smp_call_function_queue() and why it should be okay on
  PREEMPT_RT.

o Updated the trace data in Patch 5.

o More benchmarking.

v2: https://lore.kernel.org/lkml/20240904111223.1035-1-kprateek.nayak@amd.com/

v1..v2:

o Broke the PREEMPT_RT unification and idle load balance fixes into
  separate series (this) and post the SM_IDLE fast-path enhancements
  separately.

o Worked around the splat on PREEMPT_RT kernel caused by raising
  SCHED_SOFTIRQ from nohz_csd_func() in context of
  flush_smp_call_function_queue() which is undesirable on PREEMPT_RT
  kernels. (Please refer to commit 1a90bfd22020 ("smp: Make softirq
  handling RT safe in flush_smp_call_function_queue()")

o Reuse softirq_ctrl::cnt from PREEMPT_RT to prevent unnecessary
  wakeups of ksoftirqd. (Peter)
  This unifies should_wakeup_ksoftirqd() and adds an interface to
  indicate an impending call to do_softirq (set_do_softirq_pending())
  and clear it just before fulfilling the promise
  (clr_do_softirq_pending()).

o More benchmarking.

v1: https://lore.kernel.org/lkml/20240710090210.41856-1-kprateek.nayak@amd.com/
--


K Prateek Nayak (4):
  softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT
    kernel
  sched/core: Remove the unnecessary need_resched() check in
    nohz_csd_func()
  sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU
    turning busy
  sched/core: Prevent wakeup of ksoftirqd during idle load balance

 kernel/sched/core.c |  4 ++--
 kernel/sched/fair.c |  2 +-
 kernel/softirq.c    | 15 +++++++++++----
 3 files changed, 14 insertions(+), 7 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5 1/4] softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel
  2024-11-19  5:44 [PATCH v5 0/4] sched/fair: Idle load balancer fixes for fallouts from IPI optimization to TIF_POLLING CPUs K Prateek Nayak
@ 2024-11-19  5:44 ` K Prateek Nayak
  2024-12-02 11:14   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  2024-11-19  5:44 ` [PATCH v5 2/4] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() K Prateek Nayak
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: K Prateek Nayak @ 2024-11-19  5:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	linux-kernel, linux-rt-devel
  Cc: Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Tejun Heo, NeilBrown, Jens Axboe,
	Frederic Weisbecker, Zqiang, Caleb Sander Mateos,
	Gautham R . Shenoy, Chen Yu, Julia Lawall, K Prateek Nayak

do_softirq_post_smp_call_flush() on PREEMPT_RT kernels carries a
WARN_ON_ONCE() for any SOFTIRQ being raised from an SMP-call-function.
Since do_softirq_post_smp_call_flush() is called with preempt disabled,
raising a SOFTIRQ during flush_smp_call_function_queue() can lead to
longer preempt disabled sections.

Since commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") IPIs to an idle CPU in
TIF_POLLING_NRFLAG mode can be optimized out by instead setting
TIF_NEED_RESCHED bit in idle task's thread_info and relying on the
flush_smp_call_function_queue() in the idle-exit path to run the
SMP-call-function.

To trigger an idle load balancing, the scheduler queues
nohz_csd_function() responsible for triggering an idle load balancing on
a target nohz idle CPU and sends an IPI. Only now, this IPI is optimized
out and the SMP-call-function is executed from
flush_smp_call_function_queue() in do_idle() which can raise a
SCHED_SOFTIRQ to trigger the balancing.

So far, this went undetected since, the need_resched() check in
nohz_csd_function() would make it bail out of idle load balancing early
as the idle thread does not clear TIF_POLLING_NRFLAG before calling
flush_smp_call_function_queue(). The need_resched() check was added with
the intent to catch a new task wakeup, however, it has recently
discovered to be unnecessary and will be removed in the subsequent
commit after which nohz_csd_function() can raise a SCHED_SOFTIRQ from
flush_smp_call_function_queue() to trigger an idle load balance on an
idle target in TIF_POLLING_NRFLAG mode.

nohz_csd_function() bails out early if "idle_cpu()" check for the
target CPU, and does not lock the target CPU's rq until the very end,
once it has found tasks to run on the CPU and will not inhibit the
wakeup of, or running of a newly woken up higher priority task. Account
for this and prevent a WARN_ON_ONCE() when SCHED_SOFTIRQ is raised from
flush_smp_call_function_queue().

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
v4..v5:

- More clarification in the commit log around idle load balancing (last
  paragraph)
---
 kernel/softirq.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 7b525c904462..03248ca887b5 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -280,17 +280,24 @@ static inline void invoke_softirq(void)
 		wakeup_softirqd();
 }
 
+#define SCHED_SOFTIRQ_MASK	BIT(SCHED_SOFTIRQ)
+
 /*
  * flush_smp_call_function_queue() can raise a soft interrupt in a function
- * call. On RT kernels this is undesired and the only known functionality
- * in the block layer which does this is disabled on RT. If soft interrupts
- * get raised which haven't been raised before the flush, warn so it can be
+ * call. On RT kernels this is undesired and the only known functionalities
+ * are in the block layer which is disabled on RT, and in the scheduler for
+ * idle load balancing. If soft interrupts get raised which haven't been
+ * raised before the flush, warn if it is not a SCHED_SOFTIRQ so it can be
  * investigated.
  */
 void do_softirq_post_smp_call_flush(unsigned int was_pending)
 {
-	if (WARN_ON_ONCE(was_pending != local_softirq_pending()))
+	unsigned int is_pending = local_softirq_pending();
+
+	if (unlikely(was_pending != is_pending)) {
+		WARN_ON_ONCE(was_pending != (is_pending & ~SCHED_SOFTIRQ_MASK));
 		invoke_softirq();
+	}
 }
 
 #else /* CONFIG_PREEMPT_RT */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 2/4] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()
  2024-11-19  5:44 [PATCH v5 0/4] sched/fair: Idle load balancer fixes for fallouts from IPI optimization to TIF_POLLING CPUs K Prateek Nayak
  2024-11-19  5:44 ` [PATCH v5 1/4] softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel K Prateek Nayak
@ 2024-11-19  5:44 ` K Prateek Nayak
  2024-12-02 11:14   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  2024-11-19  5:44 ` [PATCH v5 3/4] sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy K Prateek Nayak
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: K Prateek Nayak @ 2024-11-19  5:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	linux-kernel, linux-rt-devel
  Cc: Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Tejun Heo, NeilBrown, Jens Axboe,
	Frederic Weisbecker, Zqiang, Caleb Sander Mateos,
	Gautham R . Shenoy, Chen Yu, Julia Lawall, K Prateek Nayak

The need_resched() check currently in nohz_csd_func() can be tracked
to have been added in scheduler_ipi() back in 2011 via commit
ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance")

Since then, it has travelled quite a bit but it seems like an idle_cpu()
check currently is sufficient to detect the need to bail out from an
idle load balancing. To justify this removal, consider all the following
case where an idle load balancing could race with a task wakeup:

o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU
  on wakelist if wakee cpu is idle") a target perceived to be idle
  (target_rq->nr_running == 0) will return true for
  ttwu_queue_cond(target) which will offload the task wakeup to the idle
  target via an IPI.

  In all such cases target_rq->ttwu_pending will be set to 1 before
  queuing the wake function.

  If an idle load balance races here, following scenarios are possible:

  - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
    IPI is sent to the CPU to wake it out of idle. If the
    nohz_csd_func() queues before sched_ttwu_pending(), the idle load
    balance will bail out since idle_cpu(target) returns 0 since
    target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
    sched_ttwu_pending() it should see rq->nr_running to be non-zero and
    bail out of idle load balancing.

  - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
    the sender will simply set TIF_NEED_RESCHED for the target to put it
    out of idle and flush_smp_call_function_queue() in do_idle() will
    execute the call function. Depending on the ordering of the queuing
    of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
    nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
    target_rq->nr_running to be non-zero if there is a genuine task
    wakeup racing with the idle load balance kick.

o The waker CPU perceives the target CPU to be busy
  (targer_rq->nr_running != 0) but the CPU is in fact going idle and due
  to a series of unfortunate events, the system reaches a case where the
  waker CPU decides to perform the wakeup by itself in ttwu_queue() on
  the target CPU but target is concurrently selected for idle load
  balance (XXX: Can this happen? I'm not sure, but we'll consider the
  mother of all coincidences to estimate the worst case scenario).

  ttwu_do_activate() calls enqueue_task() which would increment
  "rq->nr_running" post which it calls wakeup_preempt() which is
  responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
  setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
  thing to note in this case is that rq->nr_running is already non-zero
  in case of a wakeup before TIF_NEED_RESCHED is set which would
  lead to idle_cpu() check returning false.

In all cases, it seems that need_resched() check is unnecessary when
checking for idle_cpu() first since an impending wakeup racing with idle
load balancer will either set the "rq->ttwu_pending" or indicate a newly
woken task via "rq->nr_running".

Chasing the reason why this check might have existed in the first place,
I came across  Peter's suggestion on the fist iteration of Suresh's
patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was:

	sched_ttwu_do_pending(list);

	if (unlikely((rq->idle == current) &&
	    rq->nohz_balance_kick &&
	    !need_resched()))
		raise_softirq_irqoff(SCHED_SOFTIRQ);

Since the condition to raise the SCHED_SOFIRQ was preceded by
sched_ttwu_do_pending() (which is equivalent of sched_ttwu_pending()) in
the current upstream kernel, the need_resched() check was necessary to
catch a newly queued task. Peter suggested modifying it to:

	if (idle_cpu() && rq->nohz_balance_kick && !need_resched())
		raise_softirq_irqoff(SCHED_SOFTIRQ);

where idle_cpu() seems to have replaced "rq->idle == current" check.

Even back then, the idle_cpu() check would have been sufficient to catch
a new task being enqueued. Since commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") overloads the interpretation of
TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove the
need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based
on Peter's suggestion.

Link: https://lore.kernel.org/all/1317670590.20367.38.camel@twins/ [1]
Link: https://lore.kernel.org/lkml/20240615014521.GR8774@noisy.programming.kicks-ass.net/
Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
v4..v5:

- No changes.
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 43e453ab7e20..424c652a9ddc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1237,7 +1237,7 @@ static void nohz_csd_func(void *info)
 	WARN_ON(!(flags & NOHZ_KICK_MASK));
 
 	rq->idle_balance = idle_cpu(cpu);
-	if (rq->idle_balance && !need_resched()) {
+	if (rq->idle_balance) {
 		rq->nohz_idle_balance = flags;
 		raise_softirq_irqoff(SCHED_SOFTIRQ);
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 3/4] sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy
  2024-11-19  5:44 [PATCH v5 0/4] sched/fair: Idle load balancer fixes for fallouts from IPI optimization to TIF_POLLING CPUs K Prateek Nayak
  2024-11-19  5:44 ` [PATCH v5 1/4] softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel K Prateek Nayak
  2024-11-19  5:44 ` [PATCH v5 2/4] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() K Prateek Nayak
@ 2024-11-19  5:44 ` K Prateek Nayak
  2024-12-02 11:14   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  2024-11-19  5:44 ` [PATCH v5 4/4] sched/core: Prevent wakeup of ksoftirqd during idle load balance K Prateek Nayak
  2024-11-19 11:15 ` [PATCH v5 0/4] sched/fair: Idle load balancer fixes for fallouts from IPI optimization to TIF_POLLING CPUs Peter Zijlstra
  4 siblings, 1 reply; 10+ messages in thread
From: K Prateek Nayak @ 2024-11-19  5:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	linux-kernel, linux-rt-devel
  Cc: Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Tejun Heo, NeilBrown, Jens Axboe,
	Frederic Weisbecker, Zqiang, Caleb Sander Mateos,
	Gautham R . Shenoy, Chen Yu, Julia Lawall, K Prateek Nayak

Commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
optimizes IPIs to idle CPUs in TIF_POLLING_NRFLAG mode by setting the
TIF_NEED_RESCHED flag in idle task's thread info and relying on
flush_smp_call_function_queue() in idle exit path to run the
call-function. A softirq raised by the call-function is handled shortly
after in do_softirq_post_smp_call_flush() but the TIF_NEED_RESCHED flag
remains set and is only cleared later when schedule_idle() calls
__schedule().

need_resched() check in _nohz_idle_balance() exists to bail out of load
balancing if another task has woken up on the CPU currently in-charge of
idle load balancing which is being processed in SCHED_SOFTIRQ context.
Since the optimization mentioned above overloads the interpretation of
TIF_NEED_RESCHED, check for idle_cpu() before going with the existing
need_resched() check which can catch a genuine task wakeup on an idle
CPU processing SCHED_SOFTIRQ from do_softirq_post_smp_call_flush(), as
well as the case where ksoftirqd needs to be preempted as a result of
new task wakeup or slice expiry.

In case of PREEMPT_RT or threadirqs, although the idle load balancing
may be inhibited in some cases on the ilb CPU, the fact that ksoftirqd
is the only fair task going back to sleep will trigger a newidle balance
on the CPU which will alleviate some imbalance if it exists if idle
balance fails to do so.

Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
v4..v5:

- New patch. Add some details on implication of threadirqs on idle load
  balance and how newidle balance helps those cases.
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 225b31aaee55..fe6db479a855 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12564,7 +12564,7 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
 		 * work being done for other CPUs. Next load
 		 * balancing owner will pick it up.
 		 */
-		if (need_resched()) {
+		if (!idle_cpu(this_cpu) && need_resched()) {
 			if (flags & NOHZ_STATS_KICK)
 				has_blocked_load = true;
 			if (flags & NOHZ_NEXT_KICK)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 4/4] sched/core: Prevent wakeup of ksoftirqd during idle load balance
  2024-11-19  5:44 [PATCH v5 0/4] sched/fair: Idle load balancer fixes for fallouts from IPI optimization to TIF_POLLING CPUs K Prateek Nayak
                   ` (2 preceding siblings ...)
  2024-11-19  5:44 ` [PATCH v5 3/4] sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy K Prateek Nayak
@ 2024-11-19  5:44 ` K Prateek Nayak
  2024-12-02 11:14   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  2024-11-19 11:15 ` [PATCH v5 0/4] sched/fair: Idle load balancer fixes for fallouts from IPI optimization to TIF_POLLING CPUs Peter Zijlstra
  4 siblings, 1 reply; 10+ messages in thread
From: K Prateek Nayak @ 2024-11-19  5:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	linux-kernel, linux-rt-devel
  Cc: Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Tejun Heo, NeilBrown, Jens Axboe,
	Frederic Weisbecker, Zqiang, Caleb Sander Mateos,
	Gautham R . Shenoy, Chen Yu, Julia Lawall, K Prateek Nayak

Scheduler raises a SCHED_SOFTIRQ to trigger a load balancing event on
from the IPI handler on the idle CPU. If the SMP function is invoked
from an idle CPU via flush_smp_call_function_queue() then the HARD-IRQ
flag is not set and raise_softirq_irqoff() needlessly wakes ksoftirqd
because soft interrupts are handled before ksoftirqd get on the CPU.

Adding a trace_printk() in nohz_csd_func() at the spot of raising
SCHED_SOFTIRQ and enabling trace events for sched_switch, sched_wakeup,
and softirq_entry (for SCHED_SOFTIRQ vector alone) helps observing the
current behavior:

       <idle>-0   [000] dN.1.:  nohz_csd_func: Raising SCHED_SOFTIRQ from nohz_csd_func
       <idle>-0   [000] dN.4.:  sched_wakeup: comm=ksoftirqd/0 pid=16 prio=120 target_cpu=000
       <idle>-0   [000] .Ns1.:  softirq_entry: vec=7 [action=SCHED]
       <idle>-0   [000] .Ns1.:  softirq_exit: vec=7  [action=SCHED]
       <idle>-0   [000] d..2.:  sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=16 next_prio=120
  ksoftirqd/0-16  [000] d..2.:  sched_switch: prev_comm=ksoftirqd/0 prev_pid=16 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
       ...

Use __raise_softirq_irqoff() to raise the softirq. The SMP function call
is always invoked on the requested CPU in an interrupt handler. It is
guaranteed that soft interrupts are handled at the end.

Following are the observations with the changes when enabling the same
set of events:

       <idle>-0       [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ for nohz_idle_balance
       <idle>-0       [000] dN.1.: softirq_raise: vec=7 [action=SCHED]
       <idle>-0       [000] .Ns1.: softirq_entry: vec=7 [action=SCHED]

No unnecessary ksoftirqd wakeups are seen from idle task's context to
service the softirq.

Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Closes: https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/ [1]
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
v4..v5:

- Reworded the commit message as per the sugestion and dropped the
  comment above __raise_softirq_irqoff() (Sebastian)

- Collected the Reviewed-by from Sebastian.
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 424c652a9ddc..e45f922e5829 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1239,7 +1239,7 @@ static void nohz_csd_func(void *info)
 	rq->idle_balance = idle_cpu(cpu);
 	if (rq->idle_balance) {
 		rq->nohz_idle_balance = flags;
-		raise_softirq_irqoff(SCHED_SOFTIRQ);
+		__raise_softirq_irqoff(SCHED_SOFTIRQ);
 	}
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 0/4] sched/fair: Idle load balancer fixes for fallouts from IPI optimization to TIF_POLLING CPUs
  2024-11-19  5:44 [PATCH v5 0/4] sched/fair: Idle load balancer fixes for fallouts from IPI optimization to TIF_POLLING CPUs K Prateek Nayak
                   ` (3 preceding siblings ...)
  2024-11-19  5:44 ` [PATCH v5 4/4] sched/core: Prevent wakeup of ksoftirqd during idle load balance K Prateek Nayak
@ 2024-11-19 11:15 ` Peter Zijlstra
  4 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2024-11-19 11:15 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	linux-kernel, linux-rt-devel, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Thomas Gleixner, Tejun Heo,
	NeilBrown, Jens Axboe, Frederic Weisbecker, Zqiang,
	Caleb Sander Mateos, Gautham R . Shenoy, Chen Yu, Julia Lawall

On Tue, Nov 19, 2024 at 05:44:28AM +0000, K Prateek Nayak wrote:
> K Prateek Nayak (4):
>   softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT
>     kernel
>   sched/core: Remove the unnecessary need_resched() check in
>     nohz_csd_func()
>   sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU
>     turning busy
>   sched/core: Prevent wakeup of ksoftirqd during idle load balance
> 

Thanks!, I'll stick them in a tree post -rc1.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [tip: sched/core] sched/core: Prevent wakeup of ksoftirqd during idle load balance
  2024-11-19  5:44 ` [PATCH v5 4/4] sched/core: Prevent wakeup of ksoftirqd during idle load balance K Prateek Nayak
@ 2024-12-02 11:14   ` tip-bot2 for K Prateek Nayak
  0 siblings, 0 replies; 10+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2024-12-02 11:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Julia Lawall, Sebastian Andrzej Siewior, K Prateek Nayak,
	Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e932c4ab38f072ce5894b2851fea8bc5754bb8e5
Gitweb:        https://git.kernel.org/tip/e932c4ab38f072ce5894b2851fea8bc5754bb8e5
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Tue, 19 Nov 2024 05:44:32 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 02 Dec 2024 12:01:28 +01:00

sched/core: Prevent wakeup of ksoftirqd during idle load balance

Scheduler raises a SCHED_SOFTIRQ to trigger a load balancing event on
from the IPI handler on the idle CPU. If the SMP function is invoked
from an idle CPU via flush_smp_call_function_queue() then the HARD-IRQ
flag is not set and raise_softirq_irqoff() needlessly wakes ksoftirqd
because soft interrupts are handled before ksoftirqd get on the CPU.

Adding a trace_printk() in nohz_csd_func() at the spot of raising
SCHED_SOFTIRQ and enabling trace events for sched_switch, sched_wakeup,
and softirq_entry (for SCHED_SOFTIRQ vector alone) helps observing the
current behavior:

       <idle>-0   [000] dN.1.:  nohz_csd_func: Raising SCHED_SOFTIRQ from nohz_csd_func
       <idle>-0   [000] dN.4.:  sched_wakeup: comm=ksoftirqd/0 pid=16 prio=120 target_cpu=000
       <idle>-0   [000] .Ns1.:  softirq_entry: vec=7 [action=SCHED]
       <idle>-0   [000] .Ns1.:  softirq_exit: vec=7  [action=SCHED]
       <idle>-0   [000] d..2.:  sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=16 next_prio=120
  ksoftirqd/0-16  [000] d..2.:  sched_switch: prev_comm=ksoftirqd/0 prev_pid=16 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
       ...

Use __raise_softirq_irqoff() to raise the softirq. The SMP function call
is always invoked on the requested CPU in an interrupt handler. It is
guaranteed that soft interrupts are handled at the end.

Following are the observations with the changes when enabling the same
set of events:

       <idle>-0       [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ for nohz_idle_balance
       <idle>-0       [000] dN.1.: softirq_raise: vec=7 [action=SCHED]
       <idle>-0       [000] .Ns1.: softirq_entry: vec=7 [action=SCHED]

No unnecessary ksoftirqd wakeups are seen from idle task's context to
service the softirq.

Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
Closes: https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/ [1]
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20241119054432.6405-5-kprateek.nayak@amd.com
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 803b238..c6d8232 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1285,7 +1285,7 @@ static void nohz_csd_func(void *info)
 	rq->idle_balance = idle_cpu(cpu);
 	if (rq->idle_balance) {
 		rq->nohz_idle_balance = flags;
-		raise_softirq_irqoff(SCHED_SOFTIRQ);
+		__raise_softirq_irqoff(SCHED_SOFTIRQ);
 	}
 }
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [tip: sched/core] sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy
  2024-11-19  5:44 ` [PATCH v5 3/4] sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy K Prateek Nayak
@ 2024-12-02 11:14   ` tip-bot2 for K Prateek Nayak
  0 siblings, 0 replies; 10+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2024-12-02 11:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: K Prateek Nayak, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     ff47a0acfcce309cf9e175149c75614491953c8f
Gitweb:        https://git.kernel.org/tip/ff47a0acfcce309cf9e175149c75614491953c8f
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Tue, 19 Nov 2024 05:44:31 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 02 Dec 2024 12:01:28 +01:00

sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy

Commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
optimizes IPIs to idle CPUs in TIF_POLLING_NRFLAG mode by setting the
TIF_NEED_RESCHED flag in idle task's thread info and relying on
flush_smp_call_function_queue() in idle exit path to run the
call-function. A softirq raised by the call-function is handled shortly
after in do_softirq_post_smp_call_flush() but the TIF_NEED_RESCHED flag
remains set and is only cleared later when schedule_idle() calls
__schedule().

need_resched() check in _nohz_idle_balance() exists to bail out of load
balancing if another task has woken up on the CPU currently in-charge of
idle load balancing which is being processed in SCHED_SOFTIRQ context.
Since the optimization mentioned above overloads the interpretation of
TIF_NEED_RESCHED, check for idle_cpu() before going with the existing
need_resched() check which can catch a genuine task wakeup on an idle
CPU processing SCHED_SOFTIRQ from do_softirq_post_smp_call_flush(), as
well as the case where ksoftirqd needs to be preempted as a result of
new task wakeup or slice expiry.

In case of PREEMPT_RT or threadirqs, although the idle load balancing
may be inhibited in some cases on the ilb CPU, the fact that ksoftirqd
is the only fair task going back to sleep will trigger a newidle balance
on the CPU which will alleviate some imbalance if it exists if idle
balance fails to do so.

Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241119054432.6405-4-kprateek.nayak@amd.com
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbdca89..05b8f1e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12568,7 +12568,7 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
 		 * work being done for other CPUs. Next load
 		 * balancing owner will pick it up.
 		 */
-		if (need_resched()) {
+		if (!idle_cpu(this_cpu) && need_resched()) {
 			if (flags & NOHZ_STATS_KICK)
 				has_blocked_load = true;
 			if (flags & NOHZ_NEXT_KICK)

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [tip: sched/core] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()
  2024-11-19  5:44 ` [PATCH v5 2/4] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() K Prateek Nayak
@ 2024-12-02 11:14   ` tip-bot2 for K Prateek Nayak
  0 siblings, 0 replies; 10+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2024-12-02 11:14 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra, K Prateek Nayak, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     ea9cffc0a154124821531991d5afdd7e8b20d7aa
Gitweb:        https://git.kernel.org/tip/ea9cffc0a154124821531991d5afdd7e8b20d7aa
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Tue, 19 Nov 2024 05:44:30 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 02 Dec 2024 12:01:28 +01:00

sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()

The need_resched() check currently in nohz_csd_func() can be tracked
to have been added in scheduler_ipi() back in 2011 via commit
ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance")

Since then, it has travelled quite a bit but it seems like an idle_cpu()
check currently is sufficient to detect the need to bail out from an
idle load balancing. To justify this removal, consider all the following
case where an idle load balancing could race with a task wakeup:

o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU
  on wakelist if wakee cpu is idle") a target perceived to be idle
  (target_rq->nr_running == 0) will return true for
  ttwu_queue_cond(target) which will offload the task wakeup to the idle
  target via an IPI.

  In all such cases target_rq->ttwu_pending will be set to 1 before
  queuing the wake function.

  If an idle load balance races here, following scenarios are possible:

  - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
    IPI is sent to the CPU to wake it out of idle. If the
    nohz_csd_func() queues before sched_ttwu_pending(), the idle load
    balance will bail out since idle_cpu(target) returns 0 since
    target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
    sched_ttwu_pending() it should see rq->nr_running to be non-zero and
    bail out of idle load balancing.

  - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
    the sender will simply set TIF_NEED_RESCHED for the target to put it
    out of idle and flush_smp_call_function_queue() in do_idle() will
    execute the call function. Depending on the ordering of the queuing
    of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
    nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
    target_rq->nr_running to be non-zero if there is a genuine task
    wakeup racing with the idle load balance kick.

o The waker CPU perceives the target CPU to be busy
  (targer_rq->nr_running != 0) but the CPU is in fact going idle and due
  to a series of unfortunate events, the system reaches a case where the
  waker CPU decides to perform the wakeup by itself in ttwu_queue() on
  the target CPU but target is concurrently selected for idle load
  balance (XXX: Can this happen? I'm not sure, but we'll consider the
  mother of all coincidences to estimate the worst case scenario).

  ttwu_do_activate() calls enqueue_task() which would increment
  "rq->nr_running" post which it calls wakeup_preempt() which is
  responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
  setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
  thing to note in this case is that rq->nr_running is already non-zero
  in case of a wakeup before TIF_NEED_RESCHED is set which would
  lead to idle_cpu() check returning false.

In all cases, it seems that need_resched() check is unnecessary when
checking for idle_cpu() first since an impending wakeup racing with idle
load balancer will either set the "rq->ttwu_pending" or indicate a newly
woken task via "rq->nr_running".

Chasing the reason why this check might have existed in the first place,
I came across  Peter's suggestion on the fist iteration of Suresh's
patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was:

	sched_ttwu_do_pending(list);

	if (unlikely((rq->idle == current) &&
	    rq->nohz_balance_kick &&
	    !need_resched()))
		raise_softirq_irqoff(SCHED_SOFTIRQ);

Since the condition to raise the SCHED_SOFIRQ was preceded by
sched_ttwu_do_pending() (which is equivalent of sched_ttwu_pending()) in
the current upstream kernel, the need_resched() check was necessary to
catch a newly queued task. Peter suggested modifying it to:

	if (idle_cpu() && rq->nohz_balance_kick && !need_resched())
		raise_softirq_irqoff(SCHED_SOFTIRQ);

where idle_cpu() seems to have replaced "rq->idle == current" check.

Even back then, the idle_cpu() check would have been sufficient to catch
a new task being enqueued. Since commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") overloads the interpretation of
TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove the
need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based
on Peter's suggestion.

Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241119054432.6405-3-kprateek.nayak@amd.com
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 95e4089..803b238 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1283,7 +1283,7 @@ static void nohz_csd_func(void *info)
 	WARN_ON(!(flags & NOHZ_KICK_MASK));
 
 	rq->idle_balance = idle_cpu(cpu);
-	if (rq->idle_balance && !need_resched()) {
+	if (rq->idle_balance) {
 		rq->nohz_idle_balance = flags;
 		raise_softirq_irqoff(SCHED_SOFTIRQ);
 	}

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [tip: sched/core] softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel
  2024-11-19  5:44 ` [PATCH v5 1/4] softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel K Prateek Nayak
@ 2024-12-02 11:14   ` tip-bot2 for K Prateek Nayak
  0 siblings, 0 replies; 10+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2024-12-02 11:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: K Prateek Nayak, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     6675ce20046d149e1e1ffe7e9577947dee17aad5
Gitweb:        https://git.kernel.org/tip/6675ce20046d149e1e1ffe7e9577947dee17aad5
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Tue, 19 Nov 2024 05:44:29 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 02 Dec 2024 12:01:27 +01:00

softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel

do_softirq_post_smp_call_flush() on PREEMPT_RT kernels carries a
WARN_ON_ONCE() for any SOFTIRQ being raised from an SMP-call-function.
Since do_softirq_post_smp_call_flush() is called with preempt disabled,
raising a SOFTIRQ during flush_smp_call_function_queue() can lead to
longer preempt disabled sections.

Since commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") IPIs to an idle CPU in
TIF_POLLING_NRFLAG mode can be optimized out by instead setting
TIF_NEED_RESCHED bit in idle task's thread_info and relying on the
flush_smp_call_function_queue() in the idle-exit path to run the
SMP-call-function.

To trigger an idle load balancing, the scheduler queues
nohz_csd_function() responsible for triggering an idle load balancing on
a target nohz idle CPU and sends an IPI. Only now, this IPI is optimized
out and the SMP-call-function is executed from
flush_smp_call_function_queue() in do_idle() which can raise a
SCHED_SOFTIRQ to trigger the balancing.

So far, this went undetected since, the need_resched() check in
nohz_csd_function() would make it bail out of idle load balancing early
as the idle thread does not clear TIF_POLLING_NRFLAG before calling
flush_smp_call_function_queue(). The need_resched() check was added with
the intent to catch a new task wakeup, however, it has recently
discovered to be unnecessary and will be removed in the subsequent
commit after which nohz_csd_function() can raise a SCHED_SOFTIRQ from
flush_smp_call_function_queue() to trigger an idle load balance on an
idle target in TIF_POLLING_NRFLAG mode.

nohz_csd_function() bails out early if "idle_cpu()" check for the
target CPU, and does not lock the target CPU's rq until the very end,
once it has found tasks to run on the CPU and will not inhibit the
wakeup of, or running of a newly woken up higher priority task. Account
for this and prevent a WARN_ON_ONCE() when SCHED_SOFTIRQ is raised from
flush_smp_call_function_queue().

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241119054432.6405-2-kprateek.nayak@amd.com
---
 kernel/softirq.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 8b41bd1..4dae6ac 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -280,17 +280,24 @@ static inline void invoke_softirq(void)
 		wakeup_softirqd();
 }
 
+#define SCHED_SOFTIRQ_MASK	BIT(SCHED_SOFTIRQ)
+
 /*
  * flush_smp_call_function_queue() can raise a soft interrupt in a function
- * call. On RT kernels this is undesired and the only known functionality
- * in the block layer which does this is disabled on RT. If soft interrupts
- * get raised which haven't been raised before the flush, warn so it can be
+ * call. On RT kernels this is undesired and the only known functionalities
+ * are in the block layer which is disabled on RT, and in the scheduler for
+ * idle load balancing. If soft interrupts get raised which haven't been
+ * raised before the flush, warn if it is not a SCHED_SOFTIRQ so it can be
  * investigated.
  */
 void do_softirq_post_smp_call_flush(unsigned int was_pending)
 {
-	if (WARN_ON_ONCE(was_pending != local_softirq_pending()))
+	unsigned int is_pending = local_softirq_pending();
+
+	if (unlikely(was_pending != is_pending)) {
+		WARN_ON_ONCE(was_pending != (is_pending & ~SCHED_SOFTIRQ_MASK));
 		invoke_softirq();
+	}
 }
 
 #else /* CONFIG_PREEMPT_RT */

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-12-02 11:14 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-19  5:44 [PATCH v5 0/4] sched/fair: Idle load balancer fixes for fallouts from IPI optimization to TIF_POLLING CPUs K Prateek Nayak
2024-11-19  5:44 ` [PATCH v5 1/4] softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel K Prateek Nayak
2024-12-02 11:14   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
2024-11-19  5:44 ` [PATCH v5 2/4] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() K Prateek Nayak
2024-12-02 11:14   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
2024-11-19  5:44 ` [PATCH v5 3/4] sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy K Prateek Nayak
2024-12-02 11:14   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
2024-11-19  5:44 ` [PATCH v5 4/4] sched/core: Prevent wakeup of ksoftirqd during idle load balance K Prateek Nayak
2024-12-02 11:14   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
2024-11-19 11:15 ` [PATCH v5 0/4] sched/fair: Idle load balancer fixes for fallouts from IPI optimization to TIF_POLLING CPUs Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox