[PATCH 0/3] softirq: uncontroversial change

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] softirq: uncontroversial change
@ 2022-12-22 22:12 Jakub Kicinski
  2022-12-22 22:12 ` [PATCH 1/3] softirq: rename ksoftirqd_running() -> ksoftirqd_should_handle() Jakub Kicinski
                   ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Jakub Kicinski @ 2022-12-22 22:12 UTC (permalink / raw)
  To: peterz, tglx; +Cc: jstultz, edumazet, netdev, linux-kernel, Jakub Kicinski

Catching up on LWN I run across the article about softirq
changes, and then I noticed fresh patches in Peter's tree.
So probably wise for me to throw these out there.

My (can I say Meta's?) problem is the opposite to what the RT
sensitive people complain about. In the current scheme once
ksoftirqd is woken no network processing happens until it runs.

When networking gets overloaded - that's probably fair, the problem
is that we confuse latency tweaks with overload protection. We have
a needs_resched() in the loop condition (which is a latency tweak)
Most often we defer to ksoftirqd because we're trying to be nice
and let user space respond quickly, not because there is an
overload. But the user space may not be nice, and sit on the CPU
for 10ms+. Also the sirq's "work allowance" is 2ms, which is
uncomfortably close to the timer tick, but that's another story.

We have a sirq latency tracker in our prod kernel which catches
8ms+ stalls of net Tx (packets queued to the NIC but there is
no NAPI cleanup within 8ms) and with these patches applied
on 5.19 fully loaded web machine sees a drop in stalls from
1.8 stalls/sec to 0.16/sec. I also see a 50% drop in outgoing
TCP retransmissions and ~10% drop in non-TLP incoming ones.
This is not a network-heavy workload so most of the rtx are
due to scheduling artifacts.

The network latency in a datacenter is somewhere around neat
1000x lower than scheduling granularity (around 10us).

These patches (patch 2 is "the meat") change what we recognize
as overload. Instead of just checking if "ksoftirqd is woken"
it also caps how long we consider ourselves to be in overload,
a time limit which is different based on whether we yield due
to real resource exhaustion vs just hitting that needs_resched().

I hope the core concept is not entirely idiotic. It'd be great
if we could get this in or fold an equivalent concept into ongoing
work from others, because due to various "scheduler improvements"
every time we upgrade the production kernel this problem is getting
worse :(

Jakub Kicinski (3):
  softirq: rename ksoftirqd_running() -> ksoftirqd_should_handle()
  softirq: avoid spurious stalls due to need_resched()
  softirq: don't yield if only expedited handlers are pending

 kernel/softirq.c | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

-- 
2.38.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/3] softirq: rename ksoftirqd_running() -> ksoftirqd_should_handle()
  2022-12-22 22:12 [PATCH 0/3] softirq: uncontroversial change Jakub Kicinski
@ 2022-12-22 22:12 ` Jakub Kicinski
  2022-12-22 22:12 ` [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() Jakub Kicinski
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 38+ messages in thread
From: Jakub Kicinski @ 2022-12-22 22:12 UTC (permalink / raw)
  To: peterz, tglx; +Cc: jstultz, edumazet, netdev, linux-kernel, Jakub Kicinski

ksoftirqd_running() takes the high priority softirqs into
consideration, so ksoftirqd_should_handle() seems like
a better name.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 kernel/softirq.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index c8a6913c067d..00b838d566c1 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -86,7 +86,7 @@ static void wakeup_softirqd(void)
  * unless we're doing some of the synchronous softirqs.
  */
 #define SOFTIRQ_NOW_MASK ((1 << HI_SOFTIRQ) | (1 << TASKLET_SOFTIRQ))
-static bool ksoftirqd_running(unsigned long pending)
+static bool ksoftirqd_should_handle(unsigned long pending)
 {
 	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
 
@@ -236,7 +236,7 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
 		goto out;
 
 	pending = local_softirq_pending();
-	if (!pending || ksoftirqd_running(pending))
+	if (!pending || ksoftirqd_should_handle(pending))
 		goto out;
 
 	/*
@@ -432,7 +432,7 @@ static inline bool should_wake_ksoftirqd(void)
 
 static inline void invoke_softirq(void)
 {
-	if (ksoftirqd_running(local_softirq_pending()))
+	if (ksoftirqd_should_handle(local_softirq_pending()))
 		return;
 
 	if (!force_irqthreads() || !__this_cpu_read(ksoftirqd)) {
@@ -468,7 +468,7 @@ asmlinkage __visible void do_softirq(void)
 
 	pending = local_softirq_pending();
 
-	if (pending && !ksoftirqd_running(pending))
+	if (pending && !ksoftirqd_should_handle(pending))
 		do_softirq_own_stack();
 
 	local_irq_restore(flags);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2022-12-22 22:12 [PATCH 0/3] softirq: uncontroversial change Jakub Kicinski
  2022-12-22 22:12 ` [PATCH 1/3] softirq: rename ksoftirqd_running() -> ksoftirqd_should_handle() Jakub Kicinski
@ 2022-12-22 22:12 ` Jakub Kicinski
  2023-01-31 22:32   ` Jakub Kicinski
  2023-03-03 13:30   ` Thomas Gleixner
  2022-12-22 22:12 ` [PATCH 3/3] softirq: don't yield if only expedited handlers are pending Jakub Kicinski
  2023-04-20 17:24 ` [PATCH 0/3] softirq: uncontroversial change Paolo Abeni
  3 siblings, 2 replies; 38+ messages in thread
From: Jakub Kicinski @ 2022-12-22 22:12 UTC (permalink / raw)
  To: peterz, tglx; +Cc: jstultz, edumazet, netdev, linux-kernel, Jakub Kicinski

need_resched() added in commit c10d73671ad3 ("softirq: reduce latencies")
does improve latency for real workloads (for example memcache).
Unfortunately it triggers quite often even for non-network-heavy apps
(~900 times a second on a loaded webserver), and in small fraction of
cases whatever the scheduler decided to run will hold onto the CPU
for the entire time slice.

10ms+ stalls on a machine which is not actually under overload cause
erratic network behavior and spurious TCP retransmits. Typical end-to-end
latency in a datacenter is < 200us so its common to set TCP timeout
to 10ms or less.

The intent of the need_resched() is to let a low latency application
respond quickly and yield (to ksoftirqd). Put a time limit on this dance.
Ignore the fact that ksoftirqd is RUNNING if we were trying to be nice
and the application did not yield quickly.

On a webserver loaded at 90% CPU this change reduces the numer of 8ms+
stalls the network softirq processing sees by around 10x (2/sec -> 0.2/sec).
It also seems to reduce retransmissions by ~10% but the data is quite
noisy.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 kernel/softirq.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 00b838d566c1..ad200d386ec1 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -59,6 +59,7 @@ EXPORT_PER_CPU_SYMBOL(irq_stat);
 static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
 
 DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
+static DEFINE_PER_CPU(unsigned long, overload_limit);
 
 const char * const softirq_to_name[NR_SOFTIRQS] = {
 	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "IRQ_POLL",
@@ -89,10 +90,15 @@ static void wakeup_softirqd(void)
 static bool ksoftirqd_should_handle(unsigned long pending)
 {
 	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
+	unsigned long ov_limit;
 
 	if (pending & SOFTIRQ_NOW_MASK)
 		return false;
-	return tsk && task_is_running(tsk) && !__kthread_should_park(tsk);
+	if (likely(!tsk || !task_is_running(tsk) || __kthread_should_park(tsk)))
+		return false;
+
+	ov_limit = __this_cpu_read(overload_limit);
+	return time_is_after_jiffies(ov_limit);
 }
 
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -492,6 +498,9 @@ asmlinkage __visible void do_softirq(void)
 #define MAX_SOFTIRQ_TIME  msecs_to_jiffies(2)
 #define MAX_SOFTIRQ_RESTART 10
 
+#define SOFTIRQ_OVERLOAD_TIME	msecs_to_jiffies(100)
+#define SOFTIRQ_DEFER_TIME	msecs_to_jiffies(2)
+
 #ifdef CONFIG_TRACE_IRQFLAGS
 /*
  * When we run softirqs from irq_exit() and thus on the hardirq stack we need
@@ -588,10 +597,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 
 	pending = local_softirq_pending();
 	if (pending) {
-		if (time_before(jiffies, end) && !need_resched() &&
-		    --max_restart)
+		unsigned long limit;
+
+		if (time_is_before_eq_jiffies(end) || !--max_restart)
+			limit = SOFTIRQ_OVERLOAD_TIME;
+		else if (need_resched())
+			limit = SOFTIRQ_DEFER_TIME;
+		else
 			goto restart;
 
+		__this_cpu_write(overload_limit, jiffies + limit);
 		wakeup_softirqd();
 	}
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2022-12-22 22:12 ` [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() Jakub Kicinski
@ 2023-01-31 22:32   ` Jakub Kicinski
  2023-03-03 13:30   ` Thomas Gleixner
  1 sibling, 0 replies; 38+ messages in thread
From: Jakub Kicinski @ 2023-01-31 22:32 UTC (permalink / raw)
  To: peterz, tglx; +Cc: jstultz, edumazet, netdev, linux-kernel

On Thu, 22 Dec 2022 14:12:43 -0800 Jakub Kicinski wrote:
> need_resched() added in commit c10d73671ad3 ("softirq: reduce latencies")
> does improve latency for real workloads (for example memcache).
> Unfortunately it triggers quite often even for non-network-heavy apps
> (~900 times a second on a loaded webserver), and in small fraction of
> cases whatever the scheduler decided to run will hold onto the CPU
> for the entire time slice.
> 
> 10ms+ stalls on a machine which is not actually under overload cause
> erratic network behavior and spurious TCP retransmits. Typical end-to-end
> latency in a datacenter is < 200us so its common to set TCP timeout
> to 10ms or less.
> 
> The intent of the need_resched() is to let a low latency application
> respond quickly and yield (to ksoftirqd). Put a time limit on this dance.
> Ignore the fact that ksoftirqd is RUNNING if we were trying to be nice
> and the application did not yield quickly.
> 
> On a webserver loaded at 90% CPU this change reduces the numer of 8ms+
> stalls the network softirq processing sees by around 10x (2/sec -> 0.2/sec).
> It also seems to reduce retransmissions by ~10% but the data is quite
> noisy.

Peter, is there a chance you could fold this patch into your ongoing
softirq rework? We can't both work on softirq in parallel, unfortunately
and this improvement is really key to counter balance whatever
heuristics CFS accumulated between 5.12 and 5.19 :(
Not to use the "r-word".

I can spin a version of this on top of your core/softirq branch, would
that work?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2022-12-22 22:12 ` [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() Jakub Kicinski
  2023-01-31 22:32   ` Jakub Kicinski
@ 2023-03-03 13:30   ` Thomas Gleixner
  2023-03-03 15:18     ` Thomas Gleixner
  2023-03-03 21:31     ` Jakub Kicinski
  1 sibling, 2 replies; 38+ messages in thread
From: Thomas Gleixner @ 2023-03-03 13:30 UTC (permalink / raw)
  To: Jakub Kicinski, peterz
  Cc: jstultz, edumazet, netdev, linux-kernel, Jakub Kicinski

Jakub!

On Thu, Dec 22 2022 at 14:12, Jakub Kicinski wrote:
>  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
> +static DEFINE_PER_CPU(unsigned long, overload_limit);
>  
>  const char * const softirq_to_name[NR_SOFTIRQS] = {
>  	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "IRQ_POLL",
> @@ -89,10 +90,15 @@ static void wakeup_softirqd(void)
>  static bool ksoftirqd_should_handle(unsigned long pending)
>  {
>  	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
> +	unsigned long ov_limit;
>  
>  	if (pending & SOFTIRQ_NOW_MASK)
>  		return false;
> -	return tsk && task_is_running(tsk) && !__kthread_should_park(tsk);
> +	if (likely(!tsk || !task_is_running(tsk) || __kthread_should_park(tsk)))
> +		return false;
> +
> +	ov_limit = __this_cpu_read(overload_limit);
> +	return time_is_after_jiffies(ov_limit);

	return time_is_after_jiffies(__this_cpu_read(overload_limit));

Plus a comment explaining the magic, please.

>  }
>  
>  #ifdef CONFIG_TRACE_IRQFLAGS
> @@ -492,6 +498,9 @@ asmlinkage __visible void do_softirq(void)
>  #define MAX_SOFTIRQ_TIME  msecs_to_jiffies(2)
>  #define MAX_SOFTIRQ_RESTART 10
>  
> +#define SOFTIRQ_OVERLOAD_TIME	msecs_to_jiffies(100)
> +#define SOFTIRQ_DEFER_TIME	msecs_to_jiffies(2)
> +
>  #ifdef CONFIG_TRACE_IRQFLAGS
>  /*
>   * When we run softirqs from irq_exit() and thus on the hardirq stack we need
> @@ -588,10 +597,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
>  
>  	pending = local_softirq_pending();
>  	if (pending) {
> -		if (time_before(jiffies, end) && !need_resched() &&
> -		    --max_restart)
> +		unsigned long limit;
> +
> +		if (time_is_before_eq_jiffies(end) || !--max_restart)
> +			limit = SOFTIRQ_OVERLOAD_TIME;
> +		else if (need_resched())
> +			limit = SOFTIRQ_DEFER_TIME;
> +		else
>  			goto restart;
>  
> +		__this_cpu_write(overload_limit, jiffies + limit);

The logic of all this is non-obvious and I had to reread it 5 times to
conclude that it is matching the intent. Please add comments.

While I'm not a big fan of heuristical duct tape, this looks harmless
enough to not end up in an endless stream of tweaking. Famous last
words...

But without the sched_clock() changes the actual defer time depends on
HZ and the point in time where limit is set. That means it ranges from 0
to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in
the worst case, which perhaps explains the 8ms+ stalls you are still
observing. Can you test with that sched_clock change applied, i.e. the
first two commits from

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq

59be25c466d9 ("softirq: Use sched_clock() based timeout")
bd5a5bd77009 ("softirq: Rewrite softirq processing loop")

whether that makes a difference? Those two can be applied with some
minor polishing. The rest of that series is broken by f10020c97f4c
("softirq: Allow early break").

There is another issue with this overload limit. Assume max_restart or
timeout triggered and limit was set to now + 100ms. ksoftirqd runs and
gets the issue resolved after 10ms.

So for the remaining 90ms any invocation of raise_softirq() outside of
(soft)interrupt context, which wakes ksoftirqd again, prevents
processing on return from interrupt until ksoftirqd gets on the CPU and
goes back to sleep, because task_is_running() == true and the stale
limit is not after jiffies.

Probably not a big issue, but someone will notice on some weird workload
sooner than later and the tweaking will start nevertheless. :) So maybe
we fix it right away. :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-03 13:30   ` Thomas Gleixner
@ 2023-03-03 15:18     ` Thomas Gleixner
  2023-03-03 21:31     ` Jakub Kicinski
  1 sibling, 0 replies; 38+ messages in thread
From: Thomas Gleixner @ 2023-03-03 15:18 UTC (permalink / raw)
  To: Jakub Kicinski, peterz
  Cc: jstultz, edumazet, netdev, linux-kernel, Jakub Kicinski

Jakub!

On Fri, Mar 03 2023 at 14:30, Thomas Gleixner wrote:
> On Thu, Dec 22 2022 at 14:12, Jakub Kicinski wrote:
> But without the sched_clock() changes the actual defer time depends on
> HZ and the point in time where limit is set. That means it ranges from 0
> to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in
> the worst case, which perhaps explains the 8ms+ stalls you are still
> observing. Can you test with that sched_clock change applied, i.e. the
> first two commits from
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq
>
> 59be25c466d9 ("softirq: Use sched_clock() based timeout")
> bd5a5bd77009 ("softirq: Rewrite softirq processing loop")
>
> whether that makes a difference? Those two can be applied with some
> minor polishing. The rest of that series is broken by f10020c97f4c
> ("softirq: Allow early break").

WHile staring I noticed that the current jiffies based time limit
handling has the exact same problem. For HZ=100 and HZ=250
MAX_SOFTIRQ_TIME resolves to 1 jiffy. So the window is between 0 and
1/HZ. Not really useful.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-03 13:30   ` Thomas Gleixner
  2023-03-03 15:18     ` Thomas Gleixner
@ 2023-03-03 21:31     ` Jakub Kicinski
  2023-03-03 22:37       ` Paul E. McKenney
  2023-03-05 20:43       ` Thomas Gleixner
  1 sibling, 2 replies; 38+ messages in thread
From: Jakub Kicinski @ 2023-03-03 21:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: peterz, jstultz, edumazet, netdev, linux-kernel, Paul E. McKenney

On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote:
> > -		if (time_before(jiffies, end) && !need_resched() &&
> > -		    --max_restart)
> > +		unsigned long limit;
> > +
> > +		if (time_is_before_eq_jiffies(end) || !--max_restart)
> > +			limit = SOFTIRQ_OVERLOAD_TIME;
> > +		else if (need_resched())
> > +			limit = SOFTIRQ_DEFER_TIME;
> > +		else
> >  			goto restart;
> >  
> > +		__this_cpu_write(overload_limit, jiffies + limit);  
> 
> The logic of all this is non-obvious and I had to reread it 5 times to
> conclude that it is matching the intent. Please add comments.
> 
> While I'm not a big fan of heuristical duct tape, this looks harmless
> enough to not end up in an endless stream of tweaking. Famous last
> words...

Would it all be more readable if I named the "overload_limit"
"overloaded_until" instead? Naming..
I'll add comments, too.

> But without the sched_clock() changes the actual defer time depends on
> HZ and the point in time where limit is set. That means it ranges from 0
> to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in
> the worst case, which perhaps explains the 8ms+ stalls you are still
> observing. Can you test with that sched_clock change applied, i.e. the
> first two commits from
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq
> 
> 59be25c466d9 ("softirq: Use sched_clock() based timeout")
> bd5a5bd77009 ("softirq: Rewrite softirq processing loop")

Those will help, but I spent some time digging into the jiffies related
warts with kprobes - while annoying they weren't a major source of wake
ups. (FWIW the jiffies noise on our workloads is due to cgroup stats
disabling IRQs for multiple ms on the timekeeping CPU).

Here are fresh stats on why we wake up ksoftirqd on our Web workload
(collected over 100 sec):

Time exceeded:      484
Loop max run out:  6525
need_resched():   10219
(control: 17226 - number of times wakeup_process called for ksirqd)

As you can see need_resched() dominates.

Zooming into the time exceeded - we can count nanoseconds between
__do_softirq starting and the check. This is the histogram of actual
usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000):

[256, 512)             1 |                                                    |
[512, 1K)              0 |                                                    |
[1K, 2K)             217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[2K, 4K)             266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

So yes, we can probably save ourselves ~200 wakeup with a better clock
but that's just 1.3% of the total wake ups :(


Now - now about the max loop count. I ORed the pending softirqs every
time we get to the end of the loop. Looks like vast majority of the
loop counter wake ups are exclusively due to RCU:

@looped[512]: 5516

Where 512 is the ORed pending mask over all iterations
512 == 1 << RCU_SOFTIRQ.

And they usually take less than 100us to consume the 10 iterations.
Histogram of usecs consumed when we run out of loop iterations:

[16, 32)               3 |                                                    |
[32, 64)            4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)            871 |@@@@@@@@@                                           |
[128, 256)            34 |                                                    |
[256, 512)             9 |                                                    |
[512, 1K)            262 |@@                                                  |
[1K, 2K)              35 |                                                    |
[2K, 4K)               1 |                                                    |

Paul, is this expected? Is RCU not trying too hard to be nice?

# cat /sys/module/rcutree/parameters/blimit
10

Or should we perhaps just raise the loop limit? Breaking after less 
than 100usec seems excessive :(

> whether that makes a difference? Those two can be applied with some
> minor polishing. The rest of that series is broken by f10020c97f4c
> ("softirq: Allow early break").
> 
> There is another issue with this overload limit. Assume max_restart or
> timeout triggered and limit was set to now + 100ms. ksoftirqd runs and
> gets the issue resolved after 10ms.
> 
> So for the remaining 90ms any invocation of raise_softirq() outside of
> (soft)interrupt context, which wakes ksoftirqd again, prevents
> processing on return from interrupt until ksoftirqd gets on the CPU and
> goes back to sleep, because task_is_running() == true and the stale
> limit is not after jiffies.
> 
> Probably not a big issue, but someone will notice on some weird workload
> sooner than later and the tweaking will start nevertheless. :) So maybe
> we fix it right away. :)

Hm, Paolo raised this point as well, but the overload time is strictly
to stop paying attention to the fact ksoftirqd is running.
IOW current kernels behave as if they had overload_limit of infinity.

The current code already prevents processing until ksoftirqd schedules
in, after raise_softirq() from a funky context.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-03 21:31     ` Jakub Kicinski
@ 2023-03-03 22:37       ` Paul E. McKenney
  2023-03-03 23:25         ` Dave Taht
  2023-03-03 23:36         ` Paul E. McKenney
  2023-03-05 20:43       ` Thomas Gleixner
  1 sibling, 2 replies; 38+ messages in thread
From: Paul E. McKenney @ 2023-03-03 22:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel

On Fri, Mar 03, 2023 at 01:31:43PM -0800, Jakub Kicinski wrote:
> On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote:
> > > -		if (time_before(jiffies, end) && !need_resched() &&
> > > -		    --max_restart)
> > > +		unsigned long limit;
> > > +
> > > +		if (time_is_before_eq_jiffies(end) || !--max_restart)
> > > +			limit = SOFTIRQ_OVERLOAD_TIME;
> > > +		else if (need_resched())
> > > +			limit = SOFTIRQ_DEFER_TIME;
> > > +		else
> > >  			goto restart;
> > >  
> > > +		__this_cpu_write(overload_limit, jiffies + limit);  
> > 
> > The logic of all this is non-obvious and I had to reread it 5 times to
> > conclude that it is matching the intent. Please add comments.
> > 
> > While I'm not a big fan of heuristical duct tape, this looks harmless
> > enough to not end up in an endless stream of tweaking. Famous last
> > words...
> 
> Would it all be more readable if I named the "overload_limit"
> "overloaded_until" instead? Naming..
> I'll add comments, too.
> 
> > But without the sched_clock() changes the actual defer time depends on
> > HZ and the point in time where limit is set. That means it ranges from 0
> > to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in
> > the worst case, which perhaps explains the 8ms+ stalls you are still
> > observing. Can you test with that sched_clock change applied, i.e. the
> > first two commits from
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq
> > 
> > 59be25c466d9 ("softirq: Use sched_clock() based timeout")
> > bd5a5bd77009 ("softirq: Rewrite softirq processing loop")
> 
> Those will help, but I spent some time digging into the jiffies related
> warts with kprobes - while annoying they weren't a major source of wake
> ups. (FWIW the jiffies noise on our workloads is due to cgroup stats
> disabling IRQs for multiple ms on the timekeeping CPU).
> 
> Here are fresh stats on why we wake up ksoftirqd on our Web workload
> (collected over 100 sec):
> 
> Time exceeded:      484
> Loop max run out:  6525
> need_resched():   10219
> (control: 17226 - number of times wakeup_process called for ksirqd)
> 
> As you can see need_resched() dominates.
> 
> Zooming into the time exceeded - we can count nanoseconds between
> __do_softirq starting and the check. This is the histogram of actual
> usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000):
> 
> [256, 512)             1 |                                                    |
> [512, 1K)              0 |                                                    |
> [1K, 2K)             217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> [2K, 4K)             266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> So yes, we can probably save ourselves ~200 wakeup with a better clock
> but that's just 1.3% of the total wake ups :(
> 
> 
> Now - now about the max loop count. I ORed the pending softirqs every
> time we get to the end of the loop. Looks like vast majority of the
> loop counter wake ups are exclusively due to RCU:
> 
> @looped[512]: 5516
> 
> Where 512 is the ORed pending mask over all iterations
> 512 == 1 << RCU_SOFTIRQ.
> 
> And they usually take less than 100us to consume the 10 iterations.
> Histogram of usecs consumed when we run out of loop iterations:
> 
> [16, 32)               3 |                                                    |
> [32, 64)            4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [64, 128)            871 |@@@@@@@@@                                           |
> [128, 256)            34 |                                                    |
> [256, 512)             9 |                                                    |
> [512, 1K)            262 |@@                                                  |
> [1K, 2K)              35 |                                                    |
> [2K, 4K)               1 |                                                    |
> 
> Paul, is this expected? Is RCU not trying too hard to be nice?

This is from way back in the day, so it is quite possible that better
tuning and/or better heuristics should be applied.

On the other hand, 100 microseconds is a good long time from an
CONFIG_PREEMPT_RT=y perspective!

> # cat /sys/module/rcutree/parameters/blimit
> 10
> 
> Or should we perhaps just raise the loop limit? Breaking after less 
> than 100usec seems excessive :(

But note that RCU also has rcutree.rcu_divisor, which defaults to 7.
And an rcutree.rcu_resched_ns, which defaults to three milliseconds
(3,000,000 nanoseconds).  This means that RCU will do:

o	All the callbacks if there are less than ten.

o	Ten callbacks or 1/128th of them, whichever is larger.

o	Unless the larger of them is more than 100 callbacks, in which
	case there is an additional limit of three milliseconds worth
	of them.

Except that if a given CPU ends up with more than 10,000 callbacks
(rcutree.qhimark), that CPU's blimit is set to 10,000.

So there is much opportunity to tune the existing heuristics and also
much opportunity to tweak the heuristics themselves.

But let's see a good use case before tweaking, please.  ;-)

							Thanx, Paul

> > whether that makes a difference? Those two can be applied with some
> > minor polishing. The rest of that series is broken by f10020c97f4c
> > ("softirq: Allow early break").
> > 
> > There is another issue with this overload limit. Assume max_restart or
> > timeout triggered and limit was set to now + 100ms. ksoftirqd runs and
> > gets the issue resolved after 10ms.
> > 
> > So for the remaining 90ms any invocation of raise_softirq() outside of
> > (soft)interrupt context, which wakes ksoftirqd again, prevents
> > processing on return from interrupt until ksoftirqd gets on the CPU and
> > goes back to sleep, because task_is_running() == true and the stale
> > limit is not after jiffies.
> > 
> > Probably not a big issue, but someone will notice on some weird workload
> > sooner than later and the tweaking will start nevertheless. :) So maybe
> > we fix it right away. :)
> 
> Hm, Paolo raised this point as well, but the overload time is strictly
> to stop paying attention to the fact ksoftirqd is running.
> IOW current kernels behave as if they had overload_limit of infinity.
> 
> The current code already prevents processing until ksoftirqd schedules
> in, after raise_softirq() from a funky context.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-03 22:37       ` Paul E. McKenney
@ 2023-03-03 23:25         ` Dave Taht
  2023-03-04  1:14           ` Paul E. McKenney
  2023-03-03 23:36         ` Paul E. McKenney
  1 sibling, 1 reply; 38+ messages in thread
From: Dave Taht @ 2023-03-03 23:25 UTC (permalink / raw)
  To: paulmck
  Cc: Jakub Kicinski, Thomas Gleixner, peterz, jstultz, edumazet,
	netdev, linux-kernel

On Fri, Mar 3, 2023 at 2:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Mar 03, 2023 at 01:31:43PM -0800, Jakub Kicinski wrote:
> > On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote:
> > > > -         if (time_before(jiffies, end) && !need_resched() &&
> > > > -             --max_restart)
> > > > +         unsigned long limit;
> > > > +
> > > > +         if (time_is_before_eq_jiffies(end) || !--max_restart)
> > > > +                 limit = SOFTIRQ_OVERLOAD_TIME;
> > > > +         else if (need_resched())
> > > > +                 limit = SOFTIRQ_DEFER_TIME;
> > > > +         else
> > > >                   goto restart;
> > > >
> > > > +         __this_cpu_write(overload_limit, jiffies + limit);
> > >
> > > The logic of all this is non-obvious and I had to reread it 5 times to
> > > conclude that it is matching the intent. Please add comments.
> > >
> > > While I'm not a big fan of heuristical duct tape, this looks harmless
> > > enough to not end up in an endless stream of tweaking. Famous last
> > > words...
> >
> > Would it all be more readable if I named the "overload_limit"
> > "overloaded_until" instead? Naming..
> > I'll add comments, too.
> >
> > > But without the sched_clock() changes the actual defer time depends on
> > > HZ and the point in time where limit is set. That means it ranges from 0
> > > to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in
> > > the worst case, which perhaps explains the 8ms+ stalls you are still
> > > observing. Can you test with that sched_clock change applied, i.e. the
> > > first two commits from
> > >
> > >   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq
> > >
> > > 59be25c466d9 ("softirq: Use sched_clock() based timeout")
> > > bd5a5bd77009 ("softirq: Rewrite softirq processing loop")
> >
> > Those will help, but I spent some time digging into the jiffies related
> > warts with kprobes - while annoying they weren't a major source of wake
> > ups. (FWIW the jiffies noise on our workloads is due to cgroup stats
> > disabling IRQs for multiple ms on the timekeeping CPU).
> >
> > Here are fresh stats on why we wake up ksoftirqd on our Web workload
> > (collected over 100 sec):
> >
> > Time exceeded:      484
> > Loop max run out:  6525
> > need_resched():   10219
> > (control: 17226 - number of times wakeup_process called for ksirqd)
> >
> > As you can see need_resched() dominates.
> >
> > Zooming into the time exceeded - we can count nanoseconds between
> > __do_softirq starting and the check. This is the histogram of actual
> > usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000):
> >
> > [256, 512)             1 |                                                    |
> > [512, 1K)              0 |                                                    |
> > [1K, 2K)             217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> > [2K, 4K)             266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >
> > So yes, we can probably save ourselves ~200 wakeup with a better clock
> > but that's just 1.3% of the total wake ups :(
> >
> >
> > Now - now about the max loop count. I ORed the pending softirqs every
> > time we get to the end of the loop. Looks like vast majority of the
> > loop counter wake ups are exclusively due to RCU:
> >
> > @looped[512]: 5516
> >
> > Where 512 is the ORed pending mask over all iterations
> > 512 == 1 << RCU_SOFTIRQ.
> >
> > And they usually take less than 100us to consume the 10 iterations.
> > Histogram of usecs consumed when we run out of loop iterations:
> >
> > [16, 32)               3 |                                                    |
> > [32, 64)            4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [64, 128)            871 |@@@@@@@@@                                           |
> > [128, 256)            34 |                                                    |
> > [256, 512)             9 |                                                    |
> > [512, 1K)            262 |@@                                                  |
> > [1K, 2K)              35 |                                                    |
> > [2K, 4K)               1 |                                                    |
> >
> > Paul, is this expected? Is RCU not trying too hard to be nice?
>
> This is from way back in the day, so it is quite possible that better
> tuning and/or better heuristics should be applied.
>
> On the other hand, 100 microseconds is a good long time from an
> CONFIG_PREEMPT_RT=y perspective!

All I have to add to this conversation is the observation that
sampling things at the
nyquist rate helps to observe problems like these.

So if you care about sub 8ms response time, a sub 4ms sampling rate is needed.

> > # cat /sys/module/rcutree/parameters/blimit
> > 10
> >
> > Or should we perhaps just raise the loop limit? Breaking after less
> > than 100usec seems excessive :(


> But note that RCU also has rcutree.rcu_divisor, which defaults to 7.
> And an rcutree.rcu_resched_ns, which defaults to three milliseconds
> (3,000,000 nanoseconds).  This means that RCU will do:
>
> o       All the callbacks if there are less than ten.
>
> o       Ten callbacks or 1/128th of them, whichever is larger.
>
> o       Unless the larger of them is more than 100 callbacks, in which
>         case there is an additional limit of three milliseconds worth
>         of them.
>
> Except that if a given CPU ends up with more than 10,000 callbacks
> (rcutree.qhimark), that CPU's blimit is set to 10,000.
>
> So there is much opportunity to tune the existing heuristics and also
> much opportunity to tweak the heuristics themselves.

This I did not know, and to best observe rcu in action nyquist is 1.5ms...

Something with less constants and more curves seems in order.

>
> But let's see a good use case before tweaking, please.  ;-)
>
>                                                         Thanx, Paul
>
> > > whether that makes a difference? Those two can be applied with some
> > > minor polishing. The rest of that series is broken by f10020c97f4c
> > > ("softirq: Allow early break").
> > >
> > > There is another issue with this overload limit. Assume max_restart or
> > > timeout triggered and limit was set to now + 100ms. ksoftirqd runs and
> > > gets the issue resolved after 10ms.
> > >
> > > So for the remaining 90ms any invocation of raise_softirq() outside of
> > > (soft)interrupt context, which wakes ksoftirqd again, prevents
> > > processing on return from interrupt until ksoftirqd gets on the CPU and
> > > goes back to sleep, because task_is_running() == true and the stale
> > > limit is not after jiffies.
> > >
> > > Probably not a big issue, but someone will notice on some weird workload
> > > sooner than later and the tweaking will start nevertheless. :) So maybe
> > > we fix it right away. :)
> >
> > Hm, Paolo raised this point as well, but the overload time is strictly
> > to stop paying attention to the fact ksoftirqd is running.
> > IOW current kernels behave as if they had overload_limit of infinity.
> >
> > The current code already prevents processing until ksoftirqd schedules
> > in, after raise_softirq() from a funky context.



-- 
A pithy note on VOQs vs SQM: https://blog.cerowrt.org/post/juniper/
Dave Täht CEO, TekLibre, LLC

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-03 23:25         ` Dave Taht
@ 2023-03-04  1:14           ` Paul E. McKenney
  0 siblings, 0 replies; 38+ messages in thread
From: Paul E. McKenney @ 2023-03-04  1:14 UTC (permalink / raw)
  To: Dave Taht
  Cc: Jakub Kicinski, Thomas Gleixner, peterz, jstultz, edumazet,
	netdev, linux-kernel

On Fri, Mar 03, 2023 at 03:25:32PM -0800, Dave Taht wrote:
> On Fri, Mar 3, 2023 at 2:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Fri, Mar 03, 2023 at 01:31:43PM -0800, Jakub Kicinski wrote:
> > > On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote:
> > > > > -         if (time_before(jiffies, end) && !need_resched() &&
> > > > > -             --max_restart)
> > > > > +         unsigned long limit;
> > > > > +
> > > > > +         if (time_is_before_eq_jiffies(end) || !--max_restart)
> > > > > +                 limit = SOFTIRQ_OVERLOAD_TIME;
> > > > > +         else if (need_resched())
> > > > > +                 limit = SOFTIRQ_DEFER_TIME;
> > > > > +         else
> > > > >                   goto restart;
> > > > >
> > > > > +         __this_cpu_write(overload_limit, jiffies + limit);
> > > >
> > > > The logic of all this is non-obvious and I had to reread it 5 times to
> > > > conclude that it is matching the intent. Please add comments.
> > > >
> > > > While I'm not a big fan of heuristical duct tape, this looks harmless
> > > > enough to not end up in an endless stream of tweaking. Famous last
> > > > words...
> > >
> > > Would it all be more readable if I named the "overload_limit"
> > > "overloaded_until" instead? Naming..
> > > I'll add comments, too.
> > >
> > > > But without the sched_clock() changes the actual defer time depends on
> > > > HZ and the point in time where limit is set. That means it ranges from 0
> > > > to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in
> > > > the worst case, which perhaps explains the 8ms+ stalls you are still
> > > > observing. Can you test with that sched_clock change applied, i.e. the
> > > > first two commits from
> > > >
> > > >   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq
> > > >
> > > > 59be25c466d9 ("softirq: Use sched_clock() based timeout")
> > > > bd5a5bd77009 ("softirq: Rewrite softirq processing loop")
> > >
> > > Those will help, but I spent some time digging into the jiffies related
> > > warts with kprobes - while annoying they weren't a major source of wake
> > > ups. (FWIW the jiffies noise on our workloads is due to cgroup stats
> > > disabling IRQs for multiple ms on the timekeeping CPU).
> > >
> > > Here are fresh stats on why we wake up ksoftirqd on our Web workload
> > > (collected over 100 sec):
> > >
> > > Time exceeded:      484
> > > Loop max run out:  6525
> > > need_resched():   10219
> > > (control: 17226 - number of times wakeup_process called for ksirqd)
> > >
> > > As you can see need_resched() dominates.
> > >
> > > Zooming into the time exceeded - we can count nanoseconds between
> > > __do_softirq starting and the check. This is the histogram of actual
> > > usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000):
> > >
> > > [256, 512)             1 |                                                    |
> > > [512, 1K)              0 |                                                    |
> > > [1K, 2K)             217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> > > [2K, 4K)             266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > >
> > > So yes, we can probably save ourselves ~200 wakeup with a better clock
> > > but that's just 1.3% of the total wake ups :(
> > >
> > >
> > > Now - now about the max loop count. I ORed the pending softirqs every
> > > time we get to the end of the loop. Looks like vast majority of the
> > > loop counter wake ups are exclusively due to RCU:
> > >
> > > @looped[512]: 5516
> > >
> > > Where 512 is the ORed pending mask over all iterations
> > > 512 == 1 << RCU_SOFTIRQ.
> > >
> > > And they usually take less than 100us to consume the 10 iterations.
> > > Histogram of usecs consumed when we run out of loop iterations:
> > >
> > > [16, 32)               3 |                                                    |
> > > [32, 64)            4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > [64, 128)            871 |@@@@@@@@@                                           |
> > > [128, 256)            34 |                                                    |
> > > [256, 512)             9 |                                                    |
> > > [512, 1K)            262 |@@                                                  |
> > > [1K, 2K)              35 |                                                    |
> > > [2K, 4K)               1 |                                                    |
> > >
> > > Paul, is this expected? Is RCU not trying too hard to be nice?
> >
> > This is from way back in the day, so it is quite possible that better
> > tuning and/or better heuristics should be applied.
> >
> > On the other hand, 100 microseconds is a good long time from an
> > CONFIG_PREEMPT_RT=y perspective!
> 
> All I have to add to this conversation is the observation that
> sampling things at the
> nyquist rate helps to observe problems like these.
> 
> So if you care about sub 8ms response time, a sub 4ms sampling rate is needed.

My guess is that Jakub is side-stepping Nyquist by sampling every call
to and return from the rcu_do_batch() function.

> > > # cat /sys/module/rcutree/parameters/blimit
> > > 10
> > >
> > > Or should we perhaps just raise the loop limit? Breaking after less
> > > than 100usec seems excessive :(
> 
> 
> > But note that RCU also has rcutree.rcu_divisor, which defaults to 7.
> > And an rcutree.rcu_resched_ns, which defaults to three milliseconds
> > (3,000,000 nanoseconds).  This means that RCU will do:
> >
> > o       All the callbacks if there are less than ten.
> >
> > o       Ten callbacks or 1/128th of them, whichever is larger.
> >
> > o       Unless the larger of them is more than 100 callbacks, in which
> >         case there is an additional limit of three milliseconds worth
> >         of them.
> >
> > Except that if a given CPU ends up with more than 10,000 callbacks
> > (rcutree.qhimark), that CPU's blimit is set to 10,000.
> >
> > So there is much opportunity to tune the existing heuristics and also
> > much opportunity to tweak the heuristics themselves.
> 
> This I did not know, and to best observe rcu in action nyquist is 1.5ms...

This is not an oscillator, and because this all happens within a
single system, you cannot you hang your hat on speed-of-light delays.
In addition, an application can dump thousands of callbacks down RCU's
throat in a very short time, which changes RCU's timing.  Also, the
time constants for expedited grace periods are typically in the tens
of microseconds.  Something about prioritizing survivability over
measurability.  ;-)

But that is OK because ftrace and BPF can provide fine-grained
measurements quite cheaply.

> Something with less constants and more curves seems in order.

In the immortal words of MS-DOS, are you sure?

							Thanx, Paul

> > But let's see a good use case before tweaking, please.  ;-)
> >
> >                                                         Thanx, Paul
> >
> > > > whether that makes a difference? Those two can be applied with some
> > > > minor polishing. The rest of that series is broken by f10020c97f4c
> > > > ("softirq: Allow early break").
> > > >
> > > > There is another issue with this overload limit. Assume max_restart or
> > > > timeout triggered and limit was set to now + 100ms. ksoftirqd runs and
> > > > gets the issue resolved after 10ms.
> > > >
> > > > So for the remaining 90ms any invocation of raise_softirq() outside of
> > > > (soft)interrupt context, which wakes ksoftirqd again, prevents
> > > > processing on return from interrupt until ksoftirqd gets on the CPU and
> > > > goes back to sleep, because task_is_running() == true and the stale
> > > > limit is not after jiffies.
> > > >
> > > > Probably not a big issue, but someone will notice on some weird workload
> > > > sooner than later and the tweaking will start nevertheless. :) So maybe
> > > > we fix it right away. :)
> > >
> > > Hm, Paolo raised this point as well, but the overload time is strictly
> > > to stop paying attention to the fact ksoftirqd is running.
> > > IOW current kernels behave as if they had overload_limit of infinity.
> > >
> > > The current code already prevents processing until ksoftirqd schedules
> > > in, after raise_softirq() from a funky context.
> 
> 
> 
> -- 
> A pithy note on VOQs vs SQM: https://blog.cerowrt.org/post/juniper/
> Dave Täht CEO, TekLibre, LLC

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-03 22:37       ` Paul E. McKenney
  2023-03-03 23:25         ` Dave Taht
@ 2023-03-03 23:36         ` Paul E. McKenney
  2023-03-03 23:44           ` Jakub Kicinski
  1 sibling, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2023-03-03 23:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel

On Fri, Mar 03, 2023 at 02:37:39PM -0800, Paul E. McKenney wrote:
> On Fri, Mar 03, 2023 at 01:31:43PM -0800, Jakub Kicinski wrote:
> > On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote:
> > > > -		if (time_before(jiffies, end) && !need_resched() &&
> > > > -		    --max_restart)
> > > > +		unsigned long limit;
> > > > +
> > > > +		if (time_is_before_eq_jiffies(end) || !--max_restart)
> > > > +			limit = SOFTIRQ_OVERLOAD_TIME;
> > > > +		else if (need_resched())
> > > > +			limit = SOFTIRQ_DEFER_TIME;
> > > > +		else
> > > >  			goto restart;
> > > >  
> > > > +		__this_cpu_write(overload_limit, jiffies + limit);  
> > > 
> > > The logic of all this is non-obvious and I had to reread it 5 times to
> > > conclude that it is matching the intent. Please add comments.
> > > 
> > > While I'm not a big fan of heuristical duct tape, this looks harmless
> > > enough to not end up in an endless stream of tweaking. Famous last
> > > words...
> > 
> > Would it all be more readable if I named the "overload_limit"
> > "overloaded_until" instead? Naming..
> > I'll add comments, too.
> > 
> > > But without the sched_clock() changes the actual defer time depends on
> > > HZ and the point in time where limit is set. That means it ranges from 0
> > > to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in
> > > the worst case, which perhaps explains the 8ms+ stalls you are still
> > > observing. Can you test with that sched_clock change applied, i.e. the
> > > first two commits from
> > > 
> > >   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq
> > > 
> > > 59be25c466d9 ("softirq: Use sched_clock() based timeout")
> > > bd5a5bd77009 ("softirq: Rewrite softirq processing loop")
> > 
> > Those will help, but I spent some time digging into the jiffies related
> > warts with kprobes - while annoying they weren't a major source of wake
> > ups. (FWIW the jiffies noise on our workloads is due to cgroup stats
> > disabling IRQs for multiple ms on the timekeeping CPU).
> > 
> > Here are fresh stats on why we wake up ksoftirqd on our Web workload
> > (collected over 100 sec):
> > 
> > Time exceeded:      484
> > Loop max run out:  6525
> > need_resched():   10219
> > (control: 17226 - number of times wakeup_process called for ksirqd)
> > 
> > As you can see need_resched() dominates.
> > 
> > Zooming into the time exceeded - we can count nanoseconds between
> > __do_softirq starting and the check. This is the histogram of actual
> > usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000):
> > 
> > [256, 512)             1 |                                                    |
> > [512, 1K)              0 |                                                    |
> > [1K, 2K)             217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> > [2K, 4K)             266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > 
> > So yes, we can probably save ourselves ~200 wakeup with a better clock
> > but that's just 1.3% of the total wake ups :(
> > 
> > 
> > Now - now about the max loop count. I ORed the pending softirqs every
> > time we get to the end of the loop. Looks like vast majority of the
> > loop counter wake ups are exclusively due to RCU:
> > 
> > @looped[512]: 5516
> > 
> > Where 512 is the ORed pending mask over all iterations
> > 512 == 1 << RCU_SOFTIRQ.
> > 
> > And they usually take less than 100us to consume the 10 iterations.
> > Histogram of usecs consumed when we run out of loop iterations:
> > 
> > [16, 32)               3 |                                                    |
> > [32, 64)            4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [64, 128)            871 |@@@@@@@@@                                           |
> > [128, 256)            34 |                                                    |
> > [256, 512)             9 |                                                    |
> > [512, 1K)            262 |@@                                                  |
> > [1K, 2K)              35 |                                                    |
> > [2K, 4K)               1 |                                                    |
> > 
> > Paul, is this expected? Is RCU not trying too hard to be nice?
> 
> This is from way back in the day, so it is quite possible that better
> tuning and/or better heuristics should be applied.
> 
> On the other hand, 100 microseconds is a good long time from an
> CONFIG_PREEMPT_RT=y perspective!
> 
> > # cat /sys/module/rcutree/parameters/blimit
> > 10
> > 
> > Or should we perhaps just raise the loop limit? Breaking after less 
> > than 100usec seems excessive :(
> 
> But note that RCU also has rcutree.rcu_divisor, which defaults to 7.
> And an rcutree.rcu_resched_ns, which defaults to three milliseconds
> (3,000,000 nanoseconds).  This means that RCU will do:
> 
> o	All the callbacks if there are less than ten.
> 
> o	Ten callbacks or 1/128th of them, whichever is larger.
> 
> o	Unless the larger of them is more than 100 callbacks, in which
> 	case there is an additional limit of three milliseconds worth
> 	of them.
> 
> Except that if a given CPU ends up with more than 10,000 callbacks
> (rcutree.qhimark), that CPU's blimit is set to 10,000.

Also, if in the context of a softirq handler (as opposed to ksoftirqd)
that interrupted the idle task with no pending task, the count of
callbacks is ignored and only the 3-millisecond limit counts.  In the
context of ksoftirq, the only limit is that which the scheduler chooses
to impose.

But it sure seems like the ksoftirqd case should also pay attention to
that 3-millisecond limit.  I will queue a patch to that effect, and maybe
Eric Dumazet will show me the error of my ways.

> So there is much opportunity to tune the existing heuristics and also
> much opportunity to tweak the heuristics themselves.
> 
> But let's see a good use case before tweaking, please.  ;-)

							Thanx, Paul

> > > whether that makes a difference? Those two can be applied with some
> > > minor polishing. The rest of that series is broken by f10020c97f4c
> > > ("softirq: Allow early break").
> > > 
> > > There is another issue with this overload limit. Assume max_restart or
> > > timeout triggered and limit was set to now + 100ms. ksoftirqd runs and
> > > gets the issue resolved after 10ms.
> > > 
> > > So for the remaining 90ms any invocation of raise_softirq() outside of
> > > (soft)interrupt context, which wakes ksoftirqd again, prevents
> > > processing on return from interrupt until ksoftirqd gets on the CPU and
> > > goes back to sleep, because task_is_running() == true and the stale
> > > limit is not after jiffies.
> > > 
> > > Probably not a big issue, but someone will notice on some weird workload
> > > sooner than later and the tweaking will start nevertheless. :) So maybe
> > > we fix it right away. :)
> > 
> > Hm, Paolo raised this point as well, but the overload time is strictly
> > to stop paying attention to the fact ksoftirqd is running.
> > IOW current kernels behave as if they had overload_limit of infinity.
> > 
> > The current code already prevents processing until ksoftirqd schedules
> > in, after raise_softirq() from a funky context.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-03 23:36         ` Paul E. McKenney
@ 2023-03-03 23:44           ` Jakub Kicinski
  2023-03-04  1:25             ` Paul E. McKenney
  0 siblings, 1 reply; 38+ messages in thread
From: Jakub Kicinski @ 2023-03-03 23:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel

On Fri, 3 Mar 2023 15:36:27 -0800 Paul E. McKenney wrote:
> On Fri, Mar 03, 2023 at 02:37:39PM -0800, Paul E. McKenney wrote:
> > On Fri, Mar 03, 2023 at 01:31:43PM -0800, Jakub Kicinski wrote:  
> > > Now - now about the max loop count. I ORed the pending softirqs every
> > > time we get to the end of the loop. Looks like vast majority of the
> > > loop counter wake ups are exclusively due to RCU:
> > > 
> > > @looped[512]: 5516
> > > 
> > > Where 512 is the ORed pending mask over all iterations
> > > 512 == 1 << RCU_SOFTIRQ.
> > > 
> > > And they usually take less than 100us to consume the 10 iterations.
> > > Histogram of usecs consumed when we run out of loop iterations:
> > > 
> > > [16, 32)               3 |                                                    |
> > > [32, 64)            4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > [64, 128)            871 |@@@@@@@@@                                           |
> > > [128, 256)            34 |                                                    |
> > > [256, 512)             9 |                                                    |
> > > [512, 1K)            262 |@@                                                  |
> > > [1K, 2K)              35 |                                                    |
> > > [2K, 4K)               1 |                                                    |
> > > 
> > > Paul, is this expected? Is RCU not trying too hard to be nice?  
> > 
> > This is from way back in the day, so it is quite possible that better
> > tuning and/or better heuristics should be applied.
> > 
> > On the other hand, 100 microseconds is a good long time from an
> > CONFIG_PREEMPT_RT=y perspective!
> >   
> > > # cat /sys/module/rcutree/parameters/blimit
> > > 10
> > > 
> > > Or should we perhaps just raise the loop limit? Breaking after less 
> > > than 100usec seems excessive :(  
> > 
> > But note that RCU also has rcutree.rcu_divisor, which defaults to 7.
> > And an rcutree.rcu_resched_ns, which defaults to three milliseconds
> > (3,000,000 nanoseconds).  This means that RCU will do:
> > 
> > o	All the callbacks if there are less than ten.
> > 
> > o	Ten callbacks or 1/128th of them, whichever is larger.
> > 
> > o	Unless the larger of them is more than 100 callbacks, in which
> > 	case there is an additional limit of three milliseconds worth
> > 	of them.
> > 
> > Except that if a given CPU ends up with more than 10,000 callbacks
> > (rcutree.qhimark), that CPU's blimit is set to 10,000.  
> 
> Also, if in the context of a softirq handler (as opposed to ksoftirqd)
> that interrupted the idle task with no pending task, the count of
> callbacks is ignored and only the 3-millisecond limit counts.  In the
> context of ksoftirq, the only limit is that which the scheduler chooses
> to impose.
> 
> But it sure seems like the ksoftirqd case should also pay attention to
> that 3-millisecond limit.  I will queue a patch to that effect, and maybe
> Eric Dumazet will show me the error of my ways.

Just to be sure - have you seen Peter's patches?

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq

I think it feeds the time limit to the callback from softirq,
so the local 3ms is no more?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-03 23:44           ` Jakub Kicinski
@ 2023-03-04  1:25             ` Paul E. McKenney
  2023-03-04  1:39               ` Jakub Kicinski
  0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2023-03-04  1:25 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel

On Fri, Mar 03, 2023 at 03:44:13PM -0800, Jakub Kicinski wrote:
> On Fri, 3 Mar 2023 15:36:27 -0800 Paul E. McKenney wrote:
> > On Fri, Mar 03, 2023 at 02:37:39PM -0800, Paul E. McKenney wrote:
> > > On Fri, Mar 03, 2023 at 01:31:43PM -0800, Jakub Kicinski wrote:  
> > > > Now - now about the max loop count. I ORed the pending softirqs every
> > > > time we get to the end of the loop. Looks like vast majority of the
> > > > loop counter wake ups are exclusively due to RCU:
> > > > 
> > > > @looped[512]: 5516
> > > > 
> > > > Where 512 is the ORed pending mask over all iterations
> > > > 512 == 1 << RCU_SOFTIRQ.
> > > > 
> > > > And they usually take less than 100us to consume the 10 iterations.
> > > > Histogram of usecs consumed when we run out of loop iterations:
> > > > 
> > > > [16, 32)               3 |                                                    |
> > > > [32, 64)            4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > > [64, 128)            871 |@@@@@@@@@                                           |
> > > > [128, 256)            34 |                                                    |
> > > > [256, 512)             9 |                                                    |
> > > > [512, 1K)            262 |@@                                                  |
> > > > [1K, 2K)              35 |                                                    |
> > > > [2K, 4K)               1 |                                                    |
> > > > 
> > > > Paul, is this expected? Is RCU not trying too hard to be nice?  
> > > 
> > > This is from way back in the day, so it is quite possible that better
> > > tuning and/or better heuristics should be applied.
> > > 
> > > On the other hand, 100 microseconds is a good long time from an
> > > CONFIG_PREEMPT_RT=y perspective!
> > >   
> > > > # cat /sys/module/rcutree/parameters/blimit
> > > > 10
> > > > 
> > > > Or should we perhaps just raise the loop limit? Breaking after less 
> > > > than 100usec seems excessive :(  
> > > 
> > > But note that RCU also has rcutree.rcu_divisor, which defaults to 7.
> > > And an rcutree.rcu_resched_ns, which defaults to three milliseconds
> > > (3,000,000 nanoseconds).  This means that RCU will do:
> > > 
> > > o	All the callbacks if there are less than ten.
> > > 
> > > o	Ten callbacks or 1/128th of them, whichever is larger.
> > > 
> > > o	Unless the larger of them is more than 100 callbacks, in which
> > > 	case there is an additional limit of three milliseconds worth
> > > 	of them.
> > > 
> > > Except that if a given CPU ends up with more than 10,000 callbacks
> > > (rcutree.qhimark), that CPU's blimit is set to 10,000.  
> > 
> > Also, if in the context of a softirq handler (as opposed to ksoftirqd)
> > that interrupted the idle task with no pending task, the count of
> > callbacks is ignored and only the 3-millisecond limit counts.  In the
> > context of ksoftirq, the only limit is that which the scheduler chooses
> > to impose.
> > 
> > But it sure seems like the ksoftirqd case should also pay attention to
> > that 3-millisecond limit.  I will queue a patch to that effect, and maybe
> > Eric Dumazet will show me the error of my ways.
> 
> Just to be sure - have you seen Peter's patches?
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq
> 
> I think it feeds the time limit to the callback from softirq,
> so the local 3ms is no more?

I might or might not have back in September of 2020.  ;-)

But either way, the question remains:  Should RCU_SOFTIRQ do time checking
in ksoftirqd context?  Seems like the answer should be "yes", independently
of Peter's patches.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-04  1:25             ` Paul E. McKenney
@ 2023-03-04  1:39               ` Jakub Kicinski
  2023-03-04  3:11                 ` Paul E. McKenney
  0 siblings, 1 reply; 38+ messages in thread
From: Jakub Kicinski @ 2023-03-04  1:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel

On Fri, 3 Mar 2023 17:25:35 -0800 Paul E. McKenney wrote:
> > Just to be sure - have you seen Peter's patches?
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq
> > 
> > I think it feeds the time limit to the callback from softirq,
> > so the local 3ms is no more?  
> 
> I might or might not have back in September of 2020.  ;-)
> 
> But either way, the question remains:  Should RCU_SOFTIRQ do time checking
> in ksoftirqd context?  Seems like the answer should be "yes", independently
> of Peter's patches.

:-o  I didn't notice, I thought that's from Dec 22, LWN was writing
about Peter's rework at that point. I'm not sure what the story is :(
And when / if any of these changes are coming downstream.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-04  1:39               ` Jakub Kicinski
@ 2023-03-04  3:11                 ` Paul E. McKenney
  2023-03-04 20:48                   ` Paul E. McKenney
  0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2023-03-04  3:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel

On Fri, Mar 03, 2023 at 05:39:21PM -0800, Jakub Kicinski wrote:
> On Fri, 3 Mar 2023 17:25:35 -0800 Paul E. McKenney wrote:
> > > Just to be sure - have you seen Peter's patches?
> > > 
> > >   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq
> > > 
> > > I think it feeds the time limit to the callback from softirq,
> > > so the local 3ms is no more?  
> > 
> > I might or might not have back in September of 2020.  ;-)
> > 
> > But either way, the question remains:  Should RCU_SOFTIRQ do time checking
> > in ksoftirqd context?  Seems like the answer should be "yes", independently
> > of Peter's patches.
> 
> :-o  I didn't notice, I thought that's from Dec 22, LWN was writing
> about Peter's rework at that point. I'm not sure what the story is :(
> And when / if any of these changes are coming downstream.

Not a problem either way, as the compiler would complain bitterly about
the resulting merge conflict and it is easy to fix.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-04  3:11                 ` Paul E. McKenney
@ 2023-03-04 20:48                   ` Paul E. McKenney
  0 siblings, 0 replies; 38+ messages in thread
From: Paul E. McKenney @ 2023-03-04 20:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel

On Fri, Mar 03, 2023 at 07:11:09PM -0800, Paul E. McKenney wrote:
> On Fri, Mar 03, 2023 at 05:39:21PM -0800, Jakub Kicinski wrote:
> > On Fri, 3 Mar 2023 17:25:35 -0800 Paul E. McKenney wrote:
> > > > Just to be sure - have you seen Peter's patches?
> > > > 
> > > >   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq
> > > > 
> > > > I think it feeds the time limit to the callback from softirq,
> > > > so the local 3ms is no more?  
> > > 
> > > I might or might not have back in September of 2020.  ;-)
> > > 
> > > But either way, the question remains:  Should RCU_SOFTIRQ do time checking
> > > in ksoftirqd context?  Seems like the answer should be "yes", independently
> > > of Peter's patches.
> > 
> > :-o  I didn't notice, I thought that's from Dec 22, LWN was writing
> > about Peter's rework at that point. I'm not sure what the story is :(
> > And when / if any of these changes are coming downstream.
> 
> Not a problem either way, as the compiler would complain bitterly about
> the resulting merge conflict and it is easy to fix.  ;-)

And even more not a problem because in_serving_softirq() covers both the
softirq environment as well as ksoftirqd.  So that "else" clause is for
rcuoc kthreads, which do not block other softirq vectors.  So I am adding
a comment instead...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-03 21:31     ` Jakub Kicinski
  2023-03-03 22:37       ` Paul E. McKenney
@ 2023-03-05 20:43       ` Thomas Gleixner
  2023-03-05 22:42         ` Paul E. McKenney
                           ` (3 more replies)
  1 sibling, 4 replies; 38+ messages in thread
From: Thomas Gleixner @ 2023-03-05 20:43 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: peterz, jstultz, edumazet, netdev, linux-kernel, Paul E. McKenney,
	Frederic Weisbecker

Jakub!

On Fri, Mar 03 2023 at 13:31, Jakub Kicinski wrote:
> On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote:
>> > +		__this_cpu_write(overload_limit, jiffies + limit);  
>> 
>> The logic of all this is non-obvious and I had to reread it 5 times to
>> conclude that it is matching the intent. Please add comments.
>> 
>> While I'm not a big fan of heuristical duct tape, this looks harmless
>> enough to not end up in an endless stream of tweaking. Famous last
>> words...
>
> Would it all be more readable if I named the "overload_limit"
> "overloaded_until" instead? Naming..

While naming matters it wont change the 'heuristical duct tape' property
of this, right?

> I'll add comments, too.

They are definitely appreciated, but I'd prefer to have code which is
self explanatory and does at least have a notion of a halfways
scientific approach to the overall issue of softirqs.

The point is that softirqs are just the proliferation of an at least 50
years old OS design paradigm. Back then everyhting which run in an
interrupt handler was "important" and more or less allowed to hog the
CPU at will.

That obviously caused problems because it prevented other interrupt
handlers from being served.

This was attempted to work around in hardware by providing interrupt
priority levels. No general purpose OS utilized that ever because there
is no way to get this right. Not even on UP, unless you build a designed
for the purpose "OS".

Soft interrupts are not any better. They avoid the problem of stalling
interrupts by moving the problem one level down to the scheduler.

Granted they are a cute hack, but at the very end they are still evading
the resource control mechanisms of the OS by defining their own rules:

    - NET RX defaults to 2ms with the ability to override via /proc
    - RCU defaults to 3ms with the ability to override via /sysfs

while the "overload detection" in the core defines a hardcoded limit of
2ms. Alone the above does not sum up to the core limit and most of the
other soft interrupt handlers do not even have the notion of limits.

That clearly does not even remotely allow to do proper coordinated
resource management.

Not to talk about the sillyness of the jiffy based timouts which result
in a randomized granularity of 0...1/Hz as mentioned before.

I'm well aware of the fact that consulting a high resolution hardware
clock frequently can be slow and hurting performance, but there are well
understood workarounds, aka batching, which mitigate that. 

There is another aspect to softirqs which makes them a horror show:

  While they are conceptually seperate, at the very end they are all
  lumped together and especially the network code has implicit
  assumptions about that. It's simply impossible to seperate the
  processing of the various soft interrupt incarnations.

IOW, resource control by developer preference and coincidence of
events. That truly makes an understandable and to be relied on OS.

We had seperate softirq threads and per softirq serialization (except
NET_RX/TX which shared) in the early days of preempt RT, which gave fine
grained control. Back then the interaction between different softirqs
was halfways understandable and the handful of interaction points which
relied on the per CPU global BH disable were fixable with local
serializations. That lasted a year or two until we stopped maintaining
that because the interaction between softirqs was becoming a whack a
mole game. So we gave up and enjoy the full glory of a per CPU global
lock, because that's what local BH disable actually is.

I completely understand that ***GB networking is a challenge, but ***GB
networking does not work without applications wwich use it. Those
applications are unfortunately^Wrightfully subject to the scheduler,
aka. resource control.

IMO evading resource control is the worst of all approaches and the
amount of heuristics you can apply to mitigate that, is never going to
cover even a subset of the overall application space.

Just look at the memcache vs. webserver use case vs. need_resched() and
then the requirements coming from the low latency audio folks.

I know the usual approach to that is to add some more heuristics which
are by nature supposed to fail or to add yet another 'knob'. We have
already too many knobs which are not comprehensible on their own. But
even if a particular knob is comprehensible there is close to zero
documentation and I even claim close to zero understanding of the
interaction of knobs.

Just for the record. Some of our engineers are working on TSN based
real-time networking which is all about latency anc accuracy. Guess how
well that works with the current overall design. That's not an esoteric
niche use case as low-latency TSN is not restricted to the automation
space. There are quite some use cases which go there even in the high
end networking space.

>> But without the sched_clock() changes the actual defer time depends on
>> HZ and the point in time where limit is set. That means it ranges from 0
>> to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in
>> the worst case, which perhaps explains the 8ms+ stalls you are still
>> observing. Can you test with that sched_clock change applied, i.e. the
>> first two commits from
>> 
>>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq
>> 
>> 59be25c466d9 ("softirq: Use sched_clock() based timeout")
>> bd5a5bd77009 ("softirq: Rewrite softirq processing loop")
>
> Those will help, but I spent some time digging into the jiffies related
> warts with kprobes - while annoying they weren't a major source of wake
> ups. (FWIW the jiffies noise on our workloads is due to cgroup stats
> disabling IRQs for multiple ms on the timekeeping CPU).

What? That's completely insane and needs to be fixed.

> Here are fresh stats on why we wake up ksoftirqd on our Web workload
> (collected over 100 sec):
>
> Time exceeded:      484
> Loop max run out:  6525
> need_resched():   10219
> (control: 17226 - number of times wakeup_process called for ksirqd)
>
> As you can see need_resched() dominates.

> Zooming into the time exceeded - we can count nanoseconds between
> __do_softirq starting and the check. This is the histogram of actual
> usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000):
>
> [256, 512)             1 |                                                    |
> [512, 1K)              0 |                                                    |
> [1K, 2K)             217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> [2K, 4K)             266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> So yes, we can probably save ourselves ~200 wakeup with a better clock
> but that's just 1.3% of the total wake ups :(

Fair enough. Though that does not make our time limit handling any more
consistent and we need to fix that too to handle the other issues.

> Now - now about the max loop count. I ORed the pending softirqs every
> time we get to the end of the loop. Looks like vast majority of the
> loop counter wake ups are exclusively due to RCU:
>
> @looped[512]: 5516

If the loop counter breaks without consuming the time budget that's
silly.

> Where 512 is the ORed pending mask over all iterations
> 512 == 1 << RCU_SOFTIRQ.
>
> And they usually take less than 100us to consume the 10 iterations.
> Histogram of usecs consumed when we run out of loop iterations:
>
> [16, 32)               3 |                                                    |
> [32, 64)            4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [64, 128)            871 |@@@@@@@@@                                           |
> [128, 256)            34 |                                                    |
> [256, 512)             9 |                                                    |
> [512, 1K)            262 |@@                                                  |
> [1K, 2K)              35 |                                                    |
> [2K, 4K)               1 |                                                    |
>
> Paul, is this expected? Is RCU not trying too hard to be nice?
>
> # cat /sys/module/rcutree/parameters/blimit
> 10
>
> Or should we perhaps just raise the loop limit? Breaking after less 
> than 100usec seems excessive :(

No. Can we please stop twiddling a parameter here and there and go and
fix this whole problem space properly. Increasing the loop count for RCU
might work for your particular usecase and cause issues in other
scenarios.

Btw, RCU seems to be a perfect candidate to delegate batches from softirq
into a seperate scheduler controllable entity.

>> So for the remaining 90ms any invocation of raise_softirq() outside of
>> (soft)interrupt context, which wakes ksoftirqd again, prevents
>> processing on return from interrupt until ksoftirqd gets on the CPU and
>> goes back to sleep, because task_is_running() == true and the stale
>> limit is not after jiffies.
>> 
>> Probably not a big issue, but someone will notice on some weird workload
>> sooner than later and the tweaking will start nevertheless. :) So maybe
>> we fix it right away. :)
>
> Hm, Paolo raised this point as well, but the overload time is strictly
> to stop paying attention to the fact ksoftirqd is running.
> IOW current kernels behave as if they had overload_limit of infinity.
>
> The current code already prevents processing until ksoftirqd schedules
> in, after raise_softirq() from a funky context.

Correct and it does so because we are just applying duct tape over and
over.

That said, I have no brilliant solution for that off the top of my head,
but I'm not comfortable with applying more adhoc solutions which are
contrary to the efforts of e.g. the audio folks.

I have some vague ideas how to approach that, but I'm traveling all of
next week, so I neither will be reading much email, nor will I have time
to think deeply about softirqs. I'll resume when I'm back.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-05 20:43       ` Thomas Gleixner
@ 2023-03-05 22:42         ` Paul E. McKenney
  2023-03-05 23:00           ` Frederic Weisbecker
  2023-03-06  9:13         ` David Laight
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2023-03-05 22:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jakub Kicinski, peterz, jstultz, edumazet, netdev, linux-kernel,
	Frederic Weisbecker

On Sun, Mar 05, 2023 at 09:43:23PM +0100, Thomas Gleixner wrote:
> On Fri, Mar 03 2023 at 13:31, Jakub Kicinski wrote:
> > On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote:

[ . . . ]

> > Where 512 is the ORed pending mask over all iterations
> > 512 == 1 << RCU_SOFTIRQ.
> >
> > And they usually take less than 100us to consume the 10 iterations.
> > Histogram of usecs consumed when we run out of loop iterations:
> >
> > [16, 32)               3 |                                                    |
> > [32, 64)            4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [64, 128)            871 |@@@@@@@@@                                           |
> > [128, 256)            34 |                                                    |
> > [256, 512)             9 |                                                    |
> > [512, 1K)            262 |@@                                                  |
> > [1K, 2K)              35 |                                                    |
> > [2K, 4K)               1 |                                                    |
> >
> > Paul, is this expected? Is RCU not trying too hard to be nice?
> >
> > # cat /sys/module/rcutree/parameters/blimit
> > 10
> >
> > Or should we perhaps just raise the loop limit? Breaking after less 
> > than 100usec seems excessive :(
> 
> No. Can we please stop twiddling a parameter here and there and go and
> fix this whole problem space properly. Increasing the loop count for RCU
> might work for your particular usecase and cause issues in other
> scenarios.
> 
> Btw, RCU seems to be a perfect candidate to delegate batches from softirq
> into a seperate scheduler controllable entity.

Indeed, as you well know, CONFIG_RCU_NOCB_CPU=y in combination with the
rcutree.use_softirq kernel boot parameter in combination with either the
nohz_full or rcu_nocbs kernel boot parameter and then the callbacks are
invoked within separate kthreads so that the scheduler has full control.
In addition, this dispenses with all of the heuristics that are otherwise
necessary to avoid invoking too many callbacks in one shot.

Back in the day, I tried making this the default (with an eye towards
making it the sole callback-execution scheme), but this resulted in
some ugly performance regressions.  This was in part due to the extra
synchronization required to queue a callback and in part due to the
higher average cost of a wakeup compared to a raise_softirq().

So I changed to the current non-default arrangement.

And of course, you can do it halfway by booting kernel built with
CONFIG_RCU_NOCB_CPU=n with the rcutree.use_softirq kernel boot parameter.
But then the callback-invocation-limit heuristics are still used, but
this time to prevent callback invocation from preventing the CPU from
reporting quiescent states.  But if this was the only case, simpler
heuristics would suffice.

In short, it is not hard to make RCU avoid using softirq, but doing so
is not without side effects.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-05 22:42         ` Paul E. McKenney
@ 2023-03-05 23:00           ` Frederic Weisbecker
  2023-03-06  4:30             ` Paul E. McKenney
  0 siblings, 1 reply; 38+ messages in thread
From: Frederic Weisbecker @ 2023-03-05 23:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Jakub Kicinski, peterz, jstultz, edumazet,
	netdev, linux-kernel

On Sun, Mar 05, 2023 at 02:42:11PM -0800, Paul E. McKenney wrote:
> On Sun, Mar 05, 2023 at 09:43:23PM +0100, Thomas Gleixner wrote:
> Indeed, as you well know, CONFIG_RCU_NOCB_CPU=y in combination with the
> rcutree.use_softirq kernel boot parameter in combination with either the
> nohz_full or rcu_nocbs kernel boot parameter and then the callbacks are
> invoked within separate kthreads so that the scheduler has full control.
> In addition, this dispenses with all of the heuristics that are otherwise
> necessary to avoid invoking too many callbacks in one shot.
> 
> Back in the day, I tried making this the default (with an eye towards
> making it the sole callback-execution scheme), but this resulted in
> some ugly performance regressions.  This was in part due to the extra
> synchronization required to queue a callback and in part due to the
> higher average cost of a wakeup compared to a raise_softirq().
> 
> So I changed to the current non-default arrangement.
> 
> And of course, you can do it halfway by booting kernel built with
> CONFIG_RCU_NOCB_CPU=n with the rcutree.use_softirq kernel boot parameter.
> But then the callback-invocation-limit heuristics are still used, but
> this time to prevent callback invocation from preventing the CPU from
> reporting quiescent states.  But if this was the only case, simpler
> heuristics would suffice.
> 
> In short, it is not hard to make RCU avoid using softirq, but doing so
> is not without side effects.  ;-)

Right but note that, threaded or not, callbacks invocation happen
within a local_bh_disable() section, preventing other softirqs from running.

So this is still subject to the softirq per-CPU BKL.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-05 23:00           ` Frederic Weisbecker
@ 2023-03-06  4:30             ` Paul E. McKenney
  2023-03-06 11:22               ` Frederic Weisbecker
  0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2023-03-06  4:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, Jakub Kicinski, peterz, jstultz, edumazet,
	netdev, linux-kernel

On Mon, Mar 06, 2023 at 12:00:24AM +0100, Frederic Weisbecker wrote:
> On Sun, Mar 05, 2023 at 02:42:11PM -0800, Paul E. McKenney wrote:
> > On Sun, Mar 05, 2023 at 09:43:23PM +0100, Thomas Gleixner wrote:
> > Indeed, as you well know, CONFIG_RCU_NOCB_CPU=y in combination with the
> > rcutree.use_softirq kernel boot parameter in combination with either the
> > nohz_full or rcu_nocbs kernel boot parameter and then the callbacks are
> > invoked within separate kthreads so that the scheduler has full control.
> > In addition, this dispenses with all of the heuristics that are otherwise
> > necessary to avoid invoking too many callbacks in one shot.
> > 
> > Back in the day, I tried making this the default (with an eye towards
> > making it the sole callback-execution scheme), but this resulted in
> > some ugly performance regressions.  This was in part due to the extra
> > synchronization required to queue a callback and in part due to the
> > higher average cost of a wakeup compared to a raise_softirq().
> > 
> > So I changed to the current non-default arrangement.
> > 
> > And of course, you can do it halfway by booting kernel built with
> > CONFIG_RCU_NOCB_CPU=n with the rcutree.use_softirq kernel boot parameter.
> > But then the callback-invocation-limit heuristics are still used, but
> > this time to prevent callback invocation from preventing the CPU from
> > reporting quiescent states.  But if this was the only case, simpler
> > heuristics would suffice.
> > 
> > In short, it is not hard to make RCU avoid using softirq, but doing so
> > is not without side effects.  ;-)
> 
> Right but note that, threaded or not, callbacks invocation happen
> within a local_bh_disable() section, preventing other softirqs from running.
> 
> So this is still subject to the softirq per-CPU BKL.

True enough!  But it momentarily enables BH after invoking each callback,
so the other softirq vectors should be able to get a word in.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-06  4:30             ` Paul E. McKenney
@ 2023-03-06 11:22               ` Frederic Weisbecker
  0 siblings, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2023-03-06 11:22 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Jakub Kicinski, peterz, jstultz, edumazet,
	netdev, linux-kernel

On Sun, Mar 05, 2023 at 08:30:33PM -0800, Paul E. McKenney wrote:
> On Mon, Mar 06, 2023 at 12:00:24AM +0100, Frederic Weisbecker wrote:
> > On Sun, Mar 05, 2023 at 02:42:11PM -0800, Paul E. McKenney wrote:
> > > On Sun, Mar 05, 2023 at 09:43:23PM +0100, Thomas Gleixner wrote:
> > > Indeed, as you well know, CONFIG_RCU_NOCB_CPU=y in combination with the
> > > rcutree.use_softirq kernel boot parameter in combination with either the
> > > nohz_full or rcu_nocbs kernel boot parameter and then the callbacks are
> > > invoked within separate kthreads so that the scheduler has full control.
> > > In addition, this dispenses with all of the heuristics that are otherwise
> > > necessary to avoid invoking too many callbacks in one shot.
> > > 
> > > Back in the day, I tried making this the default (with an eye towards
> > > making it the sole callback-execution scheme), but this resulted in
> > > some ugly performance regressions.  This was in part due to the extra
> > > synchronization required to queue a callback and in part due to the
> > > higher average cost of a wakeup compared to a raise_softirq().
> > > 
> > > So I changed to the current non-default arrangement.
> > > 
> > > And of course, you can do it halfway by booting kernel built with
> > > CONFIG_RCU_NOCB_CPU=n with the rcutree.use_softirq kernel boot parameter.
> > > But then the callback-invocation-limit heuristics are still used, but
> > > this time to prevent callback invocation from preventing the CPU from
> > > reporting quiescent states.  But if this was the only case, simpler
> > > heuristics would suffice.
> > > 
> > > In short, it is not hard to make RCU avoid using softirq, but doing so
> > > is not without side effects.  ;-)
> > 
> > Right but note that, threaded or not, callbacks invocation happen
> > within a local_bh_disable() section, preventing other softirqs from running.
> > 
> > So this is still subject to the softirq per-CPU BKL.
> 
> True enough!  But it momentarily enables BH after invoking each callback,
> so the other softirq vectors should be able to get a word in.

Indeed it's still less worse than having it in softirqs.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-05 20:43       ` Thomas Gleixner
  2023-03-05 22:42         ` Paul E. McKenney
@ 2023-03-06  9:13         ` David Laight
  2023-03-06 11:57         ` Frederic Weisbecker
  2023-03-07  0:51         ` Jakub Kicinski
  3 siblings, 0 replies; 38+ messages in thread
From: David Laight @ 2023-03-06  9:13 UTC (permalink / raw)
  To: 'Thomas Gleixner', Jakub Kicinski
  Cc: peterz@infradead.org, jstultz@google.com, edumazet@google.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Paul E. McKenney, Frederic Weisbecker

From: Thomas Gleixner
> Sent: 05 March 2023 20:43
...
> The point is that softirqs are just the proliferation of an at least 50
> years old OS design paradigm. Back then everyhting which run in an
> interrupt handler was "important" and more or less allowed to hog the
> CPU at will.
> 
> That obviously caused problems because it prevented other interrupt
> handlers from being served.
> 
> This was attempted to work around in hardware by providing interrupt
> priority levels. No general purpose OS utilized that ever because there
> is no way to get this right. Not even on UP, unless you build a designed
> for the purpose "OS".
> 
> Soft interrupts are not any better. They avoid the problem of stalling
> interrupts by moving the problem one level down to the scheduler.
> 
> Granted they are a cute hack, but at the very end they are still evading
> the resource control mechanisms of the OS by defining their own rules:

From some measurements I've done, while softints seem like a good
idea they are almost pointless.

What usually happens is a hardware interrupt happens, does some
of the required work, schedules a softint and returns.
Immediately a softint happens (at the same instruction) and
does all the rest of the work.
The work has to be done, but you've added cost of the extra
scheduling and interrupt - so overall it is slower.

The massive batching up of some operations (like ethernet
transmit clearing and rx setup, and things being freed after rcu)
doesn't help latency.
Without the batching the softint would finish faster and cause
less of a latency 'problem' to whatever was interrupted.

Now softints do help interrupt latency, but that is only relevant
if you have critical interrupts (like pulling data out of a hardware
fifo).  Most modern hardware doesn't have anything that critical.

Now there is code that can decide to drop softint processing to
a normal thread. If that ever happens you probably lose 'big time'.
Normal softint processing is higher priority than any process code.
But the kernel thread runs at the priority of a normal user thread.
Pretty much the lowest of the low.
So all this 'high priority' interrupt related processing that
really does have to happen to keep the system running just doesn't
get scheduled.

I think it was Eric who had problems with ethernet packets being
dropped and changed the logic (of dropping to a thread) to make
it much less likely - but that got reverted (well more code added
that effectively reverted it) not long after.

Try (as I was) to run a test that requires you to receive ALL
of the 500000 ethernet packets being sent to an interface every
second while also doing enough processing on the packets to
make the system (say) 90% busy (real time UDP audio processing)
and you soon find the defaults are entirely hopeless.

Even the interrupt 'mitigation' options on the ethernet controller
don't actually work - packets get dropped at the low level.
(That will fail on an otherwise idle system.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-05 20:43       ` Thomas Gleixner
  2023-03-05 22:42         ` Paul E. McKenney
  2023-03-06  9:13         ` David Laight
@ 2023-03-06 11:57         ` Frederic Weisbecker
  2023-03-06 14:57           ` Paul E. McKenney
  2023-03-07  0:51         ` Jakub Kicinski
  3 siblings, 1 reply; 38+ messages in thread
From: Frederic Weisbecker @ 2023-03-06 11:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jakub Kicinski, peterz, jstultz, edumazet, netdev, linux-kernel,
	Paul E. McKenney

On Sun, Mar 05, 2023 at 09:43:23PM +0100, Thomas Gleixner wrote:
> That said, I have no brilliant solution for that off the top of my head,
> but I'm not comfortable with applying more adhoc solutions which are
> contrary to the efforts of e.g. the audio folks.
> 
> I have some vague ideas how to approach that, but I'm traveling all of
> next week, so I neither will be reading much email, nor will I have time
> to think deeply about softirqs. I'll resume when I'm back.

IIUC: the problem is that some (rare?) softirq vector callbacks rely on the
fact they can not be interrupted by other local vectors and they rely on
that to protect against concurrent per-cpu state access, right?

And there is no automatic way to detect those cases otherwise we would have
fixed them all with spinlocks already.

So I fear the only (in-)sane idea I could think of is to do it the same way
we did with the BKL. Some sort of pushdown: vector callbacks known for having
no such subtle interaction can re-enable softirqs.

For example known safe timers (either because they have no such interactions
or because they handle them correctly via spinlocks) can carry a
TIMER_SOFTIRQ_SAFE flag to tell about that. And RCU callbacks something alike.

Of course this is going to be a tremendous amount of work but it has the
advantage of being iterative and it will pay in the long run. Also I'm confident
that the hottest places will be handled quickly. And most of them are likely to
be in core networking code.

Because I fear no hack will ever fix that otherwise, and we have tried a lot.

Thanks.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-06 11:57         ` Frederic Weisbecker
@ 2023-03-06 14:57           ` Paul E. McKenney
  0 siblings, 0 replies; 38+ messages in thread
From: Paul E. McKenney @ 2023-03-06 14:57 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, Jakub Kicinski, peterz, jstultz, edumazet,
	netdev, linux-kernel

On Mon, Mar 06, 2023 at 12:57:11PM +0100, Frederic Weisbecker wrote:
> On Sun, Mar 05, 2023 at 09:43:23PM +0100, Thomas Gleixner wrote:
> > That said, I have no brilliant solution for that off the top of my head,
> > but I'm not comfortable with applying more adhoc solutions which are
> > contrary to the efforts of e.g. the audio folks.
> > 
> > I have some vague ideas how to approach that, but I'm traveling all of
> > next week, so I neither will be reading much email, nor will I have time
> > to think deeply about softirqs. I'll resume when I'm back.
> 
> IIUC: the problem is that some (rare?) softirq vector callbacks rely on the
> fact they can not be interrupted by other local vectors and they rely on
> that to protect against concurrent per-cpu state access, right?
> 
> And there is no automatic way to detect those cases otherwise we would have
> fixed them all with spinlocks already.
> 
> So I fear the only (in-)sane idea I could think of is to do it the same way
> we did with the BKL. Some sort of pushdown: vector callbacks known for having
> no such subtle interaction can re-enable softirqs.
> 
> For example known safe timers (either because they have no such interactions
> or because they handle them correctly via spinlocks) can carry a
> TIMER_SOFTIRQ_SAFE flag to tell about that. And RCU callbacks something alike.

When a given RCU callback causes latency problems, the usual quick fix
is to have them instead spawn a workqueue, either from the callback or
via queue_rcu_work().

But yes, this is one of the reasons that jiffies are so popular.  Eric
batched something like 30 RCU callbacks per costly time check, and you
would quite possible need similar batching to attain efficiency for
lightly loaded softirq vectors.  But 30 long-running softirq handlers
would be too many.

One option is to check the expensive time when either a batch of (say)
30 completes or when jiffies says too much time has elapsed.

> Of course this is going to be a tremendous amount of work but it has the
> advantage of being iterative and it will pay in the long run. Also I'm confident
> that the hottest places will be handled quickly. And most of them are likely to
> be in core networking code.
> 
> Because I fear no hack will ever fix that otherwise, and we have tried a lot.

Indeed, if it was easy within current overall code structure, we would
have already fixed it.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()
  2023-03-05 20:43       ` Thomas Gleixner
                           ` (2 preceding siblings ...)
  2023-03-06 11:57         ` Frederic Weisbecker
@ 2023-03-07  0:51         ` Jakub Kicinski
  3 siblings, 0 replies; 38+ messages in thread
From: Jakub Kicinski @ 2023-03-07  0:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: peterz, jstultz, edumazet, netdev, linux-kernel, Paul E. McKenney,
	Frederic Weisbecker

On Sun, 05 Mar 2023 21:43:23 +0100 Thomas Gleixner wrote:
> > Would it all be more readable if I named the "overload_limit"
> > "overloaded_until" instead? Naming..  
> 
> While naming matters it wont change the 'heuristical duct tape' property
> of this, right?

I also hate heuristics, I hope we are on the same page there.

The way I see it we allowed 2 heuristics already into the kernel:

 - ksoftirq running means overload
 - need_resched() means we should stop immediately (and wake ksoftirqd)

Those two are clearly at odds with each other.
And the latter is as weak / hacky as it gets :|

See at the end for work in progress/"real solutions" but for this 
patch - can I replace the time limit with a simple per core 
"bool wa_for_yield" and change the overload check to:

	if (ksoftirqd_running() && !wa_for_yeild)

? That's not a heuristic, right? No magic values, predictable,
repeatable behavior.

> > I'll add comments, too.  
> 
> They are definitely appreciated, but I'd prefer to have code which is
> self explanatory and does at least have a notion of a halfways
> scientific approach to the overall issue of softirqs.
> 
> The point is that softirqs are just the proliferation of an at least 50
> years old OS design paradigm. Back then everyhting which run in an
> interrupt handler was "important" and more or less allowed to hog the
> CPU at will.
> 
> That obviously caused problems because it prevented other interrupt
> handlers from being served.
> 
> This was attempted to work around in hardware by providing interrupt
> priority levels. No general purpose OS utilized that ever because there
> is no way to get this right. Not even on UP, unless you build a designed
> for the purpose "OS".
> 
> Soft interrupts are not any better. They avoid the problem of stalling
> interrupts by moving the problem one level down to the scheduler.
> 
> Granted they are a cute hack, but at the very end they are still evading
> the resource control mechanisms of the OS by defining their own rules:
> 
>     - NET RX defaults to 2ms with the ability to override via /proc
>     - RCU defaults to 3ms with the ability to override via /sysfs
> 
> while the "overload detection" in the core defines a hardcoded limit of
> 2ms. Alone the above does not sum up to the core limit and most of the
> other soft interrupt handlers do not even have the notion of limits.
> 
> That clearly does not even remotely allow to do proper coordinated
> resource management.

FWIW happy to delete all the procfs knobs we have in net. Anyone who
feels like they need to tweak those should try to use/work on a real
solution.

> Not to talk about the sillyness of the jiffy based timouts which result
> in a randomized granularity of 0...1/Hz as mentioned before.
> 
> I'm well aware of the fact that consulting a high resolution hardware
> clock frequently can be slow and hurting performance, but there are well
> understood workarounds, aka batching, which mitigate that. 
> 
> There is another aspect to softirqs which makes them a horror show:
> 
>   While they are conceptually seperate, at the very end they are all
>   lumped together and especially the network code has implicit
>   assumptions about that. It's simply impossible to seperate the
>   processing of the various soft interrupt incarnations.

Was that just about running in threads or making them preemptible?
Running Rx in threads is "mostly solved", see at the end.

> IOW, resource control by developer preference and coincidence of
> events. That truly makes an understandable and to be relied on OS.
> 
> We had seperate softirq threads and per softirq serialization (except
> NET_RX/TX which shared) in the early days of preempt RT, which gave fine
> grained control. Back then the interaction between different softirqs
> was halfways understandable and the handful of interaction points which
> relied on the per CPU global BH disable were fixable with local
> serializations. That lasted a year or two until we stopped maintaining
> that because the interaction between softirqs was becoming a whack a
> mole game. So we gave up and enjoy the full glory of a per CPU global
> lock, because that's what local BH disable actually is.
> 
> I completely understand that ***GB networking is a challenge, but ***GB
> networking does not work without applications wwich use it. Those
> applications are unfortunately^Wrightfully subject to the scheduler,
> aka. resource control.
> 
> IMO evading resource control is the worst of all approaches and the
> amount of heuristics you can apply to mitigate that, is never going to
> cover even a subset of the overall application space.
> 
> Just look at the memcache vs. webserver use case vs. need_resched() and
> then the requirements coming from the low latency audio folks.

Let me clarify that we only need the default to not be silly for
applications which are _not_ doing a lot of networking. The webserver
in my test is running a website (PHP?), not serving static content.
It's doing maybe a few Gbps on a 25/50 Gbps NIC.

> I know the usual approach to that is to add some more heuristics which
> are by nature supposed to fail or to add yet another 'knob'. We have
> already too many knobs which are not comprehensible on their own. But
> even if a particular knob is comprehensible there is close to zero
> documentation and I even claim close to zero understanding of the
> interaction of knobs.
> 
> Just for the record. Some of our engineers are working on TSN based
> real-time networking which is all about latency anc accuracy. Guess how
> well that works with the current overall design. That's not an esoteric
> niche use case as low-latency TSN is not restricted to the automation
> space. There are quite some use cases which go there even in the high
> end networking space.

TSN + ksoftirqd is definitely a bad idea :S

> > Those will help, but I spent some time digging into the jiffies related
> > warts with kprobes - while annoying they weren't a major source of wake
> > ups. (FWIW the jiffies noise on our workloads is due to cgroup stats
> > disabling IRQs for multiple ms on the timekeeping CPU).  
> 
> What? That's completely insane and needs to be fixed.

Agreed, I made the right people aware..

> > Here are fresh stats on why we wake up ksoftirqd on our Web workload
> > (collected over 100 sec):
> >
> > Time exceeded:      484
> > Loop max run out:  6525
> > need_resched():   10219
> > (control: 17226 - number of times wakeup_process called for ksirqd)
> >
> > As you can see need_resched() dominates.  
> 
> > Zooming into the time exceeded - we can count nanoseconds between
> > __do_softirq starting and the check. This is the histogram of actual
> > usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000):
> >
> > [256, 512)             1 |                                                    |
> > [512, 1K)              0 |                                                    |
> > [1K, 2K)             217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> > [2K, 4K)             266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >
> > So yes, we can probably save ourselves ~200 wakeup with a better clock
> > but that's just 1.3% of the total wake ups :(  
> 
> Fair enough. Though that does not make our time limit handling any more
> consistent and we need to fix that too to handle the other issues.
> 
> > Now - now about the max loop count. I ORed the pending softirqs every
> > time we get to the end of the loop. Looks like vast majority of the
> > loop counter wake ups are exclusively due to RCU:
> >
> > @looped[512]: 5516  
> 
> If the loop counter breaks without consuming the time budget that's
> silly.

FWIW my initial reaction was to read the jiffies from the _local_ core,
because if we're running softirq the can't have IRQs masked.
So local clock will be ticking. But I'm insufficiently competent to
code that up, and you'd presumably have done this already if it was a
good idea.

> > Where 512 is the ORed pending mask over all iterations
> > 512 == 1 << RCU_SOFTIRQ.
> >
> > And they usually take less than 100us to consume the 10 iterations.
> > Histogram of usecs consumed when we run out of loop iterations:
> >
> > [16, 32)               3 |                                                    |
> > [32, 64)            4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [64, 128)            871 |@@@@@@@@@                                           |
> > [128, 256)            34 |                                                    |
> > [256, 512)             9 |                                                    |
> > [512, 1K)            262 |@@                                                  |
> > [1K, 2K)              35 |                                                    |
> > [2K, 4K)               1 |                                                    |
> >
> > Paul, is this expected? Is RCU not trying too hard to be nice?
> >
> > # cat /sys/module/rcutree/parameters/blimit
> > 10
> >
> > Or should we perhaps just raise the loop limit? Breaking after less 
> > than 100usec seems excessive :(  
> 
> No. Can we please stop twiddling a parameter here and there and go and
> fix this whole problem space properly. Increasing the loop count for RCU
> might work for your particular usecase and cause issues in other
> scenarios.
> 
> Btw, RCU seems to be a perfect candidate to delegate batches from softirq
> into a seperate scheduler controllable entity.

Indeed, it knows how many callbacks it has, I wish we knew how many
packets had arrived :)

> > Hm, Paolo raised this point as well, but the overload time is strictly
> > to stop paying attention to the fact ksoftirqd is running.
> > IOW current kernels behave as if they had overload_limit of infinity.
> >
> > The current code already prevents processing until ksoftirqd schedules
> > in, after raise_softirq() from a funky context.  
> 
> Correct and it does so because we are just applying duct tape over and
> over.
> 
> That said, I have no brilliant solution for that off the top of my head,
> but I'm not comfortable with applying more adhoc solutions which are
> contrary to the efforts of e.g. the audio folks.

We are trying:

Threaded NAPI was added in Feb 2021. We can create a thread per NAPI
instance, then all Rx runs in dedicated kernel threads. Unfortunately
for workloads I tested it negatively impacts RPS and latency.
If the workload doesn't run the CPUs too hot it works well, 
but scheduler gets in the way.

I have a vague recollection that Google proposed patches to the
scheduler at some point to support isosync processing, but they were
rejected? I'm very likely misremembering. But I do feel like we'll need
better scheduler support to move network processing to threads. Maybe
the upcoming BPF scheduler patches can help us with that. We'll see.

The other way to go is to let the application take charge. We added
support for applications pledging that they will "busy poll" a given
queue. This turns off IRQs and expects an application to periodically
call into NAPI. I think that's the future for high RPS applications,
but real life results are scarce (other than pure forwarding workloads,
I guess, which are trivial).

Various folks in netdev had also experimented with using workqueues and
kthreads which are not mapped statically to NAPI.

So we are trying, and will continue to try.

There are unknowns, however, which make me think it's worth addressing
the obvious silliness in ksoftirqd behavior, tho. One - I'm not sure
whether we can get to a paradigm which is as fast and as easy to use 
for 100% of use cases as softirq. Two - the solution may need tuning 
and infrastructure leaving smaller users behind. And the ksoftirqd
experience is getting worse, which is why I posted the patches 
(I'm guessing scheduler changes, I don't even want to know who changed 
their heuristics).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 3/3] softirq: don't yield if only expedited handlers are pending
  2022-12-22 22:12 [PATCH 0/3] softirq: uncontroversial change Jakub Kicinski
  2022-12-22 22:12 ` [PATCH 1/3] softirq: rename ksoftirqd_running() -> ksoftirqd_should_handle() Jakub Kicinski
  2022-12-22 22:12 ` [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() Jakub Kicinski
@ 2022-12-22 22:12 ` Jakub Kicinski
  2023-01-09  9:44   ` Peter Zijlstra
  2023-03-03 14:17   ` Thomas Gleixner
  2023-04-20 17:24 ` [PATCH 0/3] softirq: uncontroversial change Paolo Abeni
  3 siblings, 2 replies; 38+ messages in thread
From: Jakub Kicinski @ 2022-12-22 22:12 UTC (permalink / raw)
  To: peterz, tglx; +Cc: jstultz, edumazet, netdev, linux-kernel, Jakub Kicinski

In networking we try to keep Tx packet queues small, so we limit
how many bytes a socket may packetize and queue up. Tx completions
(from NAPI) notify the sockets when packets have left the system
(NIC Tx completion) and the socket schedules a tasklet to queue
the next batch of frames.

This leads to a situation where we go thru the softirq loop twice.
First round we have pending = NET (from the NIC IRQ/NAPI), and
the second iteration has pending = TASKLET (the socket tasklet).

On two web workloads I looked at this condition accounts for 10%
and 23% of all ksoftirqd wake ups respectively. We run NAPI
which wakes some process up, we hit need_resched() and wake up
ksoftirqd just to run the TSQ (TCP small queues) tasklet.

Tweak the need_resched() condition to be ignored if all pending
softIRQs are "non-deferred". The tasklet would run relatively
soon, anyway, but once ksoftirqd is woken we're risking stalls.

I did not see any negative impact on the latency in an RR test
on a loaded machine with this change applied.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 kernel/softirq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index ad200d386ec1..4ac59ffb0d55 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -601,7 +601,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 
 		if (time_is_before_eq_jiffies(end) || !--max_restart)
 			limit = SOFTIRQ_OVERLOAD_TIME;
-		else if (need_resched())
+		else if (need_resched() && pending & ~SOFTIRQ_NOW_MASK)
 			limit = SOFTIRQ_DEFER_TIME;
 		else
 			goto restart;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending
  2022-12-22 22:12 ` [PATCH 3/3] softirq: don't yield if only expedited handlers are pending Jakub Kicinski
@ 2023-01-09  9:44   ` Peter Zijlstra
  2023-01-09 10:16     ` Eric Dumazet
  2023-03-03 11:41     ` Thomas Gleixner
  2023-03-03 14:17   ` Thomas Gleixner
  1 sibling, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2023-01-09  9:44 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: tglx, jstultz, edumazet, netdev, linux-kernel

On Thu, Dec 22, 2022 at 02:12:44PM -0800, Jakub Kicinski wrote:
> In networking we try to keep Tx packet queues small, so we limit
> how many bytes a socket may packetize and queue up. Tx completions
> (from NAPI) notify the sockets when packets have left the system
> (NIC Tx completion) and the socket schedules a tasklet to queue
> the next batch of frames.
> 
> This leads to a situation where we go thru the softirq loop twice.
> First round we have pending = NET (from the NIC IRQ/NAPI), and
> the second iteration has pending = TASKLET (the socket tasklet).

So to me that sounds like you want to fix the network code to not do
this then. Why can't the NAPI thing directly queue the next batch; why
do you have to do a softirq roundtrip like this?

> On two web workloads I looked at this condition accounts for 10%
> and 23% of all ksoftirqd wake ups respectively. We run NAPI
> which wakes some process up, we hit need_resched() and wake up
> ksoftirqd just to run the TSQ (TCP small queues) tasklet.
> 
> Tweak the need_resched() condition to be ignored if all pending
> softIRQs are "non-deferred". The tasklet would run relatively
> soon, anyway, but once ksoftirqd is woken we're risking stalls.
> 
> I did not see any negative impact on the latency in an RR test
> on a loaded machine with this change applied.

Ignoring need_resched() will get you in trouble with RT people real
fast.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending
  2023-01-09  9:44   ` Peter Zijlstra
@ 2023-01-09 10:16     ` Eric Dumazet
  2023-01-09 19:12       ` Jakub Kicinski
  2023-03-03 11:41     ` Thomas Gleixner
  1 sibling, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2023-01-09 10:16 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Jakub Kicinski, tglx, jstultz, netdev, linux-kernel

On Mon, Jan 9, 2023 at 10:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Dec 22, 2022 at 02:12:44PM -0800, Jakub Kicinski wrote:
> > In networking we try to keep Tx packet queues small, so we limit
> > how many bytes a socket may packetize and queue up. Tx completions
> > (from NAPI) notify the sockets when packets have left the system
> > (NIC Tx completion) and the socket schedules a tasklet to queue
> > the next batch of frames.
> >
> > This leads to a situation where we go thru the softirq loop twice.
> > First round we have pending = NET (from the NIC IRQ/NAPI), and
> > the second iteration has pending = TASKLET (the socket tasklet).
>
> So to me that sounds like you want to fix the network code to not do
> this then. Why can't the NAPI thing directly queue the next batch; why
> do you have to do a softirq roundtrip like this?

I think Jakub refers to tcp_wfree() code, which can be called from
arbitrary contexts,
including non NAPI ones, and with the socket locked (by this thread or
another) or not locked at all
(say if skb is freed from a TX completion handler or a qdisc drop)

>
> > On two web workloads I looked at this condition accounts for 10%
> > and 23% of all ksoftirqd wake ups respectively. We run NAPI
> > which wakes some process up, we hit need_resched() and wake up
> > ksoftirqd just to run the TSQ (TCP small queues) tasklet.
> >
> > Tweak the need_resched() condition to be ignored if all pending
> > softIRQs are "non-deferred". The tasklet would run relatively
> > soon, anyway, but once ksoftirqd is woken we're risking stalls.
> >
> > I did not see any negative impact on the latency in an RR test
> > on a loaded machine with this change applied.
>
> Ignoring need_resched() will get you in trouble with RT people real
> fast.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending
  2023-01-09 10:16     ` Eric Dumazet
@ 2023-01-09 19:12       ` Jakub Kicinski
  0 siblings, 0 replies; 38+ messages in thread
From: Jakub Kicinski @ 2023-01-09 19:12 UTC (permalink / raw)
  To: Eric Dumazet, Peter Zijlstra; +Cc: tglx, jstultz, netdev, linux-kernel

On Mon, 9 Jan 2023 11:16:45 +0100 Eric Dumazet wrote:
> > On Thu, Dec 22, 2022 at 02:12:44PM -0800, Jakub Kicinski wrote:  
> > > In networking we try to keep Tx packet queues small, so we limit
> > > how many bytes a socket may packetize and queue up. Tx completions
> > > (from NAPI) notify the sockets when packets have left the system
> > > (NIC Tx completion) and the socket schedules a tasklet to queue
> > > the next batch of frames.
> > >
> > > This leads to a situation where we go thru the softirq loop twice.
> > > First round we have pending = NET (from the NIC IRQ/NAPI), and
> > > the second iteration has pending = TASKLET (the socket tasklet).  
> >
> > So to me that sounds like you want to fix the network code to not do
> > this then. Why can't the NAPI thing directly queue the next batch; why
> > do you have to do a softirq roundtrip like this?  
> 
> I think Jakub refers to tcp_wfree() code, which can be called from
> arbitrary contexts,
> including non NAPI ones, and with the socket locked (by this thread or
> another) or not locked at all
> (say if skb is freed from a TX completion handler or a qdisc drop)

Yes, fwiw.

> > > On two web workloads I looked at this condition accounts for 10%
> > > and 23% of all ksoftirqd wake ups respectively. We run NAPI
> > > which wakes some process up, we hit need_resched() and wake up
> > > ksoftirqd just to run the TSQ (TCP small queues) tasklet.
> > >
> > > Tweak the need_resched() condition to be ignored if all pending
> > > softIRQs are "non-deferred". The tasklet would run relatively
> > > soon, anyway, but once ksoftirqd is woken we're risking stalls.
> > >
> > > I did not see any negative impact on the latency in an RR test
> > > on a loaded machine with this change applied.  
> >
> > Ignoring need_resched() will get you in trouble with RT people real
> > fast.  

Ah, you're right :/ Is it good enough if we throw || force_irqthreads()
into the condition?

Otherwise we can just postpone this optimization, the overload 
time horizon / limit patch is much more important.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending
  2023-01-09  9:44   ` Peter Zijlstra
  2023-01-09 10:16     ` Eric Dumazet
@ 2023-03-03 11:41     ` Thomas Gleixner
  1 sibling, 0 replies; 38+ messages in thread
From: Thomas Gleixner @ 2023-03-03 11:41 UTC (permalink / raw)
  To: Peter Zijlstra, Jakub Kicinski; +Cc: jstultz, edumazet, netdev, linux-kernel

On Mon, Jan 09 2023 at 10:44, Peter Zijlstra wrote:
> On Thu, Dec 22, 2022 at 02:12:44PM -0800, Jakub Kicinski wrote:
>> Tweak the need_resched() condition to be ignored if all pending
>> softIRQs are "non-deferred". The tasklet would run relatively
>> soon, anyway, but once ksoftirqd is woken we're risking stalls.
>> 
>> I did not see any negative impact on the latency in an RR test
>> on a loaded machine with this change applied.
>
> Ignoring need_resched() will get you in trouble with RT people real
> fast.

In this case not really. softirq processing is preemptible in RT, but
it's still a major pain ...

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending
  2022-12-22 22:12 ` [PATCH 3/3] softirq: don't yield if only expedited handlers are pending Jakub Kicinski
  2023-01-09  9:44   ` Peter Zijlstra
@ 2023-03-03 14:17   ` Thomas Gleixner
  1 sibling, 0 replies; 38+ messages in thread
From: Thomas Gleixner @ 2023-03-03 14:17 UTC (permalink / raw)
  To: Jakub Kicinski, peterz
  Cc: jstultz, edumazet, netdev, linux-kernel, Jakub Kicinski

Jakub!

On Thu, Dec 22 2022 at 14:12, Jakub Kicinski wrote:
> This leads to a situation where we go thru the softirq loop twice.
> First round we have pending = NET (from the NIC IRQ/NAPI), and
> the second iteration has pending = TASKLET (the socket tasklet).

...

> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index ad200d386ec1..4ac59ffb0d55 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -601,7 +601,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
>  
>  		if (time_is_before_eq_jiffies(end) || !--max_restart)
>  			limit = SOFTIRQ_OVERLOAD_TIME;
> -		else if (need_resched())
> +		else if (need_resched() && pending & ~SOFTIRQ_NOW_MASK)
>  			limit = SOFTIRQ_DEFER_TIME;
>  		else
>  			goto restart;

While this is the least of my softirq worries on PREEMPT_RT, Peter is
right about real-time tasks being deferred on a PREEMPT_RT=n
kernel. That's a real issue for low-latency audio which John Stultz is
trying to resolve. Especially as the above check can go in circles.

I fear we need to go back to the drawing board and come up with a real
solution which takes these contradicting aspects into account. Let me
stare at Peters and Johns patches for a while.

Thanks

        tglx



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/3] softirq: uncontroversial change
  2022-12-22 22:12 [PATCH 0/3] softirq: uncontroversial change Jakub Kicinski
                   ` (2 preceding siblings ...)
  2022-12-22 22:12 ` [PATCH 3/3] softirq: don't yield if only expedited handlers are pending Jakub Kicinski
@ 2023-04-20 17:24 ` Paolo Abeni
  2023-04-20 17:41   ` Eric Dumazet
                     ` (2 more replies)
  3 siblings, 3 replies; 38+ messages in thread
From: Paolo Abeni @ 2023-04-20 17:24 UTC (permalink / raw)
  To: Jakub Kicinski, peterz, tglx; +Cc: jstultz, edumazet, netdev, linux-kernel

Hi all,
On Thu, 2022-12-22 at 14:12 -0800, Jakub Kicinski wrote:
> Catching up on LWN I run across the article about softirq
> changes, and then I noticed fresh patches in Peter's tree.
> So probably wise for me to throw these out there.
> 
> My (can I say Meta's?) problem is the opposite to what the RT
> sensitive people complain about. In the current scheme once
> ksoftirqd is woken no network processing happens until it runs.
> 
> When networking gets overloaded - that's probably fair, the problem
> is that we confuse latency tweaks with overload protection. We have
> a needs_resched() in the loop condition (which is a latency tweak)
> Most often we defer to ksoftirqd because we're trying to be nice
> and let user space respond quickly, not because there is an
> overload. But the user space may not be nice, and sit on the CPU
> for 10ms+. Also the sirq's "work allowance" is 2ms, which is
> uncomfortably close to the timer tick, but that's another story.
> 
> We have a sirq latency tracker in our prod kernel which catches
> 8ms+ stalls of net Tx (packets queued to the NIC but there is
> no NAPI cleanup within 8ms) and with these patches applied
> on 5.19 fully loaded web machine sees a drop in stalls from
> 1.8 stalls/sec to 0.16/sec. I also see a 50% drop in outgoing
> TCP retransmissions and ~10% drop in non-TLP incoming ones.
> This is not a network-heavy workload so most of the rtx are
> due to scheduling artifacts.
> 
> The network latency in a datacenter is somewhere around neat
> 1000x lower than scheduling granularity (around 10us).
> 
> These patches (patch 2 is "the meat") change what we recognize
> as overload. Instead of just checking if "ksoftirqd is woken"
> it also caps how long we consider ourselves to be in overload,
> a time limit which is different based on whether we yield due
> to real resource exhaustion vs just hitting that needs_resched().
> 
> I hope the core concept is not entirely idiotic. It'd be great
> if we could get this in or fold an equivalent concept into ongoing
> work from others, because due to various "scheduler improvements"
> every time we upgrade the production kernel this problem is getting
> worse :(

Please allow me to revive this old thread.

My understanding is that we want to avoid adding more heuristics here,
preferring a consistent refactor.

I would like to propose a revert of:

4cd13c21b207 softirq: Let ksoftirqd do its job

the its follow-ups:

3c53776e29f8 Mark HI and TASKLET softirq synchronous
0f50524789fc softirq: Don't skip softirq execution when softirq thread is parking

The problem originally addressed by 4cd13c21b207 can now be tackled
with the threaded napi, available since:

29863d41bb6e net: implement threaded-able napi poll loop support

Reverting the mentioned commit should address the latency issues
mentioned by Jakub - I verified it solves a somewhat related problem in
my setup - and reduces the layering of heuristics in this area.

A refactor introducing uniform overload detection and proper resource
control will be better, but I admit it's beyond me and anyway it could
still land afterwards.

Any opinion more then welcome!

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/3] softirq: uncontroversial change
  2023-04-20 17:24 ` [PATCH 0/3] softirq: uncontroversial change Paolo Abeni
@ 2023-04-20 17:41   ` Eric Dumazet
  2023-04-20 20:23     ` Paolo Abeni
  2023-04-21  2:48   ` Jason Xing
  2023-05-09 19:56   ` [tip: irq/core] Revert "softirq: Let ksoftirqd do its job" tip-bot2 for Paolo Abeni
  2 siblings, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2023-04-20 17:41 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Jakub Kicinski, peterz, tglx, jstultz, netdev, linux-kernel

On Thu, Apr 20, 2023 at 7:24 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hi all,
> On Thu, 2022-12-22 at 14:12 -0800, Jakub Kicinski wrote:
> > Catching up on LWN I run across the article about softirq
> > changes, and then I noticed fresh patches in Peter's tree.
> > So probably wise for me to throw these out there.
> >
> > My (can I say Meta's?) problem is the opposite to what the RT
> > sensitive people complain about. In the current scheme once
> > ksoftirqd is woken no network processing happens until it runs.
> >
> > When networking gets overloaded - that's probably fair, the problem
> > is that we confuse latency tweaks with overload protection. We have
> > a needs_resched() in the loop condition (which is a latency tweak)
> > Most often we defer to ksoftirqd because we're trying to be nice
> > and let user space respond quickly, not because there is an
> > overload. But the user space may not be nice, and sit on the CPU
> > for 10ms+. Also the sirq's "work allowance" is 2ms, which is
> > uncomfortably close to the timer tick, but that's another story.
> >
> > We have a sirq latency tracker in our prod kernel which catches
> > 8ms+ stalls of net Tx (packets queued to the NIC but there is
> > no NAPI cleanup within 8ms) and with these patches applied
> > on 5.19 fully loaded web machine sees a drop in stalls from
> > 1.8 stalls/sec to 0.16/sec. I also see a 50% drop in outgoing
> > TCP retransmissions and ~10% drop in non-TLP incoming ones.
> > This is not a network-heavy workload so most of the rtx are
> > due to scheduling artifacts.
> >
> > The network latency in a datacenter is somewhere around neat
> > 1000x lower than scheduling granularity (around 10us).
> >
> > These patches (patch 2 is "the meat") change what we recognize
> > as overload. Instead of just checking if "ksoftirqd is woken"
> > it also caps how long we consider ourselves to be in overload,
> > a time limit which is different based on whether we yield due
> > to real resource exhaustion vs just hitting that needs_resched().
> >
> > I hope the core concept is not entirely idiotic. It'd be great
> > if we could get this in or fold an equivalent concept into ongoing
> > work from others, because due to various "scheduler improvements"
> > every time we upgrade the production kernel this problem is getting
> > worse :(
>
> Please allow me to revive this old thread.
>
> My understanding is that we want to avoid adding more heuristics here,
> preferring a consistent refactor.
>
> I would like to propose a revert of:
>
> 4cd13c21b207 softirq: Let ksoftirqd do its job
>
> the its follow-ups:
>
> 3c53776e29f8 Mark HI and TASKLET softirq synchronous
> 0f50524789fc softirq: Don't skip softirq execution when softirq thread is parking
>
> The problem originally addressed by 4cd13c21b207 can now be tackled
> with the threaded napi, available since:
>
> 29863d41bb6e net: implement threaded-able napi poll loop support
>
> Reverting the mentioned commit should address the latency issues
> mentioned by Jakub - I verified it solves a somewhat related problem in
> my setup - and reduces the layering of heuristics in this area.
>
> A refactor introducing uniform overload detection and proper resource
> control will be better, but I admit it's beyond me and anyway it could
> still land afterwards.
>
> Any opinion more then welcome!

Seems fine, but I think few things need to be fixed first in
napi_threaded_poll()
to enable some important features that are currently  in net_rx_action() only.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/3] softirq: uncontroversial change
  2023-04-20 17:41   ` Eric Dumazet
@ 2023-04-20 20:23     ` Paolo Abeni
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Abeni @ 2023-04-20 20:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jakub Kicinski, peterz, tglx, jstultz, netdev, linux-kernel

On Thu, 2023-04-20 at 19:41 +0200, Eric Dumazet wrote:
> On Thu, Apr 20, 2023 at 7:24 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > I would like to propose a revert of:
> > 
> > 4cd13c21b207 softirq: Let ksoftirqd do its job
> > 
> > the its follow-ups:
> > 
> > 3c53776e29f8 Mark HI and TASKLET softirq synchronous
> > 0f50524789fc softirq: Don't skip softirq execution when softirq thread is parking
> > 
> > The problem originally addressed by 4cd13c21b207 can now be tackled
> > with the threaded napi, available since:
> > 
> > 29863d41bb6e net: implement threaded-able napi poll loop support
> > 
> > Reverting the mentioned commit should address the latency issues
> > mentioned by Jakub - I verified it solves a somewhat related problem in
> > my setup - and reduces the layering of heuristics in this area.
> > 
> > A refactor introducing uniform overload detection and proper resource
> > control will be better, but I admit it's beyond me and anyway it could
> > still land afterwards.
> > 
> > Any opinion more then welcome!
> 
> Seems fine, but I think few things need to be fixed first in
> napi_threaded_poll()
> to enable some important features that are currently  in net_rx_action() only.

Thanks for the feedback.

I fear I'll miss some relevant bits. 

On top of my head I think about RPS and  skb_defer_free. Both should
work even when napi threaded is enabled - with an additional softirq ;)
Do you think we should be able to handle both inside the napi thread?
Or do you refer to other features?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/3] softirq: uncontroversial change
  2023-04-20 17:24 ` [PATCH 0/3] softirq: uncontroversial change Paolo Abeni
  2023-04-20 17:41   ` Eric Dumazet
@ 2023-04-21  2:48   ` Jason Xing
  2023-04-21  9:33     ` Paolo Abeni
  2023-05-09 19:56   ` [tip: irq/core] Revert "softirq: Let ksoftirqd do its job" tip-bot2 for Paolo Abeni
  2 siblings, 1 reply; 38+ messages in thread
From: Jason Xing @ 2023-04-21  2:48 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Jakub Kicinski, peterz, tglx, jstultz, edumazet, netdev,
	linux-kernel

On Fri, Apr 21, 2023 at 1:34 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hi all,
> On Thu, 2022-12-22 at 14:12 -0800, Jakub Kicinski wrote:
> > Catching up on LWN I run across the article about softirq
> > changes, and then I noticed fresh patches in Peter's tree.
> > So probably wise for me to throw these out there.
> >
> > My (can I say Meta's?) problem is the opposite to what the RT
> > sensitive people complain about. In the current scheme once
> > ksoftirqd is woken no network processing happens until it runs.
> >
> > When networking gets overloaded - that's probably fair, the problem
> > is that we confuse latency tweaks with overload protection. We have
> > a needs_resched() in the loop condition (which is a latency tweak)
> > Most often we defer to ksoftirqd because we're trying to be nice
> > and let user space respond quickly, not because there is an
> > overload. But the user space may not be nice, and sit on the CPU
> > for 10ms+. Also the sirq's "work allowance" is 2ms, which is
> > uncomfortably close to the timer tick, but that's another story.
> >
> > We have a sirq latency tracker in our prod kernel which catches
> > 8ms+ stalls of net Tx (packets queued to the NIC but there is
> > no NAPI cleanup within 8ms) and with these patches applied
> > on 5.19 fully loaded web machine sees a drop in stalls from
> > 1.8 stalls/sec to 0.16/sec. I also see a 50% drop in outgoing
> > TCP retransmissions and ~10% drop in non-TLP incoming ones.
> > This is not a network-heavy workload so most of the rtx are
> > due to scheduling artifacts.
> >
> > The network latency in a datacenter is somewhere around neat
> > 1000x lower than scheduling granularity (around 10us).
> >
> > These patches (patch 2 is "the meat") change what we recognize
> > as overload. Instead of just checking if "ksoftirqd is woken"
> > it also caps how long we consider ourselves to be in overload,
> > a time limit which is different based on whether we yield due
> > to real resource exhaustion vs just hitting that needs_resched().
> >
> > I hope the core concept is not entirely idiotic. It'd be great
> > if we could get this in or fold an equivalent concept into ongoing
> > work from others, because due to various "scheduler improvements"
> > every time we upgrade the production kernel this problem is getting
> > worse :(
>
[...]
> Please allow me to revive this old thread.

Hi Paolo,

So good to hear this :)

>
> My understanding is that we want to avoid adding more heuristics here,
> preferring a consistent refactor.
>
> I would like to propose a revert of:
>
> 4cd13c21b207 softirq: Let ksoftirqd do its job
>
> the its follow-ups:
>
> 3c53776e29f8 Mark HI and TASKLET softirq synchronous
> 0f50524789fc softirq: Don't skip softirq execution when softirq thread is parking

More than this, I list some related patches mentioned in the above
commit 3c53776e29f8:
1ff688209e2e ("watchdog: core: make sure the watchdog_worker is not deferred")
8d5755b3f77b ("watchdog: softdog: fire watchdog even if softirqs do
not get to run")
217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()")

>
> The problem originally addressed by 4cd13c21b207 can now be tackled
> with the threaded napi, available since:
>
> 29863d41bb6e net: implement threaded-able napi poll loop support
>
> Reverting the mentioned commit should address the latency issues
> mentioned by Jakub - I verified it solves a somewhat related problem in
> my setup - and reduces the layering of heuristics in this area.

Sure, it is. I also can verify its usefulness in the real workload.
Some days ago I also sent a heuristics patch [1] that can bypass the
ksoftirqd if the user chooses to mask some type of softirq. Let the
user decide it.

But I observed that if we mask some softirqs, or we can say,
completely revert the commit 4cd13c21b207, the load would go higher
and the kernel itself may occupy/consume more time than before. They
were tested under the similar workload launched by our applications.

[1]: https://lore.kernel.org/all/20230410023041.49857-1-kerneljasonxing@gmail.com/

>
> A refactor introducing uniform overload detection and proper resource
> control will be better, but I admit it's beyond me and anyway it could
> still land afterwards.

+1

Thanks,
Jason
>
> Any opinion more then welcome!
>
> Thanks,
>
> Paolo
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/3] softirq: uncontroversial change
  2023-04-21  2:48   ` Jason Xing
@ 2023-04-21  9:33     ` Paolo Abeni
  2023-04-21  9:46       ` Jason Xing
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Abeni @ 2023-04-21  9:33 UTC (permalink / raw)
  To: Jason Xing
  Cc: Jakub Kicinski, peterz, tglx, jstultz, edumazet, netdev,
	linux-kernel

On Fri, 2023-04-21 at 10:48 +0800, Jason Xing wrote:
> 
> > My understanding is that we want to avoid adding more heuristics here,
> > preferring a consistent refactor.
> > 
> > I would like to propose a revert of:
> > 
> > 4cd13c21b207 softirq: Let ksoftirqd do its job
> > 
> > the its follow-ups:
> > 
> > 3c53776e29f8 Mark HI and TASKLET softirq synchronous
> > 0f50524789fc softirq: Don't skip softirq execution when softirq thread is parking
> 
> More than this, I list some related patches mentioned in the above
> commit 3c53776e29f8:
> 1ff688209e2e ("watchdog: core: make sure the watchdog_worker is not deferred")
> 8d5755b3f77b ("watchdog: softdog: fire watchdog even if softirqs do
> not get to run")
> 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()")

The first 2 changes replace plain timers with HR ones, could possibly
be reverted, too, but it should not be a big deal either way.

I think instead we want to keep the third commit above, as it should be
useful when napi threaded is enabled.

Generally speaking I would keep the initial revert to the bare minimum.

> > The problem originally addressed by 4cd13c21b207 can now be tackled
> > with the threaded napi, available since:
> > 
> > 29863d41bb6e net: implement threaded-able napi poll loop support
> > 
> > Reverting the mentioned commit should address the latency issues
> > mentioned by Jakub - I verified it solves a somewhat related problem in
> > my setup - and reduces the layering of heuristics in this area.
> 
> Sure, it is. I also can verify its usefulness in the real workload.
> Some days ago I also sent a heuristics patch [1] that can bypass the
> ksoftirqd if the user chooses to mask some type of softirq. Let the
> user decide it.
> 
> But I observed that if we mask some softirqs, or we can say,
> completely revert the commit 4cd13c21b207, the load would go higher
> and the kernel itself may occupy/consume more time than before. They
> were tested under the similar workload launched by our applications.
> 
> [1]: https://lore.kernel.org/all/20230410023041.49857-1-kerneljasonxing@gmail.com/

Thanks for the reference, I would have missed that patch otherwise.

My understanding is that adding more knobs here is in the opposite
direction of what Thomas is suggesting, and IMHO the 'now mask' should
not be exposed to user-space.

> 
Thanks for the feedback,

Paolo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/3] softirq: uncontroversial change
  2023-04-21  9:33     ` Paolo Abeni
@ 2023-04-21  9:46       ` Jason Xing
  0 siblings, 0 replies; 38+ messages in thread
From: Jason Xing @ 2023-04-21  9:46 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Jakub Kicinski, peterz, tglx, jstultz, edumazet, netdev,
	linux-kernel

On Fri, Apr 21, 2023 at 5:33 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Fri, 2023-04-21 at 10:48 +0800, Jason Xing wrote:
> >
> > > My understanding is that we want to avoid adding more heuristics here,
> > > preferring a consistent refactor.
> > >
> > > I would like to propose a revert of:
> > >
> > > 4cd13c21b207 softirq: Let ksoftirqd do its job
> > >
> > > the its follow-ups:
> > >
> > > 3c53776e29f8 Mark HI and TASKLET softirq synchronous
> > > 0f50524789fc softirq: Don't skip softirq execution when softirq thread is parking
> >
> > More than this, I list some related patches mentioned in the above
> > commit 3c53776e29f8:
> > 1ff688209e2e ("watchdog: core: make sure the watchdog_worker is not deferred")
> > 8d5755b3f77b ("watchdog: softdog: fire watchdog even if softirqs do
> > not get to run")
> > 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()")
>
[...]
> The first 2 changes replace plain timers with HR ones, could possibly
> be reverted, too, but it should not be a big deal either way.
>
> I think instead we want to keep the third commit above, as it should be
> useful when napi threaded is enabled.
>
> Generally speaking I would keep the initial revert to the bare minimum.

I agree with you :)

>
> > > The problem originally addressed by 4cd13c21b207 can now be tackled
> > > with the threaded napi, available since:
> > >
> > > 29863d41bb6e net: implement threaded-able napi poll loop support
> > >
> > > Reverting the mentioned commit should address the latency issues
> > > mentioned by Jakub - I verified it solves a somewhat related problem in
> > > my setup - and reduces the layering of heuristics in this area.
> >
> > Sure, it is. I also can verify its usefulness in the real workload.
> > Some days ago I also sent a heuristics patch [1] that can bypass the
> > ksoftirqd if the user chooses to mask some type of softirq. Let the
> > user decide it.
> >
> > But I observed that if we mask some softirqs, or we can say,
> > completely revert the commit 4cd13c21b207, the load would go higher
> > and the kernel itself may occupy/consume more time than before. They
> > were tested under the similar workload launched by our applications.
> >
> > [1]: https://lore.kernel.org/all/20230410023041.49857-1-kerneljasonxing@gmail.com/
>
> Thanks for the reference, I would have missed that patch otherwise.
>
> My understanding is that adding more knobs here is in the opposite
> direction of what Thomas is suggesting, and IMHO the 'now mask' should
> not be exposed to user-space.

Could you please share the link about what Thomas is suggesting? I
missed it. At the beginning, I didn't have the guts to revert the
commit directly. Instead I wrote a compromised patch that is not that
elegant as you said. Anyway, the idea is common, but reverting the
whole commit may involve more work. I will spend some time digging
into this part.

More suggestions are also welcome :)

Thanks,
Jason

>
> >
> Thanks for the feedback,
>
> Paolo
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [tip: irq/core] Revert "softirq: Let ksoftirqd do its job"
  2023-04-20 17:24 ` [PATCH 0/3] softirq: uncontroversial change Paolo Abeni
  2023-04-20 17:41   ` Eric Dumazet
  2023-04-21  2:48   ` Jason Xing
@ 2023-05-09 19:56   ` tip-bot2 for Paolo Abeni
  2 siblings, 0 replies; 38+ messages in thread
From: tip-bot2 for Paolo Abeni @ 2023-05-09 19:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Paolo Abeni, Thomas Gleixner, Jason Xing, Jakub Kicinski,
	Eric Dumazet, Sebastian Andrzej Siewior, Paul E. McKenney,
	Peter Zijlstra, netdev, x86, linux-kernel, maz

The following commit has been merged into the irq/core branch of tip:

Commit-ID:     d15121be7485655129101f3960ae6add40204463
Gitweb:        https://git.kernel.org/tip/d15121be7485655129101f3960ae6add40204463
Author:        Paolo Abeni <pabeni@redhat.com>
AuthorDate:    Mon, 08 May 2023 08:17:44 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 09 May 2023 21:50:27 +02:00

Revert "softirq: Let ksoftirqd do its job"

This reverts the following commits:

  4cd13c21b207 ("softirq: Let ksoftirqd do its job")
  3c53776e29f8 ("Mark HI and TASKLET softirq synchronous")
  1342d8080f61 ("softirq: Don't skip softirq execution when softirq thread is parking")

in a single change to avoid known bad intermediate states introduced by a
patch series reverting them individually.

Due to the mentioned commit, when the ksoftirqd threads take charge of
softirq processing, the system can experience high latencies.

In the past a few workarounds have been implemented for specific
side-effects of the initial ksoftirqd enforcement commit:

commit 1ff688209e2e ("watchdog: core: make sure the watchdog_worker is not deferred")
commit 8d5755b3f77b ("watchdog: softdog: fire watchdog even if softirqs do not get to run")
commit 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()")
commit 3c53776e29f8 ("Mark HI and TASKLET softirq synchronous")

But the latency problem still exists in real-life workloads, see the link
below.

The reverted commit intended to solve a live-lock scenario that can now be
addressed with the NAPI threaded mode, introduced with commit 29863d41bb6e
("net: implement threaded-able napi poll loop support"), which is nowadays
in a pretty stable status.

While a complete solution to put softirq processing under nice resource
control would be preferable, that has proven to be a very hard task. In
the short term, remove the main pain point, and also simplify a bit the
current softirq implementation.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/netdev/305d7742212cbe98621b16be782b0562f1012cb6.camel@redhat.com
Link: https://lore.kernel.org/r/57e66b364f1b6f09c9bc0316742c3b14f4ce83bd.1683526542.git.pabeni@redhat.com
---
 kernel/softirq.c | 22 ++--------------------
 1 file changed, 2 insertions(+), 20 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 1b72551..807b34c 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -80,21 +80,6 @@ static void wakeup_softirqd(void)
 		wake_up_process(tsk);
 }
 
-/*
- * If ksoftirqd is scheduled, we do not want to process pending softirqs
- * right now. Let ksoftirqd handle this at its own rate, to get fairness,
- * unless we're doing some of the synchronous softirqs.
- */
-#define SOFTIRQ_NOW_MASK ((1 << HI_SOFTIRQ) | (1 << TASKLET_SOFTIRQ))
-static bool ksoftirqd_running(unsigned long pending)
-{
-	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
-
-	if (pending & SOFTIRQ_NOW_MASK)
-		return false;
-	return tsk && task_is_running(tsk) && !__kthread_should_park(tsk);
-}
-
 #ifdef CONFIG_TRACE_IRQFLAGS
 DEFINE_PER_CPU(int, hardirqs_enabled);
 DEFINE_PER_CPU(int, hardirq_context);
@@ -236,7 +221,7 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
 		goto out;
 
 	pending = local_softirq_pending();
-	if (!pending || ksoftirqd_running(pending))
+	if (!pending)
 		goto out;
 
 	/*
@@ -432,9 +417,6 @@ static inline bool should_wake_ksoftirqd(void)
 
 static inline void invoke_softirq(void)
 {
-	if (ksoftirqd_running(local_softirq_pending()))
-		return;
-
 	if (!force_irqthreads() || !__this_cpu_read(ksoftirqd)) {
 #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
 		/*
@@ -468,7 +450,7 @@ asmlinkage __visible void do_softirq(void)
 
 	pending = local_softirq_pending();
 
-	if (pending && !ksoftirqd_running(pending))
+	if (pending)
 		do_softirq_own_stack();
 
 	local_irq_restore(flags);

^ permalink raw reply related	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2023-05-09 19:56 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-22 22:12 [PATCH 0/3] softirq: uncontroversial change Jakub Kicinski
2022-12-22 22:12 ` [PATCH 1/3] softirq: rename ksoftirqd_running() -> ksoftirqd_should_handle() Jakub Kicinski
2022-12-22 22:12 ` [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() Jakub Kicinski
2023-01-31 22:32   ` Jakub Kicinski
2023-03-03 13:30   ` Thomas Gleixner
2023-03-03 15:18     ` Thomas Gleixner
2023-03-03 21:31     ` Jakub Kicinski
2023-03-03 22:37       ` Paul E. McKenney
2023-03-03 23:25         ` Dave Taht
2023-03-04  1:14           ` Paul E. McKenney
2023-03-03 23:36         ` Paul E. McKenney
2023-03-03 23:44           ` Jakub Kicinski
2023-03-04  1:25             ` Paul E. McKenney
2023-03-04  1:39               ` Jakub Kicinski
2023-03-04  3:11                 ` Paul E. McKenney
2023-03-04 20:48                   ` Paul E. McKenney
2023-03-05 20:43       ` Thomas Gleixner
2023-03-05 22:42         ` Paul E. McKenney
2023-03-05 23:00           ` Frederic Weisbecker
2023-03-06  4:30             ` Paul E. McKenney
2023-03-06 11:22               ` Frederic Weisbecker
2023-03-06  9:13         ` David Laight
2023-03-06 11:57         ` Frederic Weisbecker
2023-03-06 14:57           ` Paul E. McKenney
2023-03-07  0:51         ` Jakub Kicinski
2022-12-22 22:12 ` [PATCH 3/3] softirq: don't yield if only expedited handlers are pending Jakub Kicinski
2023-01-09  9:44   ` Peter Zijlstra
2023-01-09 10:16     ` Eric Dumazet
2023-01-09 19:12       ` Jakub Kicinski
2023-03-03 11:41     ` Thomas Gleixner
2023-03-03 14:17   ` Thomas Gleixner
2023-04-20 17:24 ` [PATCH 0/3] softirq: uncontroversial change Paolo Abeni
2023-04-20 17:41   ` Eric Dumazet
2023-04-20 20:23     ` Paolo Abeni
2023-04-21  2:48   ` Jason Xing
2023-04-21  9:33     ` Paolo Abeni
2023-04-21  9:46       ` Jason Xing
2023-05-09 19:56   ` [tip: irq/core] Revert "softirq: Let ksoftirqd do its job" tip-bot2 for Paolo Abeni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).