* [PATCH 0/3] softirq: uncontroversial change
@ 2022-12-22 22:12 Jakub Kicinski
2022-12-22 22:12 ` [PATCH 1/3] softirq: rename ksoftirqd_running() -> ksoftirqd_should_handle() Jakub Kicinski
` (3 more replies)
0 siblings, 4 replies; 38+ messages in thread
From: Jakub Kicinski @ 2022-12-22 22:12 UTC (permalink / raw)
To: peterz, tglx; +Cc: jstultz, edumazet, netdev, linux-kernel, Jakub Kicinski
Catching up on LWN I run across the article about softirq
changes, and then I noticed fresh patches in Peter's tree.
So probably wise for me to throw these out there.
My (can I say Meta's?) problem is the opposite to what the RT
sensitive people complain about. In the current scheme once
ksoftirqd is woken no network processing happens until it runs.
When networking gets overloaded - that's probably fair, the problem
is that we confuse latency tweaks with overload protection. We have
a needs_resched() in the loop condition (which is a latency tweak)
Most often we defer to ksoftirqd because we're trying to be nice
and let user space respond quickly, not because there is an
overload. But the user space may not be nice, and sit on the CPU
for 10ms+. Also the sirq's "work allowance" is 2ms, which is
uncomfortably close to the timer tick, but that's another story.
We have a sirq latency tracker in our prod kernel which catches
8ms+ stalls of net Tx (packets queued to the NIC but there is
no NAPI cleanup within 8ms) and with these patches applied
on 5.19 fully loaded web machine sees a drop in stalls from
1.8 stalls/sec to 0.16/sec. I also see a 50% drop in outgoing
TCP retransmissions and ~10% drop in non-TLP incoming ones.
This is not a network-heavy workload so most of the rtx are
due to scheduling artifacts.
The network latency in a datacenter is somewhere around neat
1000x lower than scheduling granularity (around 10us).
These patches (patch 2 is "the meat") change what we recognize
as overload. Instead of just checking if "ksoftirqd is woken"
it also caps how long we consider ourselves to be in overload,
a time limit which is different based on whether we yield due
to real resource exhaustion vs just hitting that needs_resched().
I hope the core concept is not entirely idiotic. It'd be great
if we could get this in or fold an equivalent concept into ongoing
work from others, because due to various "scheduler improvements"
every time we upgrade the production kernel this problem is getting
worse :(
Jakub Kicinski (3):
softirq: rename ksoftirqd_running() -> ksoftirqd_should_handle()
softirq: avoid spurious stalls due to need_resched()
softirq: don't yield if only expedited handlers are pending
kernel/softirq.c | 29 ++++++++++++++++++++++-------
1 file changed, 22 insertions(+), 7 deletions(-)
--
2.38.1
^ permalink raw reply [flat|nested] 38+ messages in thread* [PATCH 1/3] softirq: rename ksoftirqd_running() -> ksoftirqd_should_handle() 2022-12-22 22:12 [PATCH 0/3] softirq: uncontroversial change Jakub Kicinski @ 2022-12-22 22:12 ` Jakub Kicinski 2022-12-22 22:12 ` [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() Jakub Kicinski ` (2 subsequent siblings) 3 siblings, 0 replies; 38+ messages in thread From: Jakub Kicinski @ 2022-12-22 22:12 UTC (permalink / raw) To: peterz, tglx; +Cc: jstultz, edumazet, netdev, linux-kernel, Jakub Kicinski ksoftirqd_running() takes the high priority softirqs into consideration, so ksoftirqd_should_handle() seems like a better name. Signed-off-by: Jakub Kicinski <kuba@kernel.org> --- kernel/softirq.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index c8a6913c067d..00b838d566c1 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -86,7 +86,7 @@ static void wakeup_softirqd(void) * unless we're doing some of the synchronous softirqs. */ #define SOFTIRQ_NOW_MASK ((1 << HI_SOFTIRQ) | (1 << TASKLET_SOFTIRQ)) -static bool ksoftirqd_running(unsigned long pending) +static bool ksoftirqd_should_handle(unsigned long pending) { struct task_struct *tsk = __this_cpu_read(ksoftirqd); @@ -236,7 +236,7 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) goto out; pending = local_softirq_pending(); - if (!pending || ksoftirqd_running(pending)) + if (!pending || ksoftirqd_should_handle(pending)) goto out; /* @@ -432,7 +432,7 @@ static inline bool should_wake_ksoftirqd(void) static inline void invoke_softirq(void) { - if (ksoftirqd_running(local_softirq_pending())) + if (ksoftirqd_should_handle(local_softirq_pending())) return; if (!force_irqthreads() || !__this_cpu_read(ksoftirqd)) { @@ -468,7 +468,7 @@ asmlinkage __visible void do_softirq(void) pending = local_softirq_pending(); - if (pending && !ksoftirqd_running(pending)) + if (pending && !ksoftirqd_should_handle(pending)) do_softirq_own_stack(); local_irq_restore(flags); -- 2.38.1 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2022-12-22 22:12 [PATCH 0/3] softirq: uncontroversial change Jakub Kicinski 2022-12-22 22:12 ` [PATCH 1/3] softirq: rename ksoftirqd_running() -> ksoftirqd_should_handle() Jakub Kicinski @ 2022-12-22 22:12 ` Jakub Kicinski 2023-01-31 22:32 ` Jakub Kicinski 2023-03-03 13:30 ` Thomas Gleixner 2022-12-22 22:12 ` [PATCH 3/3] softirq: don't yield if only expedited handlers are pending Jakub Kicinski 2023-04-20 17:24 ` [PATCH 0/3] softirq: uncontroversial change Paolo Abeni 3 siblings, 2 replies; 38+ messages in thread From: Jakub Kicinski @ 2022-12-22 22:12 UTC (permalink / raw) To: peterz, tglx; +Cc: jstultz, edumazet, netdev, linux-kernel, Jakub Kicinski need_resched() added in commit c10d73671ad3 ("softirq: reduce latencies") does improve latency for real workloads (for example memcache). Unfortunately it triggers quite often even for non-network-heavy apps (~900 times a second on a loaded webserver), and in small fraction of cases whatever the scheduler decided to run will hold onto the CPU for the entire time slice. 10ms+ stalls on a machine which is not actually under overload cause erratic network behavior and spurious TCP retransmits. Typical end-to-end latency in a datacenter is < 200us so its common to set TCP timeout to 10ms or less. The intent of the need_resched() is to let a low latency application respond quickly and yield (to ksoftirqd). Put a time limit on this dance. Ignore the fact that ksoftirqd is RUNNING if we were trying to be nice and the application did not yield quickly. On a webserver loaded at 90% CPU this change reduces the numer of 8ms+ stalls the network softirq processing sees by around 10x (2/sec -> 0.2/sec). It also seems to reduce retransmissions by ~10% but the data is quite noisy. Signed-off-by: Jakub Kicinski <kuba@kernel.org> --- kernel/softirq.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index 00b838d566c1..ad200d386ec1 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -59,6 +59,7 @@ EXPORT_PER_CPU_SYMBOL(irq_stat); static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp; DEFINE_PER_CPU(struct task_struct *, ksoftirqd); +static DEFINE_PER_CPU(unsigned long, overload_limit); const char * const softirq_to_name[NR_SOFTIRQS] = { "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "IRQ_POLL", @@ -89,10 +90,15 @@ static void wakeup_softirqd(void) static bool ksoftirqd_should_handle(unsigned long pending) { struct task_struct *tsk = __this_cpu_read(ksoftirqd); + unsigned long ov_limit; if (pending & SOFTIRQ_NOW_MASK) return false; - return tsk && task_is_running(tsk) && !__kthread_should_park(tsk); + if (likely(!tsk || !task_is_running(tsk) || __kthread_should_park(tsk))) + return false; + + ov_limit = __this_cpu_read(overload_limit); + return time_is_after_jiffies(ov_limit); } #ifdef CONFIG_TRACE_IRQFLAGS @@ -492,6 +498,9 @@ asmlinkage __visible void do_softirq(void) #define MAX_SOFTIRQ_TIME msecs_to_jiffies(2) #define MAX_SOFTIRQ_RESTART 10 +#define SOFTIRQ_OVERLOAD_TIME msecs_to_jiffies(100) +#define SOFTIRQ_DEFER_TIME msecs_to_jiffies(2) + #ifdef CONFIG_TRACE_IRQFLAGS /* * When we run softirqs from irq_exit() and thus on the hardirq stack we need @@ -588,10 +597,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) pending = local_softirq_pending(); if (pending) { - if (time_before(jiffies, end) && !need_resched() && - --max_restart) + unsigned long limit; + + if (time_is_before_eq_jiffies(end) || !--max_restart) + limit = SOFTIRQ_OVERLOAD_TIME; + else if (need_resched()) + limit = SOFTIRQ_DEFER_TIME; + else goto restart; + __this_cpu_write(overload_limit, jiffies + limit); wakeup_softirqd(); } -- 2.38.1 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2022-12-22 22:12 ` [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() Jakub Kicinski @ 2023-01-31 22:32 ` Jakub Kicinski 2023-03-03 13:30 ` Thomas Gleixner 1 sibling, 0 replies; 38+ messages in thread From: Jakub Kicinski @ 2023-01-31 22:32 UTC (permalink / raw) To: peterz, tglx; +Cc: jstultz, edumazet, netdev, linux-kernel On Thu, 22 Dec 2022 14:12:43 -0800 Jakub Kicinski wrote: > need_resched() added in commit c10d73671ad3 ("softirq: reduce latencies") > does improve latency for real workloads (for example memcache). > Unfortunately it triggers quite often even for non-network-heavy apps > (~900 times a second on a loaded webserver), and in small fraction of > cases whatever the scheduler decided to run will hold onto the CPU > for the entire time slice. > > 10ms+ stalls on a machine which is not actually under overload cause > erratic network behavior and spurious TCP retransmits. Typical end-to-end > latency in a datacenter is < 200us so its common to set TCP timeout > to 10ms or less. > > The intent of the need_resched() is to let a low latency application > respond quickly and yield (to ksoftirqd). Put a time limit on this dance. > Ignore the fact that ksoftirqd is RUNNING if we were trying to be nice > and the application did not yield quickly. > > On a webserver loaded at 90% CPU this change reduces the numer of 8ms+ > stalls the network softirq processing sees by around 10x (2/sec -> 0.2/sec). > It also seems to reduce retransmissions by ~10% but the data is quite > noisy. Peter, is there a chance you could fold this patch into your ongoing softirq rework? We can't both work on softirq in parallel, unfortunately and this improvement is really key to counter balance whatever heuristics CFS accumulated between 5.12 and 5.19 :( Not to use the "r-word". I can spin a version of this on top of your core/softirq branch, would that work? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2022-12-22 22:12 ` [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() Jakub Kicinski 2023-01-31 22:32 ` Jakub Kicinski @ 2023-03-03 13:30 ` Thomas Gleixner 2023-03-03 15:18 ` Thomas Gleixner 2023-03-03 21:31 ` Jakub Kicinski 1 sibling, 2 replies; 38+ messages in thread From: Thomas Gleixner @ 2023-03-03 13:30 UTC (permalink / raw) To: Jakub Kicinski, peterz Cc: jstultz, edumazet, netdev, linux-kernel, Jakub Kicinski Jakub! On Thu, Dec 22 2022 at 14:12, Jakub Kicinski wrote: > DEFINE_PER_CPU(struct task_struct *, ksoftirqd); > +static DEFINE_PER_CPU(unsigned long, overload_limit); > > const char * const softirq_to_name[NR_SOFTIRQS] = { > "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "IRQ_POLL", > @@ -89,10 +90,15 @@ static void wakeup_softirqd(void) > static bool ksoftirqd_should_handle(unsigned long pending) > { > struct task_struct *tsk = __this_cpu_read(ksoftirqd); > + unsigned long ov_limit; > > if (pending & SOFTIRQ_NOW_MASK) > return false; > - return tsk && task_is_running(tsk) && !__kthread_should_park(tsk); > + if (likely(!tsk || !task_is_running(tsk) || __kthread_should_park(tsk))) > + return false; > + > + ov_limit = __this_cpu_read(overload_limit); > + return time_is_after_jiffies(ov_limit); return time_is_after_jiffies(__this_cpu_read(overload_limit)); Plus a comment explaining the magic, please. > } > > #ifdef CONFIG_TRACE_IRQFLAGS > @@ -492,6 +498,9 @@ asmlinkage __visible void do_softirq(void) > #define MAX_SOFTIRQ_TIME msecs_to_jiffies(2) > #define MAX_SOFTIRQ_RESTART 10 > > +#define SOFTIRQ_OVERLOAD_TIME msecs_to_jiffies(100) > +#define SOFTIRQ_DEFER_TIME msecs_to_jiffies(2) > + > #ifdef CONFIG_TRACE_IRQFLAGS > /* > * When we run softirqs from irq_exit() and thus on the hardirq stack we need > @@ -588,10 +597,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) > > pending = local_softirq_pending(); > if (pending) { > - if (time_before(jiffies, end) && !need_resched() && > - --max_restart) > + unsigned long limit; > + > + if (time_is_before_eq_jiffies(end) || !--max_restart) > + limit = SOFTIRQ_OVERLOAD_TIME; > + else if (need_resched()) > + limit = SOFTIRQ_DEFER_TIME; > + else > goto restart; > > + __this_cpu_write(overload_limit, jiffies + limit); The logic of all this is non-obvious and I had to reread it 5 times to conclude that it is matching the intent. Please add comments. While I'm not a big fan of heuristical duct tape, this looks harmless enough to not end up in an endless stream of tweaking. Famous last words... But without the sched_clock() changes the actual defer time depends on HZ and the point in time where limit is set. That means it ranges from 0 to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in the worst case, which perhaps explains the 8ms+ stalls you are still observing. Can you test with that sched_clock change applied, i.e. the first two commits from git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq 59be25c466d9 ("softirq: Use sched_clock() based timeout") bd5a5bd77009 ("softirq: Rewrite softirq processing loop") whether that makes a difference? Those two can be applied with some minor polishing. The rest of that series is broken by f10020c97f4c ("softirq: Allow early break"). There is another issue with this overload limit. Assume max_restart or timeout triggered and limit was set to now + 100ms. ksoftirqd runs and gets the issue resolved after 10ms. So for the remaining 90ms any invocation of raise_softirq() outside of (soft)interrupt context, which wakes ksoftirqd again, prevents processing on return from interrupt until ksoftirqd gets on the CPU and goes back to sleep, because task_is_running() == true and the stale limit is not after jiffies. Probably not a big issue, but someone will notice on some weird workload sooner than later and the tweaking will start nevertheless. :) So maybe we fix it right away. :) Thanks, tglx ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-03 13:30 ` Thomas Gleixner @ 2023-03-03 15:18 ` Thomas Gleixner 2023-03-03 21:31 ` Jakub Kicinski 1 sibling, 0 replies; 38+ messages in thread From: Thomas Gleixner @ 2023-03-03 15:18 UTC (permalink / raw) To: Jakub Kicinski, peterz Cc: jstultz, edumazet, netdev, linux-kernel, Jakub Kicinski Jakub! On Fri, Mar 03 2023 at 14:30, Thomas Gleixner wrote: > On Thu, Dec 22 2022 at 14:12, Jakub Kicinski wrote: > But without the sched_clock() changes the actual defer time depends on > HZ and the point in time where limit is set. That means it ranges from 0 > to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in > the worst case, which perhaps explains the 8ms+ stalls you are still > observing. Can you test with that sched_clock change applied, i.e. the > first two commits from > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq > > 59be25c466d9 ("softirq: Use sched_clock() based timeout") > bd5a5bd77009 ("softirq: Rewrite softirq processing loop") > > whether that makes a difference? Those two can be applied with some > minor polishing. The rest of that series is broken by f10020c97f4c > ("softirq: Allow early break"). WHile staring I noticed that the current jiffies based time limit handling has the exact same problem. For HZ=100 and HZ=250 MAX_SOFTIRQ_TIME resolves to 1 jiffy. So the window is between 0 and 1/HZ. Not really useful. Thanks, tglx ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-03 13:30 ` Thomas Gleixner 2023-03-03 15:18 ` Thomas Gleixner @ 2023-03-03 21:31 ` Jakub Kicinski 2023-03-03 22:37 ` Paul E. McKenney 2023-03-05 20:43 ` Thomas Gleixner 1 sibling, 2 replies; 38+ messages in thread From: Jakub Kicinski @ 2023-03-03 21:31 UTC (permalink / raw) To: Thomas Gleixner Cc: peterz, jstultz, edumazet, netdev, linux-kernel, Paul E. McKenney On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote: > > - if (time_before(jiffies, end) && !need_resched() && > > - --max_restart) > > + unsigned long limit; > > + > > + if (time_is_before_eq_jiffies(end) || !--max_restart) > > + limit = SOFTIRQ_OVERLOAD_TIME; > > + else if (need_resched()) > > + limit = SOFTIRQ_DEFER_TIME; > > + else > > goto restart; > > > > + __this_cpu_write(overload_limit, jiffies + limit); > > The logic of all this is non-obvious and I had to reread it 5 times to > conclude that it is matching the intent. Please add comments. > > While I'm not a big fan of heuristical duct tape, this looks harmless > enough to not end up in an endless stream of tweaking. Famous last > words... Would it all be more readable if I named the "overload_limit" "overloaded_until" instead? Naming.. I'll add comments, too. > But without the sched_clock() changes the actual defer time depends on > HZ and the point in time where limit is set. That means it ranges from 0 > to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in > the worst case, which perhaps explains the 8ms+ stalls you are still > observing. Can you test with that sched_clock change applied, i.e. the > first two commits from > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq > > 59be25c466d9 ("softirq: Use sched_clock() based timeout") > bd5a5bd77009 ("softirq: Rewrite softirq processing loop") Those will help, but I spent some time digging into the jiffies related warts with kprobes - while annoying they weren't a major source of wake ups. (FWIW the jiffies noise on our workloads is due to cgroup stats disabling IRQs for multiple ms on the timekeeping CPU). Here are fresh stats on why we wake up ksoftirqd on our Web workload (collected over 100 sec): Time exceeded: 484 Loop max run out: 6525 need_resched(): 10219 (control: 17226 - number of times wakeup_process called for ksirqd) As you can see need_resched() dominates. Zooming into the time exceeded - we can count nanoseconds between __do_softirq starting and the check. This is the histogram of actual usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000): [256, 512) 1 | | [512, 1K) 0 | | [1K, 2K) 217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [2K, 4K) 266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| So yes, we can probably save ourselves ~200 wakeup with a better clock but that's just 1.3% of the total wake ups :( Now - now about the max loop count. I ORed the pending softirqs every time we get to the end of the loop. Looks like vast majority of the loop counter wake ups are exclusively due to RCU: @looped[512]: 5516 Where 512 is the ORed pending mask over all iterations 512 == 1 << RCU_SOFTIRQ. And they usually take less than 100us to consume the 10 iterations. Histogram of usecs consumed when we run out of loop iterations: [16, 32) 3 | | [32, 64) 4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64, 128) 871 |@@@@@@@@@ | [128, 256) 34 | | [256, 512) 9 | | [512, 1K) 262 |@@ | [1K, 2K) 35 | | [2K, 4K) 1 | | Paul, is this expected? Is RCU not trying too hard to be nice? # cat /sys/module/rcutree/parameters/blimit 10 Or should we perhaps just raise the loop limit? Breaking after less than 100usec seems excessive :( > whether that makes a difference? Those two can be applied with some > minor polishing. The rest of that series is broken by f10020c97f4c > ("softirq: Allow early break"). > > There is another issue with this overload limit. Assume max_restart or > timeout triggered and limit was set to now + 100ms. ksoftirqd runs and > gets the issue resolved after 10ms. > > So for the remaining 90ms any invocation of raise_softirq() outside of > (soft)interrupt context, which wakes ksoftirqd again, prevents > processing on return from interrupt until ksoftirqd gets on the CPU and > goes back to sleep, because task_is_running() == true and the stale > limit is not after jiffies. > > Probably not a big issue, but someone will notice on some weird workload > sooner than later and the tweaking will start nevertheless. :) So maybe > we fix it right away. :) Hm, Paolo raised this point as well, but the overload time is strictly to stop paying attention to the fact ksoftirqd is running. IOW current kernels behave as if they had overload_limit of infinity. The current code already prevents processing until ksoftirqd schedules in, after raise_softirq() from a funky context. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-03 21:31 ` Jakub Kicinski @ 2023-03-03 22:37 ` Paul E. McKenney 2023-03-03 23:25 ` Dave Taht 2023-03-03 23:36 ` Paul E. McKenney 2023-03-05 20:43 ` Thomas Gleixner 1 sibling, 2 replies; 38+ messages in thread From: Paul E. McKenney @ 2023-03-03 22:37 UTC (permalink / raw) To: Jakub Kicinski Cc: Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel On Fri, Mar 03, 2023 at 01:31:43PM -0800, Jakub Kicinski wrote: > On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote: > > > - if (time_before(jiffies, end) && !need_resched() && > > > - --max_restart) > > > + unsigned long limit; > > > + > > > + if (time_is_before_eq_jiffies(end) || !--max_restart) > > > + limit = SOFTIRQ_OVERLOAD_TIME; > > > + else if (need_resched()) > > > + limit = SOFTIRQ_DEFER_TIME; > > > + else > > > goto restart; > > > > > > + __this_cpu_write(overload_limit, jiffies + limit); > > > > The logic of all this is non-obvious and I had to reread it 5 times to > > conclude that it is matching the intent. Please add comments. > > > > While I'm not a big fan of heuristical duct tape, this looks harmless > > enough to not end up in an endless stream of tweaking. Famous last > > words... > > Would it all be more readable if I named the "overload_limit" > "overloaded_until" instead? Naming.. > I'll add comments, too. > > > But without the sched_clock() changes the actual defer time depends on > > HZ and the point in time where limit is set. That means it ranges from 0 > > to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in > > the worst case, which perhaps explains the 8ms+ stalls you are still > > observing. Can you test with that sched_clock change applied, i.e. the > > first two commits from > > > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq > > > > 59be25c466d9 ("softirq: Use sched_clock() based timeout") > > bd5a5bd77009 ("softirq: Rewrite softirq processing loop") > > Those will help, but I spent some time digging into the jiffies related > warts with kprobes - while annoying they weren't a major source of wake > ups. (FWIW the jiffies noise on our workloads is due to cgroup stats > disabling IRQs for multiple ms on the timekeeping CPU). > > Here are fresh stats on why we wake up ksoftirqd on our Web workload > (collected over 100 sec): > > Time exceeded: 484 > Loop max run out: 6525 > need_resched(): 10219 > (control: 17226 - number of times wakeup_process called for ksirqd) > > As you can see need_resched() dominates. > > Zooming into the time exceeded - we can count nanoseconds between > __do_softirq starting and the check. This is the histogram of actual > usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000): > > [256, 512) 1 | | > [512, 1K) 0 | | > [1K, 2K) 217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > [2K, 4K) 266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > So yes, we can probably save ourselves ~200 wakeup with a better clock > but that's just 1.3% of the total wake ups :( > > > Now - now about the max loop count. I ORed the pending softirqs every > time we get to the end of the loop. Looks like vast majority of the > loop counter wake ups are exclusively due to RCU: > > @looped[512]: 5516 > > Where 512 is the ORed pending mask over all iterations > 512 == 1 << RCU_SOFTIRQ. > > And they usually take less than 100us to consume the 10 iterations. > Histogram of usecs consumed when we run out of loop iterations: > > [16, 32) 3 | | > [32, 64) 4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [64, 128) 871 |@@@@@@@@@ | > [128, 256) 34 | | > [256, 512) 9 | | > [512, 1K) 262 |@@ | > [1K, 2K) 35 | | > [2K, 4K) 1 | | > > Paul, is this expected? Is RCU not trying too hard to be nice? This is from way back in the day, so it is quite possible that better tuning and/or better heuristics should be applied. On the other hand, 100 microseconds is a good long time from an CONFIG_PREEMPT_RT=y perspective! > # cat /sys/module/rcutree/parameters/blimit > 10 > > Or should we perhaps just raise the loop limit? Breaking after less > than 100usec seems excessive :( But note that RCU also has rcutree.rcu_divisor, which defaults to 7. And an rcutree.rcu_resched_ns, which defaults to three milliseconds (3,000,000 nanoseconds). This means that RCU will do: o All the callbacks if there are less than ten. o Ten callbacks or 1/128th of them, whichever is larger. o Unless the larger of them is more than 100 callbacks, in which case there is an additional limit of three milliseconds worth of them. Except that if a given CPU ends up with more than 10,000 callbacks (rcutree.qhimark), that CPU's blimit is set to 10,000. So there is much opportunity to tune the existing heuristics and also much opportunity to tweak the heuristics themselves. But let's see a good use case before tweaking, please. ;-) Thanx, Paul > > whether that makes a difference? Those two can be applied with some > > minor polishing. The rest of that series is broken by f10020c97f4c > > ("softirq: Allow early break"). > > > > There is another issue with this overload limit. Assume max_restart or > > timeout triggered and limit was set to now + 100ms. ksoftirqd runs and > > gets the issue resolved after 10ms. > > > > So for the remaining 90ms any invocation of raise_softirq() outside of > > (soft)interrupt context, which wakes ksoftirqd again, prevents > > processing on return from interrupt until ksoftirqd gets on the CPU and > > goes back to sleep, because task_is_running() == true and the stale > > limit is not after jiffies. > > > > Probably not a big issue, but someone will notice on some weird workload > > sooner than later and the tweaking will start nevertheless. :) So maybe > > we fix it right away. :) > > Hm, Paolo raised this point as well, but the overload time is strictly > to stop paying attention to the fact ksoftirqd is running. > IOW current kernels behave as if they had overload_limit of infinity. > > The current code already prevents processing until ksoftirqd schedules > in, after raise_softirq() from a funky context. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-03 22:37 ` Paul E. McKenney @ 2023-03-03 23:25 ` Dave Taht 2023-03-04 1:14 ` Paul E. McKenney 2023-03-03 23:36 ` Paul E. McKenney 1 sibling, 1 reply; 38+ messages in thread From: Dave Taht @ 2023-03-03 23:25 UTC (permalink / raw) To: paulmck Cc: Jakub Kicinski, Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel On Fri, Mar 3, 2023 at 2:56 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Fri, Mar 03, 2023 at 01:31:43PM -0800, Jakub Kicinski wrote: > > On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote: > > > > - if (time_before(jiffies, end) && !need_resched() && > > > > - --max_restart) > > > > + unsigned long limit; > > > > + > > > > + if (time_is_before_eq_jiffies(end) || !--max_restart) > > > > + limit = SOFTIRQ_OVERLOAD_TIME; > > > > + else if (need_resched()) > > > > + limit = SOFTIRQ_DEFER_TIME; > > > > + else > > > > goto restart; > > > > > > > > + __this_cpu_write(overload_limit, jiffies + limit); > > > > > > The logic of all this is non-obvious and I had to reread it 5 times to > > > conclude that it is matching the intent. Please add comments. > > > > > > While I'm not a big fan of heuristical duct tape, this looks harmless > > > enough to not end up in an endless stream of tweaking. Famous last > > > words... > > > > Would it all be more readable if I named the "overload_limit" > > "overloaded_until" instead? Naming.. > > I'll add comments, too. > > > > > But without the sched_clock() changes the actual defer time depends on > > > HZ and the point in time where limit is set. That means it ranges from 0 > > > to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in > > > the worst case, which perhaps explains the 8ms+ stalls you are still > > > observing. Can you test with that sched_clock change applied, i.e. the > > > first two commits from > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq > > > > > > 59be25c466d9 ("softirq: Use sched_clock() based timeout") > > > bd5a5bd77009 ("softirq: Rewrite softirq processing loop") > > > > Those will help, but I spent some time digging into the jiffies related > > warts with kprobes - while annoying they weren't a major source of wake > > ups. (FWIW the jiffies noise on our workloads is due to cgroup stats > > disabling IRQs for multiple ms on the timekeeping CPU). > > > > Here are fresh stats on why we wake up ksoftirqd on our Web workload > > (collected over 100 sec): > > > > Time exceeded: 484 > > Loop max run out: 6525 > > need_resched(): 10219 > > (control: 17226 - number of times wakeup_process called for ksirqd) > > > > As you can see need_resched() dominates. > > > > Zooming into the time exceeded - we can count nanoseconds between > > __do_softirq starting and the check. This is the histogram of actual > > usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000): > > > > [256, 512) 1 | | > > [512, 1K) 0 | | > > [1K, 2K) 217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > [2K, 4K) 266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > > So yes, we can probably save ourselves ~200 wakeup with a better clock > > but that's just 1.3% of the total wake ups :( > > > > > > Now - now about the max loop count. I ORed the pending softirqs every > > time we get to the end of the loop. Looks like vast majority of the > > loop counter wake ups are exclusively due to RCU: > > > > @looped[512]: 5516 > > > > Where 512 is the ORed pending mask over all iterations > > 512 == 1 << RCU_SOFTIRQ. > > > > And they usually take less than 100us to consume the 10 iterations. > > Histogram of usecs consumed when we run out of loop iterations: > > > > [16, 32) 3 | | > > [32, 64) 4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > [64, 128) 871 |@@@@@@@@@ | > > [128, 256) 34 | | > > [256, 512) 9 | | > > [512, 1K) 262 |@@ | > > [1K, 2K) 35 | | > > [2K, 4K) 1 | | > > > > Paul, is this expected? Is RCU not trying too hard to be nice? > > This is from way back in the day, so it is quite possible that better > tuning and/or better heuristics should be applied. > > On the other hand, 100 microseconds is a good long time from an > CONFIG_PREEMPT_RT=y perspective! All I have to add to this conversation is the observation that sampling things at the nyquist rate helps to observe problems like these. So if you care about sub 8ms response time, a sub 4ms sampling rate is needed. > > # cat /sys/module/rcutree/parameters/blimit > > 10 > > > > Or should we perhaps just raise the loop limit? Breaking after less > > than 100usec seems excessive :( > But note that RCU also has rcutree.rcu_divisor, which defaults to 7. > And an rcutree.rcu_resched_ns, which defaults to three milliseconds > (3,000,000 nanoseconds). This means that RCU will do: > > o All the callbacks if there are less than ten. > > o Ten callbacks or 1/128th of them, whichever is larger. > > o Unless the larger of them is more than 100 callbacks, in which > case there is an additional limit of three milliseconds worth > of them. > > Except that if a given CPU ends up with more than 10,000 callbacks > (rcutree.qhimark), that CPU's blimit is set to 10,000. > > So there is much opportunity to tune the existing heuristics and also > much opportunity to tweak the heuristics themselves. This I did not know, and to best observe rcu in action nyquist is 1.5ms... Something with less constants and more curves seems in order. > > But let's see a good use case before tweaking, please. ;-) > > Thanx, Paul > > > > whether that makes a difference? Those two can be applied with some > > > minor polishing. The rest of that series is broken by f10020c97f4c > > > ("softirq: Allow early break"). > > > > > > There is another issue with this overload limit. Assume max_restart or > > > timeout triggered and limit was set to now + 100ms. ksoftirqd runs and > > > gets the issue resolved after 10ms. > > > > > > So for the remaining 90ms any invocation of raise_softirq() outside of > > > (soft)interrupt context, which wakes ksoftirqd again, prevents > > > processing on return from interrupt until ksoftirqd gets on the CPU and > > > goes back to sleep, because task_is_running() == true and the stale > > > limit is not after jiffies. > > > > > > Probably not a big issue, but someone will notice on some weird workload > > > sooner than later and the tweaking will start nevertheless. :) So maybe > > > we fix it right away. :) > > > > Hm, Paolo raised this point as well, but the overload time is strictly > > to stop paying attention to the fact ksoftirqd is running. > > IOW current kernels behave as if they had overload_limit of infinity. > > > > The current code already prevents processing until ksoftirqd schedules > > in, after raise_softirq() from a funky context. -- A pithy note on VOQs vs SQM: https://blog.cerowrt.org/post/juniper/ Dave Täht CEO, TekLibre, LLC ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-03 23:25 ` Dave Taht @ 2023-03-04 1:14 ` Paul E. McKenney 0 siblings, 0 replies; 38+ messages in thread From: Paul E. McKenney @ 2023-03-04 1:14 UTC (permalink / raw) To: Dave Taht Cc: Jakub Kicinski, Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel On Fri, Mar 03, 2023 at 03:25:32PM -0800, Dave Taht wrote: > On Fri, Mar 3, 2023 at 2:56 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Fri, Mar 03, 2023 at 01:31:43PM -0800, Jakub Kicinski wrote: > > > On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote: > > > > > - if (time_before(jiffies, end) && !need_resched() && > > > > > - --max_restart) > > > > > + unsigned long limit; > > > > > + > > > > > + if (time_is_before_eq_jiffies(end) || !--max_restart) > > > > > + limit = SOFTIRQ_OVERLOAD_TIME; > > > > > + else if (need_resched()) > > > > > + limit = SOFTIRQ_DEFER_TIME; > > > > > + else > > > > > goto restart; > > > > > > > > > > + __this_cpu_write(overload_limit, jiffies + limit); > > > > > > > > The logic of all this is non-obvious and I had to reread it 5 times to > > > > conclude that it is matching the intent. Please add comments. > > > > > > > > While I'm not a big fan of heuristical duct tape, this looks harmless > > > > enough to not end up in an endless stream of tweaking. Famous last > > > > words... > > > > > > Would it all be more readable if I named the "overload_limit" > > > "overloaded_until" instead? Naming.. > > > I'll add comments, too. > > > > > > > But without the sched_clock() changes the actual defer time depends on > > > > HZ and the point in time where limit is set. That means it ranges from 0 > > > > to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in > > > > the worst case, which perhaps explains the 8ms+ stalls you are still > > > > observing. Can you test with that sched_clock change applied, i.e. the > > > > first two commits from > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq > > > > > > > > 59be25c466d9 ("softirq: Use sched_clock() based timeout") > > > > bd5a5bd77009 ("softirq: Rewrite softirq processing loop") > > > > > > Those will help, but I spent some time digging into the jiffies related > > > warts with kprobes - while annoying they weren't a major source of wake > > > ups. (FWIW the jiffies noise on our workloads is due to cgroup stats > > > disabling IRQs for multiple ms on the timekeeping CPU). > > > > > > Here are fresh stats on why we wake up ksoftirqd on our Web workload > > > (collected over 100 sec): > > > > > > Time exceeded: 484 > > > Loop max run out: 6525 > > > need_resched(): 10219 > > > (control: 17226 - number of times wakeup_process called for ksirqd) > > > > > > As you can see need_resched() dominates. > > > > > > Zooming into the time exceeded - we can count nanoseconds between > > > __do_softirq starting and the check. This is the histogram of actual > > > usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000): > > > > > > [256, 512) 1 | | > > > [512, 1K) 0 | | > > > [1K, 2K) 217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > > [2K, 4K) 266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > > > > So yes, we can probably save ourselves ~200 wakeup with a better clock > > > but that's just 1.3% of the total wake ups :( > > > > > > > > > Now - now about the max loop count. I ORed the pending softirqs every > > > time we get to the end of the loop. Looks like vast majority of the > > > loop counter wake ups are exclusively due to RCU: > > > > > > @looped[512]: 5516 > > > > > > Where 512 is the ORed pending mask over all iterations > > > 512 == 1 << RCU_SOFTIRQ. > > > > > > And they usually take less than 100us to consume the 10 iterations. > > > Histogram of usecs consumed when we run out of loop iterations: > > > > > > [16, 32) 3 | | > > > [32, 64) 4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > [64, 128) 871 |@@@@@@@@@ | > > > [128, 256) 34 | | > > > [256, 512) 9 | | > > > [512, 1K) 262 |@@ | > > > [1K, 2K) 35 | | > > > [2K, 4K) 1 | | > > > > > > Paul, is this expected? Is RCU not trying too hard to be nice? > > > > This is from way back in the day, so it is quite possible that better > > tuning and/or better heuristics should be applied. > > > > On the other hand, 100 microseconds is a good long time from an > > CONFIG_PREEMPT_RT=y perspective! > > All I have to add to this conversation is the observation that > sampling things at the > nyquist rate helps to observe problems like these. > > So if you care about sub 8ms response time, a sub 4ms sampling rate is needed. My guess is that Jakub is side-stepping Nyquist by sampling every call to and return from the rcu_do_batch() function. > > > # cat /sys/module/rcutree/parameters/blimit > > > 10 > > > > > > Or should we perhaps just raise the loop limit? Breaking after less > > > than 100usec seems excessive :( > > > > But note that RCU also has rcutree.rcu_divisor, which defaults to 7. > > And an rcutree.rcu_resched_ns, which defaults to three milliseconds > > (3,000,000 nanoseconds). This means that RCU will do: > > > > o All the callbacks if there are less than ten. > > > > o Ten callbacks or 1/128th of them, whichever is larger. > > > > o Unless the larger of them is more than 100 callbacks, in which > > case there is an additional limit of three milliseconds worth > > of them. > > > > Except that if a given CPU ends up with more than 10,000 callbacks > > (rcutree.qhimark), that CPU's blimit is set to 10,000. > > > > So there is much opportunity to tune the existing heuristics and also > > much opportunity to tweak the heuristics themselves. > > This I did not know, and to best observe rcu in action nyquist is 1.5ms... This is not an oscillator, and because this all happens within a single system, you cannot you hang your hat on speed-of-light delays. In addition, an application can dump thousands of callbacks down RCU's throat in a very short time, which changes RCU's timing. Also, the time constants for expedited grace periods are typically in the tens of microseconds. Something about prioritizing survivability over measurability. ;-) But that is OK because ftrace and BPF can provide fine-grained measurements quite cheaply. > Something with less constants and more curves seems in order. In the immortal words of MS-DOS, are you sure? Thanx, Paul > > But let's see a good use case before tweaking, please. ;-) > > > > Thanx, Paul > > > > > > whether that makes a difference? Those two can be applied with some > > > > minor polishing. The rest of that series is broken by f10020c97f4c > > > > ("softirq: Allow early break"). > > > > > > > > There is another issue with this overload limit. Assume max_restart or > > > > timeout triggered and limit was set to now + 100ms. ksoftirqd runs and > > > > gets the issue resolved after 10ms. > > > > > > > > So for the remaining 90ms any invocation of raise_softirq() outside of > > > > (soft)interrupt context, which wakes ksoftirqd again, prevents > > > > processing on return from interrupt until ksoftirqd gets on the CPU and > > > > goes back to sleep, because task_is_running() == true and the stale > > > > limit is not after jiffies. > > > > > > > > Probably not a big issue, but someone will notice on some weird workload > > > > sooner than later and the tweaking will start nevertheless. :) So maybe > > > > we fix it right away. :) > > > > > > Hm, Paolo raised this point as well, but the overload time is strictly > > > to stop paying attention to the fact ksoftirqd is running. > > > IOW current kernels behave as if they had overload_limit of infinity. > > > > > > The current code already prevents processing until ksoftirqd schedules > > > in, after raise_softirq() from a funky context. > > > > -- > A pithy note on VOQs vs SQM: https://blog.cerowrt.org/post/juniper/ > Dave Täht CEO, TekLibre, LLC ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-03 22:37 ` Paul E. McKenney 2023-03-03 23:25 ` Dave Taht @ 2023-03-03 23:36 ` Paul E. McKenney 2023-03-03 23:44 ` Jakub Kicinski 1 sibling, 1 reply; 38+ messages in thread From: Paul E. McKenney @ 2023-03-03 23:36 UTC (permalink / raw) To: Jakub Kicinski Cc: Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel On Fri, Mar 03, 2023 at 02:37:39PM -0800, Paul E. McKenney wrote: > On Fri, Mar 03, 2023 at 01:31:43PM -0800, Jakub Kicinski wrote: > > On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote: > > > > - if (time_before(jiffies, end) && !need_resched() && > > > > - --max_restart) > > > > + unsigned long limit; > > > > + > > > > + if (time_is_before_eq_jiffies(end) || !--max_restart) > > > > + limit = SOFTIRQ_OVERLOAD_TIME; > > > > + else if (need_resched()) > > > > + limit = SOFTIRQ_DEFER_TIME; > > > > + else > > > > goto restart; > > > > > > > > + __this_cpu_write(overload_limit, jiffies + limit); > > > > > > The logic of all this is non-obvious and I had to reread it 5 times to > > > conclude that it is matching the intent. Please add comments. > > > > > > While I'm not a big fan of heuristical duct tape, this looks harmless > > > enough to not end up in an endless stream of tweaking. Famous last > > > words... > > > > Would it all be more readable if I named the "overload_limit" > > "overloaded_until" instead? Naming.. > > I'll add comments, too. > > > > > But without the sched_clock() changes the actual defer time depends on > > > HZ and the point in time where limit is set. That means it ranges from 0 > > > to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in > > > the worst case, which perhaps explains the 8ms+ stalls you are still > > > observing. Can you test with that sched_clock change applied, i.e. the > > > first two commits from > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq > > > > > > 59be25c466d9 ("softirq: Use sched_clock() based timeout") > > > bd5a5bd77009 ("softirq: Rewrite softirq processing loop") > > > > Those will help, but I spent some time digging into the jiffies related > > warts with kprobes - while annoying they weren't a major source of wake > > ups. (FWIW the jiffies noise on our workloads is due to cgroup stats > > disabling IRQs for multiple ms on the timekeeping CPU). > > > > Here are fresh stats on why we wake up ksoftirqd on our Web workload > > (collected over 100 sec): > > > > Time exceeded: 484 > > Loop max run out: 6525 > > need_resched(): 10219 > > (control: 17226 - number of times wakeup_process called for ksirqd) > > > > As you can see need_resched() dominates. > > > > Zooming into the time exceeded - we can count nanoseconds between > > __do_softirq starting and the check. This is the histogram of actual > > usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000): > > > > [256, 512) 1 | | > > [512, 1K) 0 | | > > [1K, 2K) 217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > [2K, 4K) 266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > > So yes, we can probably save ourselves ~200 wakeup with a better clock > > but that's just 1.3% of the total wake ups :( > > > > > > Now - now about the max loop count. I ORed the pending softirqs every > > time we get to the end of the loop. Looks like vast majority of the > > loop counter wake ups are exclusively due to RCU: > > > > @looped[512]: 5516 > > > > Where 512 is the ORed pending mask over all iterations > > 512 == 1 << RCU_SOFTIRQ. > > > > And they usually take less than 100us to consume the 10 iterations. > > Histogram of usecs consumed when we run out of loop iterations: > > > > [16, 32) 3 | | > > [32, 64) 4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > [64, 128) 871 |@@@@@@@@@ | > > [128, 256) 34 | | > > [256, 512) 9 | | > > [512, 1K) 262 |@@ | > > [1K, 2K) 35 | | > > [2K, 4K) 1 | | > > > > Paul, is this expected? Is RCU not trying too hard to be nice? > > This is from way back in the day, so it is quite possible that better > tuning and/or better heuristics should be applied. > > On the other hand, 100 microseconds is a good long time from an > CONFIG_PREEMPT_RT=y perspective! > > > # cat /sys/module/rcutree/parameters/blimit > > 10 > > > > Or should we perhaps just raise the loop limit? Breaking after less > > than 100usec seems excessive :( > > But note that RCU also has rcutree.rcu_divisor, which defaults to 7. > And an rcutree.rcu_resched_ns, which defaults to three milliseconds > (3,000,000 nanoseconds). This means that RCU will do: > > o All the callbacks if there are less than ten. > > o Ten callbacks or 1/128th of them, whichever is larger. > > o Unless the larger of them is more than 100 callbacks, in which > case there is an additional limit of three milliseconds worth > of them. > > Except that if a given CPU ends up with more than 10,000 callbacks > (rcutree.qhimark), that CPU's blimit is set to 10,000. Also, if in the context of a softirq handler (as opposed to ksoftirqd) that interrupted the idle task with no pending task, the count of callbacks is ignored and only the 3-millisecond limit counts. In the context of ksoftirq, the only limit is that which the scheduler chooses to impose. But it sure seems like the ksoftirqd case should also pay attention to that 3-millisecond limit. I will queue a patch to that effect, and maybe Eric Dumazet will show me the error of my ways. > So there is much opportunity to tune the existing heuristics and also > much opportunity to tweak the heuristics themselves. > > But let's see a good use case before tweaking, please. ;-) Thanx, Paul > > > whether that makes a difference? Those two can be applied with some > > > minor polishing. The rest of that series is broken by f10020c97f4c > > > ("softirq: Allow early break"). > > > > > > There is another issue with this overload limit. Assume max_restart or > > > timeout triggered and limit was set to now + 100ms. ksoftirqd runs and > > > gets the issue resolved after 10ms. > > > > > > So for the remaining 90ms any invocation of raise_softirq() outside of > > > (soft)interrupt context, which wakes ksoftirqd again, prevents > > > processing on return from interrupt until ksoftirqd gets on the CPU and > > > goes back to sleep, because task_is_running() == true and the stale > > > limit is not after jiffies. > > > > > > Probably not a big issue, but someone will notice on some weird workload > > > sooner than later and the tweaking will start nevertheless. :) So maybe > > > we fix it right away. :) > > > > Hm, Paolo raised this point as well, but the overload time is strictly > > to stop paying attention to the fact ksoftirqd is running. > > IOW current kernels behave as if they had overload_limit of infinity. > > > > The current code already prevents processing until ksoftirqd schedules > > in, after raise_softirq() from a funky context. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-03 23:36 ` Paul E. McKenney @ 2023-03-03 23:44 ` Jakub Kicinski 2023-03-04 1:25 ` Paul E. McKenney 0 siblings, 1 reply; 38+ messages in thread From: Jakub Kicinski @ 2023-03-03 23:44 UTC (permalink / raw) To: Paul E. McKenney Cc: Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel On Fri, 3 Mar 2023 15:36:27 -0800 Paul E. McKenney wrote: > On Fri, Mar 03, 2023 at 02:37:39PM -0800, Paul E. McKenney wrote: > > On Fri, Mar 03, 2023 at 01:31:43PM -0800, Jakub Kicinski wrote: > > > Now - now about the max loop count. I ORed the pending softirqs every > > > time we get to the end of the loop. Looks like vast majority of the > > > loop counter wake ups are exclusively due to RCU: > > > > > > @looped[512]: 5516 > > > > > > Where 512 is the ORed pending mask over all iterations > > > 512 == 1 << RCU_SOFTIRQ. > > > > > > And they usually take less than 100us to consume the 10 iterations. > > > Histogram of usecs consumed when we run out of loop iterations: > > > > > > [16, 32) 3 | | > > > [32, 64) 4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > [64, 128) 871 |@@@@@@@@@ | > > > [128, 256) 34 | | > > > [256, 512) 9 | | > > > [512, 1K) 262 |@@ | > > > [1K, 2K) 35 | | > > > [2K, 4K) 1 | | > > > > > > Paul, is this expected? Is RCU not trying too hard to be nice? > > > > This is from way back in the day, so it is quite possible that better > > tuning and/or better heuristics should be applied. > > > > On the other hand, 100 microseconds is a good long time from an > > CONFIG_PREEMPT_RT=y perspective! > > > > > # cat /sys/module/rcutree/parameters/blimit > > > 10 > > > > > > Or should we perhaps just raise the loop limit? Breaking after less > > > than 100usec seems excessive :( > > > > But note that RCU also has rcutree.rcu_divisor, which defaults to 7. > > And an rcutree.rcu_resched_ns, which defaults to three milliseconds > > (3,000,000 nanoseconds). This means that RCU will do: > > > > o All the callbacks if there are less than ten. > > > > o Ten callbacks or 1/128th of them, whichever is larger. > > > > o Unless the larger of them is more than 100 callbacks, in which > > case there is an additional limit of three milliseconds worth > > of them. > > > > Except that if a given CPU ends up with more than 10,000 callbacks > > (rcutree.qhimark), that CPU's blimit is set to 10,000. > > Also, if in the context of a softirq handler (as opposed to ksoftirqd) > that interrupted the idle task with no pending task, the count of > callbacks is ignored and only the 3-millisecond limit counts. In the > context of ksoftirq, the only limit is that which the scheduler chooses > to impose. > > But it sure seems like the ksoftirqd case should also pay attention to > that 3-millisecond limit. I will queue a patch to that effect, and maybe > Eric Dumazet will show me the error of my ways. Just to be sure - have you seen Peter's patches? git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq I think it feeds the time limit to the callback from softirq, so the local 3ms is no more? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-03 23:44 ` Jakub Kicinski @ 2023-03-04 1:25 ` Paul E. McKenney 2023-03-04 1:39 ` Jakub Kicinski 0 siblings, 1 reply; 38+ messages in thread From: Paul E. McKenney @ 2023-03-04 1:25 UTC (permalink / raw) To: Jakub Kicinski Cc: Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel On Fri, Mar 03, 2023 at 03:44:13PM -0800, Jakub Kicinski wrote: > On Fri, 3 Mar 2023 15:36:27 -0800 Paul E. McKenney wrote: > > On Fri, Mar 03, 2023 at 02:37:39PM -0800, Paul E. McKenney wrote: > > > On Fri, Mar 03, 2023 at 01:31:43PM -0800, Jakub Kicinski wrote: > > > > Now - now about the max loop count. I ORed the pending softirqs every > > > > time we get to the end of the loop. Looks like vast majority of the > > > > loop counter wake ups are exclusively due to RCU: > > > > > > > > @looped[512]: 5516 > > > > > > > > Where 512 is the ORed pending mask over all iterations > > > > 512 == 1 << RCU_SOFTIRQ. > > > > > > > > And they usually take less than 100us to consume the 10 iterations. > > > > Histogram of usecs consumed when we run out of loop iterations: > > > > > > > > [16, 32) 3 | | > > > > [32, 64) 4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > > [64, 128) 871 |@@@@@@@@@ | > > > > [128, 256) 34 | | > > > > [256, 512) 9 | | > > > > [512, 1K) 262 |@@ | > > > > [1K, 2K) 35 | | > > > > [2K, 4K) 1 | | > > > > > > > > Paul, is this expected? Is RCU not trying too hard to be nice? > > > > > > This is from way back in the day, so it is quite possible that better > > > tuning and/or better heuristics should be applied. > > > > > > On the other hand, 100 microseconds is a good long time from an > > > CONFIG_PREEMPT_RT=y perspective! > > > > > > > # cat /sys/module/rcutree/parameters/blimit > > > > 10 > > > > > > > > Or should we perhaps just raise the loop limit? Breaking after less > > > > than 100usec seems excessive :( > > > > > > But note that RCU also has rcutree.rcu_divisor, which defaults to 7. > > > And an rcutree.rcu_resched_ns, which defaults to three milliseconds > > > (3,000,000 nanoseconds). This means that RCU will do: > > > > > > o All the callbacks if there are less than ten. > > > > > > o Ten callbacks or 1/128th of them, whichever is larger. > > > > > > o Unless the larger of them is more than 100 callbacks, in which > > > case there is an additional limit of three milliseconds worth > > > of them. > > > > > > Except that if a given CPU ends up with more than 10,000 callbacks > > > (rcutree.qhimark), that CPU's blimit is set to 10,000. > > > > Also, if in the context of a softirq handler (as opposed to ksoftirqd) > > that interrupted the idle task with no pending task, the count of > > callbacks is ignored and only the 3-millisecond limit counts. In the > > context of ksoftirq, the only limit is that which the scheduler chooses > > to impose. > > > > But it sure seems like the ksoftirqd case should also pay attention to > > that 3-millisecond limit. I will queue a patch to that effect, and maybe > > Eric Dumazet will show me the error of my ways. > > Just to be sure - have you seen Peter's patches? > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq > > I think it feeds the time limit to the callback from softirq, > so the local 3ms is no more? I might or might not have back in September of 2020. ;-) But either way, the question remains: Should RCU_SOFTIRQ do time checking in ksoftirqd context? Seems like the answer should be "yes", independently of Peter's patches. Thanx, Paul ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-04 1:25 ` Paul E. McKenney @ 2023-03-04 1:39 ` Jakub Kicinski 2023-03-04 3:11 ` Paul E. McKenney 0 siblings, 1 reply; 38+ messages in thread From: Jakub Kicinski @ 2023-03-04 1:39 UTC (permalink / raw) To: Paul E. McKenney Cc: Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel On Fri, 3 Mar 2023 17:25:35 -0800 Paul E. McKenney wrote: > > Just to be sure - have you seen Peter's patches? > > > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq > > > > I think it feeds the time limit to the callback from softirq, > > so the local 3ms is no more? > > I might or might not have back in September of 2020. ;-) > > But either way, the question remains: Should RCU_SOFTIRQ do time checking > in ksoftirqd context? Seems like the answer should be "yes", independently > of Peter's patches. :-o I didn't notice, I thought that's from Dec 22, LWN was writing about Peter's rework at that point. I'm not sure what the story is :( And when / if any of these changes are coming downstream. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-04 1:39 ` Jakub Kicinski @ 2023-03-04 3:11 ` Paul E. McKenney 2023-03-04 20:48 ` Paul E. McKenney 0 siblings, 1 reply; 38+ messages in thread From: Paul E. McKenney @ 2023-03-04 3:11 UTC (permalink / raw) To: Jakub Kicinski Cc: Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel On Fri, Mar 03, 2023 at 05:39:21PM -0800, Jakub Kicinski wrote: > On Fri, 3 Mar 2023 17:25:35 -0800 Paul E. McKenney wrote: > > > Just to be sure - have you seen Peter's patches? > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq > > > > > > I think it feeds the time limit to the callback from softirq, > > > so the local 3ms is no more? > > > > I might or might not have back in September of 2020. ;-) > > > > But either way, the question remains: Should RCU_SOFTIRQ do time checking > > in ksoftirqd context? Seems like the answer should be "yes", independently > > of Peter's patches. > > :-o I didn't notice, I thought that's from Dec 22, LWN was writing > about Peter's rework at that point. I'm not sure what the story is :( > And when / if any of these changes are coming downstream. Not a problem either way, as the compiler would complain bitterly about the resulting merge conflict and it is easy to fix. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-04 3:11 ` Paul E. McKenney @ 2023-03-04 20:48 ` Paul E. McKenney 0 siblings, 0 replies; 38+ messages in thread From: Paul E. McKenney @ 2023-03-04 20:48 UTC (permalink / raw) To: Jakub Kicinski Cc: Thomas Gleixner, peterz, jstultz, edumazet, netdev, linux-kernel On Fri, Mar 03, 2023 at 07:11:09PM -0800, Paul E. McKenney wrote: > On Fri, Mar 03, 2023 at 05:39:21PM -0800, Jakub Kicinski wrote: > > On Fri, 3 Mar 2023 17:25:35 -0800 Paul E. McKenney wrote: > > > > Just to be sure - have you seen Peter's patches? > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq > > > > > > > > I think it feeds the time limit to the callback from softirq, > > > > so the local 3ms is no more? > > > > > > I might or might not have back in September of 2020. ;-) > > > > > > But either way, the question remains: Should RCU_SOFTIRQ do time checking > > > in ksoftirqd context? Seems like the answer should be "yes", independently > > > of Peter's patches. > > > > :-o I didn't notice, I thought that's from Dec 22, LWN was writing > > about Peter's rework at that point. I'm not sure what the story is :( > > And when / if any of these changes are coming downstream. > > Not a problem either way, as the compiler would complain bitterly about > the resulting merge conflict and it is easy to fix. ;-) And even more not a problem because in_serving_softirq() covers both the softirq environment as well as ksoftirqd. So that "else" clause is for rcuoc kthreads, which do not block other softirq vectors. So I am adding a comment instead... Thanx, Paul ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-03 21:31 ` Jakub Kicinski 2023-03-03 22:37 ` Paul E. McKenney @ 2023-03-05 20:43 ` Thomas Gleixner 2023-03-05 22:42 ` Paul E. McKenney ` (3 more replies) 1 sibling, 4 replies; 38+ messages in thread From: Thomas Gleixner @ 2023-03-05 20:43 UTC (permalink / raw) To: Jakub Kicinski Cc: peterz, jstultz, edumazet, netdev, linux-kernel, Paul E. McKenney, Frederic Weisbecker Jakub! On Fri, Mar 03 2023 at 13:31, Jakub Kicinski wrote: > On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote: >> > + __this_cpu_write(overload_limit, jiffies + limit); >> >> The logic of all this is non-obvious and I had to reread it 5 times to >> conclude that it is matching the intent. Please add comments. >> >> While I'm not a big fan of heuristical duct tape, this looks harmless >> enough to not end up in an endless stream of tweaking. Famous last >> words... > > Would it all be more readable if I named the "overload_limit" > "overloaded_until" instead? Naming.. While naming matters it wont change the 'heuristical duct tape' property of this, right? > I'll add comments, too. They are definitely appreciated, but I'd prefer to have code which is self explanatory and does at least have a notion of a halfways scientific approach to the overall issue of softirqs. The point is that softirqs are just the proliferation of an at least 50 years old OS design paradigm. Back then everyhting which run in an interrupt handler was "important" and more or less allowed to hog the CPU at will. That obviously caused problems because it prevented other interrupt handlers from being served. This was attempted to work around in hardware by providing interrupt priority levels. No general purpose OS utilized that ever because there is no way to get this right. Not even on UP, unless you build a designed for the purpose "OS". Soft interrupts are not any better. They avoid the problem of stalling interrupts by moving the problem one level down to the scheduler. Granted they are a cute hack, but at the very end they are still evading the resource control mechanisms of the OS by defining their own rules: - NET RX defaults to 2ms with the ability to override via /proc - RCU defaults to 3ms with the ability to override via /sysfs while the "overload detection" in the core defines a hardcoded limit of 2ms. Alone the above does not sum up to the core limit and most of the other soft interrupt handlers do not even have the notion of limits. That clearly does not even remotely allow to do proper coordinated resource management. Not to talk about the sillyness of the jiffy based timouts which result in a randomized granularity of 0...1/Hz as mentioned before. I'm well aware of the fact that consulting a high resolution hardware clock frequently can be slow and hurting performance, but there are well understood workarounds, aka batching, which mitigate that. There is another aspect to softirqs which makes them a horror show: While they are conceptually seperate, at the very end they are all lumped together and especially the network code has implicit assumptions about that. It's simply impossible to seperate the processing of the various soft interrupt incarnations. IOW, resource control by developer preference and coincidence of events. That truly makes an understandable and to be relied on OS. We had seperate softirq threads and per softirq serialization (except NET_RX/TX which shared) in the early days of preempt RT, which gave fine grained control. Back then the interaction between different softirqs was halfways understandable and the handful of interaction points which relied on the per CPU global BH disable were fixable with local serializations. That lasted a year or two until we stopped maintaining that because the interaction between softirqs was becoming a whack a mole game. So we gave up and enjoy the full glory of a per CPU global lock, because that's what local BH disable actually is. I completely understand that ***GB networking is a challenge, but ***GB networking does not work without applications wwich use it. Those applications are unfortunately^Wrightfully subject to the scheduler, aka. resource control. IMO evading resource control is the worst of all approaches and the amount of heuristics you can apply to mitigate that, is never going to cover even a subset of the overall application space. Just look at the memcache vs. webserver use case vs. need_resched() and then the requirements coming from the low latency audio folks. I know the usual approach to that is to add some more heuristics which are by nature supposed to fail or to add yet another 'knob'. We have already too many knobs which are not comprehensible on their own. But even if a particular knob is comprehensible there is close to zero documentation and I even claim close to zero understanding of the interaction of knobs. Just for the record. Some of our engineers are working on TSN based real-time networking which is all about latency anc accuracy. Guess how well that works with the current overall design. That's not an esoteric niche use case as low-latency TSN is not restricted to the automation space. There are quite some use cases which go there even in the high end networking space. >> But without the sched_clock() changes the actual defer time depends on >> HZ and the point in time where limit is set. That means it ranges from 0 >> to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in >> the worst case, which perhaps explains the 8ms+ stalls you are still >> observing. Can you test with that sched_clock change applied, i.e. the >> first two commits from >> >> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq >> >> 59be25c466d9 ("softirq: Use sched_clock() based timeout") >> bd5a5bd77009 ("softirq: Rewrite softirq processing loop") > > Those will help, but I spent some time digging into the jiffies related > warts with kprobes - while annoying they weren't a major source of wake > ups. (FWIW the jiffies noise on our workloads is due to cgroup stats > disabling IRQs for multiple ms on the timekeeping CPU). What? That's completely insane and needs to be fixed. > Here are fresh stats on why we wake up ksoftirqd on our Web workload > (collected over 100 sec): > > Time exceeded: 484 > Loop max run out: 6525 > need_resched(): 10219 > (control: 17226 - number of times wakeup_process called for ksirqd) > > As you can see need_resched() dominates. > Zooming into the time exceeded - we can count nanoseconds between > __do_softirq starting and the check. This is the histogram of actual > usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000): > > [256, 512) 1 | | > [512, 1K) 0 | | > [1K, 2K) 217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > [2K, 4K) 266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > So yes, we can probably save ourselves ~200 wakeup with a better clock > but that's just 1.3% of the total wake ups :( Fair enough. Though that does not make our time limit handling any more consistent and we need to fix that too to handle the other issues. > Now - now about the max loop count. I ORed the pending softirqs every > time we get to the end of the loop. Looks like vast majority of the > loop counter wake ups are exclusively due to RCU: > > @looped[512]: 5516 If the loop counter breaks without consuming the time budget that's silly. > Where 512 is the ORed pending mask over all iterations > 512 == 1 << RCU_SOFTIRQ. > > And they usually take less than 100us to consume the 10 iterations. > Histogram of usecs consumed when we run out of loop iterations: > > [16, 32) 3 | | > [32, 64) 4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [64, 128) 871 |@@@@@@@@@ | > [128, 256) 34 | | > [256, 512) 9 | | > [512, 1K) 262 |@@ | > [1K, 2K) 35 | | > [2K, 4K) 1 | | > > Paul, is this expected? Is RCU not trying too hard to be nice? > > # cat /sys/module/rcutree/parameters/blimit > 10 > > Or should we perhaps just raise the loop limit? Breaking after less > than 100usec seems excessive :( No. Can we please stop twiddling a parameter here and there and go and fix this whole problem space properly. Increasing the loop count for RCU might work for your particular usecase and cause issues in other scenarios. Btw, RCU seems to be a perfect candidate to delegate batches from softirq into a seperate scheduler controllable entity. >> So for the remaining 90ms any invocation of raise_softirq() outside of >> (soft)interrupt context, which wakes ksoftirqd again, prevents >> processing on return from interrupt until ksoftirqd gets on the CPU and >> goes back to sleep, because task_is_running() == true and the stale >> limit is not after jiffies. >> >> Probably not a big issue, but someone will notice on some weird workload >> sooner than later and the tweaking will start nevertheless. :) So maybe >> we fix it right away. :) > > Hm, Paolo raised this point as well, but the overload time is strictly > to stop paying attention to the fact ksoftirqd is running. > IOW current kernels behave as if they had overload_limit of infinity. > > The current code already prevents processing until ksoftirqd schedules > in, after raise_softirq() from a funky context. Correct and it does so because we are just applying duct tape over and over. That said, I have no brilliant solution for that off the top of my head, but I'm not comfortable with applying more adhoc solutions which are contrary to the efforts of e.g. the audio folks. I have some vague ideas how to approach that, but I'm traveling all of next week, so I neither will be reading much email, nor will I have time to think deeply about softirqs. I'll resume when I'm back. Thanks, tglx ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-05 20:43 ` Thomas Gleixner @ 2023-03-05 22:42 ` Paul E. McKenney 2023-03-05 23:00 ` Frederic Weisbecker 2023-03-06 9:13 ` David Laight ` (2 subsequent siblings) 3 siblings, 1 reply; 38+ messages in thread From: Paul E. McKenney @ 2023-03-05 22:42 UTC (permalink / raw) To: Thomas Gleixner Cc: Jakub Kicinski, peterz, jstultz, edumazet, netdev, linux-kernel, Frederic Weisbecker On Sun, Mar 05, 2023 at 09:43:23PM +0100, Thomas Gleixner wrote: > On Fri, Mar 03 2023 at 13:31, Jakub Kicinski wrote: > > On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote: [ . . . ] > > Where 512 is the ORed pending mask over all iterations > > 512 == 1 << RCU_SOFTIRQ. > > > > And they usually take less than 100us to consume the 10 iterations. > > Histogram of usecs consumed when we run out of loop iterations: > > > > [16, 32) 3 | | > > [32, 64) 4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > [64, 128) 871 |@@@@@@@@@ | > > [128, 256) 34 | | > > [256, 512) 9 | | > > [512, 1K) 262 |@@ | > > [1K, 2K) 35 | | > > [2K, 4K) 1 | | > > > > Paul, is this expected? Is RCU not trying too hard to be nice? > > > > # cat /sys/module/rcutree/parameters/blimit > > 10 > > > > Or should we perhaps just raise the loop limit? Breaking after less > > than 100usec seems excessive :( > > No. Can we please stop twiddling a parameter here and there and go and > fix this whole problem space properly. Increasing the loop count for RCU > might work for your particular usecase and cause issues in other > scenarios. > > Btw, RCU seems to be a perfect candidate to delegate batches from softirq > into a seperate scheduler controllable entity. Indeed, as you well know, CONFIG_RCU_NOCB_CPU=y in combination with the rcutree.use_softirq kernel boot parameter in combination with either the nohz_full or rcu_nocbs kernel boot parameter and then the callbacks are invoked within separate kthreads so that the scheduler has full control. In addition, this dispenses with all of the heuristics that are otherwise necessary to avoid invoking too many callbacks in one shot. Back in the day, I tried making this the default (with an eye towards making it the sole callback-execution scheme), but this resulted in some ugly performance regressions. This was in part due to the extra synchronization required to queue a callback and in part due to the higher average cost of a wakeup compared to a raise_softirq(). So I changed to the current non-default arrangement. And of course, you can do it halfway by booting kernel built with CONFIG_RCU_NOCB_CPU=n with the rcutree.use_softirq kernel boot parameter. But then the callback-invocation-limit heuristics are still used, but this time to prevent callback invocation from preventing the CPU from reporting quiescent states. But if this was the only case, simpler heuristics would suffice. In short, it is not hard to make RCU avoid using softirq, but doing so is not without side effects. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-05 22:42 ` Paul E. McKenney @ 2023-03-05 23:00 ` Frederic Weisbecker 2023-03-06 4:30 ` Paul E. McKenney 0 siblings, 1 reply; 38+ messages in thread From: Frederic Weisbecker @ 2023-03-05 23:00 UTC (permalink / raw) To: Paul E. McKenney Cc: Thomas Gleixner, Jakub Kicinski, peterz, jstultz, edumazet, netdev, linux-kernel On Sun, Mar 05, 2023 at 02:42:11PM -0800, Paul E. McKenney wrote: > On Sun, Mar 05, 2023 at 09:43:23PM +0100, Thomas Gleixner wrote: > Indeed, as you well know, CONFIG_RCU_NOCB_CPU=y in combination with the > rcutree.use_softirq kernel boot parameter in combination with either the > nohz_full or rcu_nocbs kernel boot parameter and then the callbacks are > invoked within separate kthreads so that the scheduler has full control. > In addition, this dispenses with all of the heuristics that are otherwise > necessary to avoid invoking too many callbacks in one shot. > > Back in the day, I tried making this the default (with an eye towards > making it the sole callback-execution scheme), but this resulted in > some ugly performance regressions. This was in part due to the extra > synchronization required to queue a callback and in part due to the > higher average cost of a wakeup compared to a raise_softirq(). > > So I changed to the current non-default arrangement. > > And of course, you can do it halfway by booting kernel built with > CONFIG_RCU_NOCB_CPU=n with the rcutree.use_softirq kernel boot parameter. > But then the callback-invocation-limit heuristics are still used, but > this time to prevent callback invocation from preventing the CPU from > reporting quiescent states. But if this was the only case, simpler > heuristics would suffice. > > In short, it is not hard to make RCU avoid using softirq, but doing so > is not without side effects. ;-) Right but note that, threaded or not, callbacks invocation happen within a local_bh_disable() section, preventing other softirqs from running. So this is still subject to the softirq per-CPU BKL. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-05 23:00 ` Frederic Weisbecker @ 2023-03-06 4:30 ` Paul E. McKenney 2023-03-06 11:22 ` Frederic Weisbecker 0 siblings, 1 reply; 38+ messages in thread From: Paul E. McKenney @ 2023-03-06 4:30 UTC (permalink / raw) To: Frederic Weisbecker Cc: Thomas Gleixner, Jakub Kicinski, peterz, jstultz, edumazet, netdev, linux-kernel On Mon, Mar 06, 2023 at 12:00:24AM +0100, Frederic Weisbecker wrote: > On Sun, Mar 05, 2023 at 02:42:11PM -0800, Paul E. McKenney wrote: > > On Sun, Mar 05, 2023 at 09:43:23PM +0100, Thomas Gleixner wrote: > > Indeed, as you well know, CONFIG_RCU_NOCB_CPU=y in combination with the > > rcutree.use_softirq kernel boot parameter in combination with either the > > nohz_full or rcu_nocbs kernel boot parameter and then the callbacks are > > invoked within separate kthreads so that the scheduler has full control. > > In addition, this dispenses with all of the heuristics that are otherwise > > necessary to avoid invoking too many callbacks in one shot. > > > > Back in the day, I tried making this the default (with an eye towards > > making it the sole callback-execution scheme), but this resulted in > > some ugly performance regressions. This was in part due to the extra > > synchronization required to queue a callback and in part due to the > > higher average cost of a wakeup compared to a raise_softirq(). > > > > So I changed to the current non-default arrangement. > > > > And of course, you can do it halfway by booting kernel built with > > CONFIG_RCU_NOCB_CPU=n with the rcutree.use_softirq kernel boot parameter. > > But then the callback-invocation-limit heuristics are still used, but > > this time to prevent callback invocation from preventing the CPU from > > reporting quiescent states. But if this was the only case, simpler > > heuristics would suffice. > > > > In short, it is not hard to make RCU avoid using softirq, but doing so > > is not without side effects. ;-) > > Right but note that, threaded or not, callbacks invocation happen > within a local_bh_disable() section, preventing other softirqs from running. > > So this is still subject to the softirq per-CPU BKL. True enough! But it momentarily enables BH after invoking each callback, so the other softirq vectors should be able to get a word in. Thanx, Paul ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-06 4:30 ` Paul E. McKenney @ 2023-03-06 11:22 ` Frederic Weisbecker 0 siblings, 0 replies; 38+ messages in thread From: Frederic Weisbecker @ 2023-03-06 11:22 UTC (permalink / raw) To: Paul E. McKenney Cc: Thomas Gleixner, Jakub Kicinski, peterz, jstultz, edumazet, netdev, linux-kernel On Sun, Mar 05, 2023 at 08:30:33PM -0800, Paul E. McKenney wrote: > On Mon, Mar 06, 2023 at 12:00:24AM +0100, Frederic Weisbecker wrote: > > On Sun, Mar 05, 2023 at 02:42:11PM -0800, Paul E. McKenney wrote: > > > On Sun, Mar 05, 2023 at 09:43:23PM +0100, Thomas Gleixner wrote: > > > Indeed, as you well know, CONFIG_RCU_NOCB_CPU=y in combination with the > > > rcutree.use_softirq kernel boot parameter in combination with either the > > > nohz_full or rcu_nocbs kernel boot parameter and then the callbacks are > > > invoked within separate kthreads so that the scheduler has full control. > > > In addition, this dispenses with all of the heuristics that are otherwise > > > necessary to avoid invoking too many callbacks in one shot. > > > > > > Back in the day, I tried making this the default (with an eye towards > > > making it the sole callback-execution scheme), but this resulted in > > > some ugly performance regressions. This was in part due to the extra > > > synchronization required to queue a callback and in part due to the > > > higher average cost of a wakeup compared to a raise_softirq(). > > > > > > So I changed to the current non-default arrangement. > > > > > > And of course, you can do it halfway by booting kernel built with > > > CONFIG_RCU_NOCB_CPU=n with the rcutree.use_softirq kernel boot parameter. > > > But then the callback-invocation-limit heuristics are still used, but > > > this time to prevent callback invocation from preventing the CPU from > > > reporting quiescent states. But if this was the only case, simpler > > > heuristics would suffice. > > > > > > In short, it is not hard to make RCU avoid using softirq, but doing so > > > is not without side effects. ;-) > > > > Right but note that, threaded or not, callbacks invocation happen > > within a local_bh_disable() section, preventing other softirqs from running. > > > > So this is still subject to the softirq per-CPU BKL. > > True enough! But it momentarily enables BH after invoking each callback, > so the other softirq vectors should be able to get a word in. Indeed it's still less worse than having it in softirqs. ^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-05 20:43 ` Thomas Gleixner 2023-03-05 22:42 ` Paul E. McKenney @ 2023-03-06 9:13 ` David Laight 2023-03-06 11:57 ` Frederic Weisbecker 2023-03-07 0:51 ` Jakub Kicinski 3 siblings, 0 replies; 38+ messages in thread From: David Laight @ 2023-03-06 9:13 UTC (permalink / raw) To: 'Thomas Gleixner', Jakub Kicinski Cc: peterz@infradead.org, jstultz@google.com, edumazet@google.com, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Paul E. McKenney, Frederic Weisbecker From: Thomas Gleixner > Sent: 05 March 2023 20:43 ... > The point is that softirqs are just the proliferation of an at least 50 > years old OS design paradigm. Back then everyhting which run in an > interrupt handler was "important" and more or less allowed to hog the > CPU at will. > > That obviously caused problems because it prevented other interrupt > handlers from being served. > > This was attempted to work around in hardware by providing interrupt > priority levels. No general purpose OS utilized that ever because there > is no way to get this right. Not even on UP, unless you build a designed > for the purpose "OS". > > Soft interrupts are not any better. They avoid the problem of stalling > interrupts by moving the problem one level down to the scheduler. > > Granted they are a cute hack, but at the very end they are still evading > the resource control mechanisms of the OS by defining their own rules: From some measurements I've done, while softints seem like a good idea they are almost pointless. What usually happens is a hardware interrupt happens, does some of the required work, schedules a softint and returns. Immediately a softint happens (at the same instruction) and does all the rest of the work. The work has to be done, but you've added cost of the extra scheduling and interrupt - so overall it is slower. The massive batching up of some operations (like ethernet transmit clearing and rx setup, and things being freed after rcu) doesn't help latency. Without the batching the softint would finish faster and cause less of a latency 'problem' to whatever was interrupted. Now softints do help interrupt latency, but that is only relevant if you have critical interrupts (like pulling data out of a hardware fifo). Most modern hardware doesn't have anything that critical. Now there is code that can decide to drop softint processing to a normal thread. If that ever happens you probably lose 'big time'. Normal softint processing is higher priority than any process code. But the kernel thread runs at the priority of a normal user thread. Pretty much the lowest of the low. So all this 'high priority' interrupt related processing that really does have to happen to keep the system running just doesn't get scheduled. I think it was Eric who had problems with ethernet packets being dropped and changed the logic (of dropping to a thread) to make it much less likely - but that got reverted (well more code added that effectively reverted it) not long after. Try (as I was) to run a test that requires you to receive ALL of the 500000 ethernet packets being sent to an interface every second while also doing enough processing on the packets to make the system (say) 90% busy (real time UDP audio processing) and you soon find the defaults are entirely hopeless. Even the interrupt 'mitigation' options on the ethernet controller don't actually work - packets get dropped at the low level. (That will fail on an otherwise idle system.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-05 20:43 ` Thomas Gleixner 2023-03-05 22:42 ` Paul E. McKenney 2023-03-06 9:13 ` David Laight @ 2023-03-06 11:57 ` Frederic Weisbecker 2023-03-06 14:57 ` Paul E. McKenney 2023-03-07 0:51 ` Jakub Kicinski 3 siblings, 1 reply; 38+ messages in thread From: Frederic Weisbecker @ 2023-03-06 11:57 UTC (permalink / raw) To: Thomas Gleixner Cc: Jakub Kicinski, peterz, jstultz, edumazet, netdev, linux-kernel, Paul E. McKenney On Sun, Mar 05, 2023 at 09:43:23PM +0100, Thomas Gleixner wrote: > That said, I have no brilliant solution for that off the top of my head, > but I'm not comfortable with applying more adhoc solutions which are > contrary to the efforts of e.g. the audio folks. > > I have some vague ideas how to approach that, but I'm traveling all of > next week, so I neither will be reading much email, nor will I have time > to think deeply about softirqs. I'll resume when I'm back. IIUC: the problem is that some (rare?) softirq vector callbacks rely on the fact they can not be interrupted by other local vectors and they rely on that to protect against concurrent per-cpu state access, right? And there is no automatic way to detect those cases otherwise we would have fixed them all with spinlocks already. So I fear the only (in-)sane idea I could think of is to do it the same way we did with the BKL. Some sort of pushdown: vector callbacks known for having no such subtle interaction can re-enable softirqs. For example known safe timers (either because they have no such interactions or because they handle them correctly via spinlocks) can carry a TIMER_SOFTIRQ_SAFE flag to tell about that. And RCU callbacks something alike. Of course this is going to be a tremendous amount of work but it has the advantage of being iterative and it will pay in the long run. Also I'm confident that the hottest places will be handled quickly. And most of them are likely to be in core networking code. Because I fear no hack will ever fix that otherwise, and we have tried a lot. Thanks. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-06 11:57 ` Frederic Weisbecker @ 2023-03-06 14:57 ` Paul E. McKenney 0 siblings, 0 replies; 38+ messages in thread From: Paul E. McKenney @ 2023-03-06 14:57 UTC (permalink / raw) To: Frederic Weisbecker Cc: Thomas Gleixner, Jakub Kicinski, peterz, jstultz, edumazet, netdev, linux-kernel On Mon, Mar 06, 2023 at 12:57:11PM +0100, Frederic Weisbecker wrote: > On Sun, Mar 05, 2023 at 09:43:23PM +0100, Thomas Gleixner wrote: > > That said, I have no brilliant solution for that off the top of my head, > > but I'm not comfortable with applying more adhoc solutions which are > > contrary to the efforts of e.g. the audio folks. > > > > I have some vague ideas how to approach that, but I'm traveling all of > > next week, so I neither will be reading much email, nor will I have time > > to think deeply about softirqs. I'll resume when I'm back. > > IIUC: the problem is that some (rare?) softirq vector callbacks rely on the > fact they can not be interrupted by other local vectors and they rely on > that to protect against concurrent per-cpu state access, right? > > And there is no automatic way to detect those cases otherwise we would have > fixed them all with spinlocks already. > > So I fear the only (in-)sane idea I could think of is to do it the same way > we did with the BKL. Some sort of pushdown: vector callbacks known for having > no such subtle interaction can re-enable softirqs. > > For example known safe timers (either because they have no such interactions > or because they handle them correctly via spinlocks) can carry a > TIMER_SOFTIRQ_SAFE flag to tell about that. And RCU callbacks something alike. When a given RCU callback causes latency problems, the usual quick fix is to have them instead spawn a workqueue, either from the callback or via queue_rcu_work(). But yes, this is one of the reasons that jiffies are so popular. Eric batched something like 30 RCU callbacks per costly time check, and you would quite possible need similar batching to attain efficiency for lightly loaded softirq vectors. But 30 long-running softirq handlers would be too many. One option is to check the expensive time when either a batch of (say) 30 completes or when jiffies says too much time has elapsed. > Of course this is going to be a tremendous amount of work but it has the > advantage of being iterative and it will pay in the long run. Also I'm confident > that the hottest places will be handled quickly. And most of them are likely to > be in core networking code. > > Because I fear no hack will ever fix that otherwise, and we have tried a lot. Indeed, if it was easy within current overall code structure, we would have already fixed it. Thanx, Paul ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() 2023-03-05 20:43 ` Thomas Gleixner ` (2 preceding siblings ...) 2023-03-06 11:57 ` Frederic Weisbecker @ 2023-03-07 0:51 ` Jakub Kicinski 3 siblings, 0 replies; 38+ messages in thread From: Jakub Kicinski @ 2023-03-07 0:51 UTC (permalink / raw) To: Thomas Gleixner Cc: peterz, jstultz, edumazet, netdev, linux-kernel, Paul E. McKenney, Frederic Weisbecker On Sun, 05 Mar 2023 21:43:23 +0100 Thomas Gleixner wrote: > > Would it all be more readable if I named the "overload_limit" > > "overloaded_until" instead? Naming.. > > While naming matters it wont change the 'heuristical duct tape' property > of this, right? I also hate heuristics, I hope we are on the same page there. The way I see it we allowed 2 heuristics already into the kernel: - ksoftirq running means overload - need_resched() means we should stop immediately (and wake ksoftirqd) Those two are clearly at odds with each other. And the latter is as weak / hacky as it gets :| See at the end for work in progress/"real solutions" but for this patch - can I replace the time limit with a simple per core "bool wa_for_yield" and change the overload check to: if (ksoftirqd_running() && !wa_for_yeild) ? That's not a heuristic, right? No magic values, predictable, repeatable behavior. > > I'll add comments, too. > > They are definitely appreciated, but I'd prefer to have code which is > self explanatory and does at least have a notion of a halfways > scientific approach to the overall issue of softirqs. > > The point is that softirqs are just the proliferation of an at least 50 > years old OS design paradigm. Back then everyhting which run in an > interrupt handler was "important" and more or less allowed to hog the > CPU at will. > > That obviously caused problems because it prevented other interrupt > handlers from being served. > > This was attempted to work around in hardware by providing interrupt > priority levels. No general purpose OS utilized that ever because there > is no way to get this right. Not even on UP, unless you build a designed > for the purpose "OS". > > Soft interrupts are not any better. They avoid the problem of stalling > interrupts by moving the problem one level down to the scheduler. > > Granted they are a cute hack, but at the very end they are still evading > the resource control mechanisms of the OS by defining their own rules: > > - NET RX defaults to 2ms with the ability to override via /proc > - RCU defaults to 3ms with the ability to override via /sysfs > > while the "overload detection" in the core defines a hardcoded limit of > 2ms. Alone the above does not sum up to the core limit and most of the > other soft interrupt handlers do not even have the notion of limits. > > That clearly does not even remotely allow to do proper coordinated > resource management. FWIW happy to delete all the procfs knobs we have in net. Anyone who feels like they need to tweak those should try to use/work on a real solution. > Not to talk about the sillyness of the jiffy based timouts which result > in a randomized granularity of 0...1/Hz as mentioned before. > > I'm well aware of the fact that consulting a high resolution hardware > clock frequently can be slow and hurting performance, but there are well > understood workarounds, aka batching, which mitigate that. > > There is another aspect to softirqs which makes them a horror show: > > While they are conceptually seperate, at the very end they are all > lumped together and especially the network code has implicit > assumptions about that. It's simply impossible to seperate the > processing of the various soft interrupt incarnations. Was that just about running in threads or making them preemptible? Running Rx in threads is "mostly solved", see at the end. > IOW, resource control by developer preference and coincidence of > events. That truly makes an understandable and to be relied on OS. > > We had seperate softirq threads and per softirq serialization (except > NET_RX/TX which shared) in the early days of preempt RT, which gave fine > grained control. Back then the interaction between different softirqs > was halfways understandable and the handful of interaction points which > relied on the per CPU global BH disable were fixable with local > serializations. That lasted a year or two until we stopped maintaining > that because the interaction between softirqs was becoming a whack a > mole game. So we gave up and enjoy the full glory of a per CPU global > lock, because that's what local BH disable actually is. > > I completely understand that ***GB networking is a challenge, but ***GB > networking does not work without applications wwich use it. Those > applications are unfortunately^Wrightfully subject to the scheduler, > aka. resource control. > > IMO evading resource control is the worst of all approaches and the > amount of heuristics you can apply to mitigate that, is never going to > cover even a subset of the overall application space. > > Just look at the memcache vs. webserver use case vs. need_resched() and > then the requirements coming from the low latency audio folks. Let me clarify that we only need the default to not be silly for applications which are _not_ doing a lot of networking. The webserver in my test is running a website (PHP?), not serving static content. It's doing maybe a few Gbps on a 25/50 Gbps NIC. > I know the usual approach to that is to add some more heuristics which > are by nature supposed to fail or to add yet another 'knob'. We have > already too many knobs which are not comprehensible on their own. But > even if a particular knob is comprehensible there is close to zero > documentation and I even claim close to zero understanding of the > interaction of knobs. > > Just for the record. Some of our engineers are working on TSN based > real-time networking which is all about latency anc accuracy. Guess how > well that works with the current overall design. That's not an esoteric > niche use case as low-latency TSN is not restricted to the automation > space. There are quite some use cases which go there even in the high > end networking space. TSN + ksoftirqd is definitely a bad idea :S > > Those will help, but I spent some time digging into the jiffies related > > warts with kprobes - while annoying they weren't a major source of wake > > ups. (FWIW the jiffies noise on our workloads is due to cgroup stats > > disabling IRQs for multiple ms on the timekeeping CPU). > > What? That's completely insane and needs to be fixed. Agreed, I made the right people aware.. > > Here are fresh stats on why we wake up ksoftirqd on our Web workload > > (collected over 100 sec): > > > > Time exceeded: 484 > > Loop max run out: 6525 > > need_resched(): 10219 > > (control: 17226 - number of times wakeup_process called for ksirqd) > > > > As you can see need_resched() dominates. > > > Zooming into the time exceeded - we can count nanoseconds between > > __do_softirq starting and the check. This is the histogram of actual > > usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000): > > > > [256, 512) 1 | | > > [512, 1K) 0 | | > > [1K, 2K) 217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > [2K, 4K) 266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > > So yes, we can probably save ourselves ~200 wakeup with a better clock > > but that's just 1.3% of the total wake ups :( > > Fair enough. Though that does not make our time limit handling any more > consistent and we need to fix that too to handle the other issues. > > > Now - now about the max loop count. I ORed the pending softirqs every > > time we get to the end of the loop. Looks like vast majority of the > > loop counter wake ups are exclusively due to RCU: > > > > @looped[512]: 5516 > > If the loop counter breaks without consuming the time budget that's > silly. FWIW my initial reaction was to read the jiffies from the _local_ core, because if we're running softirq the can't have IRQs masked. So local clock will be ticking. But I'm insufficiently competent to code that up, and you'd presumably have done this already if it was a good idea. > > Where 512 is the ORed pending mask over all iterations > > 512 == 1 << RCU_SOFTIRQ. > > > > And they usually take less than 100us to consume the 10 iterations. > > Histogram of usecs consumed when we run out of loop iterations: > > > > [16, 32) 3 | | > > [32, 64) 4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > [64, 128) 871 |@@@@@@@@@ | > > [128, 256) 34 | | > > [256, 512) 9 | | > > [512, 1K) 262 |@@ | > > [1K, 2K) 35 | | > > [2K, 4K) 1 | | > > > > Paul, is this expected? Is RCU not trying too hard to be nice? > > > > # cat /sys/module/rcutree/parameters/blimit > > 10 > > > > Or should we perhaps just raise the loop limit? Breaking after less > > than 100usec seems excessive :( > > No. Can we please stop twiddling a parameter here and there and go and > fix this whole problem space properly. Increasing the loop count for RCU > might work for your particular usecase and cause issues in other > scenarios. > > Btw, RCU seems to be a perfect candidate to delegate batches from softirq > into a seperate scheduler controllable entity. Indeed, it knows how many callbacks it has, I wish we knew how many packets had arrived :) > > Hm, Paolo raised this point as well, but the overload time is strictly > > to stop paying attention to the fact ksoftirqd is running. > > IOW current kernels behave as if they had overload_limit of infinity. > > > > The current code already prevents processing until ksoftirqd schedules > > in, after raise_softirq() from a funky context. > > Correct and it does so because we are just applying duct tape over and > over. > > That said, I have no brilliant solution for that off the top of my head, > but I'm not comfortable with applying more adhoc solutions which are > contrary to the efforts of e.g. the audio folks. We are trying: Threaded NAPI was added in Feb 2021. We can create a thread per NAPI instance, then all Rx runs in dedicated kernel threads. Unfortunately for workloads I tested it negatively impacts RPS and latency. If the workload doesn't run the CPUs too hot it works well, but scheduler gets in the way. I have a vague recollection that Google proposed patches to the scheduler at some point to support isosync processing, but they were rejected? I'm very likely misremembering. But I do feel like we'll need better scheduler support to move network processing to threads. Maybe the upcoming BPF scheduler patches can help us with that. We'll see. The other way to go is to let the application take charge. We added support for applications pledging that they will "busy poll" a given queue. This turns off IRQs and expects an application to periodically call into NAPI. I think that's the future for high RPS applications, but real life results are scarce (other than pure forwarding workloads, I guess, which are trivial). Various folks in netdev had also experimented with using workqueues and kthreads which are not mapped statically to NAPI. So we are trying, and will continue to try. There are unknowns, however, which make me think it's worth addressing the obvious silliness in ksoftirqd behavior, tho. One - I'm not sure whether we can get to a paradigm which is as fast and as easy to use for 100% of use cases as softirq. Two - the solution may need tuning and infrastructure leaving smaller users behind. And the ksoftirqd experience is getting worse, which is why I posted the patches (I'm guessing scheduler changes, I don't even want to know who changed their heuristics). ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 3/3] softirq: don't yield if only expedited handlers are pending 2022-12-22 22:12 [PATCH 0/3] softirq: uncontroversial change Jakub Kicinski 2022-12-22 22:12 ` [PATCH 1/3] softirq: rename ksoftirqd_running() -> ksoftirqd_should_handle() Jakub Kicinski 2022-12-22 22:12 ` [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() Jakub Kicinski @ 2022-12-22 22:12 ` Jakub Kicinski 2023-01-09 9:44 ` Peter Zijlstra 2023-03-03 14:17 ` Thomas Gleixner 2023-04-20 17:24 ` [PATCH 0/3] softirq: uncontroversial change Paolo Abeni 3 siblings, 2 replies; 38+ messages in thread From: Jakub Kicinski @ 2022-12-22 22:12 UTC (permalink / raw) To: peterz, tglx; +Cc: jstultz, edumazet, netdev, linux-kernel, Jakub Kicinski In networking we try to keep Tx packet queues small, so we limit how many bytes a socket may packetize and queue up. Tx completions (from NAPI) notify the sockets when packets have left the system (NIC Tx completion) and the socket schedules a tasklet to queue the next batch of frames. This leads to a situation where we go thru the softirq loop twice. First round we have pending = NET (from the NIC IRQ/NAPI), and the second iteration has pending = TASKLET (the socket tasklet). On two web workloads I looked at this condition accounts for 10% and 23% of all ksoftirqd wake ups respectively. We run NAPI which wakes some process up, we hit need_resched() and wake up ksoftirqd just to run the TSQ (TCP small queues) tasklet. Tweak the need_resched() condition to be ignored if all pending softIRQs are "non-deferred". The tasklet would run relatively soon, anyway, but once ksoftirqd is woken we're risking stalls. I did not see any negative impact on the latency in an RR test on a loaded machine with this change applied. Signed-off-by: Jakub Kicinski <kuba@kernel.org> --- kernel/softirq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index ad200d386ec1..4ac59ffb0d55 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -601,7 +601,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) if (time_is_before_eq_jiffies(end) || !--max_restart) limit = SOFTIRQ_OVERLOAD_TIME; - else if (need_resched()) + else if (need_resched() && pending & ~SOFTIRQ_NOW_MASK) limit = SOFTIRQ_DEFER_TIME; else goto restart; -- 2.38.1 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending 2022-12-22 22:12 ` [PATCH 3/3] softirq: don't yield if only expedited handlers are pending Jakub Kicinski @ 2023-01-09 9:44 ` Peter Zijlstra 2023-01-09 10:16 ` Eric Dumazet 2023-03-03 11:41 ` Thomas Gleixner 2023-03-03 14:17 ` Thomas Gleixner 1 sibling, 2 replies; 38+ messages in thread From: Peter Zijlstra @ 2023-01-09 9:44 UTC (permalink / raw) To: Jakub Kicinski; +Cc: tglx, jstultz, edumazet, netdev, linux-kernel On Thu, Dec 22, 2022 at 02:12:44PM -0800, Jakub Kicinski wrote: > In networking we try to keep Tx packet queues small, so we limit > how many bytes a socket may packetize and queue up. Tx completions > (from NAPI) notify the sockets when packets have left the system > (NIC Tx completion) and the socket schedules a tasklet to queue > the next batch of frames. > > This leads to a situation where we go thru the softirq loop twice. > First round we have pending = NET (from the NIC IRQ/NAPI), and > the second iteration has pending = TASKLET (the socket tasklet). So to me that sounds like you want to fix the network code to not do this then. Why can't the NAPI thing directly queue the next batch; why do you have to do a softirq roundtrip like this? > On two web workloads I looked at this condition accounts for 10% > and 23% of all ksoftirqd wake ups respectively. We run NAPI > which wakes some process up, we hit need_resched() and wake up > ksoftirqd just to run the TSQ (TCP small queues) tasklet. > > Tweak the need_resched() condition to be ignored if all pending > softIRQs are "non-deferred". The tasklet would run relatively > soon, anyway, but once ksoftirqd is woken we're risking stalls. > > I did not see any negative impact on the latency in an RR test > on a loaded machine with this change applied. Ignoring need_resched() will get you in trouble with RT people real fast. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending 2023-01-09 9:44 ` Peter Zijlstra @ 2023-01-09 10:16 ` Eric Dumazet 2023-01-09 19:12 ` Jakub Kicinski 2023-03-03 11:41 ` Thomas Gleixner 1 sibling, 1 reply; 38+ messages in thread From: Eric Dumazet @ 2023-01-09 10:16 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jakub Kicinski, tglx, jstultz, netdev, linux-kernel On Mon, Jan 9, 2023 at 10:44 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Dec 22, 2022 at 02:12:44PM -0800, Jakub Kicinski wrote: > > In networking we try to keep Tx packet queues small, so we limit > > how many bytes a socket may packetize and queue up. Tx completions > > (from NAPI) notify the sockets when packets have left the system > > (NIC Tx completion) and the socket schedules a tasklet to queue > > the next batch of frames. > > > > This leads to a situation where we go thru the softirq loop twice. > > First round we have pending = NET (from the NIC IRQ/NAPI), and > > the second iteration has pending = TASKLET (the socket tasklet). > > So to me that sounds like you want to fix the network code to not do > this then. Why can't the NAPI thing directly queue the next batch; why > do you have to do a softirq roundtrip like this? I think Jakub refers to tcp_wfree() code, which can be called from arbitrary contexts, including non NAPI ones, and with the socket locked (by this thread or another) or not locked at all (say if skb is freed from a TX completion handler or a qdisc drop) > > > On two web workloads I looked at this condition accounts for 10% > > and 23% of all ksoftirqd wake ups respectively. We run NAPI > > which wakes some process up, we hit need_resched() and wake up > > ksoftirqd just to run the TSQ (TCP small queues) tasklet. > > > > Tweak the need_resched() condition to be ignored if all pending > > softIRQs are "non-deferred". The tasklet would run relatively > > soon, anyway, but once ksoftirqd is woken we're risking stalls. > > > > I did not see any negative impact on the latency in an RR test > > on a loaded machine with this change applied. > > Ignoring need_resched() will get you in trouble with RT people real > fast. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending 2023-01-09 10:16 ` Eric Dumazet @ 2023-01-09 19:12 ` Jakub Kicinski 0 siblings, 0 replies; 38+ messages in thread From: Jakub Kicinski @ 2023-01-09 19:12 UTC (permalink / raw) To: Eric Dumazet, Peter Zijlstra; +Cc: tglx, jstultz, netdev, linux-kernel On Mon, 9 Jan 2023 11:16:45 +0100 Eric Dumazet wrote: > > On Thu, Dec 22, 2022 at 02:12:44PM -0800, Jakub Kicinski wrote: > > > In networking we try to keep Tx packet queues small, so we limit > > > how many bytes a socket may packetize and queue up. Tx completions > > > (from NAPI) notify the sockets when packets have left the system > > > (NIC Tx completion) and the socket schedules a tasklet to queue > > > the next batch of frames. > > > > > > This leads to a situation where we go thru the softirq loop twice. > > > First round we have pending = NET (from the NIC IRQ/NAPI), and > > > the second iteration has pending = TASKLET (the socket tasklet). > > > > So to me that sounds like you want to fix the network code to not do > > this then. Why can't the NAPI thing directly queue the next batch; why > > do you have to do a softirq roundtrip like this? > > I think Jakub refers to tcp_wfree() code, which can be called from > arbitrary contexts, > including non NAPI ones, and with the socket locked (by this thread or > another) or not locked at all > (say if skb is freed from a TX completion handler or a qdisc drop) Yes, fwiw. > > > On two web workloads I looked at this condition accounts for 10% > > > and 23% of all ksoftirqd wake ups respectively. We run NAPI > > > which wakes some process up, we hit need_resched() and wake up > > > ksoftirqd just to run the TSQ (TCP small queues) tasklet. > > > > > > Tweak the need_resched() condition to be ignored if all pending > > > softIRQs are "non-deferred". The tasklet would run relatively > > > soon, anyway, but once ksoftirqd is woken we're risking stalls. > > > > > > I did not see any negative impact on the latency in an RR test > > > on a loaded machine with this change applied. > > > > Ignoring need_resched() will get you in trouble with RT people real > > fast. Ah, you're right :/ Is it good enough if we throw || force_irqthreads() into the condition? Otherwise we can just postpone this optimization, the overload time horizon / limit patch is much more important. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending 2023-01-09 9:44 ` Peter Zijlstra 2023-01-09 10:16 ` Eric Dumazet @ 2023-03-03 11:41 ` Thomas Gleixner 1 sibling, 0 replies; 38+ messages in thread From: Thomas Gleixner @ 2023-03-03 11:41 UTC (permalink / raw) To: Peter Zijlstra, Jakub Kicinski; +Cc: jstultz, edumazet, netdev, linux-kernel On Mon, Jan 09 2023 at 10:44, Peter Zijlstra wrote: > On Thu, Dec 22, 2022 at 02:12:44PM -0800, Jakub Kicinski wrote: >> Tweak the need_resched() condition to be ignored if all pending >> softIRQs are "non-deferred". The tasklet would run relatively >> soon, anyway, but once ksoftirqd is woken we're risking stalls. >> >> I did not see any negative impact on the latency in an RR test >> on a loaded machine with this change applied. > > Ignoring need_resched() will get you in trouble with RT people real > fast. In this case not really. softirq processing is preemptible in RT, but it's still a major pain ... Thanks, tglx ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending 2022-12-22 22:12 ` [PATCH 3/3] softirq: don't yield if only expedited handlers are pending Jakub Kicinski 2023-01-09 9:44 ` Peter Zijlstra @ 2023-03-03 14:17 ` Thomas Gleixner 1 sibling, 0 replies; 38+ messages in thread From: Thomas Gleixner @ 2023-03-03 14:17 UTC (permalink / raw) To: Jakub Kicinski, peterz Cc: jstultz, edumazet, netdev, linux-kernel, Jakub Kicinski Jakub! On Thu, Dec 22 2022 at 14:12, Jakub Kicinski wrote: > This leads to a situation where we go thru the softirq loop twice. > First round we have pending = NET (from the NIC IRQ/NAPI), and > the second iteration has pending = TASKLET (the socket tasklet). ... > diff --git a/kernel/softirq.c b/kernel/softirq.c > index ad200d386ec1..4ac59ffb0d55 100644 > --- a/kernel/softirq.c > +++ b/kernel/softirq.c > @@ -601,7 +601,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) > > if (time_is_before_eq_jiffies(end) || !--max_restart) > limit = SOFTIRQ_OVERLOAD_TIME; > - else if (need_resched()) > + else if (need_resched() && pending & ~SOFTIRQ_NOW_MASK) > limit = SOFTIRQ_DEFER_TIME; > else > goto restart; While this is the least of my softirq worries on PREEMPT_RT, Peter is right about real-time tasks being deferred on a PREEMPT_RT=n kernel. That's a real issue for low-latency audio which John Stultz is trying to resolve. Especially as the above check can go in circles. I fear we need to go back to the drawing board and come up with a real solution which takes these contradicting aspects into account. Let me stare at Peters and Johns patches for a while. Thanks tglx ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/3] softirq: uncontroversial change 2022-12-22 22:12 [PATCH 0/3] softirq: uncontroversial change Jakub Kicinski ` (2 preceding siblings ...) 2022-12-22 22:12 ` [PATCH 3/3] softirq: don't yield if only expedited handlers are pending Jakub Kicinski @ 2023-04-20 17:24 ` Paolo Abeni 2023-04-20 17:41 ` Eric Dumazet ` (2 more replies) 3 siblings, 3 replies; 38+ messages in thread From: Paolo Abeni @ 2023-04-20 17:24 UTC (permalink / raw) To: Jakub Kicinski, peterz, tglx; +Cc: jstultz, edumazet, netdev, linux-kernel Hi all, On Thu, 2022-12-22 at 14:12 -0800, Jakub Kicinski wrote: > Catching up on LWN I run across the article about softirq > changes, and then I noticed fresh patches in Peter's tree. > So probably wise for me to throw these out there. > > My (can I say Meta's?) problem is the opposite to what the RT > sensitive people complain about. In the current scheme once > ksoftirqd is woken no network processing happens until it runs. > > When networking gets overloaded - that's probably fair, the problem > is that we confuse latency tweaks with overload protection. We have > a needs_resched() in the loop condition (which is a latency tweak) > Most often we defer to ksoftirqd because we're trying to be nice > and let user space respond quickly, not because there is an > overload. But the user space may not be nice, and sit on the CPU > for 10ms+. Also the sirq's "work allowance" is 2ms, which is > uncomfortably close to the timer tick, but that's another story. > > We have a sirq latency tracker in our prod kernel which catches > 8ms+ stalls of net Tx (packets queued to the NIC but there is > no NAPI cleanup within 8ms) and with these patches applied > on 5.19 fully loaded web machine sees a drop in stalls from > 1.8 stalls/sec to 0.16/sec. I also see a 50% drop in outgoing > TCP retransmissions and ~10% drop in non-TLP incoming ones. > This is not a network-heavy workload so most of the rtx are > due to scheduling artifacts. > > The network latency in a datacenter is somewhere around neat > 1000x lower than scheduling granularity (around 10us). > > These patches (patch 2 is "the meat") change what we recognize > as overload. Instead of just checking if "ksoftirqd is woken" > it also caps how long we consider ourselves to be in overload, > a time limit which is different based on whether we yield due > to real resource exhaustion vs just hitting that needs_resched(). > > I hope the core concept is not entirely idiotic. It'd be great > if we could get this in or fold an equivalent concept into ongoing > work from others, because due to various "scheduler improvements" > every time we upgrade the production kernel this problem is getting > worse :( Please allow me to revive this old thread. My understanding is that we want to avoid adding more heuristics here, preferring a consistent refactor. I would like to propose a revert of: 4cd13c21b207 softirq: Let ksoftirqd do its job the its follow-ups: 3c53776e29f8 Mark HI and TASKLET softirq synchronous 0f50524789fc softirq: Don't skip softirq execution when softirq thread is parking The problem originally addressed by 4cd13c21b207 can now be tackled with the threaded napi, available since: 29863d41bb6e net: implement threaded-able napi poll loop support Reverting the mentioned commit should address the latency issues mentioned by Jakub - I verified it solves a somewhat related problem in my setup - and reduces the layering of heuristics in this area. A refactor introducing uniform overload detection and proper resource control will be better, but I admit it's beyond me and anyway it could still land afterwards. Any opinion more then welcome! Thanks, Paolo ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/3] softirq: uncontroversial change 2023-04-20 17:24 ` [PATCH 0/3] softirq: uncontroversial change Paolo Abeni @ 2023-04-20 17:41 ` Eric Dumazet 2023-04-20 20:23 ` Paolo Abeni 2023-04-21 2:48 ` Jason Xing 2023-05-09 19:56 ` [tip: irq/core] Revert "softirq: Let ksoftirqd do its job" tip-bot2 for Paolo Abeni 2 siblings, 1 reply; 38+ messages in thread From: Eric Dumazet @ 2023-04-20 17:41 UTC (permalink / raw) To: Paolo Abeni; +Cc: Jakub Kicinski, peterz, tglx, jstultz, netdev, linux-kernel On Thu, Apr 20, 2023 at 7:24 PM Paolo Abeni <pabeni@redhat.com> wrote: > > Hi all, > On Thu, 2022-12-22 at 14:12 -0800, Jakub Kicinski wrote: > > Catching up on LWN I run across the article about softirq > > changes, and then I noticed fresh patches in Peter's tree. > > So probably wise for me to throw these out there. > > > > My (can I say Meta's?) problem is the opposite to what the RT > > sensitive people complain about. In the current scheme once > > ksoftirqd is woken no network processing happens until it runs. > > > > When networking gets overloaded - that's probably fair, the problem > > is that we confuse latency tweaks with overload protection. We have > > a needs_resched() in the loop condition (which is a latency tweak) > > Most often we defer to ksoftirqd because we're trying to be nice > > and let user space respond quickly, not because there is an > > overload. But the user space may not be nice, and sit on the CPU > > for 10ms+. Also the sirq's "work allowance" is 2ms, which is > > uncomfortably close to the timer tick, but that's another story. > > > > We have a sirq latency tracker in our prod kernel which catches > > 8ms+ stalls of net Tx (packets queued to the NIC but there is > > no NAPI cleanup within 8ms) and with these patches applied > > on 5.19 fully loaded web machine sees a drop in stalls from > > 1.8 stalls/sec to 0.16/sec. I also see a 50% drop in outgoing > > TCP retransmissions and ~10% drop in non-TLP incoming ones. > > This is not a network-heavy workload so most of the rtx are > > due to scheduling artifacts. > > > > The network latency in a datacenter is somewhere around neat > > 1000x lower than scheduling granularity (around 10us). > > > > These patches (patch 2 is "the meat") change what we recognize > > as overload. Instead of just checking if "ksoftirqd is woken" > > it also caps how long we consider ourselves to be in overload, > > a time limit which is different based on whether we yield due > > to real resource exhaustion vs just hitting that needs_resched(). > > > > I hope the core concept is not entirely idiotic. It'd be great > > if we could get this in or fold an equivalent concept into ongoing > > work from others, because due to various "scheduler improvements" > > every time we upgrade the production kernel this problem is getting > > worse :( > > Please allow me to revive this old thread. > > My understanding is that we want to avoid adding more heuristics here, > preferring a consistent refactor. > > I would like to propose a revert of: > > 4cd13c21b207 softirq: Let ksoftirqd do its job > > the its follow-ups: > > 3c53776e29f8 Mark HI and TASKLET softirq synchronous > 0f50524789fc softirq: Don't skip softirq execution when softirq thread is parking > > The problem originally addressed by 4cd13c21b207 can now be tackled > with the threaded napi, available since: > > 29863d41bb6e net: implement threaded-able napi poll loop support > > Reverting the mentioned commit should address the latency issues > mentioned by Jakub - I verified it solves a somewhat related problem in > my setup - and reduces the layering of heuristics in this area. > > A refactor introducing uniform overload detection and proper resource > control will be better, but I admit it's beyond me and anyway it could > still land afterwards. > > Any opinion more then welcome! Seems fine, but I think few things need to be fixed first in napi_threaded_poll() to enable some important features that are currently in net_rx_action() only. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/3] softirq: uncontroversial change 2023-04-20 17:41 ` Eric Dumazet @ 2023-04-20 20:23 ` Paolo Abeni 0 siblings, 0 replies; 38+ messages in thread From: Paolo Abeni @ 2023-04-20 20:23 UTC (permalink / raw) To: Eric Dumazet; +Cc: Jakub Kicinski, peterz, tglx, jstultz, netdev, linux-kernel On Thu, 2023-04-20 at 19:41 +0200, Eric Dumazet wrote: > On Thu, Apr 20, 2023 at 7:24 PM Paolo Abeni <pabeni@redhat.com> wrote: > > I would like to propose a revert of: > > > > 4cd13c21b207 softirq: Let ksoftirqd do its job > > > > the its follow-ups: > > > > 3c53776e29f8 Mark HI and TASKLET softirq synchronous > > 0f50524789fc softirq: Don't skip softirq execution when softirq thread is parking > > > > The problem originally addressed by 4cd13c21b207 can now be tackled > > with the threaded napi, available since: > > > > 29863d41bb6e net: implement threaded-able napi poll loop support > > > > Reverting the mentioned commit should address the latency issues > > mentioned by Jakub - I verified it solves a somewhat related problem in > > my setup - and reduces the layering of heuristics in this area. > > > > A refactor introducing uniform overload detection and proper resource > > control will be better, but I admit it's beyond me and anyway it could > > still land afterwards. > > > > Any opinion more then welcome! > > Seems fine, but I think few things need to be fixed first in > napi_threaded_poll() > to enable some important features that are currently in net_rx_action() only. Thanks for the feedback. I fear I'll miss some relevant bits. On top of my head I think about RPS and skb_defer_free. Both should work even when napi threaded is enabled - with an additional softirq ;) Do you think we should be able to handle both inside the napi thread? Or do you refer to other features? Thanks! Paolo ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/3] softirq: uncontroversial change 2023-04-20 17:24 ` [PATCH 0/3] softirq: uncontroversial change Paolo Abeni 2023-04-20 17:41 ` Eric Dumazet @ 2023-04-21 2:48 ` Jason Xing 2023-04-21 9:33 ` Paolo Abeni 2023-05-09 19:56 ` [tip: irq/core] Revert "softirq: Let ksoftirqd do its job" tip-bot2 for Paolo Abeni 2 siblings, 1 reply; 38+ messages in thread From: Jason Xing @ 2023-04-21 2:48 UTC (permalink / raw) To: Paolo Abeni Cc: Jakub Kicinski, peterz, tglx, jstultz, edumazet, netdev, linux-kernel On Fri, Apr 21, 2023 at 1:34 AM Paolo Abeni <pabeni@redhat.com> wrote: > > Hi all, > On Thu, 2022-12-22 at 14:12 -0800, Jakub Kicinski wrote: > > Catching up on LWN I run across the article about softirq > > changes, and then I noticed fresh patches in Peter's tree. > > So probably wise for me to throw these out there. > > > > My (can I say Meta's?) problem is the opposite to what the RT > > sensitive people complain about. In the current scheme once > > ksoftirqd is woken no network processing happens until it runs. > > > > When networking gets overloaded - that's probably fair, the problem > > is that we confuse latency tweaks with overload protection. We have > > a needs_resched() in the loop condition (which is a latency tweak) > > Most often we defer to ksoftirqd because we're trying to be nice > > and let user space respond quickly, not because there is an > > overload. But the user space may not be nice, and sit on the CPU > > for 10ms+. Also the sirq's "work allowance" is 2ms, which is > > uncomfortably close to the timer tick, but that's another story. > > > > We have a sirq latency tracker in our prod kernel which catches > > 8ms+ stalls of net Tx (packets queued to the NIC but there is > > no NAPI cleanup within 8ms) and with these patches applied > > on 5.19 fully loaded web machine sees a drop in stalls from > > 1.8 stalls/sec to 0.16/sec. I also see a 50% drop in outgoing > > TCP retransmissions and ~10% drop in non-TLP incoming ones. > > This is not a network-heavy workload so most of the rtx are > > due to scheduling artifacts. > > > > The network latency in a datacenter is somewhere around neat > > 1000x lower than scheduling granularity (around 10us). > > > > These patches (patch 2 is "the meat") change what we recognize > > as overload. Instead of just checking if "ksoftirqd is woken" > > it also caps how long we consider ourselves to be in overload, > > a time limit which is different based on whether we yield due > > to real resource exhaustion vs just hitting that needs_resched(). > > > > I hope the core concept is not entirely idiotic. It'd be great > > if we could get this in or fold an equivalent concept into ongoing > > work from others, because due to various "scheduler improvements" > > every time we upgrade the production kernel this problem is getting > > worse :( > [...] > Please allow me to revive this old thread. Hi Paolo, So good to hear this :) > > My understanding is that we want to avoid adding more heuristics here, > preferring a consistent refactor. > > I would like to propose a revert of: > > 4cd13c21b207 softirq: Let ksoftirqd do its job > > the its follow-ups: > > 3c53776e29f8 Mark HI and TASKLET softirq synchronous > 0f50524789fc softirq: Don't skip softirq execution when softirq thread is parking More than this, I list some related patches mentioned in the above commit 3c53776e29f8: 1ff688209e2e ("watchdog: core: make sure the watchdog_worker is not deferred") 8d5755b3f77b ("watchdog: softdog: fire watchdog even if softirqs do not get to run") 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()") > > The problem originally addressed by 4cd13c21b207 can now be tackled > with the threaded napi, available since: > > 29863d41bb6e net: implement threaded-able napi poll loop support > > Reverting the mentioned commit should address the latency issues > mentioned by Jakub - I verified it solves a somewhat related problem in > my setup - and reduces the layering of heuristics in this area. Sure, it is. I also can verify its usefulness in the real workload. Some days ago I also sent a heuristics patch [1] that can bypass the ksoftirqd if the user chooses to mask some type of softirq. Let the user decide it. But I observed that if we mask some softirqs, or we can say, completely revert the commit 4cd13c21b207, the load would go higher and the kernel itself may occupy/consume more time than before. They were tested under the similar workload launched by our applications. [1]: https://lore.kernel.org/all/20230410023041.49857-1-kerneljasonxing@gmail.com/ > > A refactor introducing uniform overload detection and proper resource > control will be better, but I admit it's beyond me and anyway it could > still land afterwards. +1 Thanks, Jason > > Any opinion more then welcome! > > Thanks, > > Paolo > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/3] softirq: uncontroversial change 2023-04-21 2:48 ` Jason Xing @ 2023-04-21 9:33 ` Paolo Abeni 2023-04-21 9:46 ` Jason Xing 0 siblings, 1 reply; 38+ messages in thread From: Paolo Abeni @ 2023-04-21 9:33 UTC (permalink / raw) To: Jason Xing Cc: Jakub Kicinski, peterz, tglx, jstultz, edumazet, netdev, linux-kernel On Fri, 2023-04-21 at 10:48 +0800, Jason Xing wrote: > > > My understanding is that we want to avoid adding more heuristics here, > > preferring a consistent refactor. > > > > I would like to propose a revert of: > > > > 4cd13c21b207 softirq: Let ksoftirqd do its job > > > > the its follow-ups: > > > > 3c53776e29f8 Mark HI and TASKLET softirq synchronous > > 0f50524789fc softirq: Don't skip softirq execution when softirq thread is parking > > More than this, I list some related patches mentioned in the above > commit 3c53776e29f8: > 1ff688209e2e ("watchdog: core: make sure the watchdog_worker is not deferred") > 8d5755b3f77b ("watchdog: softdog: fire watchdog even if softirqs do > not get to run") > 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()") The first 2 changes replace plain timers with HR ones, could possibly be reverted, too, but it should not be a big deal either way. I think instead we want to keep the third commit above, as it should be useful when napi threaded is enabled. Generally speaking I would keep the initial revert to the bare minimum. > > The problem originally addressed by 4cd13c21b207 can now be tackled > > with the threaded napi, available since: > > > > 29863d41bb6e net: implement threaded-able napi poll loop support > > > > Reverting the mentioned commit should address the latency issues > > mentioned by Jakub - I verified it solves a somewhat related problem in > > my setup - and reduces the layering of heuristics in this area. > > Sure, it is. I also can verify its usefulness in the real workload. > Some days ago I also sent a heuristics patch [1] that can bypass the > ksoftirqd if the user chooses to mask some type of softirq. Let the > user decide it. > > But I observed that if we mask some softirqs, or we can say, > completely revert the commit 4cd13c21b207, the load would go higher > and the kernel itself may occupy/consume more time than before. They > were tested under the similar workload launched by our applications. > > [1]: https://lore.kernel.org/all/20230410023041.49857-1-kerneljasonxing@gmail.com/ Thanks for the reference, I would have missed that patch otherwise. My understanding is that adding more knobs here is in the opposite direction of what Thomas is suggesting, and IMHO the 'now mask' should not be exposed to user-space. > Thanks for the feedback, Paolo ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/3] softirq: uncontroversial change 2023-04-21 9:33 ` Paolo Abeni @ 2023-04-21 9:46 ` Jason Xing 0 siblings, 0 replies; 38+ messages in thread From: Jason Xing @ 2023-04-21 9:46 UTC (permalink / raw) To: Paolo Abeni Cc: Jakub Kicinski, peterz, tglx, jstultz, edumazet, netdev, linux-kernel On Fri, Apr 21, 2023 at 5:33 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On Fri, 2023-04-21 at 10:48 +0800, Jason Xing wrote: > > > > > My understanding is that we want to avoid adding more heuristics here, > > > preferring a consistent refactor. > > > > > > I would like to propose a revert of: > > > > > > 4cd13c21b207 softirq: Let ksoftirqd do its job > > > > > > the its follow-ups: > > > > > > 3c53776e29f8 Mark HI and TASKLET softirq synchronous > > > 0f50524789fc softirq: Don't skip softirq execution when softirq thread is parking > > > > More than this, I list some related patches mentioned in the above > > commit 3c53776e29f8: > > 1ff688209e2e ("watchdog: core: make sure the watchdog_worker is not deferred") > > 8d5755b3f77b ("watchdog: softdog: fire watchdog even if softirqs do > > not get to run") > > 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()") > [...] > The first 2 changes replace plain timers with HR ones, could possibly > be reverted, too, but it should not be a big deal either way. > > I think instead we want to keep the third commit above, as it should be > useful when napi threaded is enabled. > > Generally speaking I would keep the initial revert to the bare minimum. I agree with you :) > > > > The problem originally addressed by 4cd13c21b207 can now be tackled > > > with the threaded napi, available since: > > > > > > 29863d41bb6e net: implement threaded-able napi poll loop support > > > > > > Reverting the mentioned commit should address the latency issues > > > mentioned by Jakub - I verified it solves a somewhat related problem in > > > my setup - and reduces the layering of heuristics in this area. > > > > Sure, it is. I also can verify its usefulness in the real workload. > > Some days ago I also sent a heuristics patch [1] that can bypass the > > ksoftirqd if the user chooses to mask some type of softirq. Let the > > user decide it. > > > > But I observed that if we mask some softirqs, or we can say, > > completely revert the commit 4cd13c21b207, the load would go higher > > and the kernel itself may occupy/consume more time than before. They > > were tested under the similar workload launched by our applications. > > > > [1]: https://lore.kernel.org/all/20230410023041.49857-1-kerneljasonxing@gmail.com/ > > Thanks for the reference, I would have missed that patch otherwise. > > My understanding is that adding more knobs here is in the opposite > direction of what Thomas is suggesting, and IMHO the 'now mask' should > not be exposed to user-space. Could you please share the link about what Thomas is suggesting? I missed it. At the beginning, I didn't have the guts to revert the commit directly. Instead I wrote a compromised patch that is not that elegant as you said. Anyway, the idea is common, but reverting the whole commit may involve more work. I will spend some time digging into this part. More suggestions are also welcome :) Thanks, Jason > > > > Thanks for the feedback, > > Paolo > ^ permalink raw reply [flat|nested] 38+ messages in thread
* [tip: irq/core] Revert "softirq: Let ksoftirqd do its job" 2023-04-20 17:24 ` [PATCH 0/3] softirq: uncontroversial change Paolo Abeni 2023-04-20 17:41 ` Eric Dumazet 2023-04-21 2:48 ` Jason Xing @ 2023-05-09 19:56 ` tip-bot2 for Paolo Abeni 2 siblings, 0 replies; 38+ messages in thread From: tip-bot2 for Paolo Abeni @ 2023-05-09 19:56 UTC (permalink / raw) To: linux-tip-commits Cc: Paolo Abeni, Thomas Gleixner, Jason Xing, Jakub Kicinski, Eric Dumazet, Sebastian Andrzej Siewior, Paul E. McKenney, Peter Zijlstra, netdev, x86, linux-kernel, maz The following commit has been merged into the irq/core branch of tip: Commit-ID: d15121be7485655129101f3960ae6add40204463 Gitweb: https://git.kernel.org/tip/d15121be7485655129101f3960ae6add40204463 Author: Paolo Abeni <pabeni@redhat.com> AuthorDate: Mon, 08 May 2023 08:17:44 +02:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Tue, 09 May 2023 21:50:27 +02:00 Revert "softirq: Let ksoftirqd do its job" This reverts the following commits: 4cd13c21b207 ("softirq: Let ksoftirqd do its job") 3c53776e29f8 ("Mark HI and TASKLET softirq synchronous") 1342d8080f61 ("softirq: Don't skip softirq execution when softirq thread is parking") in a single change to avoid known bad intermediate states introduced by a patch series reverting them individually. Due to the mentioned commit, when the ksoftirqd threads take charge of softirq processing, the system can experience high latencies. In the past a few workarounds have been implemented for specific side-effects of the initial ksoftirqd enforcement commit: commit 1ff688209e2e ("watchdog: core: make sure the watchdog_worker is not deferred") commit 8d5755b3f77b ("watchdog: softdog: fire watchdog even if softirqs do not get to run") commit 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()") commit 3c53776e29f8 ("Mark HI and TASKLET softirq synchronous") But the latency problem still exists in real-life workloads, see the link below. The reverted commit intended to solve a live-lock scenario that can now be addressed with the NAPI threaded mode, introduced with commit 29863d41bb6e ("net: implement threaded-able napi poll loop support"), which is nowadays in a pretty stable status. While a complete solution to put softirq processing under nice resource control would be preferable, that has proven to be a very hard task. In the short term, remove the main pain point, and also simplify a bit the current softirq implementation. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: netdev@vger.kernel.org Link: https://lore.kernel.org/netdev/305d7742212cbe98621b16be782b0562f1012cb6.camel@redhat.com Link: https://lore.kernel.org/r/57e66b364f1b6f09c9bc0316742c3b14f4ce83bd.1683526542.git.pabeni@redhat.com --- kernel/softirq.c | 22 ++-------------------- 1 file changed, 2 insertions(+), 20 deletions(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index 1b72551..807b34c 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -80,21 +80,6 @@ static void wakeup_softirqd(void) wake_up_process(tsk); } -/* - * If ksoftirqd is scheduled, we do not want to process pending softirqs - * right now. Let ksoftirqd handle this at its own rate, to get fairness, - * unless we're doing some of the synchronous softirqs. - */ -#define SOFTIRQ_NOW_MASK ((1 << HI_SOFTIRQ) | (1 << TASKLET_SOFTIRQ)) -static bool ksoftirqd_running(unsigned long pending) -{ - struct task_struct *tsk = __this_cpu_read(ksoftirqd); - - if (pending & SOFTIRQ_NOW_MASK) - return false; - return tsk && task_is_running(tsk) && !__kthread_should_park(tsk); -} - #ifdef CONFIG_TRACE_IRQFLAGS DEFINE_PER_CPU(int, hardirqs_enabled); DEFINE_PER_CPU(int, hardirq_context); @@ -236,7 +221,7 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) goto out; pending = local_softirq_pending(); - if (!pending || ksoftirqd_running(pending)) + if (!pending) goto out; /* @@ -432,9 +417,6 @@ static inline bool should_wake_ksoftirqd(void) static inline void invoke_softirq(void) { - if (ksoftirqd_running(local_softirq_pending())) - return; - if (!force_irqthreads() || !__this_cpu_read(ksoftirqd)) { #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK /* @@ -468,7 +450,7 @@ asmlinkage __visible void do_softirq(void) pending = local_softirq_pending(); - if (pending && !ksoftirqd_running(pending)) + if (pending) do_softirq_own_stack(); local_irq_restore(flags); ^ permalink raw reply related [flat|nested] 38+ messages in thread
end of thread, other threads:[~2023-05-09 19:56 UTC | newest] Thread overview: 38+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-12-22 22:12 [PATCH 0/3] softirq: uncontroversial change Jakub Kicinski 2022-12-22 22:12 ` [PATCH 1/3] softirq: rename ksoftirqd_running() -> ksoftirqd_should_handle() Jakub Kicinski 2022-12-22 22:12 ` [PATCH 2/3] softirq: avoid spurious stalls due to need_resched() Jakub Kicinski 2023-01-31 22:32 ` Jakub Kicinski 2023-03-03 13:30 ` Thomas Gleixner 2023-03-03 15:18 ` Thomas Gleixner 2023-03-03 21:31 ` Jakub Kicinski 2023-03-03 22:37 ` Paul E. McKenney 2023-03-03 23:25 ` Dave Taht 2023-03-04 1:14 ` Paul E. McKenney 2023-03-03 23:36 ` Paul E. McKenney 2023-03-03 23:44 ` Jakub Kicinski 2023-03-04 1:25 ` Paul E. McKenney 2023-03-04 1:39 ` Jakub Kicinski 2023-03-04 3:11 ` Paul E. McKenney 2023-03-04 20:48 ` Paul E. McKenney 2023-03-05 20:43 ` Thomas Gleixner 2023-03-05 22:42 ` Paul E. McKenney 2023-03-05 23:00 ` Frederic Weisbecker 2023-03-06 4:30 ` Paul E. McKenney 2023-03-06 11:22 ` Frederic Weisbecker 2023-03-06 9:13 ` David Laight 2023-03-06 11:57 ` Frederic Weisbecker 2023-03-06 14:57 ` Paul E. McKenney 2023-03-07 0:51 ` Jakub Kicinski 2022-12-22 22:12 ` [PATCH 3/3] softirq: don't yield if only expedited handlers are pending Jakub Kicinski 2023-01-09 9:44 ` Peter Zijlstra 2023-01-09 10:16 ` Eric Dumazet 2023-01-09 19:12 ` Jakub Kicinski 2023-03-03 11:41 ` Thomas Gleixner 2023-03-03 14:17 ` Thomas Gleixner 2023-04-20 17:24 ` [PATCH 0/3] softirq: uncontroversial change Paolo Abeni 2023-04-20 17:41 ` Eric Dumazet 2023-04-20 20:23 ` Paolo Abeni 2023-04-21 2:48 ` Jason Xing 2023-04-21 9:33 ` Paolo Abeni 2023-04-21 9:46 ` Jason Xing 2023-05-09 19:56 ` [tip: irq/core] Revert "softirq: Let ksoftirqd do its job" tip-bot2 for Paolo Abeni
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).