* [PATCH] delayacct/sched: add SOFTIRQ delay
@ 2025-08-19 9:27 Tio Zhang
2025-08-19 10:14 ` Peter Zijlstra
2025-08-20 0:27 ` [PATCH] delayacct/sched: " kernel test robot
0 siblings, 2 replies; 9+ messages in thread
From: Tio Zhang @ 2025-08-19 9:27 UTC (permalink / raw)
To: akpm, wang.yaxin, fan.yu9, corbet, bsingharora, yang.yang29
Cc: linux-kernel, linux-doc, mingo, peterz, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, jiang.kun2, xu.xin16, wang.yong12, tiozhang, zyhtheonly,
zyhtheonly
Intro SOFTIRQ delay, so we can separate softirq as SOFTIRQ delay
and hardirq as {IRQ - SOFTIRQ} delay.
A typical scenario is when tasks delayed by network,
if they delayed by rx net packets, i.e, net_rx_action(),
SOFTIRQ delay is almost same as IRQ delay;
if they delayed by, e.g, bad driver or broken hardware,
SOFTIRQ delay is almost 0 while IRQ delay remains big.
Examples tool usage could be found in
Documentation/accounting/delay-accounting.rst
Signed-off-by: Tio Zhang <tiozhang@didiglobal.com>
---
Documentation/accounting/delay-accounting.rst | 5 ++++-
include/linux/delayacct.h | 18 ++++++++++------
include/uapi/linux/taskstats.h | 9 +++++++-
kernel/delayacct.c | 9 +++++++-
kernel/sched/core.c | 14 +++++++++----
kernel/sched/cputime.c | 21 ++++++++++++++-----
kernel/sched/sched.h | 6 +++++-
tools/accounting/getdelays.c | 7 +++++++
8 files changed, 70 insertions(+), 19 deletions(-)
diff --git a/Documentation/accounting/delay-accounting.rst b/Documentation/accounting/delay-accounting.rst
index 8ccc5af5ea1e..b6453723fbac 100644
--- a/Documentation/accounting/delay-accounting.rst
+++ b/Documentation/accounting/delay-accounting.rst
@@ -17,6 +17,7 @@ e) thrashing
f) direct compact
g) write-protect copy
h) IRQ/SOFTIRQ
+i) SOFTIRQ
and makes these statistics available to userspace through
the taskstats interface.
@@ -50,7 +51,7 @@ this structure. See
for a description of the fields pertaining to delay accounting.
It will generally be in the form of counters returning the cumulative
delay seen for cpu, sync block I/O, swapin, memory reclaim, thrash page
-cache, direct compact, write-protect copy, IRQ/SOFTIRQ etc.
+cache, direct compact, write-protect copy, IRQ/SOFTIRQ, SOFTIRQ etc.
Taking the difference of two successive readings of a given
counter (say cpu_delay_total) for a task will give the delay
@@ -123,6 +124,8 @@ Get sum and peak of delays, since system boot, for all pids with tgid 242::
156 11215873 0.072ms 0.207403ms 0.033913ms
IRQ count delay total delay average delay max delay min
0 0 0.000ms 0.000000ms 0.000000ms
+ SOFTIRQ count delay total delay average delay max delay min
+ 0 0 0.000ms 0.000000ms 0.000000ms
Get IO accounting for pid 1, it works only with -p::
diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index 800dcc360db2..b73d777d7a96 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -62,13 +62,18 @@ struct task_delay_info {
u64 irq_delay_max;
u64 irq_delay_min;
- u64 irq_delay; /* wait for IRQ/SOFTIRQ */
+ u64 irq_delay; /* wait for IRQ/SOFTIRQ */
+
+ u64 soft_delay_max;
+ u64 soft_delay_min;
+ u64 soft_delay; /* wait for SOFTIRQ */
u32 freepages_count; /* total count of memory reclaim */
u32 thrashing_count; /* total count of thrash waits */
u32 compact_count; /* total count of memory compact */
u32 wpcopy_count; /* total count of write-protect copy */
- u32 irq_count; /* total count of IRQ/SOFTIRQ */
+ u32 irq_count; /* total count of IRQ/SOFTIRQ */
+ u32 soft_count; /* total count of SOFTIRQ */
};
#endif
@@ -98,7 +103,7 @@ extern void __delayacct_compact_start(void);
extern void __delayacct_compact_end(void);
extern void __delayacct_wpcopy_start(void);
extern void __delayacct_wpcopy_end(void);
-extern void __delayacct_irq(struct task_struct *task, u32 delta);
+extern void __delayacct_irq(struct task_struct *task, u32 delta, u32 delta_soft);
static inline void delayacct_tsk_init(struct task_struct *tsk)
{
@@ -233,13 +238,14 @@ static inline void delayacct_wpcopy_end(void)
__delayacct_wpcopy_end();
}
-static inline void delayacct_irq(struct task_struct *task, u32 delta)
+static inline void delayacct_irq(struct task_struct *task, u32 delta,
+ u32 delta_soft)
{
if (!static_branch_unlikely(&delayacct_key))
return;
if (task->delays)
- __delayacct_irq(task, delta);
+ __delayacct_irq(task, delta, delta_soft);
}
#else
@@ -280,7 +286,7 @@ static inline void delayacct_wpcopy_start(void)
{}
static inline void delayacct_wpcopy_end(void)
{}
-static inline void delayacct_irq(struct task_struct *task, u32 delta)
+static inline void delayacct_irq(struct task_struct *task, u32 delta, u32 delta_soft)
{}
#endif /* CONFIG_TASK_DELAY_ACCT */
diff --git a/include/uapi/linux/taskstats.h b/include/uapi/linux/taskstats.h
index 5929030d4e8b..23307f88e255 100644
--- a/include/uapi/linux/taskstats.h
+++ b/include/uapi/linux/taskstats.h
@@ -34,7 +34,7 @@
*/
-#define TASKSTATS_VERSION 16
+#define TASKSTATS_VERSION 17
#define TS_COMM_LEN 32 /* should be >= TASK_COMM_LEN
* in linux/sched.h */
@@ -230,6 +230,13 @@ struct taskstats {
__u64 irq_delay_max;
__u64 irq_delay_min;
+
+ /* v17: Delay waiting for SOFTIRQ */
+ __u64 soft_count;
+ __u64 soft_delay_total;
+
+ __u64 soft_delay_max;
+ __u64 soft_delay_min;
};
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index 30e7912ebb0d..15f88ca0c0e6 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -189,6 +189,7 @@ int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
UPDATE_DELAY(compact);
UPDATE_DELAY(wpcopy);
UPDATE_DELAY(irq);
+ UPDATE_DELAY(soft);
raw_spin_unlock_irqrestore(&tsk->delays->lock, flags);
return 0;
@@ -289,7 +290,7 @@ void __delayacct_wpcopy_end(void)
¤t->delays->wpcopy_delay_min);
}
-void __delayacct_irq(struct task_struct *task, u32 delta)
+void __delayacct_irq(struct task_struct *task, u32 delta, u32 delta_soft)
{
unsigned long flags;
@@ -300,6 +301,12 @@ void __delayacct_irq(struct task_struct *task, u32 delta)
task->delays->irq_delay_max = delta;
if (delta && (!task->delays->irq_delay_min || delta < task->delays->irq_delay_min))
task->delays->irq_delay_min = delta;
+ task->delays->soft_delay += delta_soft;
+ task->delays->soft_count++;
+ if (delta_soft > task->delays->soft_delay_max)
+ task->delays->soft_delay_max = delta_soft;
+ if (delta_soft && (!task->delays->soft_delay_min || delta_soft < task->delays->soft_delay_min))
+ task->delays->soft_delay_min = delta_soft;
raw_spin_unlock_irqrestore(&task->delays->lock, flags);
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index be00629f0ba4..30ba2e312356 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -773,11 +773,12 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
* In theory, the compile should just see 0 here, and optimize out the call
* to sched_rt_avg_update. But I don't trust it...
*/
- s64 __maybe_unused steal = 0, irq_delta = 0;
+ s64 __maybe_unused steal = 0, irq_delta = 0, soft_delta = 0;
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
if (irqtime_enabled()) {
- irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
+ irq_delta = irq_time_read(cpu_of(rq), &soft_delta) - rq->prev_irq_time;
+ soft_delta -= rq->prev_soft_time;
/*
* Since irq_time is only updated on {soft,}irq_exit, we might run into
@@ -794,12 +795,17 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
* the current rq->clock timestamp, except that would require using
* atomic ops.
*/
- if (irq_delta > delta)
+ if (soft_delta > delta) { /* IRQ includes SOFTIRQ */
+ soft_delta = delta;
irq_delta = delta;
+ } else if (irq_delta > delta) {
+ irq_delta = delta;
+ }
rq->prev_irq_time += irq_delta;
+ rq->prev_soft_time += soft_delta;
delta -= irq_delta;
- delayacct_irq(rq->curr, irq_delta);
+ delayacct_irq(rq->curr, irq_delta, soft_delta);
}
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 7097de2c8cda..17467f1f3e72 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -38,13 +38,14 @@ void disable_sched_clock_irqtime(void)
}
static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,
- enum cpu_usage_stat idx)
+ u64 delta_soft, enum cpu_usage_stat idx)
{
u64 *cpustat = kcpustat_this_cpu->cpustat;
u64_stats_update_begin(&irqtime->sync);
cpustat[idx] += delta;
irqtime->total += delta;
+ irqtime->total_soft += delta_soft;
irqtime->tick_delta += delta;
u64_stats_update_end(&irqtime->sync);
}
@@ -57,17 +58,27 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
{
struct irqtime *irqtime = this_cpu_ptr(&cpu_irqtime);
unsigned int pc;
- s64 delta;
+ s64 delta, delta_soft = 0, cpu_clock;
int cpu;
if (!irqtime_enabled())
return;
cpu = smp_processor_id();
- delta = sched_clock_cpu(cpu) - irqtime->irq_start_time;
+ cpu_clock = sched_clock_cpu(cpu);
+ delta = cpu_clock - irqtime->irq_start_time;
irqtime->irq_start_time += delta;
pc = irq_count() - offset;
+ /*
+ * We only account softirq time when we are called by
+ * account_softirq_enter{,exit}
+ */
+ if ((offset & SOFTIRQ_OFFSET) || (pc & SOFTIRQ_OFFSET)) {
+ delta_soft = cpu_clock - irqtime->soft_start_time;
+ irqtime->soft_start_time += delta_soft;
+ }
+
/*
* We do not account for softirq time from ksoftirqd here.
* We want to continue accounting softirq time to ksoftirqd thread
@@ -75,9 +86,9 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
* that do not consume any time, but still wants to run.
*/
if (pc & HARDIRQ_MASK)
- irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
+ irqtime_account_delta(irqtime, delta, delta_soft, CPUTIME_IRQ);
else if ((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd())
- irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
+ irqtime_account_delta(irqtime, delta, delta_soft, CPUTIME_SOFTIRQ);
}
static u64 irqtime_tick_accounted(u64 maxtime)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be9745d104f7..b263cb046cfa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1219,6 +1219,7 @@ struct rq {
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
+ u64 prev_soft_time;
u64 psi_irq_time;
#endif
#ifdef CONFIG_PARAVIRT
@@ -3135,8 +3136,10 @@ static inline void sched_core_tick(struct rq *rq) { }
struct irqtime {
u64 total;
+ u64 total_soft;
u64 tick_delta;
u64 irq_start_time;
+ u64 soft_start_time;
struct u64_stats_sync sync;
};
@@ -3153,7 +3156,7 @@ static inline int irqtime_enabled(void)
* Otherwise ksoftirqd's sum_exec_runtime is subtracted its own runtime
* and never move forward.
*/
-static inline u64 irq_time_read(int cpu)
+static inline u64 irq_time_read(int cpu, u64 *total_soft)
{
struct irqtime *irqtime = &per_cpu(cpu_irqtime, cpu);
unsigned int seq;
@@ -3162,6 +3165,7 @@ static inline u64 irq_time_read(int cpu)
do {
seq = __u64_stats_fetch_begin(&irqtime->sync);
total = irqtime->total;
+ *total_soft = irqtime->total_soft;
} while (__u64_stats_fetch_retry(&irqtime->sync, seq));
return total;
diff --git a/tools/accounting/getdelays.c b/tools/accounting/getdelays.c
index 21cb3c3d1331..7299cb60aa33 100644
--- a/tools/accounting/getdelays.c
+++ b/tools/accounting/getdelays.c
@@ -205,6 +205,7 @@ static int get_family_id(int sd)
* version >= 13 - supports WPCOPY statistics
* version >= 14 - supports IRQ statistics
* version >= 16 - supports *_max and *_min delay statistics
+ * version >= 17 - supports SOFTIRQ statistics
*
* Always verify version before accessing version-dependent fields
* to maintain backward compatibility.
@@ -296,6 +297,12 @@ static void print_delayacct(struct taskstats *t)
irq_count, irq_delay_total,
irq_delay_max, irq_delay_min);
}
+
+ if (t->version >= 17) {
+ PRINT_FILED_DELAY("SOFTIRQ", t->version, t,
+ soft_count, soft_delay_total,
+ soft_delay_max, soft_delay_min);
+ }
}
static void task_context_switch_counts(struct taskstats *t)
--
2.39.3 (Apple Git-145)
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] delayacct/sched: add SOFTIRQ delay
2025-08-19 9:27 [PATCH] delayacct/sched: add SOFTIRQ delay Tio Zhang
@ 2025-08-19 10:14 ` Peter Zijlstra
2025-08-20 8:13 ` Tio Zhang
2025-08-20 0:27 ` [PATCH] delayacct/sched: " kernel test robot
1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2025-08-19 10:14 UTC (permalink / raw)
To: akpm, wang.yaxin, fan.yu9, corbet, bsingharora, yang.yang29,
linux-kernel, linux-doc, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, jiang.kun2,
xu.xin16, wang.yong12, zyhtheonly, zyhtheonly
On Tue, Aug 19, 2025 at 05:27:39PM +0800, Tio Zhang wrote:
> Intro SOFTIRQ delay, so we can separate softirq as SOFTIRQ delay
> and hardirq as {IRQ - SOFTIRQ} delay.
>
> A typical scenario is when tasks delayed by network,
> if they delayed by rx net packets, i.e, net_rx_action(),
> SOFTIRQ delay is almost same as IRQ delay;
> if they delayed by, e.g, bad driver or broken hardware,
> SOFTIRQ delay is almost 0 while IRQ delay remains big.
>
> Examples tool usage could be found in
> Documentation/accounting/delay-accounting.rst
accounting will be the death of us :/
How do you account ksoftirqd ?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] delayacct/sched: add SOFTIRQ delay
2025-08-19 9:27 [PATCH] delayacct/sched: add SOFTIRQ delay Tio Zhang
2025-08-19 10:14 ` Peter Zijlstra
@ 2025-08-20 0:27 ` kernel test robot
1 sibling, 0 replies; 9+ messages in thread
From: kernel test robot @ 2025-08-20 0:27 UTC (permalink / raw)
To: Tio Zhang, akpm, wang.yaxin, fan.yu9, corbet, bsingharora,
yang.yang29
Cc: oe-kbuild-all, linux-kernel, linux-doc, mingo, peterz, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, jiang.kun2, xu.xin16, wang.yong12, tiozhang, zyhtheonly,
zyhtheonly
Hi Tio,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.17-rc2 next-20250819]
[cannot apply to tip/sched/core peterz-queue/sched/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Tio-Zhang/delayacct-sched-add-SOFTIRQ-delay/20250819-173756
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20250819092739.GA31177%40didi-ThinkCentre-M930t-N000
patch subject: [PATCH] delayacct/sched: add SOFTIRQ delay
config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20250820/202508200857.4KUgmreB-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250820/202508200857.4KUgmreB-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508200857.4KUgmreB-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from kernel/sched/build_utility.c:93:
kernel/sched/psi.c: In function 'psi_account_irqtime':
>> kernel/sched/psi.c:1024:15: error: too few arguments to function 'irq_time_read'
1024 | irq = irq_time_read(cpu);
| ^~~~~~~~~~~~~
In file included from kernel/sched/build_utility.c:52:
kernel/sched/sched.h:3159:19: note: declared here
3159 | static inline u64 irq_time_read(int cpu, u64 *total_soft)
| ^~~~~~~~~~~~~
vim +/irq_time_read +1024 kernel/sched/psi.c
eb414681d5a07d Johannes Weiner 2018-10-26 1004
52b1364ba0b105 Chengming Zhou 2022-08-26 1005 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
ddae0ca2a8fe12 John Stultz 2024-06-18 1006 void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev)
52b1364ba0b105 Chengming Zhou 2022-08-26 1007 {
ddae0ca2a8fe12 John Stultz 2024-06-18 1008 int cpu = task_cpu(curr);
52b1364ba0b105 Chengming Zhou 2022-08-26 1009 struct psi_group_cpu *groupc;
ddae0ca2a8fe12 John Stultz 2024-06-18 1010 s64 delta;
3840cbe24cf060 Johannes Weiner 2024-10-03 1011 u64 irq;
570c8efd5eb79c Peter Zijlstra 2025-05-23 1012 u64 now;
52b1364ba0b105 Chengming Zhou 2022-08-26 1013
a6fd16148fdd7e Yafang Shao 2025-01-03 1014 if (static_branch_likely(&psi_disabled) || !irqtime_enabled())
0c2924079f5a83 Haifeng Xu 2023-09-26 1015 return;
0c2924079f5a83 Haifeng Xu 2023-09-26 1016
ddae0ca2a8fe12 John Stultz 2024-06-18 1017 if (!curr->pid)
ddae0ca2a8fe12 John Stultz 2024-06-18 1018 return;
ddae0ca2a8fe12 John Stultz 2024-06-18 1019
ddae0ca2a8fe12 John Stultz 2024-06-18 1020 lockdep_assert_rq_held(rq);
570c8efd5eb79c Peter Zijlstra 2025-05-23 1021 if (prev && task_psi_group(prev) == task_psi_group(curr))
52b1364ba0b105 Chengming Zhou 2022-08-26 1022 return;
52b1364ba0b105 Chengming Zhou 2022-08-26 1023
ddae0ca2a8fe12 John Stultz 2024-06-18 @1024 irq = irq_time_read(cpu);
ddae0ca2a8fe12 John Stultz 2024-06-18 1025 delta = (s64)(irq - rq->psi_irq_time);
ddae0ca2a8fe12 John Stultz 2024-06-18 1026 if (delta < 0)
ddae0ca2a8fe12 John Stultz 2024-06-18 1027 return;
ddae0ca2a8fe12 John Stultz 2024-06-18 1028 rq->psi_irq_time = irq;
52b1364ba0b105 Chengming Zhou 2022-08-26 1029
570c8efd5eb79c Peter Zijlstra 2025-05-23 1030 psi_write_begin(cpu);
570c8efd5eb79c Peter Zijlstra 2025-05-23 1031 now = cpu_clock(cpu);
3840cbe24cf060 Johannes Weiner 2024-10-03 1032
570c8efd5eb79c Peter Zijlstra 2025-05-23 1033 for_each_group(group, task_psi_group(curr)) {
34f26a15611afb Chengming Zhou 2022-09-07 1034 if (!group->enabled)
34f26a15611afb Chengming Zhou 2022-09-07 1035 continue;
34f26a15611afb Chengming Zhou 2022-09-07 1036
52b1364ba0b105 Chengming Zhou 2022-08-26 1037 groupc = per_cpu_ptr(group->pcpu, cpu);
52b1364ba0b105 Chengming Zhou 2022-08-26 1038
52b1364ba0b105 Chengming Zhou 2022-08-26 1039 record_times(groupc, now);
52b1364ba0b105 Chengming Zhou 2022-08-26 1040 groupc->times[PSI_IRQ_FULL] += delta;
52b1364ba0b105 Chengming Zhou 2022-08-26 1041
65457b74aa9437 Domenico Cerasuolo 2023-03-30 1042 if (group->rtpoll_states & (1 << PSI_IRQ_FULL))
65457b74aa9437 Domenico Cerasuolo 2023-03-30 1043 psi_schedule_rtpoll_work(group, 1, false);
570c8efd5eb79c Peter Zijlstra 2025-05-23 1044 }
570c8efd5eb79c Peter Zijlstra 2025-05-23 1045 psi_write_end(cpu);
52b1364ba0b105 Chengming Zhou 2022-08-26 1046 }
fd3db705f7496c Ingo Molnar 2025-05-28 1047 #endif /* CONFIG_IRQ_TIME_ACCOUNTING */
52b1364ba0b105 Chengming Zhou 2022-08-26 1048
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] delayacct/sched: add SOFTIRQ delay
2025-08-19 10:14 ` Peter Zijlstra
@ 2025-08-20 8:13 ` Tio Zhang
2025-08-20 8:19 ` [PATCH v2] delayaccy/sched: " Tio Zhang
0 siblings, 1 reply; 9+ messages in thread
From: Tio Zhang @ 2025-08-20 8:13 UTC (permalink / raw)
To: Peter Zijlstra
Cc: akpm, wang.yaxin, fan.yu9, corbet, bsingharora, yang.yang29,
linux-kernel, linux-doc, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, jiang.kun2,
xu.xin16, zyhtheonly, 张元瀚 Tio Zhang
Peter Zijlstra <peterz@infradead.org> 于2025年8月19日周二 18:14写道:
>
> On Tue, Aug 19, 2025 at 05:27:39PM +0800, Tio Zhang wrote:
> > Intro SOFTIRQ delay, so we can separate softirq as SOFTIRQ delay
> > and hardirq as {IRQ - SOFTIRQ} delay.
> >
> > A typical scenario is when tasks delayed by network,
> > if they delayed by rx net packets, i.e, net_rx_action(),
> > SOFTIRQ delay is almost same as IRQ delay;
> > if they delayed by, e.g, bad driver or broken hardware,
> > SOFTIRQ delay is almost 0 while IRQ delay remains big.
> >
> > Examples tool usage could be found in
> > Documentation/accounting/delay-accounting.rst
>
> accounting will be the death of us :/
>
> How do you account ksoftirqd ?
delay accounting should count delay within the task's own context,
so ksoftirqd should not be take into consideration in "SOFTIRQ delay".
When a task is delayed by ksoftirqd, the task is exactly delayed
by ksoftirqd's preemption, not softirq context:
--------------------------------------------------------------------------------------------
TASK A
<runs in A's context>
<IRQ context>
| -------------------------------------------------------
| counts in A's IRQ delay |
| -------------------------------------------------------
<SOFTIRQ context>
| -------------------------------------------------------
| counts in A's SOFTIRQ delay |
| -------------------------------------------------------
wakeup_softirqd
preempted by ksoftirqd
<A in rq waiting>
| ------------------------------------------------------------------------
| counts in A's CPU delay (A->sched_info.run_delay) |
| ------------------------------------------------------------------------
ksoftirqd gives the cpu
<runs in A's context>
--------------------------------------------------------------------------------------------
So when ksoftirqd plays a significant role, we will FISRT see
SOFTIRQ delay increasing in task's delay, THEN see CPU delay
increasing.
We should always find out the task delayed by softirq.
Though not working in PRERMPT_RT (No IRQ delay but always CPU delay).
Btw, I did miss exclude ksoftirqd in irqtime_account_irq, will add in V2.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2] delayaccy/sched: add SOFTIRQ delay
2025-08-20 8:13 ` Tio Zhang
@ 2025-08-20 8:19 ` Tio Zhang
2025-08-27 3:14 ` wang.yaxin
0 siblings, 1 reply; 9+ messages in thread
From: Tio Zhang @ 2025-08-20 8:19 UTC (permalink / raw)
To: akpm, wang.yaxin, fan.yu9, corbet, bsingharora, yang.yang29
Cc: linux-kernel, linux-doc, mingo, peterz, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, jiang.kun2, xu.xin16, tiozhang, zyhtheonly, zyhtheonly
Intro SOFTIRQ delay, so we can separate softirq as SOFTIRQ delay
and hardirq as {IRQ - SOFTIRQ} delay.
A typical scenario is when tasks delayed by network,
if they delayed by rx net packets, i.e, net_rx_action(),
SOFTIRQ delay is almost same as IRQ delay;
if they delayed by, e.g, bad driver or broken hardware,
SOFTIRQ delay is almost 0 while IRQ delay remains big.
Examples tool usage could be found in
Documentation/accounting/delay-accounting.rst
Signed-off-by: Tio Zhang <tiozhang@didiglobal.com>
---
Documentation/accounting/delay-accounting.rst | 5 +++-
include/linux/delayacct.h | 18 ++++++++++-----
include/uapi/linux/taskstats.h | 9 +++++++-
kernel/delayacct.c | 9 +++++++-
kernel/sched/core.c | 14 +++++++----
kernel/sched/cputime.c | 23 +++++++++++++++----
kernel/sched/psi.c | 3 ++-
kernel/sched/sched.h | 6 ++++-
tools/accounting/getdelays.c | 7 ++++++
9 files changed, 74 insertions(+), 20 deletions(-)
diff --git a/Documentation/accounting/delay-accounting.rst b/Documentation/accounting/delay-accounting.rst
index 8ccc5af5ea1e..b6453723fbac 100644
--- a/Documentation/accounting/delay-accounting.rst
+++ b/Documentation/accounting/delay-accounting.rst
@@ -17,6 +17,7 @@ e) thrashing
f) direct compact
g) write-protect copy
h) IRQ/SOFTIRQ
+i) SOFTIRQ
and makes these statistics available to userspace through
the taskstats interface.
@@ -50,7 +51,7 @@ this structure. See
for a description of the fields pertaining to delay accounting.
It will generally be in the form of counters returning the cumulative
delay seen for cpu, sync block I/O, swapin, memory reclaim, thrash page
-cache, direct compact, write-protect copy, IRQ/SOFTIRQ etc.
+cache, direct compact, write-protect copy, IRQ/SOFTIRQ, SOFTIRQ etc.
Taking the difference of two successive readings of a given
counter (say cpu_delay_total) for a task will give the delay
@@ -123,6 +124,8 @@ Get sum and peak of delays, since system boot, for all pids with tgid 242::
156 11215873 0.072ms 0.207403ms 0.033913ms
IRQ count delay total delay average delay max delay min
0 0 0.000ms 0.000000ms 0.000000ms
+ SOFTIRQ count delay total delay average delay max delay min
+ 0 0 0.000ms 0.000000ms 0.000000ms
Get IO accounting for pid 1, it works only with -p::
diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index 800dcc360db2..b73d777d7a96 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -62,13 +62,18 @@ struct task_delay_info {
u64 irq_delay_max;
u64 irq_delay_min;
- u64 irq_delay; /* wait for IRQ/SOFTIRQ */
+ u64 irq_delay; /* wait for IRQ/SOFTIRQ */
+
+ u64 soft_delay_max;
+ u64 soft_delay_min;
+ u64 soft_delay; /* wait for SOFTIRQ */
u32 freepages_count; /* total count of memory reclaim */
u32 thrashing_count; /* total count of thrash waits */
u32 compact_count; /* total count of memory compact */
u32 wpcopy_count; /* total count of write-protect copy */
- u32 irq_count; /* total count of IRQ/SOFTIRQ */
+ u32 irq_count; /* total count of IRQ/SOFTIRQ */
+ u32 soft_count; /* total count of SOFTIRQ */
};
#endif
@@ -98,7 +103,7 @@ extern void __delayacct_compact_start(void);
extern void __delayacct_compact_end(void);
extern void __delayacct_wpcopy_start(void);
extern void __delayacct_wpcopy_end(void);
-extern void __delayacct_irq(struct task_struct *task, u32 delta);
+extern void __delayacct_irq(struct task_struct *task, u32 delta, u32 delta_soft);
static inline void delayacct_tsk_init(struct task_struct *tsk)
{
@@ -233,13 +238,14 @@ static inline void delayacct_wpcopy_end(void)
__delayacct_wpcopy_end();
}
-static inline void delayacct_irq(struct task_struct *task, u32 delta)
+static inline void delayacct_irq(struct task_struct *task, u32 delta,
+ u32 delta_soft)
{
if (!static_branch_unlikely(&delayacct_key))
return;
if (task->delays)
- __delayacct_irq(task, delta);
+ __delayacct_irq(task, delta, delta_soft);
}
#else
@@ -280,7 +286,7 @@ static inline void delayacct_wpcopy_start(void)
{}
static inline void delayacct_wpcopy_end(void)
{}
-static inline void delayacct_irq(struct task_struct *task, u32 delta)
+static inline void delayacct_irq(struct task_struct *task, u32 delta, u32 delta_soft)
{}
#endif /* CONFIG_TASK_DELAY_ACCT */
diff --git a/include/uapi/linux/taskstats.h b/include/uapi/linux/taskstats.h
index 5929030d4e8b..23307f88e255 100644
--- a/include/uapi/linux/taskstats.h
+++ b/include/uapi/linux/taskstats.h
@@ -34,7 +34,7 @@
*/
-#define TASKSTATS_VERSION 16
+#define TASKSTATS_VERSION 17
#define TS_COMM_LEN 32 /* should be >= TASK_COMM_LEN
* in linux/sched.h */
@@ -230,6 +230,13 @@ struct taskstats {
__u64 irq_delay_max;
__u64 irq_delay_min;
+
+ /* v17: Delay waiting for SOFTIRQ */
+ __u64 soft_count;
+ __u64 soft_delay_total;
+
+ __u64 soft_delay_max;
+ __u64 soft_delay_min;
};
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index 30e7912ebb0d..15f88ca0c0e6 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -189,6 +189,7 @@ int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
UPDATE_DELAY(compact);
UPDATE_DELAY(wpcopy);
UPDATE_DELAY(irq);
+ UPDATE_DELAY(soft);
raw_spin_unlock_irqrestore(&tsk->delays->lock, flags);
return 0;
@@ -289,7 +290,7 @@ void __delayacct_wpcopy_end(void)
¤t->delays->wpcopy_delay_min);
}
-void __delayacct_irq(struct task_struct *task, u32 delta)
+void __delayacct_irq(struct task_struct *task, u32 delta, u32 delta_soft)
{
unsigned long flags;
@@ -300,6 +301,12 @@ void __delayacct_irq(struct task_struct *task, u32 delta)
task->delays->irq_delay_max = delta;
if (delta && (!task->delays->irq_delay_min || delta < task->delays->irq_delay_min))
task->delays->irq_delay_min = delta;
+ task->delays->soft_delay += delta_soft;
+ task->delays->soft_count++;
+ if (delta_soft > task->delays->soft_delay_max)
+ task->delays->soft_delay_max = delta_soft;
+ if (delta_soft && (!task->delays->soft_delay_min || delta_soft < task->delays->soft_delay_min))
+ task->delays->soft_delay_min = delta_soft;
raw_spin_unlock_irqrestore(&task->delays->lock, flags);
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index be00629f0ba4..30ba2e312356 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -773,11 +773,12 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
* In theory, the compile should just see 0 here, and optimize out the call
* to sched_rt_avg_update. But I don't trust it...
*/
- s64 __maybe_unused steal = 0, irq_delta = 0;
+ s64 __maybe_unused steal = 0, irq_delta = 0, soft_delta = 0;
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
if (irqtime_enabled()) {
- irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
+ irq_delta = irq_time_read(cpu_of(rq), &soft_delta) - rq->prev_irq_time;
+ soft_delta -= rq->prev_soft_time;
/*
* Since irq_time is only updated on {soft,}irq_exit, we might run into
@@ -794,12 +795,17 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
* the current rq->clock timestamp, except that would require using
* atomic ops.
*/
- if (irq_delta > delta)
+ if (soft_delta > delta) { /* IRQ includes SOFTIRQ */
+ soft_delta = delta;
irq_delta = delta;
+ } else if (irq_delta > delta) {
+ irq_delta = delta;
+ }
rq->prev_irq_time += irq_delta;
+ rq->prev_soft_time += soft_delta;
delta -= irq_delta;
- delayacct_irq(rq->curr, irq_delta);
+ delayacct_irq(rq->curr, irq_delta, soft_delta);
}
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 7097de2c8cda..7a553d411ae0 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -38,13 +38,14 @@ void disable_sched_clock_irqtime(void)
}
static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,
- enum cpu_usage_stat idx)
+ u64 delta_soft, enum cpu_usage_stat idx)
{
u64 *cpustat = kcpustat_this_cpu->cpustat;
u64_stats_update_begin(&irqtime->sync);
cpustat[idx] += delta;
irqtime->total += delta;
+ irqtime->total_soft += delta_soft;
irqtime->tick_delta += delta;
u64_stats_update_end(&irqtime->sync);
}
@@ -57,17 +58,29 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
{
struct irqtime *irqtime = this_cpu_ptr(&cpu_irqtime);
unsigned int pc;
- s64 delta;
+ s64 delta, delta_soft = 0, cpu_clock;
int cpu;
if (!irqtime_enabled())
return;
cpu = smp_processor_id();
- delta = sched_clock_cpu(cpu) - irqtime->irq_start_time;
+ cpu_clock = sched_clock_cpu(cpu);
+ delta = cpu_clock - irqtime->irq_start_time;
irqtime->irq_start_time += delta;
pc = irq_count() - offset;
+ /*
+ * We only account softirq time when we are called by
+ * account_softirq_enter{,exit}
+ * and we do not account ksoftirqd here.
+ */
+ if (curr != this_cpu_ksoftirqd() &&
+ ((offset & SOFTIRQ_OFFSET) || (pc & SOFTIRQ_OFFSET))) {
+ delta_soft = cpu_clock - irqtime->soft_start_time;
+ irqtime->soft_start_time += delta_soft;
+ }
+
/*
* We do not account for softirq time from ksoftirqd here.
* We want to continue accounting softirq time to ksoftirqd thread
@@ -75,9 +88,9 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
* that do not consume any time, but still wants to run.
*/
if (pc & HARDIRQ_MASK)
- irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
+ irqtime_account_delta(irqtime, delta, delta_soft, CPUTIME_IRQ);
else if ((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd())
- irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
+ irqtime_account_delta(irqtime, delta, delta_soft, CPUTIME_SOFTIRQ);
}
static u64 irqtime_tick_accounted(u64 maxtime)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 59fdb7ebbf22..07f0caf5042d 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1009,6 +1009,7 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
struct psi_group_cpu *groupc;
s64 delta;
u64 irq;
+ u64 __maybe_unused soft_irq;
u64 now;
if (static_branch_likely(&psi_disabled) || !irqtime_enabled())
@@ -1021,7 +1022,7 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
if (prev && task_psi_group(prev) == task_psi_group(curr))
return;
- irq = irq_time_read(cpu);
+ irq = irq_time_read(cpu, &soft_irq);
delta = (s64)(irq - rq->psi_irq_time);
if (delta < 0)
return;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be9745d104f7..b263cb046cfa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1219,6 +1219,7 @@ struct rq {
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
+ u64 prev_soft_time;
u64 psi_irq_time;
#endif
#ifdef CONFIG_PARAVIRT
@@ -3135,8 +3136,10 @@ static inline void sched_core_tick(struct rq *rq) { }
struct irqtime {
u64 total;
+ u64 total_soft;
u64 tick_delta;
u64 irq_start_time;
+ u64 soft_start_time;
struct u64_stats_sync sync;
};
@@ -3153,7 +3156,7 @@ static inline int irqtime_enabled(void)
* Otherwise ksoftirqd's sum_exec_runtime is subtracted its own runtime
* and never move forward.
*/
-static inline u64 irq_time_read(int cpu)
+static inline u64 irq_time_read(int cpu, u64 *total_soft)
{
struct irqtime *irqtime = &per_cpu(cpu_irqtime, cpu);
unsigned int seq;
@@ -3162,6 +3165,7 @@ static inline u64 irq_time_read(int cpu)
do {
seq = __u64_stats_fetch_begin(&irqtime->sync);
total = irqtime->total;
+ *total_soft = irqtime->total_soft;
} while (__u64_stats_fetch_retry(&irqtime->sync, seq));
return total;
diff --git a/tools/accounting/getdelays.c b/tools/accounting/getdelays.c
index 21cb3c3d1331..7299cb60aa33 100644
--- a/tools/accounting/getdelays.c
+++ b/tools/accounting/getdelays.c
@@ -205,6 +205,7 @@ static int get_family_id(int sd)
* version >= 13 - supports WPCOPY statistics
* version >= 14 - supports IRQ statistics
* version >= 16 - supports *_max and *_min delay statistics
+ * version >= 17 - supports SOFTIRQ statistics
*
* Always verify version before accessing version-dependent fields
* to maintain backward compatibility.
@@ -296,6 +297,12 @@ static void print_delayacct(struct taskstats *t)
irq_count, irq_delay_total,
irq_delay_max, irq_delay_min);
}
+
+ if (t->version >= 17) {
+ PRINT_FILED_DELAY("SOFTIRQ", t->version, t,
+ soft_count, soft_delay_total,
+ soft_delay_max, soft_delay_min);
+ }
}
static void task_context_switch_counts(struct taskstats *t)
--
2.39.3 (Apple Git-145)
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v2] delayaccy/sched: add SOFTIRQ delay
2025-08-20 8:19 ` [PATCH v2] delayaccy/sched: " Tio Zhang
@ 2025-08-27 3:14 ` wang.yaxin
2025-08-28 9:26 ` [PATCH v3] " Tio Zhang
0 siblings, 1 reply; 9+ messages in thread
From: wang.yaxin @ 2025-08-27 3:14 UTC (permalink / raw)
To: tiozhang
Cc: akpm, fan.yu9, corbet, bsingharora, yang.yang29, linux-kernel,
linux-doc, mingo, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, jiang.kun2,
xu.xin16, tiozhang, zyhtheonly, zyhtheonly
> Taking the difference of two successive readings of a given
> counter (say cpu_delay_total) for a task will give the delay
>@@ -123,6 +124,8 @@ Get sum and peak of delays, since system boot, for all pids with tgid 242::
> 156 11215873 0.072ms 0.207403ms 0.033913ms
> IRQ count delay total delay average delay max delay min
> 0 0 0.000ms 0.000000ms 0.000000ms
>+ SOFTIRQ count delay total delay average delay max delay min
>+ 0 0 0.000ms 0.000000ms 0.000000ms
>
> Get IO accounting for pid 1, it works only with -p::
If possible, you can construct some abnormal scenarios to assign values to IRQ and SOFTIRQ,
highlighting the differences between them. Additionally, this should cover the testing of
new features. If the delay info is entirely zero, it may fail to demonstrate such differences.
>+ /*
>+ * We only account softirq time when we are called by
>+ * account_softirq_enter{,exit}
>+ * and we do not account ksoftirqd here.
>+ */
>+ if (curr != this_cpu_ksoftirqd() &&
>+ ((offset & SOFTIRQ_OFFSET) || (pc & SOFTIRQ_OFFSET))) {
>+ delta_soft = cpu_clock - irqtime->soft_start_time;
>+ irqtime->soft_start_time += delta_soft;
>+ }
>+
> /*
> * We do not account for softirq time from ksoftirqd here.
> * We want to continue accounting softirq time to ksoftirqd thread
>@@ -75,9 +88,9 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
> * that do not consume any time, but still wants to run.
> */
> if (pc & HARDIRQ_MASK)
>- irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
>+ irqtime_account_delta(irqtime, delta, delta_soft, CPUTIME_IRQ);
> else if ((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd())
>- irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
>+ irqtime_account_delta(irqtime, delta, delta_soft, CPUTIME_SOFTIRQ);
> }
>
> static u64 irqtime_tick_accounted(u64 maxtime)
As you mentioned, delta_soft represents SOFTIRQ, but it appears to have been accumulated once
under the condition of (pc & HARDIRQ_MASK). Is there a potential double-counting issue?
replacing delta_soft with 0 when in HARDIRQ might make it more intuitive.
Thanks,
Yaxin
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v3] delayaccy/sched: add SOFTIRQ delay
2025-08-27 3:14 ` wang.yaxin
@ 2025-08-28 9:26 ` Tio Zhang
2025-08-30 6:25 ` wang.yaxin
2025-08-30 6:29 ` wang.yaxin
0 siblings, 2 replies; 9+ messages in thread
From: Tio Zhang @ 2025-08-28 9:26 UTC (permalink / raw)
To: akpm, wang.yaxin, fan.yu9, corbet, bsingharora, yang.yang29
Cc: linux-kernel, linux-doc, mingo, peterz, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, jiang.kun2, xu.xin16, tiozhang, zyhtheonly, zyhtheonly
Intro SOFTIRQ delay, so we can separate softirq as SOFTIRQ delay
and hardirq as {IRQ - SOFTIRQ} delay.
A typical scenario is when tasks delayed by network,
if they delayed by rx net packets, i.e, net_rx_action(),
SOFTIRQ delay is almost same as IRQ delay;
if they delayed by, e.g, bad driver or broken hardware,
SOFTIRQ delay is almost 0 while IRQ delay remains big.
Examples tool usage could be found in
Documentation/accounting/delay-accounting.rst
Signed-off-by: Tio Zhang <tiozhang@didiglobal.com>
---
Documentation/accounting/delay-accounting.rst | 60 ++++++++++++++++++-
include/linux/delayacct.h | 18 ++++--
include/uapi/linux/taskstats.h | 9 ++-
kernel/delayacct.c | 9 ++-
kernel/sched/core.c | 14 +++--
kernel/sched/cputime.c | 23 +++++--
kernel/sched/psi.c | 3 +-
kernel/sched/sched.h | 6 +-
tools/accounting/getdelays.c | 7 +++
9 files changed, 129 insertions(+), 20 deletions(-)
diff --git a/Documentation/accounting/delay-accounting.rst b/Documentation/accounting/delay-accounting.rst
index 8ccc5af5ea1e..be53d22f8c1b 100644
--- a/Documentation/accounting/delay-accounting.rst
+++ b/Documentation/accounting/delay-accounting.rst
@@ -17,6 +17,7 @@ e) thrashing
f) direct compact
g) write-protect copy
h) IRQ/SOFTIRQ
+i) SOFTIRQ
and makes these statistics available to userspace through
the taskstats interface.
@@ -50,7 +51,7 @@ this structure. See
for a description of the fields pertaining to delay accounting.
It will generally be in the form of counters returning the cumulative
delay seen for cpu, sync block I/O, swapin, memory reclaim, thrash page
-cache, direct compact, write-protect copy, IRQ/SOFTIRQ etc.
+cache, direct compact, write-protect copy, IRQ/SOFTIRQ, SOFTIRQ etc.
Taking the difference of two successive readings of a given
counter (say cpu_delay_total) for a task will give the delay
@@ -123,6 +124,63 @@ Get sum and peak of delays, since system boot, for all pids with tgid 242::
156 11215873 0.072ms 0.207403ms 0.033913ms
IRQ count delay total delay average delay max delay min
0 0 0.000ms 0.000000ms 0.000000ms
+ SOFTIRQ count delay total delay average delay max delay min
+ 0 0 0.000ms 0.000000ms 0.000000ms
+
+Get IRQ and SOFTIRQ delays::
+
+To enable, compile the kernel with::
+
+ CONFIG_IRQ_TIME_ACCOUNTING=y
+
+IRQ counts ALL IRQ context while SOFTIRQ counts context between
+account_softirq_{enter,exit} excluding ksoftirqd.
+
+SOFTIRQ is mainly used to separate delays between hardirq and softirq,
+its "count" should equal to IRQ and its "total" should entirely be included by IRQ.
+
+So,
+all IRQ context delay is IRQ
+softirq context delay is SOFTIRQ
+hardirq context delay is (IRQ - SOFTIRQ)
+
+An example::
+
+ # echo 0 > /proc/irq/${my_hardirq}/smp_affinity_list
+ (bound some hardirq to CPU 0)
+
+ # taskset -pc 0 ${my_pid}
+ (bound ${my_pid} to CPU 0, so ${my_pid} should be delay by hardirq)
+
+ # ./getdelays -d -t ${my_pid}
+ print delayacct stats ON
+ TGID ${my_pid}
+
+ ......
+ IRQ count delay total delay average delay max delay min
+ 127 59074 0.000ms 0.009655ms 0.000943ms
+ SOFTIRQ count delay total delay average delay max delay min
+ 127 7255 0.000ms 0.005080ms 0.002175ms
+
+ (In most count timings, IRQ/SOFTIRQ delay should be 0, so average is too small to show)
+ (IRQ is significantly bigger than SOFTIRQ here, so we learn hardirq delays ${my_pid} more)
+
+ # iperf -s -p ${my_port} -D // on the machine running ${my_pid}
+ # iperf -c ${my_ip} -p ${my_port} // on another machine as client
+ (start having some softirq, also on CPU 0)
+
+ # ./getdelays -d -t ${my_pid}
+ print delayacct stats ON
+ TGID ${my_pid}
+
+ ......
+ IRQ count delay total delay average delay max delay min
+ 386 473515 0.001ms 0.032010ms 0.001954ms
+ SOFTIRQ count delay total delay average delay max delay min
+ 386 419761 0.001ms 0.029616ms 0.002155ms
+
+ (SOFTIRQ is getting very close to IRQ here, so we learn softirq delays ${my_pid} more)
+
Get IO accounting for pid 1, it works only with -p::
diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index 800dcc360db2..b73d777d7a96 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -62,13 +62,18 @@ struct task_delay_info {
u64 irq_delay_max;
u64 irq_delay_min;
- u64 irq_delay; /* wait for IRQ/SOFTIRQ */
+ u64 irq_delay; /* wait for IRQ/SOFTIRQ */
+
+ u64 soft_delay_max;
+ u64 soft_delay_min;
+ u64 soft_delay; /* wait for SOFTIRQ */
u32 freepages_count; /* total count of memory reclaim */
u32 thrashing_count; /* total count of thrash waits */
u32 compact_count; /* total count of memory compact */
u32 wpcopy_count; /* total count of write-protect copy */
- u32 irq_count; /* total count of IRQ/SOFTIRQ */
+ u32 irq_count; /* total count of IRQ/SOFTIRQ */
+ u32 soft_count; /* total count of SOFTIRQ */
};
#endif
@@ -98,7 +103,7 @@ extern void __delayacct_compact_start(void);
extern void __delayacct_compact_end(void);
extern void __delayacct_wpcopy_start(void);
extern void __delayacct_wpcopy_end(void);
-extern void __delayacct_irq(struct task_struct *task, u32 delta);
+extern void __delayacct_irq(struct task_struct *task, u32 delta, u32 delta_soft);
static inline void delayacct_tsk_init(struct task_struct *tsk)
{
@@ -233,13 +238,14 @@ static inline void delayacct_wpcopy_end(void)
__delayacct_wpcopy_end();
}
-static inline void delayacct_irq(struct task_struct *task, u32 delta)
+static inline void delayacct_irq(struct task_struct *task, u32 delta,
+ u32 delta_soft)
{
if (!static_branch_unlikely(&delayacct_key))
return;
if (task->delays)
- __delayacct_irq(task, delta);
+ __delayacct_irq(task, delta, delta_soft);
}
#else
@@ -280,7 +286,7 @@ static inline void delayacct_wpcopy_start(void)
{}
static inline void delayacct_wpcopy_end(void)
{}
-static inline void delayacct_irq(struct task_struct *task, u32 delta)
+static inline void delayacct_irq(struct task_struct *task, u32 delta, u32 delta_soft)
{}
#endif /* CONFIG_TASK_DELAY_ACCT */
diff --git a/include/uapi/linux/taskstats.h b/include/uapi/linux/taskstats.h
index 5929030d4e8b..23307f88e255 100644
--- a/include/uapi/linux/taskstats.h
+++ b/include/uapi/linux/taskstats.h
@@ -34,7 +34,7 @@
*/
-#define TASKSTATS_VERSION 16
+#define TASKSTATS_VERSION 17
#define TS_COMM_LEN 32 /* should be >= TASK_COMM_LEN
* in linux/sched.h */
@@ -230,6 +230,13 @@ struct taskstats {
__u64 irq_delay_max;
__u64 irq_delay_min;
+
+ /* v17: Delay waiting for SOFTIRQ */
+ __u64 soft_count;
+ __u64 soft_delay_total;
+
+ __u64 soft_delay_max;
+ __u64 soft_delay_min;
};
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index 30e7912ebb0d..15f88ca0c0e6 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -189,6 +189,7 @@ int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
UPDATE_DELAY(compact);
UPDATE_DELAY(wpcopy);
UPDATE_DELAY(irq);
+ UPDATE_DELAY(soft);
raw_spin_unlock_irqrestore(&tsk->delays->lock, flags);
return 0;
@@ -289,7 +290,7 @@ void __delayacct_wpcopy_end(void)
¤t->delays->wpcopy_delay_min);
}
-void __delayacct_irq(struct task_struct *task, u32 delta)
+void __delayacct_irq(struct task_struct *task, u32 delta, u32 delta_soft)
{
unsigned long flags;
@@ -300,6 +301,12 @@ void __delayacct_irq(struct task_struct *task, u32 delta)
task->delays->irq_delay_max = delta;
if (delta && (!task->delays->irq_delay_min || delta < task->delays->irq_delay_min))
task->delays->irq_delay_min = delta;
+ task->delays->soft_delay += delta_soft;
+ task->delays->soft_count++;
+ if (delta_soft > task->delays->soft_delay_max)
+ task->delays->soft_delay_max = delta_soft;
+ if (delta_soft && (!task->delays->soft_delay_min || delta_soft < task->delays->soft_delay_min))
+ task->delays->soft_delay_min = delta_soft;
raw_spin_unlock_irqrestore(&task->delays->lock, flags);
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index be00629f0ba4..30ba2e312356 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -773,11 +773,12 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
* In theory, the compile should just see 0 here, and optimize out the call
* to sched_rt_avg_update. But I don't trust it...
*/
- s64 __maybe_unused steal = 0, irq_delta = 0;
+ s64 __maybe_unused steal = 0, irq_delta = 0, soft_delta = 0;
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
if (irqtime_enabled()) {
- irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
+ irq_delta = irq_time_read(cpu_of(rq), &soft_delta) - rq->prev_irq_time;
+ soft_delta -= rq->prev_soft_time;
/*
* Since irq_time is only updated on {soft,}irq_exit, we might run into
@@ -794,12 +795,17 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
* the current rq->clock timestamp, except that would require using
* atomic ops.
*/
- if (irq_delta > delta)
+ if (soft_delta > delta) { /* IRQ includes SOFTIRQ */
+ soft_delta = delta;
irq_delta = delta;
+ } else if (irq_delta > delta) {
+ irq_delta = delta;
+ }
rq->prev_irq_time += irq_delta;
+ rq->prev_soft_time += soft_delta;
delta -= irq_delta;
- delayacct_irq(rq->curr, irq_delta);
+ delayacct_irq(rq->curr, irq_delta, soft_delta);
}
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 7097de2c8cda..5ce28036b149 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -38,13 +38,14 @@ void disable_sched_clock_irqtime(void)
}
static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,
- enum cpu_usage_stat idx)
+ u64 delta_soft, enum cpu_usage_stat idx)
{
u64 *cpustat = kcpustat_this_cpu->cpustat;
u64_stats_update_begin(&irqtime->sync);
cpustat[idx] += delta;
irqtime->total += delta;
+ irqtime->total_soft += delta_soft;
irqtime->tick_delta += delta;
u64_stats_update_end(&irqtime->sync);
}
@@ -57,17 +58,29 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
{
struct irqtime *irqtime = this_cpu_ptr(&cpu_irqtime);
unsigned int pc;
- s64 delta;
+ s64 delta, delta_soft, cpu_clock;
int cpu;
if (!irqtime_enabled())
return;
cpu = smp_processor_id();
- delta = sched_clock_cpu(cpu) - irqtime->irq_start_time;
+ cpu_clock = sched_clock_cpu(cpu);
+ delta = cpu_clock - irqtime->irq_start_time;
irqtime->irq_start_time += delta;
pc = irq_count() - offset;
+ /*
+ * We only account softirq time when we are called by
+ * account_softirq_enter{,exit}
+ * and we do not account ksoftirqd here.
+ */
+ if (curr != this_cpu_ksoftirqd() &&
+ ((offset & SOFTIRQ_OFFSET) || (pc & SOFTIRQ_OFFSET))) {
+ delta_soft = cpu_clock - irqtime->soft_start_time;
+ irqtime->soft_start_time += delta_soft;
+ }
+
/*
* We do not account for softirq time from ksoftirqd here.
* We want to continue accounting softirq time to ksoftirqd thread
@@ -75,9 +88,9 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
* that do not consume any time, but still wants to run.
*/
if (pc & HARDIRQ_MASK)
- irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
+ irqtime_account_delta(irqtime, delta, 0, CPUTIME_IRQ);
else if ((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd())
- irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
+ irqtime_account_delta(irqtime, delta, delta_soft, CPUTIME_SOFTIRQ);
}
static u64 irqtime_tick_accounted(u64 maxtime)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 59fdb7ebbf22..07f0caf5042d 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1009,6 +1009,7 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
struct psi_group_cpu *groupc;
s64 delta;
u64 irq;
+ u64 __maybe_unused soft_irq;
u64 now;
if (static_branch_likely(&psi_disabled) || !irqtime_enabled())
@@ -1021,7 +1022,7 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
if (prev && task_psi_group(prev) == task_psi_group(curr))
return;
- irq = irq_time_read(cpu);
+ irq = irq_time_read(cpu, &soft_irq);
delta = (s64)(irq - rq->psi_irq_time);
if (delta < 0)
return;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be9745d104f7..b263cb046cfa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1219,6 +1219,7 @@ struct rq {
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
+ u64 prev_soft_time;
u64 psi_irq_time;
#endif
#ifdef CONFIG_PARAVIRT
@@ -3135,8 +3136,10 @@ static inline void sched_core_tick(struct rq *rq) { }
struct irqtime {
u64 total;
+ u64 total_soft;
u64 tick_delta;
u64 irq_start_time;
+ u64 soft_start_time;
struct u64_stats_sync sync;
};
@@ -3153,7 +3156,7 @@ static inline int irqtime_enabled(void)
* Otherwise ksoftirqd's sum_exec_runtime is subtracted its own runtime
* and never move forward.
*/
-static inline u64 irq_time_read(int cpu)
+static inline u64 irq_time_read(int cpu, u64 *total_soft)
{
struct irqtime *irqtime = &per_cpu(cpu_irqtime, cpu);
unsigned int seq;
@@ -3162,6 +3165,7 @@ static inline u64 irq_time_read(int cpu)
do {
seq = __u64_stats_fetch_begin(&irqtime->sync);
total = irqtime->total;
+ *total_soft = irqtime->total_soft;
} while (__u64_stats_fetch_retry(&irqtime->sync, seq));
return total;
diff --git a/tools/accounting/getdelays.c b/tools/accounting/getdelays.c
index 21cb3c3d1331..7299cb60aa33 100644
--- a/tools/accounting/getdelays.c
+++ b/tools/accounting/getdelays.c
@@ -205,6 +205,7 @@ static int get_family_id(int sd)
* version >= 13 - supports WPCOPY statistics
* version >= 14 - supports IRQ statistics
* version >= 16 - supports *_max and *_min delay statistics
+ * version >= 17 - supports SOFTIRQ statistics
*
* Always verify version before accessing version-dependent fields
* to maintain backward compatibility.
@@ -296,6 +297,12 @@ static void print_delayacct(struct taskstats *t)
irq_count, irq_delay_total,
irq_delay_max, irq_delay_min);
}
+
+ if (t->version >= 17) {
+ PRINT_FILED_DELAY("SOFTIRQ", t->version, t,
+ soft_count, soft_delay_total,
+ soft_delay_max, soft_delay_min);
+ }
}
static void task_context_switch_counts(struct taskstats *t)
--
2.39.3 (Apple Git-145)
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v3] delayaccy/sched: add SOFTIRQ delay
2025-08-28 9:26 ` [PATCH v3] " Tio Zhang
@ 2025-08-30 6:25 ` wang.yaxin
2025-08-30 6:29 ` wang.yaxin
1 sibling, 0 replies; 9+ messages in thread
From: wang.yaxin @ 2025-08-30 6:25 UTC (permalink / raw)
To: tiozhang
Cc: akpm, fan.yu9, corbet, bsingharora, yang.yang29, linux-kernel,
linux-doc, mingo, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, jiang.kun2,
xu.xin16, tiozhang, zyhtheonly, zyhtheonly
>Intro SOFTIRQ delay, so we can separate softirq as SOFTIRQ delay
>and hardirq as {IRQ - SOFTIRQ} delay.
>
>A typical scenario is when tasks delayed by network,
>if they delayed by rx net packets, i.e, net_rx_action(),
>SOFTIRQ delay is almost same as IRQ delay;
>if they delayed by, e.g, bad driver or broken hardware,
>SOFTIRQ delay is almost 0 while IRQ delay remains big.
>
>Examples tool usage could be found in
>Documentation/accounting/delay-accounting.rst
>
>Signed-off-by: Tio Zhang <tiozhang@didiglobal.com>
>---
a small suggestion: it would be clearer if you could include a changelog
when sending a new version of the patch next time. For example:
https://lore.kernel.org/all/20250828171242.59810-1-sj@kernel.org/
Thanks
Yaxin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3] delayaccy/sched: add SOFTIRQ delay
2025-08-28 9:26 ` [PATCH v3] " Tio Zhang
2025-08-30 6:25 ` wang.yaxin
@ 2025-08-30 6:29 ` wang.yaxin
1 sibling, 0 replies; 9+ messages in thread
From: wang.yaxin @ 2025-08-30 6:29 UTC (permalink / raw)
To: tiozhang
Cc: akpm, fan.yu9, corbet, bsingharora, yang.yang29, linux-kernel,
linux-doc, mingo, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, jiang.kun2,
xu.xin16, tiozhang, zyhtheonly, zyhtheonly
>Intro SOFTIRQ delay, so we can separate softirq as SOFTIRQ delay
>and hardirq as {IRQ - SOFTIRQ} delay.
>
>A typical scenario is when tasks delayed by network,
>if they delayed by rx net packets, i.e, net_rx_action(),
>SOFTIRQ delay is almost same as IRQ delay;
>if they delayed by, e.g, bad driver or broken hardware,
>SOFTIRQ delay is almost 0 while IRQ delay remains big.
>
>Examples tool usage could be found in
>Documentation/accounting/delay-accounting.rst
>
>Signed-off-by: Tio Zhang <tiozhang@didiglobal.com>
Reviewed-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Thanks
Yaxin
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-08-30 6:29 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-19 9:27 [PATCH] delayacct/sched: add SOFTIRQ delay Tio Zhang
2025-08-19 10:14 ` Peter Zijlstra
2025-08-20 8:13 ` Tio Zhang
2025-08-20 8:19 ` [PATCH v2] delayaccy/sched: " Tio Zhang
2025-08-27 3:14 ` wang.yaxin
2025-08-28 9:26 ` [PATCH v3] " Tio Zhang
2025-08-30 6:25 ` wang.yaxin
2025-08-30 6:29 ` wang.yaxin
2025-08-20 0:27 ` [PATCH] delayacct/sched: " kernel test robot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).