cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 0/2] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled
@ 2025-05-11  3:07 Yafang Shao
  2025-05-11  3:07 ` [PATCH v9 1/2] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
  2025-05-11  3:08 ` [PATCH v9 2/2] sched: Annotate sched_clock_irqtime with __read_mostly Yafang Shao
  0 siblings, 2 replies; 7+ messages in thread
From: Yafang Shao @ 2025-05-11  3:07 UTC (permalink / raw)
  To: mingo, peterz, mkoutny, hannes
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, surenb, linux-kernel, cgroups, lkp,
	Yafang Shao

After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
container environment, we encountered several user-visible behavioral
changes:

- Interrupted IRQ/softirq time is excluded in the cpuacct cgroup

  This breaks userspace applications that rely on CPU usage data from
  cgroups to monitor CPU pressure. This patchset resolves the issue by
  ensuring that IRQ/softirq time is included in the cgroup of the
  interrupted tasks.

- getrusage(2) does not include time interrupted by IRQ/softirq

  Some services use getrusage(2) to check if workloads are experiencing CPU
  pressure. Since IRQ/softirq time is no longer included in task runtime,
  getrusage(2) can no longer reflect the CPU pressure caused by heavy
  interrupts.

This patchset addresses the first issue, which is relatively
straightforward. Once this solution is accepted, I will address the second
issue in a follow-up patchset.

Enabling CONFIG_IRQ_TIME_ACCOUNTING results in the CPU
utilization metric excluding the time spent in IRQs. This means we
lose visibility into how long the CPU was actually interrupted in
comparison to its total utilization. The user will misleadlingly
consider the *interrupted irq time* as *sleep time* as follows,

  |<----Runtime---->|<----Sleep---->|<----Runtime---->|<---Sleep-->|

While actually it is:

  |<----Runtime---->|<--Interrupted time-->|<----Runtime---->|<---Sleep-->|

Currently, the only ways to monitor interrupt time are through IRQ PSI or
the IRQ time recorded in delay accounting. However, these metrics are
independent of CPU utilization, which makes it difficult to combine them
into a single, unified measure

CPU utilization is a critical metric for almost all workloads, and
it's problematic if it fails to reflect the full extent of system
pressure. This situation is similar to iowait: when a task is in
iowait, it could be due to other tasks performing I/O. It doesn’t
matter if the I/O is being done by one of your tasks or by someone
else's; what matters is that your task is stalled and waiting on I/O.
Similarly, a comprehensive CPU utilization metric should reflect all
sources of pressure, including IRQ time, to provide a more accurate
representation of workload behavior.

One of the applications impacted by this issue is our Redis load-balancing
service. The setup operates as follows:

                   ----------------
                   | Load Balancer|
                   ----------------
                /    |      |        \
               /     |      |         \ 
          Server1 Server2 Server3 ... ServerN

Although the load balancer's algorithm is complex, it follows some core
principles:

- When server CPU utilization increases, it adds more servers and deploys
  additional instances to meet SLA requirements.
- When server CPU utilization decreases, it scales down by decommissioning
  servers and reducing the number of instances to save on costs.

On our servers, the majority of IRQ/softIRQ activity originates from
network traffic, and we consistently enable Receive Flow Steering
(RFS) [0]. This configuration ensures that softIRQs are more likely to
interrupt the tasks responsible for processing the corresponding
packets. As a result, the distribution of softIRQs is not random but
instead closely aligned with the packet-handling tasks.

The load balancer is malfunctioning due to the exclusion of IRQ time from
CPU utilization calculations. What's worse, there is no effective way to
add the irq time back into the CPU utilization based on current
available metrics. Therefore, we have to change the kernel code.

Link: https://lwn.net/Articles/381955/ [0]

Changes:
v8->v9:
- Rebase it after the discussion with Ingo
  https://lore.kernel.org/all/aBuI36FCDbj20x28@gmail.com/
- Mark sched_clock_irqtime with __read_mostly to avoid possible
  cacheline false sharing issue

v7->v8: https://lore.kernel.org/all/20250103022409.2544-1-laoar.shao@gmail.com/
- Fix a build failure report by kernel test robot

v6->v7: https://lore.kernel.org/all/20241215032315.43698-1-laoar.shao@gmail.com/
- Fix psi_show() (Michal)

v5->v6: https://lore.kernel.org/all/20241211131729.43996-1-laoar.shao@gmail.com/
- Return EOPNOTSUPP in psi_show() if irqtime is disabled (Michal)
- Collect Reviewed-by from Michal 

v4->v5: https://lore.kernel.org/all/20241108132904.6932-1-laoar.shao@gmail.com/
- Don't use static key in the IRQ_TIME_ACCOUNTING=n case (Peter)
- Rename psi_irq_time to irq_time (Peter)
- Use CPUTIME_IRQ instead of CPUTIME_SOFTIRQ (Peter)

v3->v4: https://lore.kernel.org/all/20241101031750.1471-1-laoar.shao@gmail.com/
- Rebase

v2->v3:
- Add a helper account_irqtime() to avoid redundant code (Johannes)

v1->v2: https://lore.kernel.org/cgroups/20241008061951.3980-1-laoar.shao@gmail.com/
- Fix lockdep issues reported by kernel test robot <oliver.sang@intel.com>

v1: https://lore.kernel.org/all/20240923090028.16368-1-laoar.shao@gmail.com/

Yafang Shao (2):
  sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING
  sched: Annotate sched_clock_irqtime with __read_mostly

 kernel/sched/core.c    | 33 +++++++++++++++++++++++++++++++--
 kernel/sched/cputime.c |  2 +-
 kernel/sched/psi.c     | 13 +++----------
 kernel/sched/sched.h   |  2 +-
 kernel/sched/stats.h   |  7 ++++---
 5 files changed, 40 insertions(+), 17 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v9 1/2] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING
  2025-05-11  3:07 [PATCH v9 0/2] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled Yafang Shao
@ 2025-05-11  3:07 ` Yafang Shao
  2025-05-27 15:33   ` Michal Koutný
  2025-05-11  3:08 ` [PATCH v9 2/2] sched: Annotate sched_clock_irqtime with __read_mostly Yafang Shao
  1 sibling, 1 reply; 7+ messages in thread
From: Yafang Shao @ 2025-05-11  3:07 UTC (permalink / raw)
  To: mingo, peterz, mkoutny, hannes
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, surenb, linux-kernel, cgroups, lkp,
	Yafang Shao

After enabling CONFIG_IRQ_TIME_ACCOUNTING to monitor IRQ pressure in our
container environment, we observed several noticeable behavioral changes.

One of our IRQ-heavy services, such as Redis, reported a significant
reduction in CPU usage after upgrading to the new kernel with
CONFIG_IRQ_TIME_ACCOUNTING enabled. However, despite adding more threads
to handle an increased workload, the CPU usage could not be raised. In
other words, even though the container’s CPU usage appeared low, it was
unable to process more workloads to utilize additional CPU resources, which
caused issues.

This behavior can be demonstrated using netperf:

  function start_server() {
      for j in `seq 1 3`; do
          netserver -p $[12345+j] > /dev/null &
      done
  }

  server_ip=$1
  function start_client() {
    # That applies to cgroup2 as well.
    mkdir -p /sys/fs/cgroup/cpuacct/test
    echo $$ > /sys/fs/cgroup/cpuacct/test/cgroup.procs
    for j in `seq 1 3`; do
        port=$[12345+j]
        taskset -c 0 netperf -H ${server_ip} -l ${run_time:-30000}   \
                -t TCP_STREAM -p $port -- -D -m 1k -M 1K -s 8k -S 8k \
                > /dev/null &
    done
  }

  start_server
  start_client

We can verify the CPU usage of the test cgroup using cpuacct.stat. The
output shows:

  system: 53
  user: 2

The CPU usage of the cgroup is relatively low at around 55%, but this usage
doesn't increase, even with more netperf tasks. The reason is that CPU0 is
at 100% utilization, as confirmed by mpstat:

  02:56:22 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
  02:56:23 PM    0    0.99    0.00   55.45    0.00    0.99   42.57    0.00    0.00    0.00    0.00

  02:56:23 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
  02:56:24 PM    0    2.00    0.00   55.00    0.00    0.00   43.00    0.00    0.00    0.00    0.00

It is clear that the %soft is excluded in the cgroup of the interrupted
task. This behavior is unexpected. We should include IRQ time in the
cgroup to reflect the pressure the group is under.

After a thorough analysis, I discovered that this change in behavior is due
to commit 305e6835e055 ("sched: Do not account irq time to current task"),
which altered whether IRQ time should be charged to the interrupted task.
While I agree that a task should not be penalized by random interrupts, the
task itself cannot progress while interrupted. Therefore, the interrupted
time should be reported to the user.

The system metric in cpuacct.stat is crucial in indicating whether a
container is under heavy system pressure, including IRQ/softirq activity.
Hence, IRQ/softirq time should be included in the cpuacct system usage,
which also applies to cgroup2’s rstat.

The reason it doesn't just add the cgroup_account_*() to
irqtime_account_irq() is that it might result in performance hit to hold
the rq_lock in the critical path. Taking inspiration from
commit ddae0ca2a8fe ("sched: Move psi_account_irqtime() out of
update_rq_clock_task() hotpath"), I've now adapted the approach to handle
it in a non-critical path, reducing the performance impact.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Koutný <mkoutny@suse.com>
---
 kernel/sched/core.c  | 33 +++++++++++++++++++++++++++++++--
 kernel/sched/psi.c   | 13 +++----------
 kernel/sched/sched.h |  2 +-
 kernel/sched/stats.h |  7 ++++---
 4 files changed, 39 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c81cf642dba0..2a1ce5a88076 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5627,6 +5627,35 @@ static int __init setup_resched_latency_warn_ms(char *str)
 }
 __setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static void account_irqtime(struct rq *rq, struct task_struct *curr,
+			    struct task_struct *prev)
+{
+	int cpu = smp_processor_id();
+	s64 delta;
+	u64 irq;
+
+	if (!irqtime_enabled())
+		return;
+
+	irq = irq_time_read(cpu);
+	delta = (s64)(irq - rq->irq_time);
+	if (delta < 0)
+		return;
+
+	rq->irq_time = irq;
+	psi_account_irqtime(rq, curr, prev, delta);
+	cgroup_account_cputime(curr, delta);
+	/* We account both softirq and irq into CPUTIME_IRQ */
+	cgroup_account_cputime_field(curr, CPUTIME_IRQ, delta);
+}
+#else
+static inline void account_irqtime(struct rq *rq, struct task_struct *curr,
+				   struct task_struct *prev)
+{
+}
+#endif
+
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -5649,7 +5678,7 @@ void sched_tick(void)
 	rq_lock(rq, &rf);
 	donor = rq->donor;
 
-	psi_account_irqtime(rq, donor, NULL);
+	account_irqtime(rq, donor, NULL);
 
 	update_rq_clock(rq);
 	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
@@ -6757,7 +6786,7 @@ static void __sched notrace __schedule(int sched_mode)
 		++*switch_count;
 
 		migrate_disable_switch(rq, prev);
-		psi_account_irqtime(rq, prev, next);
+		account_irqtime(rq, prev, next);
 		psi_sched_switch(prev, next, !task_on_rq_queued(prev) ||
 					     prev->se.sched_delayed);
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 1396674fa722..4affc1f747bd 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -990,15 +990,14 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev)
+void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev,
+			 s64 delta)
 {
 	int cpu = task_cpu(curr);
 	struct psi_group *group;
 	struct psi_group_cpu *groupc;
-	s64 delta;
-	u64 irq;
 
-	if (static_branch_likely(&psi_disabled) || !irqtime_enabled())
+	if (static_branch_likely(&psi_disabled))
 		return;
 
 	if (!curr->pid)
@@ -1009,12 +1008,6 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 	if (prev && task_psi_group(prev) == group)
 		return;
 
-	irq = irq_time_read(cpu);
-	delta = (s64)(irq - rq->psi_irq_time);
-	if (delta < 0)
-		return;
-	rq->psi_irq_time = irq;
-
 	do {
 		u64 now;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 47972f34ea70..b5e71dfad66c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1226,7 +1226,7 @@ struct rq {
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 	u64			prev_irq_time;
-	u64			psi_irq_time;
+	u64			irq_time;
 #endif
 #ifdef CONFIG_PARAVIRT
 	u64			prev_steal_time;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 452826df6ae1..b5b626cb1b83 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -111,10 +111,11 @@ void psi_task_change(struct task_struct *task, int clear, int set);
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep);
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev);
+void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
+			 struct task_struct *prev, s64 delta);
 #else
 static inline void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
-				       struct task_struct *prev) {}
+				       struct task_struct *prev, s64 delta) {}
 #endif /*CONFIG_IRQ_TIME_ACCOUNTING */
 /*
  * PSI tracks state that persists across sleeps, such as iowaits and
@@ -228,7 +229,7 @@ static inline void psi_sched_switch(struct task_struct *prev,
 				    struct task_struct *next,
 				    bool sleep) {}
 static inline void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
-				       struct task_struct *prev) {}
+				       struct task_struct *prev, s64 delta) {}
 #endif /* CONFIG_PSI */
 
 #ifdef CONFIG_SCHED_INFO
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v9 2/2] sched: Annotate sched_clock_irqtime with __read_mostly
  2025-05-11  3:07 [PATCH v9 0/2] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled Yafang Shao
  2025-05-11  3:07 ` [PATCH v9 1/2] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
@ 2025-05-11  3:08 ` Yafang Shao
  2025-05-27 15:33   ` Michal Koutný
  1 sibling, 1 reply; 7+ messages in thread
From: Yafang Shao @ 2025-05-11  3:08 UTC (permalink / raw)
  To: mingo, peterz, mkoutny, hannes
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, surenb, linux-kernel, cgroups, lkp,
	Yafang Shao, Eric Dumazet

Eric reported an issue [0] as follows,
: rebalance_domains() can attempt to change sched_balance_running
: more than 350,000 times per second on our servers.

: If sched_clock_irqtime and sched_balance_running share the
: same cache line, we see a very high cost on hosts with 480 threads
: dealing with many interrupts.

While the rebalance_domains() issue has been resolved [1], we should
proactively annotate sched_clock_irqtime with __read_mostly to prevent
potential cacheline false sharing. This optimization is particularly
justified since sched_clock_irqtime is only modified during TSC instability
events.

Link: https://lore.kernel.org/all/20250423174634.3009657-1-edumazet@google.com/ [0]
Link: https://lore.kernel.org/all/20250416035823.1846307-1-tim.c.chen@linux.intel.com/ [1]

Reported-by: Eric Dumazet <edumazet@google.com>
Debugged-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
---
 kernel/sched/cputime.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 6dab4854c6c0..c499a42ceda4 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -22,7 +22,7 @@
  */
 DEFINE_PER_CPU(struct irqtime, cpu_irqtime);
 
-int sched_clock_irqtime;
+int __read_mostly sched_clock_irqtime;
 
 void enable_sched_clock_irqtime(void)
 {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v9 1/2] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING
  2025-05-11  3:07 ` [PATCH v9 1/2] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
@ 2025-05-27 15:33   ` Michal Koutný
  2025-05-28  2:10     ` Yafang Shao
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Koutný @ 2025-05-27 15:33 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, peterz, hannes, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, surenb,
	linux-kernel, cgroups, lkp

[-- Attachment #1: Type: text/plain, Size: 2070 bytes --]

Hello.

On Sun, May 11, 2025 at 11:07:59AM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> The CPU usage of the cgroup is relatively low at around 55%, but this usage
> doesn't increase, even with more netperf tasks. The reason is that CPU0 is
> at 100% utilization, as confirmed by mpstat:
> 
>   02:56:22 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>   02:56:23 PM    0    0.99    0.00   55.45    0.00    0.99   42.57    0.00    0.00    0.00    0.00
> 
>   02:56:23 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>   02:56:24 PM    0    2.00    0.00   55.00    0.00    0.00   43.00    0.00    0.00    0.00    0.00
> 
> It is clear that the %soft is excluded in the cgroup of the interrupted
> task. This behavior is unexpected. We should include IRQ time in the
> cgroup to reflect the pressure the group is under.

I think this would go against intention of CONFIG_IRQ_TIME_ACCOUNTING
(someony more familiar may chime in).

> After a thorough analysis, I discovered that this change in behavior is due
> to commit 305e6835e055 ("sched: Do not account irq time to current task"),
> which altered whether IRQ time should be charged to the interrupted task.
> While I agree that a task should not be penalized by random interrupts, the
> task itself cannot progress while interrupted. Therefore, the interrupted
> time should be reported to the user.
> 
> The system metric in cpuacct.stat is crucial in indicating whether a
> container is under heavy system pressure, including IRQ/softirq activity.
> Hence, IRQ/softirq time should be included in the cpuacct system usage,
> which also applies to cgroup2’s rstat.

So I guess, it'd be better to add a separate entry in cpu.stat with
irq_usec (instead of bundling it into system_usec in spite of
CONFIG_IRQ_TIME_ACCOUNTING).

I admit, I'd be happier if irq.pressure values could be used for
that. Maybe not the PSI ratio itself but irq.pressure:total should be
that amount. WDYT?

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v9 2/2] sched: Annotate sched_clock_irqtime with __read_mostly
  2025-05-11  3:08 ` [PATCH v9 2/2] sched: Annotate sched_clock_irqtime with __read_mostly Yafang Shao
@ 2025-05-27 15:33   ` Michal Koutný
  2025-05-28  2:13     ` Yafang Shao
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Koutný @ 2025-05-27 15:33 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, peterz, hannes, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, surenb,
	linux-kernel, cgroups, lkp, Eric Dumazet

[-- Attachment #1: Type: text/plain, Size: 1344 bytes --]

On Sun, May 11, 2025 at 11:08:00AM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> Eric reported an issue [0] as follows,
> : rebalance_domains() can attempt to change sched_balance_running
> : more than 350,000 times per second on our servers.
> 
> : If sched_clock_irqtime and sched_balance_running share the
> : same cache line, we see a very high cost on hosts with 480 threads
> : dealing with many interrupts.

I'd say this patch could be independent from the "series".

> While the rebalance_domains() issue has been resolved [1], we should
> proactively annotate sched_clock_irqtime with __read_mostly to prevent
> potential cacheline false sharing. This optimization is particularly
> justified since sched_clock_irqtime is only modified during TSC instability
> events.
> 
> Link: https://lore.kernel.org/all/20250423174634.3009657-1-edumazet@google.com/ [0]
> Link: https://lore.kernel.org/all/20250416035823.1846307-1-tim.c.chen@linux.intel.com/ [1]
> 
> Reported-by: Eric Dumazet <edumazet@google.com>
> Debugged-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Eric Dumazet <edumazet@google.com>

I can say
Reviewed-by: Michal Koutný <mkoutny@suse.com>

but it'd be good to have also Tested-by: wrt the cache traffic
reduction.

0.02€,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v9 1/2] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING
  2025-05-27 15:33   ` Michal Koutný
@ 2025-05-28  2:10     ` Yafang Shao
  0 siblings, 0 replies; 7+ messages in thread
From: Yafang Shao @ 2025-05-28  2:10 UTC (permalink / raw)
  To: Michal Koutný
  Cc: mingo, peterz, hannes, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, surenb,
	linux-kernel, cgroups, lkp

On Tue, May 27, 2025 at 11:33 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello.
>
> On Sun, May 11, 2025 at 11:07:59AM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> > The CPU usage of the cgroup is relatively low at around 55%, but this usage
> > doesn't increase, even with more netperf tasks. The reason is that CPU0 is
> > at 100% utilization, as confirmed by mpstat:
> >
> >   02:56:22 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> >   02:56:23 PM    0    0.99    0.00   55.45    0.00    0.99   42.57    0.00    0.00    0.00    0.00
> >
> >   02:56:23 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> >   02:56:24 PM    0    2.00    0.00   55.00    0.00    0.00   43.00    0.00    0.00    0.00    0.00
> >
> > It is clear that the %soft is excluded in the cgroup of the interrupted
> > task. This behavior is unexpected. We should include IRQ time in the
> > cgroup to reflect the pressure the group is under.
>
> I think this would go against intention of CONFIG_IRQ_TIME_ACCOUNTING
> (someony more familiar may chime in).

Please refer to the discussion with Ingo :
https://lore.kernel.org/all/aBsGXCKX8-2_Cn9x@gmail.com/

>
> > After a thorough analysis, I discovered that this change in behavior is due
> > to commit 305e6835e055 ("sched: Do not account irq time to current task"),
> > which altered whether IRQ time should be charged to the interrupted task.
> > While I agree that a task should not be penalized by random interrupts, the
> > task itself cannot progress while interrupted. Therefore, the interrupted
> > time should be reported to the user.
> >
> > The system metric in cpuacct.stat is crucial in indicating whether a
> > container is under heavy system pressure, including IRQ/softirq activity.
> > Hence, IRQ/softirq time should be included in the cpuacct system usage,
> > which also applies to cgroup2’s rstat.
>
> So I guess, it'd be better to add a separate entry in cpu.stat with
> irq_usec (instead of bundling it into system_usec in spite of
> CONFIG_IRQ_TIME_ACCOUNTING).
>
> I admit, I'd be happier if irq.pressure values could be used for
> that. Maybe not the PSI ratio itself but irq.pressure:total should be
> that amount. WDYT?

Thank you for your suggestion. Both methods can effectively retrieve
the container’s IRQ usage. However, I prefer adding a new entry
irq_usec to cpu.stat since it aligns better with CPU utilization
metrics.


--
Regards
Yafang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v9 2/2] sched: Annotate sched_clock_irqtime with __read_mostly
  2025-05-27 15:33   ` Michal Koutný
@ 2025-05-28  2:13     ` Yafang Shao
  0 siblings, 0 replies; 7+ messages in thread
From: Yafang Shao @ 2025-05-28  2:13 UTC (permalink / raw)
  To: Michal Koutný
  Cc: mingo, peterz, hannes, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, surenb,
	linux-kernel, cgroups, lkp, Eric Dumazet

On Tue, May 27, 2025 at 11:33 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Sun, May 11, 2025 at 11:08:00AM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> > Eric reported an issue [0] as follows,
> > : rebalance_domains() can attempt to change sched_balance_running
> > : more than 350,000 times per second on our servers.
> >
> > : If sched_clock_irqtime and sched_balance_running share the
> > : same cache line, we see a very high cost on hosts with 480 threads
> > : dealing with many interrupts.
>
> I'd say this patch could be independent from the "series".

I will send it separately.

>
> > While the rebalance_domains() issue has been resolved [1], we should
> > proactively annotate sched_clock_irqtime with __read_mostly to prevent
> > potential cacheline false sharing. This optimization is particularly
> > justified since sched_clock_irqtime is only modified during TSC instability
> > events.
> >
> > Link: https://lore.kernel.org/all/20250423174634.3009657-1-edumazet@google.com/ [0]
> > Link: https://lore.kernel.org/all/20250416035823.1846307-1-tim.c.chen@linux.intel.com/ [1]
> >
> > Reported-by: Eric Dumazet <edumazet@google.com>
> > Debugged-by: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Eric Dumazet <edumazet@google.com>
>
> I can say
> Reviewed-by: Michal Koutný <mkoutny@suse.com>

Thanks for the review.

>
> but it'd be good to have also Tested-by: wrt the cache traffic
> reduction.


-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-05-28  2:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-11  3:07 [PATCH v9 0/2] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled Yafang Shao
2025-05-11  3:07 ` [PATCH v9 1/2] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
2025-05-27 15:33   ` Michal Koutný
2025-05-28  2:10     ` Yafang Shao
2025-05-11  3:08 ` [PATCH v9 2/2] sched: Annotate sched_clock_irqtime with __read_mostly Yafang Shao
2025-05-27 15:33   ` Michal Koutný
2025-05-28  2:13     ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).