[PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled
@ 2024-11-08 13:29 Yafang Shao
  2024-11-08 13:29 ` [PATCH v5 1/4] sched: Define sched_clock_irqtime as static key Yafang Shao
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Yafang Shao @ 2024-11-08 13:29 UTC (permalink / raw)
  To: mingo, peterz
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, hannes, surenb, cgroups, linux-kernel,
	Yafang Shao

After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
container environment, we encountered several user-visible behavioral
changes:

- Interrupted IRQ/softirq time is excluded in the cpuacct cgroup

  This breaks userspace applications that rely on CPU usage data from
  cgroups to monitor CPU pressure. This patchset resolves the issue by
  ensuring that IRQ/softirq time is included in the cgroup of the
  interrupted tasks.

- getrusage(2) does not include time interrupted by IRQ/softirq

  Some services use getrusage(2) to check if workloads are experiencing CPU
  pressure. Since IRQ/softirq time is no longer included in task runtime,
  getrusage(2) can no longer reflect the CPU pressure caused by heavy
  interrupts.

This patchset addresses the first issue, which is relatively
straightforward. Once this solution is accepted, I will address the second
issue in a follow-up patchset.

Enabling CONFIG_IRQ_TIME_ACCOUNTING results in the CPU
utilization metric excluding the time spent in IRQs. This means we
lose visibility into how long the CPU was actually interrupted in
comparison to its total utilization. Currently, the only ways to
monitor interrupt time are through IRQ PSI or the IRQ time recorded in
delay accounting. However, these metrics are independent of CPU
utilization, which makes it difficult to combine them into a single,
unified measure

CPU utilization is a critical metric for almost all workloads, and
it's problematic if it fails to reflect the full extent of system
pressure. This situation is similar to iowait: when a task is in
iowait, it could be due to other tasks performing I/O. It doesn’t
matter if the I/O is being done by one of your tasks or by someone
else's; what matters is that your task is stalled and waiting on I/O.
Similarly, a comprehensive CPU utilization metric should reflect all
sources of pressure, including IRQ time, to provide a more accurate
representation of workload behavior.

One of the applications impacted by this issue is our Redis load-balancing
service. The setup operates as follows:

                   ----------------
                   | Load Balancer|
                   ----------------
                /    |      |        \
               /     |      |         \ 
          Server1 Server2 Server3 ... ServerN

Although the load balancer's algorithm is complex, it follows some core
principles:

- When server CPU utilization increases, it adds more servers and deploys
  additional instances to meet SLA requirements.
- When server CPU utilization decreases, it scales down by decommissioning
  servers and reducing the number of instances to save on costs.

The load balancer is malfunctioning due to the exclusion of IRQ time from
CPU utilization calculations.

Changes:
v4->v5:
- Don't use static key in the IRQ_TIME_ACCOUNTING=n case (Peter)
- Rename psi_irq_time to irq_time (Peter)
- Use CPUTIME_IRQ instead of CPUTIME_SOFTIRQ (Peter)

v3->v4: https://lore.kernel.org/all/20241101031750.1471-1-laoar.shao@gmail.com/
- Rebase

v2->v3:
- Add a helper account_irqtime() to avoid redundant code (Johannes)

v1->v2: https://lore.kernel.org/cgroups/20241008061951.3980-1-laoar.shao@gmail.com/
- Fix lockdep issues reported by kernel test robot <oliver.sang@intel.com>

v1: https://lore.kernel.org/all/20240923090028.16368-1-laoar.shao@gmail.com/

Yafang Shao (4):
  sched: Define sched_clock_irqtime as static key
  sched: Don't account irq time if sched_clock_irqtime is disabled
  sched, psi: Don't account irq time if sched_clock_irqtime is disabled
  sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING

 kernel/sched/core.c    | 77 +++++++++++++++++++++++++++++-------------
 kernel/sched/cputime.c | 16 ++++-----
 kernel/sched/psi.c     | 11 ++----
 kernel/sched/sched.h   | 15 +++++++-
 kernel/sched/stats.h   |  7 ++--
 5 files changed, 81 insertions(+), 45 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v5 1/4] sched: Define sched_clock_irqtime as static key
  2024-11-08 13:29 [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled Yafang Shao
@ 2024-11-08 13:29 ` Yafang Shao
  2024-12-03 10:01   ` Michal Koutný
  2024-11-08 13:29 ` [PATCH v5 2/4] sched: Don't account irq time if sched_clock_irqtime is disabled Yafang Shao
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Yafang Shao @ 2024-11-08 13:29 UTC (permalink / raw)
  To: mingo, peterz
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, hannes, surenb, cgroups, linux-kernel,
	Yafang Shao

Since CPU time accounting is a performance-critical path, let's define
sched_clock_irqtime as a static key to minimize potential overhead.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/sched/cputime.c | 16 +++++++---------
 kernel/sched/sched.h   | 13 +++++++++++++
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0bed0fa1acd9..5d9143dd0879 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -9,6 +9,8 @@
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 
+DEFINE_STATIC_KEY_FALSE(sched_clock_irqtime);
+
 /*
  * There are no locks covering percpu hardirq/softirq time.
  * They are only modified in vtime_account, on corresponding CPU
@@ -22,16 +24,14 @@
  */
 DEFINE_PER_CPU(struct irqtime, cpu_irqtime);
 
-static int sched_clock_irqtime;
-
 void enable_sched_clock_irqtime(void)
 {
-	sched_clock_irqtime = 1;
+	static_branch_enable(&sched_clock_irqtime);
 }
 
 void disable_sched_clock_irqtime(void)
 {
-	sched_clock_irqtime = 0;
+	static_branch_disable(&sched_clock_irqtime);
 }
 
 static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,
@@ -57,7 +57,7 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
 	s64 delta;
 	int cpu;
 
-	if (!sched_clock_irqtime)
+	if (!irqtime_enabled())
 		return;
 
 	cpu = smp_processor_id();
@@ -90,8 +90,6 @@ static u64 irqtime_tick_accounted(u64 maxtime)
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
 
-#define sched_clock_irqtime	(0)
-
 static u64 irqtime_tick_accounted(u64 dummy)
 {
 	return 0;
@@ -478,7 +476,7 @@ void account_process_tick(struct task_struct *p, int user_tick)
 	if (vtime_accounting_enabled_this_cpu())
 		return;
 
-	if (sched_clock_irqtime) {
+	if (irqtime_enabled()) {
 		irqtime_account_process_tick(p, user_tick, 1);
 		return;
 	}
@@ -507,7 +505,7 @@ void account_idle_ticks(unsigned long ticks)
 {
 	u64 cputime, steal;
 
-	if (sched_clock_irqtime) {
+	if (irqtime_enabled()) {
 		irqtime_account_idle_ticks(ticks);
 		return;
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 081519ffab46..0c83ab35256e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3179,6 +3179,12 @@ struct irqtime {
 };
 
 DECLARE_PER_CPU(struct irqtime, cpu_irqtime);
+DECLARE_STATIC_KEY_FALSE(sched_clock_irqtime);
+
+static inline int irqtime_enabled(void)
+{
+	return static_branch_likely(&sched_clock_irqtime);
+}
 
 /*
  * Returns the irqtime minus the softirq time computed by ksoftirqd.
@@ -3199,6 +3205,13 @@ static inline u64 irq_time_read(int cpu)
 	return total;
 }
 
+#else
+
+static inline int irqtime_enabled(void)
+{
+	return 0;
+}
+
 #endif /* CONFIG_IRQ_TIME_ACCOUNTING */
 
 #ifdef CONFIG_CPU_FREQ
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v5 2/4] sched: Don't account irq time if sched_clock_irqtime is disabled
  2024-11-08 13:29 [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled Yafang Shao
  2024-11-08 13:29 ` [PATCH v5 1/4] sched: Define sched_clock_irqtime as static key Yafang Shao
@ 2024-11-08 13:29 ` Yafang Shao
  2024-12-03 10:01   ` Michal Koutný
  2024-11-08 13:29 ` [PATCH v5 3/4] sched, psi: " Yafang Shao
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Yafang Shao @ 2024-11-08 13:29 UTC (permalink / raw)
  To: mingo, peterz
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, hannes, surenb, cgroups, linux-kernel,
	Yafang Shao

sched_clock_irqtime may be disabled due to the clock source, in which case
IRQ time should not be accounted. Let's add a conditional check to avoid
unnecessary logic.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/sched/core.c | 44 +++++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dbfb5717d6af..a75dad9be4b9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -740,29 +740,31 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 	s64 __maybe_unused steal = 0, irq_delta = 0;
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-	irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
+	if (irqtime_enabled()) {
+		irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
 
-	/*
-	 * Since irq_time is only updated on {soft,}irq_exit, we might run into
-	 * this case when a previous update_rq_clock() happened inside a
-	 * {soft,}IRQ region.
-	 *
-	 * When this happens, we stop ->clock_task and only update the
-	 * prev_irq_time stamp to account for the part that fit, so that a next
-	 * update will consume the rest. This ensures ->clock_task is
-	 * monotonic.
-	 *
-	 * It does however cause some slight miss-attribution of {soft,}IRQ
-	 * time, a more accurate solution would be to update the irq_time using
-	 * the current rq->clock timestamp, except that would require using
-	 * atomic ops.
-	 */
-	if (irq_delta > delta)
-		irq_delta = delta;
+		/*
+		 * Since irq_time is only updated on {soft,}irq_exit, we might run into
+		 * this case when a previous update_rq_clock() happened inside a
+		 * {soft,}IRQ region.
+		 *
+		 * When this happens, we stop ->clock_task and only update the
+		 * prev_irq_time stamp to account for the part that fit, so that a next
+		 * update will consume the rest. This ensures ->clock_task is
+		 * monotonic.
+		 *
+		 * It does however cause some slight miss-attribution of {soft,}IRQ
+		 * time, a more accurate solution would be to update the irq_time using
+		 * the current rq->clock timestamp, except that would require using
+		 * atomic ops.
+		 */
+		if (irq_delta > delta)
+			irq_delta = delta;
 
-	rq->prev_irq_time += irq_delta;
-	delta -= irq_delta;
-	delayacct_irq(rq->curr, irq_delta);
+		rq->prev_irq_time += irq_delta;
+		delta -= irq_delta;
+		delayacct_irq(rq->curr, irq_delta);
+	}
 #endif
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
 	if (static_key_false((&paravirt_steal_rq_enabled))) {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v5 3/4] sched, psi: Don't account irq time if sched_clock_irqtime is disabled
  2024-11-08 13:29 [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled Yafang Shao
  2024-11-08 13:29 ` [PATCH v5 1/4] sched: Define sched_clock_irqtime as static key Yafang Shao
  2024-11-08 13:29 ` [PATCH v5 2/4] sched: Don't account irq time if sched_clock_irqtime is disabled Yafang Shao
@ 2024-11-08 13:29 ` Yafang Shao
  2024-12-03 10:01   ` Michal Koutný
  2024-11-08 13:29 ` [PATCH v5 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
  2024-11-15 13:41 ` [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled Michal Koutný
  4 siblings, 1 reply; 17+ messages in thread
From: Yafang Shao @ 2024-11-08 13:29 UTC (permalink / raw)
  To: mingo, peterz
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, hannes, surenb, cgroups, linux-kernel,
	Yafang Shao

sched_clock_irqtime may be disabled due to the clock source. When disabled,
irq_time_read() won't change over time, so there is nothing to account. We
can save iterating the whole hierarchy on every tick and context switch.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/psi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 84dad1511d1e..210aec717dc7 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -998,7 +998,7 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 	s64 delta;
 	u64 irq;
 
-	if (static_branch_likely(&psi_disabled))
+	if (static_branch_likely(&psi_disabled) || !irqtime_enabled())
 		return;
 
 	if (!curr->pid)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v5 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING
  2024-11-08 13:29 [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled Yafang Shao
                   ` (2 preceding siblings ...)
  2024-11-08 13:29 ` [PATCH v5 3/4] sched, psi: " Yafang Shao
@ 2024-11-08 13:29 ` Yafang Shao
  2024-12-03 10:01   ` Michal Koutný
  2024-11-15 13:41 ` [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled Michal Koutný
  4 siblings, 1 reply; 17+ messages in thread
From: Yafang Shao @ 2024-11-08 13:29 UTC (permalink / raw)
  To: mingo, peterz
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, hannes, surenb, cgroups, linux-kernel,
	Yafang Shao

After enabling CONFIG_IRQ_TIME_ACCOUNTING to monitor IRQ pressure in our
container environment, we observed several noticeable behavioral changes.

One of our IRQ-heavy services, such as Redis, reported a significant
reduction in CPU usage after upgrading to the new kernel with
CONFIG_IRQ_TIME_ACCOUNTING enabled. However, despite adding more threads
to handle an increased workload, the CPU usage could not be raised. In
other words, even though the container’s CPU usage appeared low, it was
unable to process more workloads to utilize additional CPU resources, which
caused issues.

This behavior can be demonstrated using netperf:

  function start_server() {
      for j in `seq 1 3`; do
          netserver -p $[12345+j] > /dev/null &
      done
  }

  server_ip=$1
  function start_client() {
    # That applies to cgroup2 as well.
    mkdir -p /sys/fs/cgroup/cpuacct/test
    echo $$ > /sys/fs/cgroup/cpuacct/test/cgroup.procs
    for j in `seq 1 3`; do
        port=$[12345+j]
        taskset -c 0 netperf -H ${server_ip} -l ${run_time:-30000}   \
                -t TCP_STREAM -p $port -- -D -m 1k -M 1K -s 8k -S 8k \
                > /dev/null &
    done
  }

  start_server
  start_client

We can verify the CPU usage of the test cgroup using cpuacct.stat. The
output shows:

  system: 53
  user: 2

The CPU usage of the cgroup is relatively low at around 55%, but this usage
doesn't increase, even with more netperf tasks. The reason is that CPU0 is
at 100% utilization, as confirmed by mpstat:

  02:56:22 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
  02:56:23 PM    0    0.99    0.00   55.45    0.00    0.99   42.57    0.00    0.00    0.00    0.00

  02:56:23 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
  02:56:24 PM    0    2.00    0.00   55.00    0.00    0.00   43.00    0.00    0.00    0.00    0.00

It is clear that the %soft is excluded in the cgroup of the interrupted
task. This behavior is unexpected. We should include IRQ time in the
cgroup to reflect the pressure the group is under.

After a thorough analysis, I discovered that this change in behavior is due
to commit 305e6835e055 ("sched: Do not account irq time to current task"),
which altered whether IRQ time should be charged to the interrupted task.
While I agree that a task should not be penalized by random interrupts, the
task itself cannot progress while interrupted. Therefore, the interrupted
time should be reported to the user.

The system metric in cpuacct.stat is crucial in indicating whether a
container is under heavy system pressure, including IRQ/softirq activity.
Hence, IRQ/softirq time should be included in the cpuacct system usage,
which also applies to cgroup2’s rstat.

The reason it doesn't just add the cgroup_account_*() to
irqtime_account_irq() is that it might result in performance hit to hold
the rq_lock in the critical path. Taking inspiration from
commit ddae0ca2a8fe ("sched: Move psi_account_irqtime() out of
update_rq_clock_task() hotpath"), I've now adapted the approach to handle
it in a non-critical path, reducing the performance impact.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c  | 33 +++++++++++++++++++++++++++++++--
 kernel/sched/psi.c   | 13 +++----------
 kernel/sched/sched.h |  2 +-
 kernel/sched/stats.h |  7 ++++---
 4 files changed, 39 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a75dad9be4b9..61545db8ca4b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5579,6 +5579,35 @@ __setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
 static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
 #endif /* CONFIG_SCHED_DEBUG */
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static void account_irqtime(struct rq *rq, struct task_struct *curr,
+			    struct task_struct *prev)
+{
+	int cpu = smp_processor_id();
+	s64 delta;
+	u64 irq;
+
+	if (!irqtime_enabled())
+		return;
+
+	irq = irq_time_read(cpu);
+	delta = (s64)(irq - rq->irq_time);
+	if (delta < 0)
+		return;
+
+	rq->irq_time = irq;
+	psi_account_irqtime(rq, curr, prev, delta);
+	cgroup_account_cputime(curr, delta);
+	/* We account both softirq and irq into CPUTIME_IRQ */
+	cgroup_account_cputime_field(curr, CPUTIME_IRQ, delta);
+}
+#else
+static inline void account_irqtime(struct rq *rq, struct task_struct *curr,
+				   struct task_struct *prev)
+{
+}
+#endif
+
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -5600,7 +5629,7 @@ void sched_tick(void)
 	rq_lock(rq, &rf);
 
 	curr = rq->curr;
-	psi_account_irqtime(rq, curr, NULL);
+	account_irqtime(rq, curr, NULL);
 
 	update_rq_clock(rq);
 	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
@@ -6683,7 +6712,7 @@ static void __sched notrace __schedule(int sched_mode)
 		++*switch_count;
 
 		migrate_disable_switch(rq, prev);
-		psi_account_irqtime(rq, prev, next);
+		account_irqtime(rq, prev, next);
 		psi_sched_switch(prev, next, block);
 
 		trace_sched_switch(preempt, prev, next, prev_state);
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 210aec717dc7..1adb41b2ae1d 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -990,15 +990,14 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev)
+void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev,
+			 s64 delta)
 {
 	int cpu = task_cpu(curr);
 	struct psi_group *group;
 	struct psi_group_cpu *groupc;
-	s64 delta;
-	u64 irq;
 
-	if (static_branch_likely(&psi_disabled) || !irqtime_enabled())
+	if (static_branch_likely(&psi_disabled))
 		return;
 
 	if (!curr->pid)
@@ -1009,12 +1008,6 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 	if (prev && task_psi_group(prev) == group)
 		return;
 
-	irq = irq_time_read(cpu);
-	delta = (s64)(irq - rq->psi_irq_time);
-	if (delta < 0)
-		return;
-	rq->psi_irq_time = irq;
-
 	do {
 		u64 now;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0c83ab35256e..690fc6f9d97c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1223,7 +1223,7 @@ struct rq {
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 	u64			prev_irq_time;
-	u64			psi_irq_time;
+	u64			irq_time;
 #endif
 #ifdef CONFIG_PARAVIRT
 	u64			prev_steal_time;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 767e098a3bd1..17eefe5876a5 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -111,10 +111,11 @@ void psi_task_change(struct task_struct *task, int clear, int set);
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep);
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev);
+void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
+			 struct task_struct *prev, s64 delta);
 #else
 static inline void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
-				       struct task_struct *prev) {}
+				       struct task_struct *prev, s64 delta) {}
 #endif /*CONFIG_IRQ_TIME_ACCOUNTING */
 /*
  * PSI tracks state that persists across sleeps, such as iowaits and
@@ -215,7 +216,7 @@ static inline void psi_sched_switch(struct task_struct *prev,
 				    struct task_struct *next,
 				    bool sleep) {}
 static inline void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
-				       struct task_struct *prev) {}
+				       struct task_struct *prev, s64 delta) {}
 #endif /* CONFIG_PSI */
 
 #ifdef CONFIG_SCHED_INFO
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled
  2024-11-08 13:29 [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled Yafang Shao
                   ` (3 preceding siblings ...)
  2024-11-08 13:29 ` [PATCH v5 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
@ 2024-11-15 13:41 ` Michal Koutný
  2024-11-17  2:56   ` Yafang Shao
  4 siblings, 1 reply; 17+ messages in thread
From: Michal Koutný @ 2024-11-15 13:41 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, hannes, surenb, cgroups,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2744 bytes --]

Hello Yafang.

On Fri, Nov 08, 2024 at 09:29:00PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
> container environment, we encountered several user-visible behavioral
> changes:
> 
> - Interrupted IRQ/softirq time is excluded in the cpuacct cgroup
> 
>   This breaks userspace applications that rely on CPU usage data from
>   cgroups to monitor CPU pressure. This patchset resolves the issue by
>   ensuring that IRQ/softirq time is included in the cgroup of the
>   interrupted tasks.
> 
> - getrusage(2) does not include time interrupted by IRQ/softirq
> 
>   Some services use getrusage(2) to check if workloads are experiencing CPU
>   pressure. Since IRQ/softirq time is no longer included in task runtime,
>   getrusage(2) can no longer reflect the CPU pressure caused by heavy
>   interrupts.
 
I understand that IRQ/softirq time is difficult to attribute to an
"accountable" entity and it's technically simplest to attribute it
everyone/noone, i.e. to root cgroup (or through a global stat w/out
cgroups).

> This patchset addresses the first issue, which is relatively
> straightforward. Once this solution is accepted, I will address the second
> issue in a follow-up patchset.

Is the first issue about cpuacct data or irq.pressure?

It sounds kind of both and I noticed the docs for irq.pressure is
lacking in Documentation/accounting/psi.rst. When you're touching this,
could you please add a paragraph or sentence explaining what does this
value represent?

(Also, there is same change both for cpuacct and
cgroup_base_stat_cputime_show(), right?)

>                    ----------------
>                    | Load Balancer|
>                    ----------------
>                 /    |      |        \
>                /     |      |         \ 
>           Server1 Server2 Server3 ... ServerN
> 
> Although the load balancer's algorithm is complex, it follows some core
> principles:
> 
> - When server CPU utilization increases, it adds more servers and deploys
>   additional instances to meet SLA requirements.
> - When server CPU utilization decreases, it scales down by decommissioning
>   servers and reducing the number of instances to save on costs.

A server here references to a whole node (whole kernel) or to a cgroup
(i.e. more servers on top of one kernel)?

> The load balancer is malfunctioning due to the exclusion of IRQ time from
> CPU utilization calculations.

Could this be fixed by subtracting (global) IRQ time from (presumed
total) system capacity that the balancer uses for its decisions? (i.e.
without exact per-cgroup breakdown of IRQ time)

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled
  2024-11-15 13:41 ` [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled Michal Koutný
@ 2024-11-17  2:56   ` Yafang Shao
  2024-11-18 10:10     ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Yafang Shao @ 2024-11-17  2:56 UTC (permalink / raw)
  To: Michal Koutný
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, hannes, surenb, cgroups,
	linux-kernel

On Fri, Nov 15, 2024 at 9:41 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello Yafang.
>
> On Fri, Nov 08, 2024 at 09:29:00PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> > After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
> > container environment, we encountered several user-visible behavioral
> > changes:
> >
> > - Interrupted IRQ/softirq time is excluded in the cpuacct cgroup
> >
> >   This breaks userspace applications that rely on CPU usage data from
> >   cgroups to monitor CPU pressure. This patchset resolves the issue by
> >   ensuring that IRQ/softirq time is included in the cgroup of the
> >   interrupted tasks.
> >
> > - getrusage(2) does not include time interrupted by IRQ/softirq
> >
> >   Some services use getrusage(2) to check if workloads are experiencing CPU
> >   pressure. Since IRQ/softirq time is no longer included in task runtime,
> >   getrusage(2) can no longer reflect the CPU pressure caused by heavy
> >   interrupts.
>
> I understand that IRQ/softirq time is difficult to attribute to an
> "accountable" entity and it's technically simplest to attribute it
> everyone/noone, i.e. to root cgroup (or through a global stat w/out
> cgroups).

This issue is not about deciding which IRQ/softIRQ events should be
accounted for. Instead, it focuses on reflecting the interrupted
runtime of a task or a cgroup. I might be misunderstanding the
distinction between *charge* and *account*—or perhaps there is no
difference between them—but PATCH #4 captures exactly what I mean.
While IRQ/softIRQ time should not be attributed to the interrupted
task, it is crucial to have a metric that reflects this interrupted
runtime in CPU utilization.

The purpose of this patchset is to address this issue, conceptually
represented as:

   |<----Runtime---->|<----Interrupted time---->|<----Runtime---->|<---Sleep-->|

Without reflecting the *interrupted time* in CPU utilization, a gap—or
hole—is created:

    |<----Runtime---->|<----HOLE---->|<----Runtime---->|<---Sleep-->|

This gap will misleadingly appear as sleep time to the user:

  |<----Runtime---->|<----Sleep---->|<----Runtime---->|<---Sleep-->|

As a result, users may interpret this as underutilized CPU time and
attempt to increase their workloads to raise CPU runtime. However,
these efforts will be futile, as the observed runtime cannot increase
due to the missing metric for interrupted time.

>
> > This patchset addresses the first issue, which is relatively
> > straightforward. Once this solution is accepted, I will address the second
> > issue in a follow-up patchset.
>
> Is the first issue about cpuacct data or irq.pressure?

The data in question is from cpu.stat. Below is the cpu.stat file for cgroup2:

  $ cat cpu.stat
  usage_usec 0
  user_usec 0
  system_usec 0                            <<<< We should reflect the
interrupted time here.
  core_sched.force_idle_usec 0
  nr_periods 0
  nr_throttled 0
  throttled_usec 0
  nr_bursts 0
  burst_usec 0

>
> It sounds kind of both and I noticed the docs for irq.pressure is
> lacking in Documentation/accounting/psi.rst. When you're touching this,
> could you please add a paragraph or sentence explaining what does this
> value represent?

I believe I have explained this clearly in the comments above.
However, if anything remains unclear, please feel free to ask for
further clarification.

>
> (Also, there is same change both for cpuacct and
> cgroup_base_stat_cputime_show(), right?)
>
> >                    ----------------
> >                    | Load Balancer|
> >                    ----------------
> >                 /    |      |        \
> >                /     |      |         \
> >           Server1 Server2 Server3 ... ServerN
> >
> > Although the load balancer's algorithm is complex, it follows some core
> > principles:
> >
> > - When server CPU utilization increases, it adds more servers and deploys
> >   additional instances to meet SLA requirements.
> > - When server CPU utilization decreases, it scales down by decommissioning
> >   servers and reducing the number of instances to save on costs.
>
> A server here references to a whole node (whole kernel) or to a cgroup
> (i.e. more servers on top of one kernel)?

It is, in fact, a cgroup. These cgroups may be deployed across
different servers.

>
> > The load balancer is malfunctioning due to the exclusion of IRQ time from
> > CPU utilization calculations.
>
> Could this be fixed by subtracting (global) IRQ time from (presumed
> total) system capacity that the balancer uses for its decisions? (i.e.
> without exact per-cgroup breakdown of IRQ time)

The issue here is that the global IRQ time may include the interrupted
time of tasks outside the target cgroup. As a result, I don't believe
it's possible to find a reliable solution without modifying the
kernel.

--
Regards
Yafang

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled
  2024-11-17  2:56   ` Yafang Shao
@ 2024-11-18 10:10     ` Peter Zijlstra
  2024-11-18 12:12       ` Yafang Shao
  2024-11-22  3:17       ` Yafang Shao
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2024-11-18 10:10 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Michal Koutný, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, hannes,
	surenb, cgroups, linux-kernel

On Sun, Nov 17, 2024 at 10:56:21AM +0800, Yafang Shao wrote:
> On Fri, Nov 15, 2024 at 9:41 PM Michal Koutný <mkoutny@suse.com> wrote:

> > > The load balancer is malfunctioning due to the exclusion of IRQ time from
> > > CPU utilization calculations.
> >
> > Could this be fixed by subtracting (global) IRQ time from (presumed
> > total) system capacity that the balancer uses for its decisions? (i.e.
> > without exact per-cgroup breakdown of IRQ time)
> 
> The issue here is that the global IRQ time may include the interrupted
> time of tasks outside the target cgroup. As a result, I don't believe
> it's possible to find a reliable solution without modifying the
> kernel.

Since there is no relation between the interrupt and the interrupted
task (and through that its cgroup) -- all time might or might not be
part of your cgroup of interest. Consider it a random distribution if
you will.

What Michael suggests seems no less fair, and possible more fair than
what you propose:

 \Sum cgroup = total - IRQ

As opposed to what you propose:

 \Sum (cgroup + cgroup-IRQ) = total - remainder-IRQ

Like I argued earlier, if you have two cgroups, one doing a while(1)
loop (proxy for doing computation) and one cgroup doing heavy IO or
networking, then per your accounting the computation cgroup will get a
significant amount of IRQ time 'injected', even though it is effidently
not of that group.

Injecting 'half' of the interrupts in the computation group, and missing
'half' of the interrupts from the network group will get 'wrong'
load-balance results too.

I remain unconvinced that any of this makes sense.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled
  2024-11-18 10:10     ` Peter Zijlstra
@ 2024-11-18 12:12       ` Yafang Shao
  2024-11-22  3:17       ` Yafang Shao
  1 sibling, 0 replies; 17+ messages in thread
From: Yafang Shao @ 2024-11-18 12:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Koutný, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, hannes,
	surenb, cgroups, linux-kernel

On Mon, Nov 18, 2024 at 6:10 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Nov 17, 2024 at 10:56:21AM +0800, Yafang Shao wrote:
> > On Fri, Nov 15, 2024 at 9:41 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> > > > The load balancer is malfunctioning due to the exclusion of IRQ time from
> > > > CPU utilization calculations.
> > >
> > > Could this be fixed by subtracting (global) IRQ time from (presumed
> > > total) system capacity that the balancer uses for its decisions? (i.e.
> > > without exact per-cgroup breakdown of IRQ time)
> >
> > The issue here is that the global IRQ time may include the interrupted
> > time of tasks outside the target cgroup. As a result, I don't believe
> > it's possible to find a reliable solution without modifying the
> > kernel.
>
> Since there is no relation between the interrupt and the interrupted
> task (and through that its cgroup) -- all time might or might not be
> part of your cgroup of interest. Consider it a random distribution if
> you will.
>
> What Michael suggests seems no less fair, and possible more fair than
> what you propose:
>
>  \Sum cgroup = total - IRQ

The key issue here is determining how to reliably get the IRQ. I don't
believe there is a dependable way to achieve this.

For example, consider a server with 16 CPUs. My cgroup contains 4
threads that can freely migrate across CPUs, while other tasks are
also running on the system simultaneously. In this scenario, how can
we accurately determine the IRQ to subtract?

>
> As opposed to what you propose:
>
>  \Sum (cgroup + cgroup-IRQ) = total - remainder-IRQ
>
> Like I argued earlier, if you have two cgroups, one doing a while(1)
> loop (proxy for doing computation) and one cgroup doing heavy IO or
> networking, then per your accounting the computation cgroup will get a
> significant amount of IRQ time 'injected', even though it is effidently
> not of that group.

That is precisely what the user wants. If my tasks are frequently
interrupted by IRQs, it indicates that my service may be experiencing
poor quality. In response, I would likely reduce the traffic sent to
it. If the issue persists and IRQ interruptions remain high, I would
then consider migrating the service to other servers.

>
> Injecting 'half' of the interrupts in the computation group, and missing
> 'half' of the interrupts from the network group will get 'wrong'
> load-balance results too.
>
> I remain unconvinced that any of this makes sense.

If we are uncertain about which choice makes more sense, it might be
better to align this behavior with the case where
CONFIG_IRQ_TIME_ACCOUNTING=n.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled
  2024-11-18 10:10     ` Peter Zijlstra
  2024-11-18 12:12       ` Yafang Shao
@ 2024-11-22  3:17       ` Yafang Shao
  1 sibling, 0 replies; 17+ messages in thread
From: Yafang Shao @ 2024-11-22  3:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Koutný, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, hannes,
	surenb, cgroups, linux-kernel

On Mon, Nov 18, 2024 at 6:10 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Nov 17, 2024 at 10:56:21AM +0800, Yafang Shao wrote:
> > On Fri, Nov 15, 2024 at 9:41 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> > > > The load balancer is malfunctioning due to the exclusion of IRQ time from
> > > > CPU utilization calculations.
> > >
> > > Could this be fixed by subtracting (global) IRQ time from (presumed
> > > total) system capacity that the balancer uses for its decisions? (i.e.
> > > without exact per-cgroup breakdown of IRQ time)
> >
> > The issue here is that the global IRQ time may include the interrupted
> > time of tasks outside the target cgroup. As a result, I don't believe
> > it's possible to find a reliable solution without modifying the
> > kernel.
>
> Since there is no relation between the interrupt and the interrupted
> task (and through that its cgroup) -- all time might or might not be
> part of your cgroup of interest. Consider it a random distribution if
> you will.

Some points require further clarification.

On our servers, the majority of IRQ/softIRQ activity originates from
network traffic, and we consistently enable Receive Flow Steering
(RFS) [0]. This configuration ensures that softIRQs are more likely to
interrupt the tasks responsible for processing the corresponding
packets. As a result, the distribution of softIRQs is not random but
instead closely aligned with the packet-handling tasks.

[0]. https://lwn.net/Articles/381955/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 1/4] sched: Define sched_clock_irqtime as static key
  2024-11-08 13:29 ` [PATCH v5 1/4] sched: Define sched_clock_irqtime as static key Yafang Shao
@ 2024-12-03 10:01   ` Michal Koutný
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Koutný @ 2024-12-03 10:01 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, hannes, surenb, cgroups,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 533 bytes --]

On Fri, Nov 08, 2024 at 09:29:01PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 081519ffab46..0c83ab35256e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
...  
> +#else
> +
> +static inline int irqtime_enabled(void)
> +{
> +	return 0;
> +}
> +

You can reuse the removed macro here:

BTW, have you measured whether this change has an effect above rounding
error?

But generally,
Reviewed-by: Michal Koutný <mkoutny@suse.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 2/4] sched: Don't account irq time if sched_clock_irqtime is disabled
  2024-11-08 13:29 ` [PATCH v5 2/4] sched: Don't account irq time if sched_clock_irqtime is disabled Yafang Shao
@ 2024-12-03 10:01   ` Michal Koutný
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Koutný @ 2024-12-03 10:01 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, hannes, surenb, cgroups,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 534 bytes --]

On Fri, Nov 08, 2024 at 09:29:02PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> sched_clock_irqtime may be disabled due to the clock source, in which case
> IRQ time should not be accounted. Let's add a conditional check to avoid
> unnecessary logic.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  kernel/sched/core.c | 44 +++++++++++++++++++++++---------------------
>  1 file changed, 23 insertions(+), 21 deletions(-)

This is actually a good catch!

Reviewed-by: Michal Koutný <mkoutny@suse.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/4] sched, psi: Don't account irq time if sched_clock_irqtime is disabled
  2024-11-08 13:29 ` [PATCH v5 3/4] sched, psi: " Yafang Shao
@ 2024-12-03 10:01   ` Michal Koutný
  2024-12-03 10:15     ` Michal Koutný
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Koutný @ 2024-12-03 10:01 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, hannes, surenb, cgroups,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 559 bytes --]

On Fri, Nov 08, 2024 at 09:29:03PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> sched_clock_irqtime may be disabled due to the clock source. When disabled,
> irq_time_read() won't change over time, so there is nothing to account. We
> can save iterating the whole hierarchy on every tick and context switch.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  kernel/sched/psi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Michal Koutný <mkoutny@suse.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING
  2024-11-08 13:29 ` [PATCH v5 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
@ 2024-12-03 10:01   ` Michal Koutný
  2024-12-04  2:17     ` Yafang Shao
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Koutný @ 2024-12-03 10:01 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, hannes, surenb, cgroups,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2393 bytes --]

On Fri, Nov 08, 2024 at 09:29:04PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> The system metric in cpuacct.stat is crucial in indicating whether a
> container is under heavy system pressure, including IRQ/softirq activity.
> Hence, IRQ/softirq time should be included in the cpuacct system usage,
> which also applies to cgroup2’s rstat.

(snipped from cover letter thread)

On Mon, Nov 18, 2024 at 08:12:03PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> The key issue here is determining how to reliably get the IRQ. I don't
> believe there is a dependable way to achieve this.
> 
> For example, consider a server with 16 CPUs. My cgroup contains 4
> threads that can freely migrate across CPUs, while other tasks are
> also running on the system simultaneously. In this scenario, how can
> we accurately determine the IRQ to subtract?

I understand there's some IRQ noise which is a property of CPU (noise is
a function of cpu).

Then there're cgroup workloads, on a single CPU impact per-cgroup
depends how much that given cgroup runs on the CPU (it's more exposed).
Whole cgroup's impact is sum of these (i.e. it's kind of scalar product
between that IRQ noise per-cpu and cgroup's CPU consuption per-cpu).

(In your usage, there's some correlation of IRQ noise with CPU
consumption).

> That is precisely what the user wants. If my tasks are frequently
> interrupted by IRQs, it indicates that my service may be experiencing
> poor quality. In response, I would likely reduce the traffic sent to
> it. If the issue persists and IRQ interruptions remain high, I would
> then consider migrating the service to other servers.

If I look at
	52b1364ba0b10 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure")
(where it's clearer than after
	ddae0ca2a8fe1 ("sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath")
)

the IRQ pressure accounting takes into account the task (its cgroup) vs
IRQ time (~multiplication) and then its summed a) over time
(psi_account_irqtime()), b) over CPUs (collect_percpu_times()) so it's
the scalar product (squinting) I mentioned above.

Therefore I think irq.pressure provides exactly the information that's
relevant for your scheduling decisions and the info cannot be fit into
cpuacct.stat.

Or what is irq.pressure missing or distorting for your scenario?

HTH,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/4] sched, psi: Don't account irq time if sched_clock_irqtime is disabled
  2024-12-03 10:01   ` Michal Koutný
@ 2024-12-03 10:15     ` Michal Koutný
  2024-12-04  2:23       ` Yafang Shao
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Koutný @ 2024-12-03 10:15 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, hannes, surenb, cgroups,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 899 bytes --]

On Tue, Dec 03, 2024 at 11:01:41AM GMT, Michal Koutný <mkoutny@suse.com> wrote:
> On Fri, Nov 08, 2024 at 09:29:03PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> > sched_clock_irqtime may be disabled due to the clock source. When disabled,
> > irq_time_read() won't change over time, so there is nothing to account. We
> > can save iterating the whole hierarchy on every tick and context switch.
> > 
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  kernel/sched/psi.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Reviewed-by: Michal Koutný <mkoutny@suse.com>

On second thought, similar guard may be useful in psi_show() too. Since
there's a difference between zero pressure and unmeasured pressure (it'd
fail with EOPNOTSUPP).

(How common is it actually that tsc_init fails?)

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING
  2024-12-03 10:01   ` Michal Koutný
@ 2024-12-04  2:17     ` Yafang Shao
  0 siblings, 0 replies; 17+ messages in thread
From: Yafang Shao @ 2024-12-04  2:17 UTC (permalink / raw)
  To: Michal Koutný
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, hannes, surenb, cgroups,
	linux-kernel

On Tue, Dec 3, 2024 at 6:01 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Fri, Nov 08, 2024 at 09:29:04PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> > The system metric in cpuacct.stat is crucial in indicating whether a
> > container is under heavy system pressure, including IRQ/softirq activity.
> > Hence, IRQ/softirq time should be included in the cpuacct system usage,
> > which also applies to cgroup2’s rstat.
>
> (snipped from cover letter thread)
>
> On Mon, Nov 18, 2024 at 08:12:03PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> > The key issue here is determining how to reliably get the IRQ. I don't
> > believe there is a dependable way to achieve this.
> >
> > For example, consider a server with 16 CPUs. My cgroup contains 4
> > threads that can freely migrate across CPUs, while other tasks are
> > also running on the system simultaneously. In this scenario, how can
> > we accurately determine the IRQ to subtract?
>
> I understand there's some IRQ noise which is a property of CPU (noise is
> a function of cpu).
>
> Then there're cgroup workloads, on a single CPU impact per-cgroup
> depends how much that given cgroup runs on the CPU (it's more exposed).
> Whole cgroup's impact is sum of these (i.e. it's kind of scalar product
> between that IRQ noise per-cpu and cgroup's CPU consuption per-cpu).
>
> (In your usage, there's some correlation of IRQ noise with CPU
> consumption).
>
> > That is precisely what the user wants. If my tasks are frequently
> > interrupted by IRQs, it indicates that my service may be experiencing
> > poor quality. In response, I would likely reduce the traffic sent to
> > it. If the issue persists and IRQ interruptions remain high, I would
> > then consider migrating the service to other servers.
>
> If I look at
>         52b1364ba0b10 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure")
> (where it's clearer than after
>         ddae0ca2a8fe1 ("sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath")
> )
>
> the IRQ pressure accounting takes into account the task (its cgroup) vs
> IRQ time (~multiplication) and then its summed a) over time
> (psi_account_irqtime()), b) over CPUs (collect_percpu_times()) so it's
> the scalar product (squinting) I mentioned above.
>
> Therefore I think irq.pressure provides exactly the information that's
> relevant for your scheduling decisions and the info cannot be fit into
> cpuacct.stat.
>
> Or what is irq.pressure missing or distorting for your scenario?

irq.pressure is a metric that operates independently of CPU
utilization. It can be used to monitor whether latency spikes in a
cgroup are caused by high IRQ pressure—this is exactly how we utilize
it in our production environment.

However, this metric alone doesn’t provide clear guidance on whether
to add or reduce CPUs for a workload. To address this, we’ve attempted
to combine irq.pressure with CPU utilization into a unified metric,
but achieving this has proven to be challenging.

--
Regards
Yafang

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/4] sched, psi: Don't account irq time if sched_clock_irqtime is disabled
  2024-12-03 10:15     ` Michal Koutný
@ 2024-12-04  2:23       ` Yafang Shao
  0 siblings, 0 replies; 17+ messages in thread
From: Yafang Shao @ 2024-12-04  2:23 UTC (permalink / raw)
  To: Michal Koutný
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, hannes, surenb, cgroups,
	linux-kernel

On Tue, Dec 3, 2024 at 6:15 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Tue, Dec 03, 2024 at 11:01:41AM GMT, Michal Koutný <mkoutny@suse.com> wrote:
> > On Fri, Nov 08, 2024 at 09:29:03PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> > > sched_clock_irqtime may be disabled due to the clock source. When disabled,
> > > irq_time_read() won't change over time, so there is nothing to account. We
> > > can save iterating the whole hierarchy on every tick and context switch.
> > >
> > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > ---
> > >  kernel/sched/psi.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > Reviewed-by: Michal Koutný <mkoutny@suse.com>
>
> On second thought, similar guard may be useful in psi_show() too. Since
> there's a difference between zero pressure and unmeasured pressure (it'd
> fail with EOPNOTSUPP).

I'll update the psi_show() function in the next version.

>
> (How common is it actually that tsc_init fails?)

This is not a common scenario, but it can occur randomly across a large fleet.

--
Regards
Yafang

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-12-04  2:23 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-08 13:29 [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled Yafang Shao
2024-11-08 13:29 ` [PATCH v5 1/4] sched: Define sched_clock_irqtime as static key Yafang Shao
2024-12-03 10:01   ` Michal Koutný
2024-11-08 13:29 ` [PATCH v5 2/4] sched: Don't account irq time if sched_clock_irqtime is disabled Yafang Shao
2024-12-03 10:01   ` Michal Koutný
2024-11-08 13:29 ` [PATCH v5 3/4] sched, psi: " Yafang Shao
2024-12-03 10:01   ` Michal Koutný
2024-12-03 10:15     ` Michal Koutný
2024-12-04  2:23       ` Yafang Shao
2024-11-08 13:29 ` [PATCH v5 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
2024-12-03 10:01   ` Michal Koutný
2024-12-04  2:17     ` Yafang Shao
2024-11-15 13:41 ` [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled Michal Koutný
2024-11-17  2:56   ` Yafang Shao
2024-11-18 10:10     ` Peter Zijlstra
2024-11-18 12:12       ` Yafang Shao
2024-11-22  3:17       ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox