cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING
@ 2024-10-08  6:19 Yafang Shao
  2024-10-08  6:19 ` [PATCH v2 1/4] sched: Define sched_clock_irqtime as static key Yafang Shao
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Yafang Shao @ 2024-10-08  6:19 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, hannes, surenb
  Cc: cgroups, linux-kernel, Yafang Shao

After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
container environment, we encountered several user-visible behavioral
changes:

- Interrupted IRQ/softirq time is not accounted for in the cpuacct cgroup

  This breaks userspace applications that rely on CPU usage data from
  cgroups to monitor CPU pressure. This patchset resolves the issue by
  ensuring that IRQ/softirq time is accounted for in the cgroup of the
  interrupted tasks.

- getrusage(2) does not include time interrupted by IRQ/softirq 

  Some services use getrusage(2) to check if workloads are experiencing CPU
  pressure. Since IRQ/softirq time is no longer charged to task runtime,
  getrusage(2) can no longer reflect the CPU pressure caused by heavy
  interrupts.

This patchset addresses the first issue, which is relatively
straightforward. However, the second issue remains unresolved, as there
might be debate over whether interrupted time should be considered part of
a task’s usage. Nonetheless, it is important to report interrupted time to
the user via some metric, though that is a separate discussion.

Changes:
v1->v2: 
- Fix lockdep issues reported by kernel test robot <oliver.sang@intel.com> 

v1: https://lore.kernel.org/all/20240923090028.16368-1-laoar.shao@gmail.com/

Yafang Shao (4):
  sched: Define sched_clock_irqtime as static key
  sched: Don't account irq time if sched_clock_irqtime is disabled
  sched, psi: Don't account irq time if sched_clock_irqtime is disabled
  sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING

 kernel/sched/core.c    | 83 ++++++++++++++++++++++++++++++------------
 kernel/sched/cputime.c | 16 ++++----
 kernel/sched/psi.c     | 12 +-----
 kernel/sched/sched.h   |  1 +
 kernel/sched/stats.h   |  7 ++--
 5 files changed, 74 insertions(+), 45 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/4] sched: Define sched_clock_irqtime as static key
  2024-10-08  6:19 [PATCH v2 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
@ 2024-10-08  6:19 ` Yafang Shao
  2024-10-08  6:19 ` [PATCH v2 2/4] sched: Don't account irq time if sched_clock_irqtime is disabled Yafang Shao
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Yafang Shao @ 2024-10-08  6:19 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, hannes, surenb
  Cc: cgroups, linux-kernel, Yafang Shao

Since CPU time accounting is a performance-critical path, let's define
sched_clock_irqtime as a static key to minimize potential overhead.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/sched/cputime.c | 16 +++++++---------
 kernel/sched/sched.h   |  1 +
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0bed0fa1acd9..d0b6ea737d04 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -7,6 +7,8 @@
  #include <asm/cputime.h>
 #endif
 
+DEFINE_STATIC_KEY_FALSE(sched_clock_irqtime);
+
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 
 /*
@@ -22,16 +24,14 @@
  */
 DEFINE_PER_CPU(struct irqtime, cpu_irqtime);
 
-static int sched_clock_irqtime;
-
 void enable_sched_clock_irqtime(void)
 {
-	sched_clock_irqtime = 1;
+	static_branch_enable(&sched_clock_irqtime);
 }
 
 void disable_sched_clock_irqtime(void)
 {
-	sched_clock_irqtime = 0;
+	static_branch_disable(&sched_clock_irqtime);
 }
 
 static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,
@@ -57,7 +57,7 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
 	s64 delta;
 	int cpu;
 
-	if (!sched_clock_irqtime)
+	if (!static_branch_likely(&sched_clock_irqtime))
 		return;
 
 	cpu = smp_processor_id();
@@ -90,8 +90,6 @@ static u64 irqtime_tick_accounted(u64 maxtime)
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
 
-#define sched_clock_irqtime	(0)
-
 static u64 irqtime_tick_accounted(u64 dummy)
 {
 	return 0;
@@ -478,7 +476,7 @@ void account_process_tick(struct task_struct *p, int user_tick)
 	if (vtime_accounting_enabled_this_cpu())
 		return;
 
-	if (sched_clock_irqtime) {
+	if (static_branch_likely(&sched_clock_irqtime)) {
 		irqtime_account_process_tick(p, user_tick, 1);
 		return;
 	}
@@ -507,7 +505,7 @@ void account_idle_ticks(unsigned long ticks)
 {
 	u64 cputime, steal;
 
-	if (sched_clock_irqtime) {
+	if (static_branch_likely(&sched_clock_irqtime)) {
 		irqtime_account_idle_ticks(ticks);
 		return;
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8063db62b027..db7d541eebff 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3174,6 +3174,7 @@ struct irqtime {
 };
 
 DECLARE_PER_CPU(struct irqtime, cpu_irqtime);
+DECLARE_STATIC_KEY_FALSE(sched_clock_irqtime);
 
 /*
  * Returns the irqtime minus the softirq time computed by ksoftirqd.
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/4] sched: Don't account irq time if sched_clock_irqtime is disabled
  2024-10-08  6:19 [PATCH v2 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
  2024-10-08  6:19 ` [PATCH v2 1/4] sched: Define sched_clock_irqtime as static key Yafang Shao
@ 2024-10-08  6:19 ` Yafang Shao
  2024-10-08  6:19 ` [PATCH v2 3/4] sched, psi: " Yafang Shao
  2024-10-08  6:19 ` [PATCH v2 4/4] sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
  3 siblings, 0 replies; 8+ messages in thread
From: Yafang Shao @ 2024-10-08  6:19 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, hannes, surenb
  Cc: cgroups, linux-kernel, Yafang Shao

sched_clock_irqtime may be disabled due to the clock source, in which case
IRQ time should not be accounted. Let's add a conditional check to avoid
unnecessary logic.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/sched/core.c | 44 +++++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b6cc1cf499d6..8b633a14a60f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -735,29 +735,31 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 	s64 __maybe_unused steal = 0, irq_delta = 0;
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-	irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
+	if (static_branch_likely(&sched_clock_irqtime)) {
+		irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
 
-	/*
-	 * Since irq_time is only updated on {soft,}irq_exit, we might run into
-	 * this case when a previous update_rq_clock() happened inside a
-	 * {soft,}IRQ region.
-	 *
-	 * When this happens, we stop ->clock_task and only update the
-	 * prev_irq_time stamp to account for the part that fit, so that a next
-	 * update will consume the rest. This ensures ->clock_task is
-	 * monotonic.
-	 *
-	 * It does however cause some slight miss-attribution of {soft,}IRQ
-	 * time, a more accurate solution would be to update the irq_time using
-	 * the current rq->clock timestamp, except that would require using
-	 * atomic ops.
-	 */
-	if (irq_delta > delta)
-		irq_delta = delta;
+		/*
+		 * Since irq_time is only updated on {soft,}irq_exit, we might run into
+		 * this case when a previous update_rq_clock() happened inside a
+		 * {soft,}IRQ region.
+		 *
+		 * When this happens, we stop ->clock_task and only update the
+		 * prev_irq_time stamp to account for the part that fit, so that a next
+		 * update will consume the rest. This ensures ->clock_task is
+		 * monotonic.
+		 *
+		 * It does however cause some slight miss-attribution of {soft,}IRQ
+		 * time, a more accurate solution would be to update the irq_time using
+		 * the current rq->clock timestamp, except that would require using
+		 * atomic ops.
+		 */
+		if (irq_delta > delta)
+			irq_delta = delta;
 
-	rq->prev_irq_time += irq_delta;
-	delta -= irq_delta;
-	delayacct_irq(rq->curr, irq_delta);
+		rq->prev_irq_time += irq_delta;
+		delta -= irq_delta;
+		delayacct_irq(rq->curr, irq_delta);
+	}
 #endif
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
 	if (static_key_false((&paravirt_steal_rq_enabled))) {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/4] sched, psi: Don't account irq time if sched_clock_irqtime is disabled
  2024-10-08  6:19 [PATCH v2 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
  2024-10-08  6:19 ` [PATCH v2 1/4] sched: Define sched_clock_irqtime as static key Yafang Shao
  2024-10-08  6:19 ` [PATCH v2 2/4] sched: Don't account irq time if sched_clock_irqtime is disabled Yafang Shao
@ 2024-10-08  6:19 ` Yafang Shao
  2024-10-08 14:04   ` Johannes Weiner
  2024-10-08  6:19 ` [PATCH v2 4/4] sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
  3 siblings, 1 reply; 8+ messages in thread
From: Yafang Shao @ 2024-10-08  6:19 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, hannes, surenb
  Cc: cgroups, linux-kernel, Yafang Shao

sched_clock_irqtime may be disabled due to the clock source, in which case
IRQ time should not be accounted. Let's add a conditional check to avoid
unnecessary logic.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/sched/psi.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 020d58967d4e..49d9c75be0c8 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1000,7 +1000,8 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 	u64 now, irq;
 	s64 delta;
 
-	if (static_branch_likely(&psi_disabled))
+	if (static_branch_likely(&psi_disabled) ||
+	    !static_branch_likely(&sched_clock_irqtime))
 		return;
 
 	if (!curr->pid)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 4/4] sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING
  2024-10-08  6:19 [PATCH v2 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
                   ` (2 preceding siblings ...)
  2024-10-08  6:19 ` [PATCH v2 3/4] sched, psi: " Yafang Shao
@ 2024-10-08  6:19 ` Yafang Shao
  2024-10-08 14:23   ` Johannes Weiner
  3 siblings, 1 reply; 8+ messages in thread
From: Yafang Shao @ 2024-10-08  6:19 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, hannes, surenb
  Cc: cgroups, linux-kernel, Yafang Shao

After enabling CONFIG_IRQ_TIME_ACCOUNTING to monitor IRQ pressure in our
container environment, we observed several noticeable behavioral changes.

One of our IRQ-heavy services, such as Redis, reported a significant
reduction in CPU usage after upgrading to the new kernel with
CONFIG_IRQ_TIME_ACCOUNTING enabled. However, despite adding more threads
to handle an increased workload, the CPU usage could not be raised. In
other words, even though the container’s CPU usage appeared low, it was
unable to process more workloads to utilize additional CPU resources, which
caused issues.

This behavior can be demonstrated using netperf:

  function start_server() {
      for j in `seq 1 3`; do
          netserver -p $[12345+j] > /dev/null &
      done
  }

  server_ip=$1
  function start_client() {
    # That applies to cgroup2 as well.
    mkdir -p /sys/fs/cgroup/cpuacct/test
    echo $$ > /sys/fs/cgroup/cpuacct/test/cgroup.procs
    for j in `seq 1 3`; do
        port=$[12345+j]
        taskset -c 0 netperf -H ${server_ip} -l ${run_time:-30000}   \
                -t TCP_STREAM -p $port -- -D -m 1k -M 1K -s 8k -S 8k \
                > /dev/null &
    done
  }

  start_server
  start_client

We can verify the CPU usage of the test cgroup using cpuacct.stat. The
output shows:

  system: 53
  user: 2

The CPU usage of the cgroup is relatively low at around 55%, but this usage
doesn't increase, even with more netperf tasks. The reason is that CPU0 is
at 100% utilization, as confirmed by mpstat:

  02:56:22 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
  02:56:23 PM    0    0.99    0.00   55.45    0.00    0.99   42.57    0.00    0.00    0.00    0.00

  02:56:23 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
  02:56:24 PM    0    2.00    0.00   55.00    0.00    0.00   43.00    0.00    0.00    0.00    0.00

This behavior is unexpected. We should account for IRQ time to the cgroup
to reflect the pressure the group is under.

After a thorough analysis, I discovered that this change in behavior is due
to commit 305e6835e055 ("sched: Do not account irq time to current task"),
which altered whether IRQ time should be charged to the interrupted task.
While I agree that a task should not be penalized by random interrupts, the
task itself cannot progress while interrupted. Therefore, the interrupted
time should be reported to the user.

The system metric in cpuacct.stat is crucial in indicating whether a
container is under heavy system pressure, including IRQ/softirq activity.
Hence, IRQ/softirq time should be accounted for in the cpuacct system
usage, which also applies to cgroup2’s rstat.

This patch reintroduces IRQ/softirq accounting to cgroups.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/sched/core.c  | 39 +++++++++++++++++++++++++++++++++++++--
 kernel/sched/psi.c   | 15 +++------------
 kernel/sched/stats.h |  7 ++++---
 3 files changed, 44 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b633a14a60f..533e015f8777 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5587,7 +5587,24 @@ void sched_tick(void)
 	rq_lock(rq, &rf);
 
 	curr = rq->curr;
-	psi_account_irqtime(rq, curr, NULL);
+
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+	if (static_branch_likely(&sched_clock_irqtime)) {
+		u64 now, irq;
+		s64 delta;
+
+		now = cpu_clock(cpu);
+		irq = irq_time_read(cpu);
+		delta = (s64)(irq - rq->psi_irq_time);
+		if (delta > 0) {
+			rq->psi_irq_time = irq;
+			psi_account_irqtime(rq, curr, NULL, now, delta);
+			cgroup_account_cputime(curr, delta);
+			/* We account both softirq and irq into softirq */
+			cgroup_account_cputime_field(curr, CPUTIME_SOFTIRQ, delta);
+		}
+	}
+#endif
 
 	update_rq_clock(rq);
 	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
@@ -6667,7 +6684,25 @@ static void __sched notrace __schedule(int sched_mode)
 		++*switch_count;
 
 		migrate_disable_switch(rq, prev);
-		psi_account_irqtime(rq, prev, next);
+
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+		if (static_branch_likely(&sched_clock_irqtime)) {
+			u64 now, irq;
+			s64 delta;
+
+			now = cpu_clock(cpu);
+			irq = irq_time_read(cpu);
+			delta = (s64)(irq - rq->psi_irq_time);
+			if (delta > 0) {
+				rq->psi_irq_time = irq;
+				psi_account_irqtime(rq, prev, next, now, delta);
+				cgroup_account_cputime(prev, delta);
+				/* We account both softirq and irq into softirq */
+				cgroup_account_cputime_field(prev, CPUTIME_SOFTIRQ, delta);
+			}
+		}
+#endif
+
 		psi_sched_switch(prev, next, !task_on_rq_queued(prev));
 
 		trace_sched_switch(preempt, prev, next, prev_state);
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 49d9c75be0c8..ffa8aa372fbd 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -992,16 +992,14 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev)
+void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev,
+			 u64 now, s64 delta)
 {
 	int cpu = task_cpu(curr);
 	struct psi_group *group;
 	struct psi_group_cpu *groupc;
-	u64 now, irq;
-	s64 delta;
 
-	if (static_branch_likely(&psi_disabled) ||
-	    !static_branch_likely(&sched_clock_irqtime))
+	if (static_branch_likely(&psi_disabled))
 		return;
 
 	if (!curr->pid)
@@ -1012,13 +1010,6 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 	if (prev && task_psi_group(prev) == group)
 		return;
 
-	now = cpu_clock(cpu);
-	irq = irq_time_read(cpu);
-	delta = (s64)(irq - rq->psi_irq_time);
-	if (delta < 0)
-		return;
-	rq->psi_irq_time = irq;
-
 	do {
 		if (!group->enabled)
 			continue;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 237780aa3c53..7c5979761021 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -111,10 +111,11 @@ void psi_task_change(struct task_struct *task, int clear, int set);
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep);
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev);
+void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
+			 struct task_struct *prev, u64 now, s64 delta);
 #else
 static inline void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
-				       struct task_struct *prev) {}
+				       struct task_struct *prev, u64 now, s64 delta) {}
 #endif /*CONFIG_IRQ_TIME_ACCOUNTING */
 /*
  * PSI tracks state that persists across sleeps, such as iowaits and
@@ -197,7 +198,7 @@ static inline void psi_sched_switch(struct task_struct *prev,
 				    struct task_struct *next,
 				    bool sleep) {}
 static inline void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
-				       struct task_struct *prev) {}
+				       struct task_struct *prev, u64 now, s64 delta) {}
 #endif /* CONFIG_PSI */
 
 #ifdef CONFIG_SCHED_INFO
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 3/4] sched, psi: Don't account irq time if sched_clock_irqtime is disabled
  2024-10-08  6:19 ` [PATCH v2 3/4] sched, psi: " Yafang Shao
@ 2024-10-08 14:04   ` Johannes Weiner
  0 siblings, 0 replies; 8+ messages in thread
From: Johannes Weiner @ 2024-10-08 14:04 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, surenb, cgroups,
	linux-kernel

On Tue, Oct 08, 2024 at 02:19:50PM +0800, Yafang Shao wrote:
> sched_clock_irqtime may be disabled due to the clock source, in which case
> IRQ time should not be accounted. Let's add a conditional check to avoid
> unnecessary logic.

Makes sense. When disabled, irq_time_read() won't change over time, so
there is nothing to account. We can save iterating the whole hierarchy
on every tick and context switch.

> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 4/4] sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING
  2024-10-08  6:19 ` [PATCH v2 4/4] sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
@ 2024-10-08 14:23   ` Johannes Weiner
  2024-10-09 11:40     ` Yafang Shao
  0 siblings, 1 reply; 8+ messages in thread
From: Johannes Weiner @ 2024-10-08 14:23 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, surenb, cgroups,
	linux-kernel

On Tue, Oct 08, 2024 at 02:19:51PM +0800, Yafang Shao wrote:
> @@ -5587,7 +5587,24 @@ void sched_tick(void)
>  	rq_lock(rq, &rf);
>  
>  	curr = rq->curr;
> -	psi_account_irqtime(rq, curr, NULL);
> +
> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> +	if (static_branch_likely(&sched_clock_irqtime)) {
> +		u64 now, irq;
> +		s64 delta;
> +
> +		now = cpu_clock(cpu);
> +		irq = irq_time_read(cpu);
> +		delta = (s64)(irq - rq->psi_irq_time);
> +		if (delta > 0) {
> +			rq->psi_irq_time = irq;
> +			psi_account_irqtime(rq, curr, NULL, now, delta);
> +			cgroup_account_cputime(curr, delta);
> +			/* We account both softirq and irq into softirq */
> +			cgroup_account_cputime_field(curr, CPUTIME_SOFTIRQ, delta);
> +		}
> +	}
> +#endif
>  
>  	update_rq_clock(rq);
>  	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
> @@ -6667,7 +6684,25 @@ static void __sched notrace __schedule(int sched_mode)
>  		++*switch_count;
>  
>  		migrate_disable_switch(rq, prev);
> -		psi_account_irqtime(rq, prev, next);
> +
> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> +		if (static_branch_likely(&sched_clock_irqtime)) {
> +			u64 now, irq;
> +			s64 delta;
> +
> +			now = cpu_clock(cpu);
> +			irq = irq_time_read(cpu);
> +			delta = (s64)(irq - rq->psi_irq_time);
> +			if (delta > 0) {
> +				rq->psi_irq_time = irq;
> +				psi_account_irqtime(rq, prev, next, now, delta);
> +				cgroup_account_cputime(prev, delta);
> +				/* We account both softirq and irq into softirq */
> +				cgroup_account_cputime_field(prev, CPUTIME_SOFTIRQ, delta);
> +			}
> +		}
> +#endif

This should be inside its own function - to avoid duplication of
course, but also the ifdefs and overly detailed accounting code in the
middle of core scheduling logic.

#ifdef CONFIG_IRQ_TIME_ACCOUNTING
static void account_irqtime(struct rq *rq, ...)
{
	...
}
#else
static inline void account_irqtime(...) {}
#endif

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 4/4] sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING
  2024-10-08 14:23   ` Johannes Weiner
@ 2024-10-09 11:40     ` Yafang Shao
  0 siblings, 0 replies; 8+ messages in thread
From: Yafang Shao @ 2024-10-09 11:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, surenb, cgroups,
	linux-kernel

On Tue, Oct 8, 2024 at 10:23 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Oct 08, 2024 at 02:19:51PM +0800, Yafang Shao wrote:
> > @@ -5587,7 +5587,24 @@ void sched_tick(void)
> >       rq_lock(rq, &rf);
> >
> >       curr = rq->curr;
> > -     psi_account_irqtime(rq, curr, NULL);
> > +
> > +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> > +     if (static_branch_likely(&sched_clock_irqtime)) {
> > +             u64 now, irq;
> > +             s64 delta;
> > +
> > +             now = cpu_clock(cpu);
> > +             irq = irq_time_read(cpu);
> > +             delta = (s64)(irq - rq->psi_irq_time);
> > +             if (delta > 0) {
> > +                     rq->psi_irq_time = irq;
> > +                     psi_account_irqtime(rq, curr, NULL, now, delta);
> > +                     cgroup_account_cputime(curr, delta);
> > +                     /* We account both softirq and irq into softirq */
> > +                     cgroup_account_cputime_field(curr, CPUTIME_SOFTIRQ, delta);
> > +             }
> > +     }
> > +#endif
> >
> >       update_rq_clock(rq);
> >       hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
> > @@ -6667,7 +6684,25 @@ static void __sched notrace __schedule(int sched_mode)
> >               ++*switch_count;
> >
> >               migrate_disable_switch(rq, prev);
> > -             psi_account_irqtime(rq, prev, next);
> > +
> > +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> > +             if (static_branch_likely(&sched_clock_irqtime)) {
> > +                     u64 now, irq;
> > +                     s64 delta;
> > +
> > +                     now = cpu_clock(cpu);
> > +                     irq = irq_time_read(cpu);
> > +                     delta = (s64)(irq - rq->psi_irq_time);
> > +                     if (delta > 0) {
> > +                             rq->psi_irq_time = irq;
> > +                             psi_account_irqtime(rq, prev, next, now, delta);
> > +                             cgroup_account_cputime(prev, delta);
> > +                             /* We account both softirq and irq into softirq */
> > +                             cgroup_account_cputime_field(prev, CPUTIME_SOFTIRQ, delta);
> > +                     }
> > +             }
> > +#endif
>
> This should be inside its own function - to avoid duplication of
> course, but also the ifdefs and overly detailed accounting code in the
> middle of core scheduling logic.
>
> #ifdef CONFIG_IRQ_TIME_ACCOUNTING
> static void account_irqtime(struct rq *rq, ...)
> {
>         ...
> }
> #else
> static inline void account_irqtime(...) {}
> #endif

Good suggestion. Will do it in the next version.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-10-09 11:40 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-08  6:19 [PATCH v2 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
2024-10-08  6:19 ` [PATCH v2 1/4] sched: Define sched_clock_irqtime as static key Yafang Shao
2024-10-08  6:19 ` [PATCH v2 2/4] sched: Don't account irq time if sched_clock_irqtime is disabled Yafang Shao
2024-10-08  6:19 ` [PATCH v2 3/4] sched, psi: " Yafang Shao
2024-10-08 14:04   ` Johannes Weiner
2024-10-08  6:19 ` [PATCH v2 4/4] sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
2024-10-08 14:23   ` Johannes Weiner
2024-10-09 11:40     ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).