* [PATCH v4 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING
@ 2024-11-01 3:17 Yafang Shao
2024-11-01 3:17 ` [PATCH v4 1/4] sched: Define sched_clock_irqtime as static key Yafang Shao
` (4 more replies)
0 siblings, 5 replies; 11+ messages in thread
From: Yafang Shao @ 2024-11-01 3:17 UTC (permalink / raw)
To: mingo, peterz
Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, hannes, surenb, cgroups, linux-kernel,
Yafang Shao
After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
container environment, we encountered several user-visible behavioral
changes:
- Interrupted IRQ/softirq time is not accounted for in the cpuacct cgroup
This breaks userspace applications that rely on CPU usage data from
cgroups to monitor CPU pressure. This patchset resolves the issue by
ensuring that IRQ/softirq time is accounted for in the cgroup of the
interrupted tasks.
- getrusage(2) does not include time interrupted by IRQ/softirq
Some services use getrusage(2) to check if workloads are experiencing CPU
pressure. Since IRQ/softirq time is no longer charged to task runtime,
getrusage(2) can no longer reflect the CPU pressure caused by heavy
interrupts.
This patchset addresses the first issue, which is relatively
straightforward. However, the second issue remains unresolved, as there
might be debate over whether interrupted time should be considered part of
a task’s usage. Nonetheless, it is important to report interrupted time to
the user via some metric, though that is a separate discussion.
Changes:
v3->v4:
- Rebase
v2->v3:
- Add a helper account_irqtime() to avoid redundant code (Johannes)
v1->v2: https://lore.kernel.org/cgroups/20241008061951.3980-1-laoar.shao@gmail.com/
- Fix lockdep issues reported by kernel test robot <oliver.sang@intel.com>
v1: https://lore.kernel.org/all/20240923090028.16368-1-laoar.shao@gmail.com/
Yafang Shao (4):
sched: Define sched_clock_irqtime as static key
sched: Don't account irq time if sched_clock_irqtime is disabled
sched, psi: Don't account irq time if sched_clock_irqtime is disabled
sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING
kernel/sched/core.c | 77 +++++++++++++++++++++++++++++-------------
kernel/sched/cputime.c | 16 ++++-----
kernel/sched/psi.c | 11 ++----
kernel/sched/sched.h | 1 +
kernel/sched/stats.h | 7 ++--
5 files changed, 68 insertions(+), 44 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v4 1/4] sched: Define sched_clock_irqtime as static key
2024-11-01 3:17 [PATCH v4 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
@ 2024-11-01 3:17 ` Yafang Shao
2024-11-01 10:06 ` Peter Zijlstra
2024-11-01 3:17 ` [PATCH v4 2/4] sched: Don't account irq time if sched_clock_irqtime is disabled Yafang Shao
` (3 subsequent siblings)
4 siblings, 1 reply; 11+ messages in thread
From: Yafang Shao @ 2024-11-01 3:17 UTC (permalink / raw)
To: mingo, peterz
Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, hannes, surenb, cgroups, linux-kernel,
Yafang Shao
Since CPU time accounting is a performance-critical path, let's define
sched_clock_irqtime as a static key to minimize potential overhead.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
kernel/sched/cputime.c | 16 +++++++---------
kernel/sched/sched.h | 1 +
2 files changed, 8 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0bed0fa1acd9..d0b6ea737d04 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -7,6 +7,8 @@
#include <asm/cputime.h>
#endif
+DEFINE_STATIC_KEY_FALSE(sched_clock_irqtime);
+
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
/*
@@ -22,16 +24,14 @@
*/
DEFINE_PER_CPU(struct irqtime, cpu_irqtime);
-static int sched_clock_irqtime;
-
void enable_sched_clock_irqtime(void)
{
- sched_clock_irqtime = 1;
+ static_branch_enable(&sched_clock_irqtime);
}
void disable_sched_clock_irqtime(void)
{
- sched_clock_irqtime = 0;
+ static_branch_disable(&sched_clock_irqtime);
}
static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,
@@ -57,7 +57,7 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
s64 delta;
int cpu;
- if (!sched_clock_irqtime)
+ if (!static_branch_likely(&sched_clock_irqtime))
return;
cpu = smp_processor_id();
@@ -90,8 +90,6 @@ static u64 irqtime_tick_accounted(u64 maxtime)
#else /* CONFIG_IRQ_TIME_ACCOUNTING */
-#define sched_clock_irqtime (0)
-
static u64 irqtime_tick_accounted(u64 dummy)
{
return 0;
@@ -478,7 +476,7 @@ void account_process_tick(struct task_struct *p, int user_tick)
if (vtime_accounting_enabled_this_cpu())
return;
- if (sched_clock_irqtime) {
+ if (static_branch_likely(&sched_clock_irqtime)) {
irqtime_account_process_tick(p, user_tick, 1);
return;
}
@@ -507,7 +505,7 @@ void account_idle_ticks(unsigned long ticks)
{
u64 cputime, steal;
- if (sched_clock_irqtime) {
+ if (static_branch_likely(&sched_clock_irqtime)) {
irqtime_account_idle_ticks(ticks);
return;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 081519ffab46..038ce65d6635 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3179,6 +3179,7 @@ struct irqtime {
};
DECLARE_PER_CPU(struct irqtime, cpu_irqtime);
+DECLARE_STATIC_KEY_FALSE(sched_clock_irqtime);
/*
* Returns the irqtime minus the softirq time computed by ksoftirqd.
--
2.43.5
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v4 2/4] sched: Don't account irq time if sched_clock_irqtime is disabled
2024-11-01 3:17 [PATCH v4 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
2024-11-01 3:17 ` [PATCH v4 1/4] sched: Define sched_clock_irqtime as static key Yafang Shao
@ 2024-11-01 3:17 ` Yafang Shao
2024-11-01 3:17 ` [PATCH v4 3/4] sched, psi: " Yafang Shao
` (2 subsequent siblings)
4 siblings, 0 replies; 11+ messages in thread
From: Yafang Shao @ 2024-11-01 3:17 UTC (permalink / raw)
To: mingo, peterz
Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, hannes, surenb, cgroups, linux-kernel,
Yafang Shao
sched_clock_irqtime may be disabled due to the clock source, in which case
IRQ time should not be accounted. Let's add a conditional check to avoid
unnecessary logic.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
kernel/sched/core.c | 44 +++++++++++++++++++++++---------------------
1 file changed, 23 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dbfb5717d6af..06a06f0897c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -740,29 +740,31 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
s64 __maybe_unused steal = 0, irq_delta = 0;
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
- irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
+ if (static_branch_likely(&sched_clock_irqtime)) {
+ irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
- /*
- * Since irq_time is only updated on {soft,}irq_exit, we might run into
- * this case when a previous update_rq_clock() happened inside a
- * {soft,}IRQ region.
- *
- * When this happens, we stop ->clock_task and only update the
- * prev_irq_time stamp to account for the part that fit, so that a next
- * update will consume the rest. This ensures ->clock_task is
- * monotonic.
- *
- * It does however cause some slight miss-attribution of {soft,}IRQ
- * time, a more accurate solution would be to update the irq_time using
- * the current rq->clock timestamp, except that would require using
- * atomic ops.
- */
- if (irq_delta > delta)
- irq_delta = delta;
+ /*
+ * Since irq_time is only updated on {soft,}irq_exit, we might run into
+ * this case when a previous update_rq_clock() happened inside a
+ * {soft,}IRQ region.
+ *
+ * When this happens, we stop ->clock_task and only update the
+ * prev_irq_time stamp to account for the part that fit, so that a next
+ * update will consume the rest. This ensures ->clock_task is
+ * monotonic.
+ *
+ * It does however cause some slight miss-attribution of {soft,}IRQ
+ * time, a more accurate solution would be to update the irq_time using
+ * the current rq->clock timestamp, except that would require using
+ * atomic ops.
+ */
+ if (irq_delta > delta)
+ irq_delta = delta;
- rq->prev_irq_time += irq_delta;
- delta -= irq_delta;
- delayacct_irq(rq->curr, irq_delta);
+ rq->prev_irq_time += irq_delta;
+ delta -= irq_delta;
+ delayacct_irq(rq->curr, irq_delta);
+ }
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
--
2.43.5
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v4 3/4] sched, psi: Don't account irq time if sched_clock_irqtime is disabled
2024-11-01 3:17 [PATCH v4 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
2024-11-01 3:17 ` [PATCH v4 1/4] sched: Define sched_clock_irqtime as static key Yafang Shao
2024-11-01 3:17 ` [PATCH v4 2/4] sched: Don't account irq time if sched_clock_irqtime is disabled Yafang Shao
@ 2024-11-01 3:17 ` Yafang Shao
2024-11-01 3:17 ` [PATCH v4 4/4] sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
2024-11-01 10:54 ` [PATCH v4 0/4] sched: Fix " Peter Zijlstra
4 siblings, 0 replies; 11+ messages in thread
From: Yafang Shao @ 2024-11-01 3:17 UTC (permalink / raw)
To: mingo, peterz
Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, hannes, surenb, cgroups, linux-kernel,
Yafang Shao
sched_clock_irqtime may be disabled due to the clock source. When disabled,
irq_time_read() won't change over time, so there is nothing to account. We
can save iterating the whole hierarchy on every tick and context switch.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
kernel/sched/psi.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 84dad1511d1e..4d26a106f03b 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -998,7 +998,8 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
s64 delta;
u64 irq;
- if (static_branch_likely(&psi_disabled))
+ if (static_branch_likely(&psi_disabled) ||
+ !static_branch_likely(&sched_clock_irqtime))
return;
if (!curr->pid)
--
2.43.5
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v4 4/4] sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING
2024-11-01 3:17 [PATCH v4 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
` (2 preceding siblings ...)
2024-11-01 3:17 ` [PATCH v4 3/4] sched, psi: " Yafang Shao
@ 2024-11-01 3:17 ` Yafang Shao
2024-11-01 10:28 ` Peter Zijlstra
2024-11-01 10:54 ` [PATCH v4 0/4] sched: Fix " Peter Zijlstra
4 siblings, 1 reply; 11+ messages in thread
From: Yafang Shao @ 2024-11-01 3:17 UTC (permalink / raw)
To: mingo, peterz
Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, hannes, surenb, cgroups, linux-kernel,
Yafang Shao
After enabling CONFIG_IRQ_TIME_ACCOUNTING to monitor IRQ pressure in our
container environment, we observed several noticeable behavioral changes.
One of our IRQ-heavy services, such as Redis, reported a significant
reduction in CPU usage after upgrading to the new kernel with
CONFIG_IRQ_TIME_ACCOUNTING enabled. However, despite adding more threads
to handle an increased workload, the CPU usage could not be raised. In
other words, even though the container’s CPU usage appeared low, it was
unable to process more workloads to utilize additional CPU resources, which
caused issues.
This behavior can be demonstrated using netperf:
function start_server() {
for j in `seq 1 3`; do
netserver -p $[12345+j] > /dev/null &
done
}
server_ip=$1
function start_client() {
# That applies to cgroup2 as well.
mkdir -p /sys/fs/cgroup/cpuacct/test
echo $$ > /sys/fs/cgroup/cpuacct/test/cgroup.procs
for j in `seq 1 3`; do
port=$[12345+j]
taskset -c 0 netperf -H ${server_ip} -l ${run_time:-30000} \
-t TCP_STREAM -p $port -- -D -m 1k -M 1K -s 8k -S 8k \
> /dev/null &
done
}
start_server
start_client
We can verify the CPU usage of the test cgroup using cpuacct.stat. The
output shows:
system: 53
user: 2
The CPU usage of the cgroup is relatively low at around 55%, but this usage
doesn't increase, even with more netperf tasks. The reason is that CPU0 is
at 100% utilization, as confirmed by mpstat:
02:56:22 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:56:23 PM 0 0.99 0.00 55.45 0.00 0.99 42.57 0.00 0.00 0.00 0.00
02:56:23 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:56:24 PM 0 2.00 0.00 55.00 0.00 0.00 43.00 0.00 0.00 0.00 0.00
It is clear that the %soft is not accounted into the cgroup of the
interrupted task. This behavior is unexpected. We should account for IRQ
time to the cgroup to reflect the pressure the group is under.
After a thorough analysis, I discovered that this change in behavior is due
to commit 305e6835e055 ("sched: Do not account irq time to current task"),
which altered whether IRQ time should be charged to the interrupted task.
While I agree that a task should not be penalized by random interrupts, the
task itself cannot progress while interrupted. Therefore, the interrupted
time should be reported to the user.
The system metric in cpuacct.stat is crucial in indicating whether a
container is under heavy system pressure, including IRQ/softirq activity.
Hence, IRQ/softirq time should be accounted for in the cpuacct system
usage, which also applies to cgroup2’s rstat.
This patch reintroduces IRQ/softirq accounting to cgroups.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
kernel/sched/core.c | 33 +++++++++++++++++++++++++++++++--
kernel/sched/psi.c | 14 +++-----------
kernel/sched/stats.h | 7 ++++---
3 files changed, 38 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 06a06f0897c3..5ed2c5c8c911 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5579,6 +5579,35 @@ __setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
#endif /* CONFIG_SCHED_DEBUG */
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static void account_irqtime(struct rq *rq, struct task_struct *curr,
+ struct task_struct *prev)
+{
+ int cpu = smp_processor_id();
+ s64 delta;
+ u64 irq;
+
+ if (!static_branch_likely(&sched_clock_irqtime))
+ return;
+
+ irq = irq_time_read(cpu);
+ delta = (s64)(irq - rq->psi_irq_time);
+ if (delta < 0)
+ return;
+
+ rq->psi_irq_time = irq;
+ psi_account_irqtime(rq, curr, prev, delta);
+ cgroup_account_cputime(curr, delta);
+ /* We account both softirq and irq into softirq */
+ cgroup_account_cputime_field(curr, CPUTIME_SOFTIRQ, delta);
+}
+#else
+static inline void account_irqtime(struct rq *rq, struct task_struct *curr,
+ struct task_struct *prev)
+{
+}
+#endif
+
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
@@ -5600,7 +5629,7 @@ void sched_tick(void)
rq_lock(rq, &rf);
curr = rq->curr;
- psi_account_irqtime(rq, curr, NULL);
+ account_irqtime(rq, curr, NULL);
update_rq_clock(rq);
hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
@@ -6683,7 +6712,7 @@ static void __sched notrace __schedule(int sched_mode)
++*switch_count;
migrate_disable_switch(rq, prev);
- psi_account_irqtime(rq, prev, next);
+ account_irqtime(rq, prev, next);
psi_sched_switch(prev, next, block);
trace_sched_switch(preempt, prev, next, prev_state);
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 4d26a106f03b..1adb41b2ae1d 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -990,16 +990,14 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
-void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev)
+void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev,
+ s64 delta)
{
int cpu = task_cpu(curr);
struct psi_group *group;
struct psi_group_cpu *groupc;
- s64 delta;
- u64 irq;
- if (static_branch_likely(&psi_disabled) ||
- !static_branch_likely(&sched_clock_irqtime))
+ if (static_branch_likely(&psi_disabled))
return;
if (!curr->pid)
@@ -1010,12 +1008,6 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
if (prev && task_psi_group(prev) == group)
return;
- irq = irq_time_read(cpu);
- delta = (s64)(irq - rq->psi_irq_time);
- if (delta < 0)
- return;
- rq->psi_irq_time = irq;
-
do {
u64 now;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 767e098a3bd1..17eefe5876a5 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -111,10 +111,11 @@ void psi_task_change(struct task_struct *task, int clear, int set);
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
bool sleep);
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
-void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev);
+void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
+ struct task_struct *prev, s64 delta);
#else
static inline void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
- struct task_struct *prev) {}
+ struct task_struct *prev, s64 delta) {}
#endif /*CONFIG_IRQ_TIME_ACCOUNTING */
/*
* PSI tracks state that persists across sleeps, such as iowaits and
@@ -215,7 +216,7 @@ static inline void psi_sched_switch(struct task_struct *prev,
struct task_struct *next,
bool sleep) {}
static inline void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
- struct task_struct *prev) {}
+ struct task_struct *prev, s64 delta) {}
#endif /* CONFIG_PSI */
#ifdef CONFIG_SCHED_INFO
--
2.43.5
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v4 1/4] sched: Define sched_clock_irqtime as static key
2024-11-01 3:17 ` [PATCH v4 1/4] sched: Define sched_clock_irqtime as static key Yafang Shao
@ 2024-11-01 10:06 ` Peter Zijlstra
2024-11-01 11:55 ` Yafang Shao
0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2024-11-01 10:06 UTC (permalink / raw)
To: Yafang Shao
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, hannes, surenb, cgroups, linux-kernel
On Fri, Nov 01, 2024 at 11:17:47AM +0800, Yafang Shao wrote:
> Since CPU time accounting is a performance-critical path, let's define
> sched_clock_irqtime as a static key to minimize potential overhead.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
> kernel/sched/cputime.c | 16 +++++++---------
> kernel/sched/sched.h | 1 +
> 2 files changed, 8 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 0bed0fa1acd9..d0b6ea737d04 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -7,6 +7,8 @@
> #include <asm/cputime.h>
> #endif
>
> +DEFINE_STATIC_KEY_FALSE(sched_clock_irqtime);
> +
> #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>
> /*
> @@ -22,16 +24,14 @@
> */
> DEFINE_PER_CPU(struct irqtime, cpu_irqtime);
>
> -static int sched_clock_irqtime;
> -
> void enable_sched_clock_irqtime(void)
> {
> - sched_clock_irqtime = 1;
> + static_branch_enable(&sched_clock_irqtime);
> }
>
> void disable_sched_clock_irqtime(void)
> {
> - sched_clock_irqtime = 0;
> + static_branch_disable(&sched_clock_irqtime);
> }
>
> static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,
> @@ -57,7 +57,7 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
> s64 delta;
> int cpu;
>
> - if (!sched_clock_irqtime)
> + if (!static_branch_likely(&sched_clock_irqtime))
> return;
>
> cpu = smp_processor_id();
> @@ -90,8 +90,6 @@ static u64 irqtime_tick_accounted(u64 maxtime)
>
> #else /* CONFIG_IRQ_TIME_ACCOUNTING */
>
> -#define sched_clock_irqtime (0)
> -
> static u64 irqtime_tick_accounted(u64 dummy)
> {
> return 0;
This makes no sense... in the IRQ_TIME_ACCOUNTING=n case you shouldn't
be using the static key.
> @@ -478,7 +476,7 @@ void account_process_tick(struct task_struct *p, int user_tick)
> if (vtime_accounting_enabled_this_cpu())
> return;
>
> - if (sched_clock_irqtime) {
> + if (static_branch_likely(&sched_clock_irqtime)) {
> irqtime_account_process_tick(p, user_tick, 1);
> return;
> }
> @@ -507,7 +505,7 @@ void account_idle_ticks(unsigned long ticks)
> {
> u64 cputime, steal;
>
> - if (sched_clock_irqtime) {
> + if (static_branch_likely(&sched_clock_irqtime)) {
> irqtime_account_idle_ticks(ticks);
> return;
> }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 081519ffab46..038ce65d6635 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3179,6 +3179,7 @@ struct irqtime {
> };
>
> DECLARE_PER_CPU(struct irqtime, cpu_irqtime);
> +DECLARE_STATIC_KEY_FALSE(sched_clock_irqtime);
>
> /*
> * Returns the irqtime minus the softirq time computed by ksoftirqd.
> --
> 2.43.5
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v4 4/4] sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING
2024-11-01 3:17 ` [PATCH v4 4/4] sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
@ 2024-11-01 10:28 ` Peter Zijlstra
2024-11-01 12:04 ` Yafang Shao
0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2024-11-01 10:28 UTC (permalink / raw)
To: Yafang Shao
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, hannes, surenb, cgroups, linux-kernel
On Fri, Nov 01, 2024 at 11:17:50AM +0800, Yafang Shao wrote:
> After enabling CONFIG_IRQ_TIME_ACCOUNTING to monitor IRQ pressure in our
> container environment, we observed several noticeable behavioral changes.
>
> One of our IRQ-heavy services, such as Redis, reported a significant
> reduction in CPU usage after upgrading to the new kernel with
> CONFIG_IRQ_TIME_ACCOUNTING enabled. However, despite adding more threads
> to handle an increased workload, the CPU usage could not be raised. In
> other words, even though the container’s CPU usage appeared low, it was
> unable to process more workloads to utilize additional CPU resources, which
> caused issues.
> We can verify the CPU usage of the test cgroup using cpuacct.stat. The
> output shows:
>
> system: 53
> user: 2
>
> The CPU usage of the cgroup is relatively low at around 55%, but this usage
> doesn't increase, even with more netperf tasks. The reason is that CPU0 is
> at 100% utilization, as confirmed by mpstat:
>
> 02:56:22 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 02:56:23 PM 0 0.99 0.00 55.45 0.00 0.99 42.57 0.00 0.00 0.00 0.00
>
> 02:56:23 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 02:56:24 PM 0 2.00 0.00 55.00 0.00 0.00 43.00 0.00 0.00 0.00 0.00
>
> It is clear that the %soft is not accounted into the cgroup of the
> interrupted task. This behavior is unexpected. We should account for IRQ
> time to the cgroup to reflect the pressure the group is under.
>
> After a thorough analysis, I discovered that this change in behavior is due
> to commit 305e6835e055 ("sched: Do not account irq time to current task"),
> which altered whether IRQ time should be charged to the interrupted task.
> While I agree that a task should not be penalized by random interrupts, the
> task itself cannot progress while interrupted. Therefore, the interrupted
> time should be reported to the user.
>
> The system metric in cpuacct.stat is crucial in indicating whether a
> container is under heavy system pressure, including IRQ/softirq activity.
> Hence, IRQ/softirq time should be accounted for in the cpuacct system
> usage, which also applies to cgroup2’s rstat.
>
> This patch reintroduces IRQ/softirq accounting to cgroups.
How !? what does it actually do?
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> ---
> kernel/sched/core.c | 33 +++++++++++++++++++++++++++++++--
> kernel/sched/psi.c | 14 +++-----------
> kernel/sched/stats.h | 7 ++++---
> 3 files changed, 38 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 06a06f0897c3..5ed2c5c8c911 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5579,6 +5579,35 @@ __setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
> static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
> #endif /* CONFIG_SCHED_DEBUG */
>
> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> +static void account_irqtime(struct rq *rq, struct task_struct *curr,
> + struct task_struct *prev)
> +{
> + int cpu = smp_processor_id();
> + s64 delta;
> + u64 irq;
> +
> + if (!static_branch_likely(&sched_clock_irqtime))
> + return;
> +
> + irq = irq_time_read(cpu);
> + delta = (s64)(irq - rq->psi_irq_time);
At this point the variable is no longer exclusive to PSI and should
probably be renamed.
> + if (delta < 0)
> + return;
> +
> + rq->psi_irq_time = irq;
> + psi_account_irqtime(rq, curr, prev, delta);
> + cgroup_account_cputime(curr, delta);
> + /* We account both softirq and irq into softirq */
> + cgroup_account_cputime_field(curr, CPUTIME_SOFTIRQ, delta);
This seems wrong.. we have CPUTIME_IRQ.
> +}
In fact, much of this seems like it's going about things sideways.
Why can't you just add the cgroup_account_*() garbage to
irqtime_account_irq()? That is were it's still split out into softirq
and irq.
But the much bigger question is -- how can you be sure that this
interrupt is in fact for the cgroup you're attributing it to? Could be
for an entirely different cgroup.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v4 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING
2024-11-01 3:17 [PATCH v4 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
` (3 preceding siblings ...)
2024-11-01 3:17 ` [PATCH v4 4/4] sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
@ 2024-11-01 10:54 ` Peter Zijlstra
2024-11-01 11:54 ` Yafang Shao
4 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2024-11-01 10:54 UTC (permalink / raw)
To: Yafang Shao
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, hannes, surenb, cgroups, linux-kernel
On Fri, Nov 01, 2024 at 11:17:46AM +0800, Yafang Shao wrote:
> After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
> container environment, we encountered several user-visible behavioral
> changes:
>
> - Interrupted IRQ/softirq time is not accounted for in the cpuacct cgroup
>
> This breaks userspace applications that rely on CPU usage data from
> cgroups to monitor CPU pressure. This patchset resolves the issue by
> ensuring that IRQ/softirq time is accounted for in the cgroup of the
> interrupted tasks.
>
> - getrusage(2) does not include time interrupted by IRQ/softirq
>
> Some services use getrusage(2) to check if workloads are experiencing CPU
> pressure. Since IRQ/softirq time is no longer charged to task runtime,
> getrusage(2) can no longer reflect the CPU pressure caused by heavy
> interrupts.
>
> This patchset addresses the first issue, which is relatively
> straightforward.
So I don't think it is. I think they're both the same issue. You cannot
know for whom the work done by the (soft) interrupt is.
For instance, if you were to create 2 cgroups, and have one cgroup do a
while(1) loop, while you'd have that other cgroup do your netperf
workload, I suspect you'll see significant (soft)irq load on the
while(1) cgroup, even though it's guaranteed to not be from it.
Same with rusage -- rusage is fully task centric, and the work done by
(soft) irqs are not necessarily related to the task they interrupt.
So while you're trying to make the world conform to your legacy
monitoring view, perhaps you should fix your view of things.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v4 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING
2024-11-01 10:54 ` [PATCH v4 0/4] sched: Fix " Peter Zijlstra
@ 2024-11-01 11:54 ` Yafang Shao
0 siblings, 0 replies; 11+ messages in thread
From: Yafang Shao @ 2024-11-01 11:54 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, hannes, surenb, cgroups, linux-kernel
On Fri, Nov 1, 2024 at 6:54 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 01, 2024 at 11:17:46AM +0800, Yafang Shao wrote:
> > After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
> > container environment, we encountered several user-visible behavioral
> > changes:
> >
> > - Interrupted IRQ/softirq time is not accounted for in the cpuacct cgroup
> >
> > This breaks userspace applications that rely on CPU usage data from
> > cgroups to monitor CPU pressure. This patchset resolves the issue by
> > ensuring that IRQ/softirq time is accounted for in the cgroup of the
> > interrupted tasks.
> >
> > - getrusage(2) does not include time interrupted by IRQ/softirq
> >
> > Some services use getrusage(2) to check if workloads are experiencing CPU
> > pressure. Since IRQ/softirq time is no longer charged to task runtime,
> > getrusage(2) can no longer reflect the CPU pressure caused by heavy
> > interrupts.
> >
> > This patchset addresses the first issue, which is relatively
> > straightforward.
>
> So I don't think it is. I think they're both the same issue. You cannot
> know for whom the work done by the (soft) interrupt is.
>
> For instance, if you were to create 2 cgroups, and have one cgroup do a
> while(1) loop, while you'd have that other cgroup do your netperf
> workload, I suspect you'll see significant (soft)irq load on the
> while(1) cgroup, even though it's guaranteed to not be from it.
>
> Same with rusage -- rusage is fully task centric, and the work done by
> (soft) irqs are not necessarily related to the task they interrupt.
>
>
> So while you're trying to make the world conform to your legacy
> monitoring view, perhaps you should fix your view of things.
The issue here can't simply be addressed by adjusting my view of
things. Enabling CONFIG_IRQ_TIME_ACCOUNTING results in the CPU
utilization metric excluding the time spent in IRQs. This means we
lose visibility into how long the CPU was actually interrupted in
comparison to its total utilization. Currently, the only ways to
monitor interrupt time are through IRQ PSI or the IRQ time recorded in
delay accounting. However, these metrics are independent of CPU
utilization, which makes it difficult to combine them into a single,
unified measure.
CPU utilization is a critical metric for almost all workloads, and
it's problematic if it fails to reflect the full extent of system
pressure. This situation is similar to iowait: when a task is in
iowait, it could be due to other tasks performing I/O. It doesn’t
matter if the I/O is being done by one of your tasks or by someone
else's; what matters is that your task is stalled and waiting on I/O.
Similarly, a comprehensive CPU utilization metric should reflect all
sources of pressure, including IRQ time, to provide a more accurate
representation of workload behavior.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v4 1/4] sched: Define sched_clock_irqtime as static key
2024-11-01 10:06 ` Peter Zijlstra
@ 2024-11-01 11:55 ` Yafang Shao
0 siblings, 0 replies; 11+ messages in thread
From: Yafang Shao @ 2024-11-01 11:55 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, hannes, surenb, cgroups, linux-kernel
On Fri, Nov 1, 2024 at 6:06 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 01, 2024 at 11:17:47AM +0800, Yafang Shao wrote:
> > Since CPU time accounting is a performance-critical path, let's define
> > sched_clock_irqtime as a static key to minimize potential overhead.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> > kernel/sched/cputime.c | 16 +++++++---------
> > kernel/sched/sched.h | 1 +
> > 2 files changed, 8 insertions(+), 9 deletions(-)
> >
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > index 0bed0fa1acd9..d0b6ea737d04 100644
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -7,6 +7,8 @@
> > #include <asm/cputime.h>
> > #endif
> >
> > +DEFINE_STATIC_KEY_FALSE(sched_clock_irqtime);
> > +
> > #ifdef CONFIG_IRQ_TIME_ACCOUNTING
> >
> > /*
> > @@ -22,16 +24,14 @@
> > */
> > DEFINE_PER_CPU(struct irqtime, cpu_irqtime);
> >
> > -static int sched_clock_irqtime;
> > -
> > void enable_sched_clock_irqtime(void)
> > {
> > - sched_clock_irqtime = 1;
> > + static_branch_enable(&sched_clock_irqtime);
> > }
> >
> > void disable_sched_clock_irqtime(void)
> > {
> > - sched_clock_irqtime = 0;
> > + static_branch_disable(&sched_clock_irqtime);
> > }
> >
> > static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,
> > @@ -57,7 +57,7 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
> > s64 delta;
> > int cpu;
> >
> > - if (!sched_clock_irqtime)
> > + if (!static_branch_likely(&sched_clock_irqtime))
> > return;
> >
> > cpu = smp_processor_id();
> > @@ -90,8 +90,6 @@ static u64 irqtime_tick_accounted(u64 maxtime)
> >
> > #else /* CONFIG_IRQ_TIME_ACCOUNTING */
> >
> > -#define sched_clock_irqtime (0)
> > -
> > static u64 irqtime_tick_accounted(u64 dummy)
> > {
> > return 0;
>
> This makes no sense... in the IRQ_TIME_ACCOUNTING=n case you shouldn't
> be using the static key.
will change it.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v4 4/4] sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING
2024-11-01 10:28 ` Peter Zijlstra
@ 2024-11-01 12:04 ` Yafang Shao
0 siblings, 0 replies; 11+ messages in thread
From: Yafang Shao @ 2024-11-01 12:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, hannes, surenb, cgroups, linux-kernel
On Fri, Nov 1, 2024 at 6:28 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 01, 2024 at 11:17:50AM +0800, Yafang Shao wrote:
> > After enabling CONFIG_IRQ_TIME_ACCOUNTING to monitor IRQ pressure in our
> > container environment, we observed several noticeable behavioral changes.
> >
> > One of our IRQ-heavy services, such as Redis, reported a significant
> > reduction in CPU usage after upgrading to the new kernel with
> > CONFIG_IRQ_TIME_ACCOUNTING enabled. However, despite adding more threads
> > to handle an increased workload, the CPU usage could not be raised. In
> > other words, even though the container’s CPU usage appeared low, it was
> > unable to process more workloads to utilize additional CPU resources, which
> > caused issues.
>
> > We can verify the CPU usage of the test cgroup using cpuacct.stat. The
> > output shows:
> >
> > system: 53
> > user: 2
> >
> > The CPU usage of the cgroup is relatively low at around 55%, but this usage
> > doesn't increase, even with more netperf tasks. The reason is that CPU0 is
> > at 100% utilization, as confirmed by mpstat:
> >
> > 02:56:22 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > 02:56:23 PM 0 0.99 0.00 55.45 0.00 0.99 42.57 0.00 0.00 0.00 0.00
> >
> > 02:56:23 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > 02:56:24 PM 0 2.00 0.00 55.00 0.00 0.00 43.00 0.00 0.00 0.00 0.00
> >
> > It is clear that the %soft is not accounted into the cgroup of the
> > interrupted task. This behavior is unexpected. We should account for IRQ
> > time to the cgroup to reflect the pressure the group is under.
> >
> > After a thorough analysis, I discovered that this change in behavior is due
> > to commit 305e6835e055 ("sched: Do not account irq time to current task"),
> > which altered whether IRQ time should be charged to the interrupted task.
> > While I agree that a task should not be penalized by random interrupts, the
> > task itself cannot progress while interrupted. Therefore, the interrupted
> > time should be reported to the user.
> >
> > The system metric in cpuacct.stat is crucial in indicating whether a
> > container is under heavy system pressure, including IRQ/softirq activity.
> > Hence, IRQ/softirq time should be accounted for in the cpuacct system
> > usage, which also applies to cgroup2’s rstat.
> >
> > This patch reintroduces IRQ/softirq accounting to cgroups.
>
> How !? what does it actually do?
It seems there's some misunderstanding due to the term *accounting*
here. What it actually does is track the interrupted time within a
cgroup.
>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> > kernel/sched/core.c | 33 +++++++++++++++++++++++++++++++--
> > kernel/sched/psi.c | 14 +++-----------
> > kernel/sched/stats.h | 7 ++++---
> > 3 files changed, 38 insertions(+), 16 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 06a06f0897c3..5ed2c5c8c911 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5579,6 +5579,35 @@ __setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
> > static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
> > #endif /* CONFIG_SCHED_DEBUG */
> >
> > +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> > +static void account_irqtime(struct rq *rq, struct task_struct *curr,
> > + struct task_struct *prev)
> > +{
> > + int cpu = smp_processor_id();
> > + s64 delta;
> > + u64 irq;
> > +
> > + if (!static_branch_likely(&sched_clock_irqtime))
> > + return;
> > +
> > + irq = irq_time_read(cpu);
> > + delta = (s64)(irq - rq->psi_irq_time);
>
> At this point the variable is no longer exclusive to PSI and should
> probably be renamed.
OK.
>
> > + if (delta < 0)
> > + return;
> > +
> > + rq->psi_irq_time = irq;
> > + psi_account_irqtime(rq, curr, prev, delta);
> > + cgroup_account_cputime(curr, delta);
> > + /* We account both softirq and irq into softirq */
> > + cgroup_account_cputime_field(curr, CPUTIME_SOFTIRQ, delta);
>
> This seems wrong.. we have CPUTIME_IRQ.
OK.
>
> > +}
>
> In fact, much of this seems like it's going about things sideways.
>
> Why can't you just add the cgroup_account_*() garbage to
> irqtime_account_irq()? That is were it's still split out into softirq
> and irq.
I previously implemented this in v1: link. However, in that version,
we had to hold the irq_lock within the critical path, which could
impact performance. Taking inspiration from commit ddae0ca2a8fe
("sched: Move psi_account_irqtime() out of update_rq_clock_task()
hotpath"), I've now adapted the approach to handle it in a
non-critical path, reducing the performance impact.
>
> But the much bigger question is -- how can you be sure that this
> interrupt is in fact for the cgroup you're attributing it to? Could be
> for an entirely different cgroup.
As I explained in another thread, identifying the exact culprit can be
challenging, but identifying the victim is straightforward. That’s
precisely what this patch set accomplishes.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-11-01 12:04 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-01 3:17 [PATCH v4 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
2024-11-01 3:17 ` [PATCH v4 1/4] sched: Define sched_clock_irqtime as static key Yafang Shao
2024-11-01 10:06 ` Peter Zijlstra
2024-11-01 11:55 ` Yafang Shao
2024-11-01 3:17 ` [PATCH v4 2/4] sched: Don't account irq time if sched_clock_irqtime is disabled Yafang Shao
2024-11-01 3:17 ` [PATCH v4 3/4] sched, psi: " Yafang Shao
2024-11-01 3:17 ` [PATCH v4 4/4] sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING Yafang Shao
2024-11-01 10:28 ` Peter Zijlstra
2024-11-01 12:04 ` Yafang Shao
2024-11-01 10:54 ` [PATCH v4 0/4] sched: Fix " Peter Zijlstra
2024-11-01 11:54 ` Yafang Shao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox