From: Usama Arif <usama.arif@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>,
david@kernel.org, linux-mm@kvack.org, bsegall@google.com,
dietmar.eggemann@arm.com, hannes@cmpxchg.org,
juri.lelli@redhat.com, kprateek.nayak@amd.com,
linux-kernel@vger.kernel.org, mgorman@suse.de, mingo@redhat.com,
peterz@infradead.org, rostedt@goodmis.org, surenb@google.com,
vincent.guittot@linaro.org, vschneid@redhat.com
Cc: shakeel.butt@linux.dev, riel@surriel.com, kernel-team@meta.com,
Usama Arif <usama.arif@linux.dev>
Subject: [PATCH 0/1] sched/psi: skip irqtime accounting when no new irq time has elapsed
Date: Wed, 17 Jun 2026 10:50:05 -0700 [thread overview]
Message-ID: <20260617175219.2494857-1-usama.arif@linux.dev> (raw)
psi_account_irqtime() reads irq_time_read() into a per-rq cumulative
counter and only bails out when the delta vs. the previously accounted
amount is negative. A delta of exactly zero is treated as "do the
work": psi_write_begin() is taken, cpu_clock(cpu) is read (which on
x86 ends up in native_sched_clock() / rdtsc) and the cgroup ancestor
chain is walked to add zero to every group's PSI_IRQ_FULL bucket.
The zero-delta case is common in practice -- it fires every time a
context switch crosses a PSI group boundary on a CPU that hasn't
serviced an interrupt between the two switches.
To find out how often this actually fires in the wild, I attached a
bpftrace probe to psi_account_irqtime() on a 176-thread AMD EPYC 9D64
server running an compute intensive workload.
The probe also reads irq_time_read(cpu) directly from the per-CPU
cpu_irqtime variable so it can separate delta == 0 from delta < 0.
The bpftrace script was generated by claude and is at the end
of the coverletter.
Over a 30 s window under steady-state load:
@total 17,229,311 (100.0%)
@ret_curr_swapper 7,864,195 ( 45.6%) curr->pid == 0
@ret_samegrp 323,299 ( 1.9%) same cgroup as prev
@reached_delta 9,041,817 ( 52.5%)
@delta_positive 6,358,192 ( 36.9%) real work
@delta_zero 2,683,625 ( 15.6%) work wasted (this patch)
@delta_negative (0) ( 0.0%) monotonic clock
15.6 % of all psi_account_irqtime() calls -- and 29.7 % of the calls
that get past the early returns -- hit the delta == 0 case. delta < 0
did not occur once in the 30 s window. That works out to ~89 k calls/sec
on this host that today take the full seqcount write + cpu_clock() +
cgroup-chain walk purely to add 0 to every group's PSI_IRQ_FULL counter.
Extend the early-return to also cover delta == 0. rq->psi_irq_time
does not need updating in that case (it would store the same value
back) and no PSI bucket would change. The existing behaviour for
delta > 0 is untouched.
--------- psi_delta_exact.bt -------
#!/usr/bin/env bpftrace
#include <linux/sched.h>
kprobe:psi_account_irqtime
{
$rq = (struct rq *)arg0;
$curr = (struct task_struct *)arg1;
$prev = (struct task_struct *)arg2;
@total = count();
if ($curr->pid == 0) {
@ret_curr_swapper = count();
return;
}
if ($prev != 0) {
$pg_curr = $curr->cgroups->dfl_cgrp;
$pg_prev = $prev->cgroups->dfl_cgrp;
if ($pg_curr == $pg_prev) {
@ret_samegrp = count();
return;
}
}
@reached_delta = count();
$pcpu_off = *(uint64 *)(kaddr("__per_cpu_offset") + cpu * 8);
$irq_time = *(uint64 *)(kaddr("cpu_irqtime") + $pcpu_off);
$prev_time = $rq->psi_irq_time;
$delta = (int64)($irq_time - $prev_time);
if ($delta == 0) {
@delta_zero = count();
} else if ($delta > 0) {
@delta_positive = count();
} else {
@delta_negative = count();
}
}
interval:s:30 { exit(); }
--------- end bpftrace ---------
Usama Arif (1):
sched/psi: skip irqtime accounting when no new irq time has elapsed
kernel/sched/psi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--
2.53.0-Meta
next reply other threads:[~2026-06-17 17:52 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-17 17:50 Usama Arif [this message]
2026-06-17 17:50 ` [PATCH 1/1] sched/psi: skip irqtime accounting when no new irq time has elapsed Usama Arif
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260617175219.2494857-1-usama.arif@linux.dev \
--to=usama.arif@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=bsegall@google.com \
--cc=david@kernel.org \
--cc=dietmar.eggemann@arm.com \
--cc=hannes@cmpxchg.org \
--cc=juri.lelli@redhat.com \
--cc=kernel-team@meta.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=riel@surriel.com \
--cc=rostedt@goodmis.org \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.