From: Johannes Weiner <hannes@cmpxchg.org>
To: Usama Arif <usama.arif@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>,
david@kernel.org, linux-mm@kvack.org, bsegall@google.com,
dietmar.eggemann@arm.com, juri.lelli@redhat.com,
kprateek.nayak@amd.com, linux-kernel@vger.kernel.org,
mgorman@suse.de, mingo@redhat.com, peterz@infradead.org,
rostedt@goodmis.org, surenb@google.com,
vincent.guittot@linaro.org, vschneid@redhat.com,
shakeel.butt@linux.dev, riel@surriel.com, kernel-team@meta.com,
Chengming Zhou <chengming.zhou@linux.dev>
Subject: Re: [PATCH 1/1] sched/psi: skip irqtime accounting when no new irq time has elapsed
Date: Thu, 18 Jun 2026 12:20:55 -0400 [thread overview]
Message-ID: <ajQa56F2aOtfVzlo@cmpxchg.org> (raw)
In-Reply-To: <20260617175219.2494857-2-usama.arif@linux.dev>
On Wed, Jun 17, 2026 at 10:50:06AM -0700, Usama Arif wrote:
> psi_account_irqtime() reads irq_time_read() into a per-rq cumulative
> counter and only bails out when the delta vs. the previously accounted
> amount is negative. A delta of exactly zero is treated as "do the
> work": psi_write_begin() is taken, cpu_clock(cpu) is read (which on
> x86 ends up in native_sched_clock() / rdtsc) and the cgroup ancestor
> chain is walked to add zero to every group's PSI_IRQ_FULL bucket.
>
> The zero-delta case is common in practice -- it fires every time a
> context switch crosses a PSI group boundary on a CPU that hasn't
> serviced an interrupt between the two switches.
>
> Measured on a 176-thread AMD EPYC 9D64 server running a compute
> intensive production workload, instrumented with bpftrace over a 30s
> window (irq_time_read() read directly from the per-CPU cpu_irqtime so
> that delta == 0 and delta < 0 could be separated):
>
> @total 17,229,311 (100.0%)
> @ret_curr_swapper 7,864,195 ( 45.6%) curr->pid == 0
> @ret_samegrp 323,299 ( 1.9%) same cgroup as prev
> @reached_delta 9,041,817 ( 52.5%)
> @delta_positive 6,358,192 ( 36.9%) real work
> @delta_zero 2,683,625 ( 15.6%) work wasted (this patch)
> @delta_negative (0) ( 0.0%) monotonic clock
>
> So 15.6 % of all psi_account_irqtime() calls - and 29.7 % of the
> calls that get past the early returns - hit the delta == 0 case;
> delta < 0 did not occur once in the 30 s window. Under the current
> code each of those ~89 k calls per second performs the full seqcount
> write + cpu_clock() read + cgroup-chain walk just to add 0 to every
> group's PSI_IRQ_FULL counter.
>
> Extend the early-return to also cover delta == 0. rq->psi_irq_time
> does not need updating in that case (it would store the same value
> back) and no PSI bucket would change. The existing behaviour for
> delta > 0 is untouched.
>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
CCing Chengming as well, quote untrimmed.
> ---
> kernel/sched/psi.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index d9c9d9480a45..848955f8893d 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -1023,7 +1023,7 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
>
> irq = irq_time_read(cpu);
> delta = (s64)(irq - rq->psi_irq_time);
> - if (delta < 0)
> + if (delta <= 0)
> return;
> rq->psi_irq_time = irq;
>
> --
> 2.53.0-Meta
>
prev parent reply other threads:[~2026-06-18 16:21 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-17 17:50 [PATCH 0/1] sched/psi: skip irqtime accounting when no new irq time has elapsed Usama Arif
2026-06-17 17:50 ` [PATCH 1/1] " Usama Arif
2026-06-18 16:20 ` Johannes Weiner [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ajQa56F2aOtfVzlo@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=akpm@linux-foundation.org \
--cc=bsegall@google.com \
--cc=chengming.zhou@linux.dev \
--cc=david@kernel.org \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=kernel-team@meta.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=riel@surriel.com \
--cc=rostedt@goodmis.org \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.