From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-183.mta1.migadu.com (out-183.mta1.migadu.com [95.215.58.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3FA9D33B6EF for ; Wed, 17 Jun 2026 17:52:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.183 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781718755; cv=none; b=K6cdeRr5nMpagiKQE4bj1V6NmSc7DR9NPDJ3j+lKnrtqCtU7jwJ3CMDe2n8MtMwR+UuyjrXUoatPgwt0np/8QNcQuQLQaVF51ytFr5nBeE0JwQc9lv23W0Q1XT/fRsuEO8cqZXN4WrKdsMprHVrJTNpJ1A3djW83NY2ikGo8Mt0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781718755; c=relaxed/simple; bh=gyZdf/G1lFKv+xR7JqtW8htBrXM9ohzScvFmG50/7IQ=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=RSLclPzPvgrMpOIzbJ0izBlJFpc3icMdPdYuH/PphVMvQELtK8tEOzJ8z739WJn0Bq8u+IjxNLIUNlQiLxmPGQdNPAc44giDtcvPBkDTIL82rsU9lFGoENZm5DoKx9cFCabYVJM2GRTWRPLUn8+7pDBr1BBCObGdmFYtT6/WHdk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=k2iAMXhi; arc=none smtp.client-ip=95.215.58.183 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="k2iAMXhi" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1781718751; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=cKw8FJF+qidvImrMgGPSUuMVzanR5YCMoXKmktrT1J0=; b=k2iAMXhic8L3Q1IPgGhWs3uPV5fDXQXGKAc+7mgrHLG3azpDNzO6JcD0OYdXN+GTNDlNag WX7Dv5awnHVpmCuP55/n9YNNRnXL6OS5L36cifkLyN0E2xaVajQ8cg7u0opRIVlEfg/oSi efiPgjMqK8HIAl8vRCVt9PvtWIWvRaA= From: Usama Arif To: Andrew Morton , david@kernel.org, linux-mm@kvack.org, bsegall@google.com, dietmar.eggemann@arm.com, hannes@cmpxchg.org, juri.lelli@redhat.com, kprateek.nayak@amd.com, linux-kernel@vger.kernel.org, mgorman@suse.de, mingo@redhat.com, peterz@infradead.org, rostedt@goodmis.org, surenb@google.com, vincent.guittot@linaro.org, vschneid@redhat.com Cc: shakeel.butt@linux.dev, riel@surriel.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 0/1] sched/psi: skip irqtime accounting when no new irq time has elapsed Date: Wed, 17 Jun 2026 10:50:05 -0700 Message-ID: <20260617175219.2494857-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT psi_account_irqtime() reads irq_time_read() into a per-rq cumulative counter and only bails out when the delta vs. the previously accounted amount is negative. A delta of exactly zero is treated as "do the work": psi_write_begin() is taken, cpu_clock(cpu) is read (which on x86 ends up in native_sched_clock() / rdtsc) and the cgroup ancestor chain is walked to add zero to every group's PSI_IRQ_FULL bucket. The zero-delta case is common in practice -- it fires every time a context switch crosses a PSI group boundary on a CPU that hasn't serviced an interrupt between the two switches. To find out how often this actually fires in the wild, I attached a bpftrace probe to psi_account_irqtime() on a 176-thread AMD EPYC 9D64 server running an compute intensive workload. The probe also reads irq_time_read(cpu) directly from the per-CPU cpu_irqtime variable so it can separate delta == 0 from delta < 0. The bpftrace script was generated by claude and is at the end of the coverletter. Over a 30 s window under steady-state load: @total 17,229,311 (100.0%) @ret_curr_swapper 7,864,195 ( 45.6%) curr->pid == 0 @ret_samegrp 323,299 ( 1.9%) same cgroup as prev @reached_delta 9,041,817 ( 52.5%) @delta_positive 6,358,192 ( 36.9%) real work @delta_zero 2,683,625 ( 15.6%) work wasted (this patch) @delta_negative (0) ( 0.0%) monotonic clock 15.6 % of all psi_account_irqtime() calls -- and 29.7 % of the calls that get past the early returns -- hit the delta == 0 case. delta < 0 did not occur once in the 30 s window. That works out to ~89 k calls/sec on this host that today take the full seqcount write + cpu_clock() + cgroup-chain walk purely to add 0 to every group's PSI_IRQ_FULL counter. Extend the early-return to also cover delta == 0. rq->psi_irq_time does not need updating in that case (it would store the same value back) and no PSI bucket would change. The existing behaviour for delta > 0 is untouched. --------- psi_delta_exact.bt ------- #!/usr/bin/env bpftrace #include kprobe:psi_account_irqtime { $rq = (struct rq *)arg0; $curr = (struct task_struct *)arg1; $prev = (struct task_struct *)arg2; @total = count(); if ($curr->pid == 0) { @ret_curr_swapper = count(); return; } if ($prev != 0) { $pg_curr = $curr->cgroups->dfl_cgrp; $pg_prev = $prev->cgroups->dfl_cgrp; if ($pg_curr == $pg_prev) { @ret_samegrp = count(); return; } } @reached_delta = count(); $pcpu_off = *(uint64 *)(kaddr("__per_cpu_offset") + cpu * 8); $irq_time = *(uint64 *)(kaddr("cpu_irqtime") + $pcpu_off); $prev_time = $rq->psi_irq_time; $delta = (int64)($irq_time - $prev_time); if ($delta == 0) { @delta_zero = count(); } else if ($delta > 0) { @delta_positive = count(); } else { @delta_negative = count(); } } interval:s:30 { exit(); } --------- end bpftrace --------- Usama Arif (1): sched/psi: skip irqtime accounting when no new irq time has elapsed kernel/sched/psi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- 2.53.0-Meta