From: zhidao su <soolaugust@gmail.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, jstultz@google.com,
kprateek.nayak@amd.com
Subject: Re: [PATCH v24 00/11] Donor Migration for Proxy Execution - sched/proxy_exec: perf bench sched messaging regression analysis
Date: Mon, 09 Mar 2026 14:47:49 +0800 [thread overview]
Message-ID: <69ae6d1a.170a0220.264e67.5f30@mx.google.com> (raw)
[-- Attachment #1: Type: text/plain, Size: 4915 bytes --]
I ran perf bench sched messaging (-g 10 -l 1000) under both
sched_proxy_exec=0 and sched_proxy_exec=1 to help locate the source of
the reported regression.
Test environment:
Kernel: 7.0.0-rc2 (CONFIG_SCHED_PROXY_EXEC=y, no v24 patches)
Host CPU: Intel Core i7-10700 (8C/16T, Comet Lake)
Test env: virtme-ng (vng) with 4 vCPUs, 2GB RAM
Benchmark: perf bench sched messaging -g 10 -l 1000 (400 processes)
Runs: 7 per configuration
Note: measurements are from a QEMU/vng environment, not bare metal.
The relative comparison between PE=0 and PE=1 is meaningful, but
absolute numbers will differ on real hardware.
--- Wall-clock timing (7 runs each, 4 vCPU) ---
PE ON (sched_proxy_exec=1): avg=1.385s stdev=0.028s (2.0%)
PE OFF (sched_proxy_exec=0): avg=1.441s stdev=0.022s (1.5%)
Delta: -3.9% (PE ON is *faster*)
I also tested with 1 vCPU (single-threaded QEMU, more serialized):
PE ON: cycles 33,309,716,744 IPC 0.453
PE OFF: cycles 31,822,761,446 IPC 0.471
Delta: +4.7% cycles (PE slower on 1 vCPU)
The 1-vCPU result shows a small overhead at low parallelism, which
is consistent with the static cost of the rq->donor/curr split (see
below). With 4 vCPUs the scheduling benefit (fewer context-switches)
dominates.
The 3-5% regression K Prateek reported on v24 is larger than the 1-vCPU
overhead here, which suggests the regression is specific to the v24
Donor Migration patches, not to the base PE infrastructure.
--- perf stat comparison (single run, 4 vCPU) ---
metric PE=1 PE=0 delta
-----------------------------------------------------------
cycles 21,606,480,478 22,026,855,018 -1.91%
instructions 14,415,921,265 14,475,061,137 -0.41%
cache-misses 202,569,745 191,711,462 +5.66% *
context-switches 56,541 59,956 -5.70%
cpu-migrations 250 306 -18.30%
sys-time (s) 6.404 6.710 -4.56%
IPC 0.667 0.657 +1.53%
The one counter that goes the wrong direction is cache-misses (+5.66% with
PE). This likely traces back to the split of rq->curr and rq->donor in
sched.h:1137-1144:
Without CONFIG_SCHED_PROXY_EXEC:
union {
struct task_struct __rcu *donor; /* occupies same cacheline slot */
struct task_struct __rcu *curr;
};
With CONFIG_SCHED_PROXY_EXEC:
struct task_struct __rcu *donor; /* two separate pointers */
struct task_struct __rcu *curr; /* → one extra pointer on hot cacheline */
This adds 8 bytes to the rq hot cacheline. The extra pointer is always
present (even when no task is blocked), making it a *static* overhead of
the PE infrastructure.
--- sched tracepoint comparison ---
event PE=1 PE=0 delta
---------------------------------------------------
sched_switch 63,789 67,782 -5.89%
sched_wakeup 61,289 63,842 -4.00%
sched_migrate_task 209 194 +7.73%
PE reduces context-switches and wakeups by ~5-6%, which is the expected
benefit: the proxy-running mechanism avoids blocking the high-priority
waiter, leading to fewer voluntary switches. This translates directly into
the observed wall-clock improvement.
--- perf record symbol breakdown (cycles:k) ---
symbol PE=1% PE=0% delta pp
-------------------------------------------------------
rep_movs_alternative 10.76 10.93 -0.17
__refill_objects_node 7.95 8.05 -0.10
_raw_spin_lock 7.61 7.66 -0.05
__skb_datagram_iter 5.25 5.28 -0.03
clear_bhb_loop 3.55 3.64 -0.09
unix_stream_read_generic 3.53 3.42 +0.11
queued_spin_lock_slowpath 1.15 1.24 -0.09
With PE enabled, queued_spin_lock_slowpath is *lower* (-0.09pp), consistent
with the cache-miss story: fewer blocked tasks means less spinlock contention.
The small increase in cache-misses (from the larger rq struct) does not
outweigh the scheduling benefit.
--- Summary ---
On upstream 7.0-rc2 with base PE infrastructure (no v24 patches), PE:
- Improves wall-clock by ~4% on 4-core workload
- Reduces context-switches by ~6%
- Adds ~6% cache-miss overhead (from rq->donor/curr split)
- Has negligible __schedule path overhead
The 3-5% regression K Prateek identified is therefore specific to the
v24 Donor Migration patches, not to the upstream PE base. This suggests
the regression source is in one of:
- proxy_migrate_task() cross-CPU donor migration overhead (patch 6)
- the PROXY_WAKING state machinery (patch 3/9)
- pick_again loop changes in pick_next_task() (patches 4/5/7)
The rq->donor cacheline overhead (sched.h:1137) is a static cost worth
noting for the long-term, but it does not cause the v24 regression.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
reply other threads:[~2026-03-09 6:47 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=69ae6d1a.170a0220.264e67.5f30@mx.google.com \
--to=soolaugust@gmail.com \
--cc=jstultz@google.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox