* Re: [PATCH v24 00/11] Donor Migration for Proxy Execution - sched/proxy_exec: perf bench sched messaging regression analysis
@ 2026-03-09 6:47 zhidao su
0 siblings, 0 replies; only message in thread
From: zhidao su @ 2026-03-09 6:47 UTC (permalink / raw)
To: linux-kernel
Cc: mingo, peterz, juri.lelli, vincent.guittot, jstultz,
kprateek.nayak
[-- Attachment #1: Type: text/plain, Size: 4915 bytes --]
I ran perf bench sched messaging (-g 10 -l 1000) under both
sched_proxy_exec=0 and sched_proxy_exec=1 to help locate the source of
the reported regression.
Test environment:
Kernel: 7.0.0-rc2 (CONFIG_SCHED_PROXY_EXEC=y, no v24 patches)
Host CPU: Intel Core i7-10700 (8C/16T, Comet Lake)
Test env: virtme-ng (vng) with 4 vCPUs, 2GB RAM
Benchmark: perf bench sched messaging -g 10 -l 1000 (400 processes)
Runs: 7 per configuration
Note: measurements are from a QEMU/vng environment, not bare metal.
The relative comparison between PE=0 and PE=1 is meaningful, but
absolute numbers will differ on real hardware.
--- Wall-clock timing (7 runs each, 4 vCPU) ---
PE ON (sched_proxy_exec=1): avg=1.385s stdev=0.028s (2.0%)
PE OFF (sched_proxy_exec=0): avg=1.441s stdev=0.022s (1.5%)
Delta: -3.9% (PE ON is *faster*)
I also tested with 1 vCPU (single-threaded QEMU, more serialized):
PE ON: cycles 33,309,716,744 IPC 0.453
PE OFF: cycles 31,822,761,446 IPC 0.471
Delta: +4.7% cycles (PE slower on 1 vCPU)
The 1-vCPU result shows a small overhead at low parallelism, which
is consistent with the static cost of the rq->donor/curr split (see
below). With 4 vCPUs the scheduling benefit (fewer context-switches)
dominates.
The 3-5% regression K Prateek reported on v24 is larger than the 1-vCPU
overhead here, which suggests the regression is specific to the v24
Donor Migration patches, not to the base PE infrastructure.
--- perf stat comparison (single run, 4 vCPU) ---
metric PE=1 PE=0 delta
-----------------------------------------------------------
cycles 21,606,480,478 22,026,855,018 -1.91%
instructions 14,415,921,265 14,475,061,137 -0.41%
cache-misses 202,569,745 191,711,462 +5.66% *
context-switches 56,541 59,956 -5.70%
cpu-migrations 250 306 -18.30%
sys-time (s) 6.404 6.710 -4.56%
IPC 0.667 0.657 +1.53%
The one counter that goes the wrong direction is cache-misses (+5.66% with
PE). This likely traces back to the split of rq->curr and rq->donor in
sched.h:1137-1144:
Without CONFIG_SCHED_PROXY_EXEC:
union {
struct task_struct __rcu *donor; /* occupies same cacheline slot */
struct task_struct __rcu *curr;
};
With CONFIG_SCHED_PROXY_EXEC:
struct task_struct __rcu *donor; /* two separate pointers */
struct task_struct __rcu *curr; /* → one extra pointer on hot cacheline */
This adds 8 bytes to the rq hot cacheline. The extra pointer is always
present (even when no task is blocked), making it a *static* overhead of
the PE infrastructure.
--- sched tracepoint comparison ---
event PE=1 PE=0 delta
---------------------------------------------------
sched_switch 63,789 67,782 -5.89%
sched_wakeup 61,289 63,842 -4.00%
sched_migrate_task 209 194 +7.73%
PE reduces context-switches and wakeups by ~5-6%, which is the expected
benefit: the proxy-running mechanism avoids blocking the high-priority
waiter, leading to fewer voluntary switches. This translates directly into
the observed wall-clock improvement.
--- perf record symbol breakdown (cycles:k) ---
symbol PE=1% PE=0% delta pp
-------------------------------------------------------
rep_movs_alternative 10.76 10.93 -0.17
__refill_objects_node 7.95 8.05 -0.10
_raw_spin_lock 7.61 7.66 -0.05
__skb_datagram_iter 5.25 5.28 -0.03
clear_bhb_loop 3.55 3.64 -0.09
unix_stream_read_generic 3.53 3.42 +0.11
queued_spin_lock_slowpath 1.15 1.24 -0.09
With PE enabled, queued_spin_lock_slowpath is *lower* (-0.09pp), consistent
with the cache-miss story: fewer blocked tasks means less spinlock contention.
The small increase in cache-misses (from the larger rq struct) does not
outweigh the scheduling benefit.
--- Summary ---
On upstream 7.0-rc2 with base PE infrastructure (no v24 patches), PE:
- Improves wall-clock by ~4% on 4-core workload
- Reduces context-switches by ~6%
- Adds ~6% cache-miss overhead (from rq->donor/curr split)
- Has negligible __schedule path overhead
The 3-5% regression K Prateek identified is therefore specific to the
v24 Donor Migration patches, not to the upstream PE base. This suggests
the regression source is in one of:
- proxy_migrate_task() cross-CPU donor migration overhead (patch 6)
- the PROXY_WAKING state machinery (patch 3/9)
- pick_again loop changes in pick_next_task() (patches 4/5/7)
The rq->donor cacheline overhead (sched.h:1137) is a static cost worth
noting for the long-term, but it does not cause the v24 regression.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2026-03-09 6:47 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-09 6:47 [PATCH v24 00/11] Donor Migration for Proxy Execution - sched/proxy_exec: perf bench sched messaging regression analysis zhidao su
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox