* Re: [PATCH v24 00/11] Donor Migration for Proxy Execution - sched/proxy_exec: perf bench sched messaging regression analysis
@ 2026-03-09 6:47 zhidao su
0 siblings, 0 replies; only message in thread
From: zhidao su @ 2026-03-09 6:47 UTC (permalink / raw)
To: linux-kernel
Cc: mingo, peterz, juri.lelli, vincent.guittot, jstultz,
kprateek.nayak
[-- Attachment #1: Type: text/plain, Size: 4915 bytes --]
I ran perf bench sched messaging (-g 10 -l 1000) under both
sched_proxy_exec=0 and sched_proxy_exec=1 to help locate the source of
the reported regression.
Test environment:
Kernel: 7.0.0-rc2 (CONFIG_SCHED_PROXY_EXEC=y, no v24 patches)
Host CPU: Intel Core i7-10700 (8C/16T, Comet Lake)
Test env: virtme-ng (vng) with 4 vCPUs, 2GB RAM
Benchmark: perf bench sched messaging -g 10 -l 1000 (400 processes)
Runs: 7 per configuration
Note: measurements are from a QEMU/vng environment, not bare metal.
The relative comparison between PE=0 and PE=1 is meaningful, but
absolute numbers will differ on real hardware.
--- Wall-clock timing (7 runs each, 4 vCPU) ---
PE ON (sched_proxy_exec=1): avg=1.385s stdev=0.028s (2.0%)
PE OFF (sched_proxy_exec=0): avg=1.441s stdev=0.022s (1.5%)
Delta: -3.9% (PE ON is *faster*)
I also tested with 1 vCPU (single-threaded QEMU, more serialized):
PE ON: cycles 33,309,716,744 IPC 0.453
PE OFF: cycles 31,822,761,446 IPC 0.471
Delta: +4.7% cycles (PE slower on 1 vCPU)
The 1-vCPU result shows a small overhead at low parallelism, which
is consistent with the static cost of the rq->donor/curr split (see
below). With 4 vCPUs the scheduling benefit (fewer context-switches)
dominates.
The 3-5% regression K Prateek reported on v24 is larger than the 1-vCPU
overhead here, which suggests the regression is specific to the v24
Donor Migration patches, not to the base PE infrastructure.
--- perf stat comparison (single run, 4 vCPU) ---
metric PE=1 PE=0 delta
-----------------------------------------------------------
cycles 21,606,480,478 22,026,855,018 -1.91%
instructions 14,415,921,265 14,475,061,137 -0.41%
cache-misses 202,569,745 191,711,462 +5.66% *
context-switches 56,541 59,956 -5.70%
cpu-migrations 250 306 -18.30%
sys-time (s) 6.404 6.710 -4.56%
IPC 0.667 0.657 +1.53%
The one counter that goes the wrong direction is cache-misses (+5.66% with
PE). This likely traces back to the split of rq->curr and rq->donor in
sched.h:1137-1144:
Without CONFIG_SCHED_PROXY_EXEC:
union {
struct task_struct __rcu *donor; /* occupies same cacheline slot */
struct task_struct __rcu *curr;
};
With CONFIG_SCHED_PROXY_EXEC:
struct task_struct __rcu *donor; /* two separate pointers */
struct task_struct __rcu *curr; /* → one extra pointer on hot cacheline */
This adds 8 bytes to the rq hot cacheline. The extra pointer is always
present (even when no task is blocked), making it a *static* overhead of
the PE infrastructure.
--- sched tracepoint comparison ---
event PE=1 PE=0 delta
---------------------------------------------------
sched_switch 63,789 67,782 -5.89%
sched_wakeup 61,289 63,842 -4.00%
sched_migrate_task 209 194 +7.73%
PE reduces context-switches and wakeups by ~5-6%, which is the expected
benefit: the proxy-running mechanism avoids blocking the high-priority
waiter, leading to fewer voluntary switches. This translates directly into
the observed wall-clock improvement.
--- perf record symbol breakdown (cycles:k) ---
symbol PE=1% PE=0% delta pp
-------------------------------------------------------
rep_movs_alternative 10.76 10.93 -0.17
__refill_objects_node 7.95 8.05 -0.10
_raw_spin_lock 7.61 7.66 -0.05
__skb_datagram_iter 5.25 5.28 -0.03
clear_bhb_loop 3.55 3.64 -0.09
unix_stream_read_generic 3.53 3.42 +0.11
queued_spin_lock_slowpath 1.15 1.24 -0.09
With PE enabled, queued_spin_lock_slowpath is *lower* (-0.09pp), consistent
with the cache-miss story: fewer blocked tasks means less spinlock contention.
The small increase in cache-misses (from the larger rq struct) does not
outweigh the scheduling benefit.
--- Summary ---
On upstream 7.0-rc2 with base PE infrastructure (no v24 patches), PE:
- Improves wall-clock by ~4% on 4-core workload
- Reduces context-switches by ~6%
- Adds ~6% cache-miss overhead (from rq->donor/curr split)
- Has negligible __schedule path overhead
The 3-5% regression K Prateek identified is therefore specific to the
v24 Donor Migration patches, not to the upstream PE base. This suggests
the regression source is in one of:
- proxy_migrate_task() cross-CPU donor migration overhead (patch 6)
- the PROXY_WAKING state machinery (patch 3/9)
- pick_again loop changes in pick_next_task() (patches 4/5/7)
The rq->donor cacheline overhead (sched.h:1137) is a static cost worth
noting for the long-term, but it does not cause the v24 regression.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2026-03-09 6:47 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-09 6:47 [PATCH v24 00/11] Donor Migration for Proxy Execution - sched/proxy_exec: perf bench sched messaging regression analysis zhidao su
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.