Re: [PATCH v24 00/11] Donor Migration for Proxy Execution - sched/proxy_exec: perf bench sched messaging regression analysis

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v24 00/11] Donor Migration for Proxy Execution - sched/proxy_exec: perf bench sched messaging regression analysis
@ 2026-03-09  6:47 zhidao su
  0 siblings, 0 replies; only message in thread
From: zhidao su @ 2026-03-09  6:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, jstultz,
	kprateek.nayak

[-- Attachment #1: Type: text/plain, Size: 4915 bytes --]

I ran perf bench sched messaging (-g 10 -l 1000) under both
sched_proxy_exec=0 and sched_proxy_exec=1 to help locate the source of
the reported regression.

Test environment:
  Kernel:    7.0.0-rc2 (CONFIG_SCHED_PROXY_EXEC=y, no v24 patches)
  Host CPU:  Intel Core i7-10700 (8C/16T, Comet Lake)
  Test env:  virtme-ng (vng) with 4 vCPUs, 2GB RAM
  Benchmark: perf bench sched messaging -g 10 -l 1000 (400 processes)
  Runs:      7 per configuration

Note: measurements are from a QEMU/vng environment, not bare metal.
The relative comparison between PE=0 and PE=1 is meaningful, but
absolute numbers will differ on real hardware.

--- Wall-clock timing (7 runs each, 4 vCPU) ---

  PE ON  (sched_proxy_exec=1): avg=1.385s  stdev=0.028s (2.0%)
  PE OFF (sched_proxy_exec=0): avg=1.441s  stdev=0.022s (1.5%)
  Delta: -3.9% (PE ON is *faster*)

I also tested with 1 vCPU (single-threaded QEMU, more serialized):
  PE ON:  cycles 33,309,716,744  IPC 0.453
  PE OFF: cycles 31,822,761,446  IPC 0.471
  Delta:  +4.7% cycles (PE slower on 1 vCPU)

The 1-vCPU result shows a small overhead at low parallelism, which
is consistent with the static cost of the rq->donor/curr split (see
below). With 4 vCPUs the scheduling benefit (fewer context-switches)
dominates.

The 3-5% regression K Prateek reported on v24 is larger than the 1-vCPU
overhead here, which suggests the regression is specific to the v24
Donor Migration patches, not to the base PE infrastructure.

--- perf stat comparison (single run, 4 vCPU) ---

  metric               PE=1           PE=0        delta
  -----------------------------------------------------------
  cycles          21,606,480,478  22,026,855,018   -1.91%
  instructions    14,415,921,265  14,475,061,137   -0.41%
  cache-misses       202,569,745     191,711,462   +5.66%  *
  context-switches        56,541          59,956   -5.70%
  cpu-migrations             250             306  -18.30%
  sys-time (s)            6.404           6.710    -4.56%
  IPC                     0.667           0.657    +1.53%

The one counter that goes the wrong direction is cache-misses (+5.66% with
PE). This likely traces back to the split of rq->curr and rq->donor in
sched.h:1137-1144:

  Without CONFIG_SCHED_PROXY_EXEC:
    union {
        struct task_struct __rcu *donor;  /* occupies same cacheline slot */
        struct task_struct __rcu *curr;
    };

  With CONFIG_SCHED_PROXY_EXEC:
    struct task_struct __rcu *donor;  /* two separate pointers */
    struct task_struct __rcu *curr;   /* → one extra pointer on hot cacheline */

This adds 8 bytes to the rq hot cacheline. The extra pointer is always
present (even when no task is blocked), making it a *static* overhead of
the PE infrastructure.

--- sched tracepoint comparison ---

  event                PE=1        PE=0       delta
  ---------------------------------------------------
  sched_switch         63,789      67,782     -5.89%
  sched_wakeup         61,289      63,842     -4.00%
  sched_migrate_task      209         194     +7.73%

PE reduces context-switches and wakeups by ~5-6%, which is the expected
benefit: the proxy-running mechanism avoids blocking the high-priority
waiter, leading to fewer voluntary switches. This translates directly into
the observed wall-clock improvement.

--- perf record symbol breakdown (cycles:k) ---

  symbol                       PE=1%   PE=0%   delta pp
  -------------------------------------------------------
  rep_movs_alternative         10.76   10.93    -0.17
  __refill_objects_node         7.95    8.05    -0.10
  _raw_spin_lock                7.61    7.66    -0.05
  __skb_datagram_iter           5.25    5.28    -0.03
  clear_bhb_loop                3.55    3.64    -0.09
  unix_stream_read_generic      3.53    3.42    +0.11
  queued_spin_lock_slowpath     1.15    1.24    -0.09

With PE enabled, queued_spin_lock_slowpath is *lower* (-0.09pp), consistent
with the cache-miss story: fewer blocked tasks means less spinlock contention.
The small increase in cache-misses (from the larger rq struct) does not
outweigh the scheduling benefit.

--- Summary ---

On upstream 7.0-rc2 with base PE infrastructure (no v24 patches), PE:
  - Improves wall-clock by ~4% on 4-core workload
  - Reduces context-switches by ~6%
  - Adds ~6% cache-miss overhead (from rq->donor/curr split)
  - Has negligible __schedule path overhead

The 3-5% regression K Prateek identified is therefore specific to the
v24 Donor Migration patches, not to the upstream PE base. This suggests
the regression source is in one of:
  - proxy_migrate_task() cross-CPU donor migration overhead (patch 6)
  - the PROXY_WAKING state machinery (patch 3/9)
  - pick_again loop changes in pick_next_task() (patches 4/5/7)

The rq->donor cacheline overhead (sched.h:1137) is a static cost worth
noting for the long-term, but it does not cause the v24 regression.

Signed-off-by: zhidao su <suzhidao@xiaomi.com>

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2026-03-09  6:47 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-09  6:47 [PATCH v24 00/11] Donor Migration for Proxy Execution - sched/proxy_exec: perf bench sched messaging regression analysis zhidao su

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.