public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v24 00/11] Donor Migration for Proxy Execution - sched/proxy_exec: perf bench sched messaging regression analysis
@ 2026-03-09  6:47 zhidao su
  0 siblings, 0 replies; only message in thread
From: zhidao su @ 2026-03-09  6:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, jstultz,
	kprateek.nayak

[-- Attachment #1: Type: text/plain, Size: 4915 bytes --]

I ran perf bench sched messaging (-g 10 -l 1000) under both
sched_proxy_exec=0 and sched_proxy_exec=1 to help locate the source of
the reported regression.

Test environment:
  Kernel:    7.0.0-rc2 (CONFIG_SCHED_PROXY_EXEC=y, no v24 patches)
  Host CPU:  Intel Core i7-10700 (8C/16T, Comet Lake)
  Test env:  virtme-ng (vng) with 4 vCPUs, 2GB RAM
  Benchmark: perf bench sched messaging -g 10 -l 1000 (400 processes)
  Runs:      7 per configuration

Note: measurements are from a QEMU/vng environment, not bare metal.
The relative comparison between PE=0 and PE=1 is meaningful, but
absolute numbers will differ on real hardware.

--- Wall-clock timing (7 runs each, 4 vCPU) ---

  PE ON  (sched_proxy_exec=1): avg=1.385s  stdev=0.028s (2.0%)
  PE OFF (sched_proxy_exec=0): avg=1.441s  stdev=0.022s (1.5%)
  Delta: -3.9% (PE ON is *faster*)

I also tested with 1 vCPU (single-threaded QEMU, more serialized):
  PE ON:  cycles 33,309,716,744  IPC 0.453
  PE OFF: cycles 31,822,761,446  IPC 0.471
  Delta:  +4.7% cycles (PE slower on 1 vCPU)

The 1-vCPU result shows a small overhead at low parallelism, which
is consistent with the static cost of the rq->donor/curr split (see
below). With 4 vCPUs the scheduling benefit (fewer context-switches)
dominates.

The 3-5% regression K Prateek reported on v24 is larger than the 1-vCPU
overhead here, which suggests the regression is specific to the v24
Donor Migration patches, not to the base PE infrastructure.

--- perf stat comparison (single run, 4 vCPU) ---

  metric               PE=1           PE=0        delta
  -----------------------------------------------------------
  cycles          21,606,480,478  22,026,855,018   -1.91%
  instructions    14,415,921,265  14,475,061,137   -0.41%
  cache-misses       202,569,745     191,711,462   +5.66%  *
  context-switches        56,541          59,956   -5.70%
  cpu-migrations             250             306  -18.30%
  sys-time (s)            6.404           6.710    -4.56%
  IPC                     0.667           0.657    +1.53%

The one counter that goes the wrong direction is cache-misses (+5.66% with
PE). This likely traces back to the split of rq->curr and rq->donor in
sched.h:1137-1144:

  Without CONFIG_SCHED_PROXY_EXEC:
    union {
        struct task_struct __rcu *donor;  /* occupies same cacheline slot */
        struct task_struct __rcu *curr;
    };

  With CONFIG_SCHED_PROXY_EXEC:
    struct task_struct __rcu *donor;  /* two separate pointers */
    struct task_struct __rcu *curr;   /* → one extra pointer on hot cacheline */

This adds 8 bytes to the rq hot cacheline. The extra pointer is always
present (even when no task is blocked), making it a *static* overhead of
the PE infrastructure.

--- sched tracepoint comparison ---

  event                PE=1        PE=0       delta
  ---------------------------------------------------
  sched_switch         63,789      67,782     -5.89%
  sched_wakeup         61,289      63,842     -4.00%
  sched_migrate_task      209         194     +7.73%

PE reduces context-switches and wakeups by ~5-6%, which is the expected
benefit: the proxy-running mechanism avoids blocking the high-priority
waiter, leading to fewer voluntary switches. This translates directly into
the observed wall-clock improvement.

--- perf record symbol breakdown (cycles:k) ---

  symbol                       PE=1%   PE=0%   delta pp
  -------------------------------------------------------
  rep_movs_alternative         10.76   10.93    -0.17
  __refill_objects_node         7.95    8.05    -0.10
  _raw_spin_lock                7.61    7.66    -0.05
  __skb_datagram_iter           5.25    5.28    -0.03
  clear_bhb_loop                3.55    3.64    -0.09
  unix_stream_read_generic      3.53    3.42    +0.11
  queued_spin_lock_slowpath     1.15    1.24    -0.09

With PE enabled, queued_spin_lock_slowpath is *lower* (-0.09pp), consistent
with the cache-miss story: fewer blocked tasks means less spinlock contention.
The small increase in cache-misses (from the larger rq struct) does not
outweigh the scheduling benefit.

--- Summary ---

On upstream 7.0-rc2 with base PE infrastructure (no v24 patches), PE:
  - Improves wall-clock by ~4% on 4-core workload
  - Reduces context-switches by ~6%
  - Adds ~6% cache-miss overhead (from rq->donor/curr split)
  - Has negligible __schedule path overhead

The 3-5% regression K Prateek identified is therefore specific to the
v24 Donor Migration patches, not to the upstream PE base. This suggests
the regression source is in one of:
  - proxy_migrate_task() cross-CPU donor migration overhead (patch 6)
  - the PROXY_WAKING state machinery (patch 3/9)
  - pick_again loop changes in pick_next_task() (patches 4/5/7)

The rq->donor cacheline overhead (sched.h:1137) is a static cost worth
noting for the long-term, but it does not cause the v24 regression.

Signed-off-by: zhidao su <suzhidao@xiaomi.com>

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2026-03-09  6:47 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-09  6:47 [PATCH v24 00/11] Donor Migration for Proxy Execution - sched/proxy_exec: perf bench sched messaging regression analysis zhidao su

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox