Re: [PATCH v24 00/11] Donor Migration for Proxy Execution - sched/proxy_exec: perf bench sched messaging regression analysis

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: zhidao su <soolaugust@gmail.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, jstultz@google.com,
	kprateek.nayak@amd.com
Subject: Re: [PATCH v24 00/11] Donor Migration for Proxy Execution - sched/proxy_exec: perf bench sched messaging regression analysis
Date: Mon, 09 Mar 2026 14:47:49 +0800	[thread overview]
Message-ID: <69ae6d1a.170a0220.264e67.5f30@mx.google.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 4915 bytes --]

I ran perf bench sched messaging (-g 10 -l 1000) under both
sched_proxy_exec=0 and sched_proxy_exec=1 to help locate the source of
the reported regression.

Test environment:
  Kernel:    7.0.0-rc2 (CONFIG_SCHED_PROXY_EXEC=y, no v24 patches)
  Host CPU:  Intel Core i7-10700 (8C/16T, Comet Lake)
  Test env:  virtme-ng (vng) with 4 vCPUs, 2GB RAM
  Benchmark: perf bench sched messaging -g 10 -l 1000 (400 processes)
  Runs:      7 per configuration

Note: measurements are from a QEMU/vng environment, not bare metal.
The relative comparison between PE=0 and PE=1 is meaningful, but
absolute numbers will differ on real hardware.

--- Wall-clock timing (7 runs each, 4 vCPU) ---

  PE ON  (sched_proxy_exec=1): avg=1.385s  stdev=0.028s (2.0%)
  PE OFF (sched_proxy_exec=0): avg=1.441s  stdev=0.022s (1.5%)
  Delta: -3.9% (PE ON is *faster*)

I also tested with 1 vCPU (single-threaded QEMU, more serialized):
  PE ON:  cycles 33,309,716,744  IPC 0.453
  PE OFF: cycles 31,822,761,446  IPC 0.471
  Delta:  +4.7% cycles (PE slower on 1 vCPU)

The 1-vCPU result shows a small overhead at low parallelism, which
is consistent with the static cost of the rq->donor/curr split (see
below). With 4 vCPUs the scheduling benefit (fewer context-switches)
dominates.

The 3-5% regression K Prateek reported on v24 is larger than the 1-vCPU
overhead here, which suggests the regression is specific to the v24
Donor Migration patches, not to the base PE infrastructure.

--- perf stat comparison (single run, 4 vCPU) ---

  metric               PE=1           PE=0        delta
  -----------------------------------------------------------
  cycles          21,606,480,478  22,026,855,018   -1.91%
  instructions    14,415,921,265  14,475,061,137   -0.41%
  cache-misses       202,569,745     191,711,462   +5.66%  *
  context-switches        56,541          59,956   -5.70%
  cpu-migrations             250             306  -18.30%
  sys-time (s)            6.404           6.710    -4.56%
  IPC                     0.667           0.657    +1.53%

The one counter that goes the wrong direction is cache-misses (+5.66% with
PE). This likely traces back to the split of rq->curr and rq->donor in
sched.h:1137-1144:

  Without CONFIG_SCHED_PROXY_EXEC:
    union {
        struct task_struct __rcu *donor;  /* occupies same cacheline slot */
        struct task_struct __rcu *curr;
    };

  With CONFIG_SCHED_PROXY_EXEC:
    struct task_struct __rcu *donor;  /* two separate pointers */
    struct task_struct __rcu *curr;   /* → one extra pointer on hot cacheline */

This adds 8 bytes to the rq hot cacheline. The extra pointer is always
present (even when no task is blocked), making it a *static* overhead of
the PE infrastructure.

--- sched tracepoint comparison ---

  event                PE=1        PE=0       delta
  ---------------------------------------------------
  sched_switch         63,789      67,782     -5.89%
  sched_wakeup         61,289      63,842     -4.00%
  sched_migrate_task      209         194     +7.73%

PE reduces context-switches and wakeups by ~5-6%, which is the expected
benefit: the proxy-running mechanism avoids blocking the high-priority
waiter, leading to fewer voluntary switches. This translates directly into
the observed wall-clock improvement.

--- perf record symbol breakdown (cycles:k) ---

  symbol                       PE=1%   PE=0%   delta pp
  -------------------------------------------------------
  rep_movs_alternative         10.76   10.93    -0.17
  __refill_objects_node         7.95    8.05    -0.10
  _raw_spin_lock                7.61    7.66    -0.05
  __skb_datagram_iter           5.25    5.28    -0.03
  clear_bhb_loop                3.55    3.64    -0.09
  unix_stream_read_generic      3.53    3.42    +0.11
  queued_spin_lock_slowpath     1.15    1.24    -0.09

With PE enabled, queued_spin_lock_slowpath is *lower* (-0.09pp), consistent
with the cache-miss story: fewer blocked tasks means less spinlock contention.
The small increase in cache-misses (from the larger rq struct) does not
outweigh the scheduling benefit.

--- Summary ---

On upstream 7.0-rc2 with base PE infrastructure (no v24 patches), PE:
  - Improves wall-clock by ~4% on 4-core workload
  - Reduces context-switches by ~6%
  - Adds ~6% cache-miss overhead (from rq->donor/curr split)
  - Has negligible __schedule path overhead

The 3-5% regression K Prateek identified is therefore specific to the
v24 Donor Migration patches, not to the upstream PE base. This suggests
the regression source is in one of:
  - proxy_migrate_task() cross-CPU donor migration overhead (patch 6)
  - the PROXY_WAKING state machinery (patch 3/9)
  - pick_again loop changes in pick_next_task() (patches 4/5/7)

The rq->donor cacheline overhead (sched.h:1137) is a static cost worth
noting for the long-term, but it does not cause the v24 regression.

Signed-off-by: zhidao su <suzhidao@xiaomi.com>

                 reply	other threads:[~2026-03-09  6:47 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=69ae6d1a.170a0220.264e67.5f30@mx.google.com \
    --to=soolaugust@gmail.com \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox