public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	LKML <linux-kernel@vger.kernel.org>
Cc: Michael Jeanson <mjeanson@efficios.com>,
	Jens Axboe <axboe@kernel.dk>,
	Peter Zijlstra <peterz@infradead.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Boqun Feng <boqun.feng@gmail.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Sean Christopherson <seanjc@google.com>,
	Wei Liu <wei.liu@kernel.org>, Dexuan Cui <decui@microsoft.com>,
	x86@kernel.org, Arnd Bergmann <arnd@arndb.de>,
	Heiko Carstens <hca@linux.ibm.com>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Sven Schnelle <svens@linux.ibm.com>,
	Huacai Chen <chenhuacai@kernel.org>,
	Paul Walmsley <paul.walmsley@sifive.com>,
	Palmer Dabbelt <palmer@dabbelt.com>
Subject: Re: [patch V4 28/36] rseq: Switch to fast path processing on exit to user
Date: Fri, 12 Sep 2025 16:22:26 +0200	[thread overview]
Message-ID: <87frcrrbrx.ffs@tglx> (raw)
In-Reply-To: <cda37321-691d-47db-8b71-7a79dedb4bb6@efficios.com>

On Thu, Sep 11 2025 at 16:00, Mathieu Desnoyers wrote:
> On 2025-09-11 12:47, Thomas Gleixner wrote:
>> The normal case is that the fault is pretty much guaranteed to happen
>> because the scheduler places the child on a different CPU and therefore
>> the CPU/MM IDs need to be updated anyway.
>
> Is this "normal case" the behavior you wish the scheduler to have or
> is it based on metrics ?

I did measurements and that's how I came to the conclusion. I measured
forks vs. faults in the page which holds the RSEQ memory, aka. TLS.

> Here is the result of a sched_fork() instrumentation vs the first
> pick_next_task that follows for each thread (accounted with schedstats
> see diff at the end of this email, followed by my kernel config), and we
> can observe the following pattern with a parallel kernel build workload
> (64-core VM, on a physical machine that has 384 logical cpus):
>
> - When undercommitted, in which case the scheduler has idle core
>    targets to select (make -j32 on a 64-core VM), indeed the behavior
>    favors spreading the workload to different CPUs (in which case we
>    would hit a page fault on return from fork, as you predicted):
>    543 same-cpu vs 62k different-cpu.
>    Most of the same-cpu are on CPU 0 (402), surprisingly,
>
> - When using all the cores (make -j64 on a 64-core VM) (and
>    also for overcommit scenarios tested up to -j400), the
>    picture differs:
>    27k same-cpu vs 32k different-cpu, which is pretty much even.

Sure. In overcommit there is obviously a higher probability to stay on
the same CPU.

> This is a kernel build workload, which consist mostly of
> single-threaded processes (make, gcc). I therefore expect the
> rseq mm_cid to be always 0 for those cases.

Sure, but that does not mean the CPU stays the same.

>> The only cases where this is not true, are when there is no capacity to
>> do so or on UP or the parent was affine to a single CPU, which is what
>> the child inherits.
>
> Wrong. See above.

Huch? You just confirmed what I was saying. It's not true for
overcommit, i.e. there is no capacity to do place it on a different CPU
right away.

> Moreover, a common userspace process execution pattern is fork followed
> by exec, in which case whatever page faults happen on fork become
> irrelevant as soon as exec replaces the entire mapping.

That's correct, but for exec to happen it needs to complete the fork
first and go out to user space to do the exec without touching TLS and
without being placed on a different CPU, right?

So again, that depends on the ability of placement where this ends up.

>> the fault will happen in the fast path first, which means the exit code
>> has to go another round through the work loop instead of forcing the
>> fault right away on the first exit in the slowpath, where it can be
>> actually resolved.
>
> I think you misunderstood my improvement suggestion: I do _not_
> recommend that we pointlessly take the fast-path on fork, but rather
> than we keep the cached cpu_cid value (from the parent) around to check
> whether we actually need to store to userspace and take a page fault.
> Based on this, we can choose whether we use the fast-path or go directly
> to the slow-path.

That's not that trivial because you need extra an extra decision in the
scheduler to take the slowpath based on information whether that's the
first schedule and change after fork so the scheduler can set either
TIF_RSEQ or NOTIFY along with the slowpath flag.

But that's just not worth it. So forcing the slowpath after fork seemed
to be a sensible thing to do.

Anyway, I just took the numbers for a make -j128 build on a 64CPU
systems again with the force removed. I have to correct my claim about
vast majority for that case, but 69% of forked childs which fault in the
TLS/RSEQ page before exit or exec is definitely not a small number.

The pattern, which brought me to actually force the slowpath
unconditionally was:

  child()
  ...
  return to user:

    fast_path -> faults
    slow_path -> faults and handles the fault
    fast_path    (usually a NOOP, but not necessarily)
    done

Though I'm not insisting on keeping the fork force slowpath mechanics.

As we might eventually agree on, it depends on the workload and the
state of the system, so the milage may vary pretty much.

Thanks,

        tglx

  reply	other threads:[~2025-09-12 14:22 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-08 21:31 [patch V4 00/36] rseq: Optimize exit to user space Thomas Gleixner
2025-09-08 21:31 ` [patch V4 01/36] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
2025-09-08 21:31 ` [patch V4 02/36] rseq: Condense the inline stubs Thomas Gleixner
2025-09-08 21:31 ` [patch V4 03/36] rseq: Move algorithm comment to top Thomas Gleixner
2025-09-08 21:31 ` [patch V4 04/36] rseq: Remove the ksig argument from rseq_handle_notify_resume() Thomas Gleixner
2025-09-08 21:31 ` [patch V4 05/36] rseq: Simplify registration Thomas Gleixner
2025-09-08 21:31 ` [patch V4 06/36] rseq: Simplify the event notification Thomas Gleixner
2025-09-09 13:18   ` Mathieu Desnoyers
2025-09-08 21:31 ` [patch V4 07/36] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
2025-09-09  0:00   ` Sean Christopherson
2025-09-09 12:10     ` Thomas Gleixner
2025-09-09 13:21   ` Mathieu Desnoyers
2025-09-08 21:31 ` [patch V4 08/36] rseq: Avoid CPU/MM CID updates when no event pending Thomas Gleixner
2025-09-09 13:25   ` Mathieu Desnoyers
2025-09-08 21:31 ` [patch V4 09/36] rseq: Introduce struct rseq_data Thomas Gleixner
2025-09-09 13:30   ` Mathieu Desnoyers
2025-09-12 20:44     ` Thomas Gleixner
2025-09-12 21:33       ` Mathieu Desnoyers
2025-09-08 21:31 ` [patch V4 10/36] entry: Cleanup header Thomas Gleixner
2025-09-08 21:31 ` [patch V4 11/36] entry: Remove syscall_enter_from_user_mode_prepare() Thomas Gleixner
2025-09-09 13:33   ` Mathieu Desnoyers
2025-09-08 21:31 ` [patch V4 12/36] entry: Inline irqentry_enter/exit_from/to_user_mode() Thomas Gleixner
2025-09-09 13:38   ` Mathieu Desnoyers
2025-09-09 14:10     ` Thomas Gleixner
2025-09-09 14:59       ` Mathieu Desnoyers
2025-09-08 21:31 ` [patch V4 13/36] sched: Move MM CID related functions to sched.h Thomas Gleixner
2025-09-08 21:31 ` [patch V4 14/36] rseq: Cache CPU ID and MM CID values Thomas Gleixner
2025-09-09 13:43   ` Mathieu Desnoyers
2025-09-09 14:13     ` Thomas Gleixner
2025-09-09 15:01       ` Mathieu Desnoyers
2025-09-08 21:31 ` [patch V4 15/36] rseq: Record interrupt from user space Thomas Gleixner
2025-09-09 13:53   ` Mathieu Desnoyers
2025-09-09 14:17     ` Thomas Gleixner
2025-09-09 15:05       ` Mathieu Desnoyers
2025-09-08 21:31 ` [patch V4 16/36] rseq: Provide tracepoint wrappers for inline code Thomas Gleixner
2025-09-08 21:31 ` [patch V4 17/36] rseq: Expose lightweight statistics in debugfs Thomas Gleixner
2025-09-08 21:32 ` [patch V4 18/36] rseq: Provide static branch for runtime debugging Thomas Gleixner
2025-09-08 21:32 ` [patch V4 19/36] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
2025-09-09 15:11   ` Mathieu Desnoyers
2025-09-08 21:32 ` [patch V4 20/36] rseq: Replace the original debug implementation Thomas Gleixner
2025-09-08 21:32 ` [patch V4 21/36] rseq: Make exit debugging static branch based Thomas Gleixner
2025-09-08 21:32 ` [patch V4 22/36] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Thomas Gleixner
2025-09-08 21:32 ` [patch V4 23/36] rseq: Provide and use rseq_set_ids() Thomas Gleixner
2025-09-11 13:40   ` Mathieu Desnoyers
2025-09-11 16:02     ` Thomas Gleixner
2025-09-11 17:13       ` Mathieu Desnoyers
2025-09-08 21:32 ` [patch V4 24/36] rseq: Separate the signal delivery path Thomas Gleixner
2025-09-08 21:32 ` [patch V4 25/36] rseq: Rework the TIF_NOTIFY handler Thomas Gleixner
2025-09-08 21:32 ` [patch V4 26/36] rseq: Optimize event setting Thomas Gleixner
2025-09-11 14:03   ` Mathieu Desnoyers
2025-09-11 16:06     ` Thomas Gleixner
2025-09-11 17:15       ` Mathieu Desnoyers
2025-09-12  6:58         ` Thomas Gleixner
2025-09-08 21:32 ` [patch V4 27/36] rseq: Implement fast path for exit to user Thomas Gleixner
2025-09-11 14:27   ` Mathieu Desnoyers
2025-09-11 16:08     ` Thomas Gleixner
2025-09-08 21:32 ` [patch V4 28/36] rseq: Switch to fast path processing on " Thomas Gleixner
2025-09-11 14:44   ` Mathieu Desnoyers
2025-09-11 14:45     ` Mathieu Desnoyers
2025-09-11 16:50       ` Thomas Gleixner
2025-09-11 16:47     ` Thomas Gleixner
2025-09-11 20:00       ` Mathieu Desnoyers
2025-09-12 14:22         ` Thomas Gleixner [this message]
2025-09-12 15:44           ` Mathieu Desnoyers
2025-09-08 21:32 ` [patch V4 29/36] entry: Split up exit_to_user_mode_prepare() Thomas Gleixner
2025-09-08 21:32 ` [patch V4 30/36] rseq: Split up rseq_exit_to_user_mode() Thomas Gleixner
2025-09-08 21:32 ` [patch V4 31/36] asm-generic: Provide generic TIF infrastructure Thomas Gleixner
2025-09-17  6:16   ` [tip: core/core] " tip-bot2 for Thomas Gleixner
2025-09-08 21:32 ` [patch V4 32/36] x86: Use generic TIF bits Thomas Gleixner
2025-09-17  6:16   ` [tip: core/core] " tip-bot2 for Thomas Gleixner
2025-09-08 21:32 ` [patch V4 33/36] s390: " Thomas Gleixner
2025-09-11  9:11   ` Sven Schnelle
2025-09-11 11:03   ` Heiko Carstens
2025-09-17  6:16   ` [tip: core/core] " tip-bot2 for Thomas Gleixner
2025-09-08 21:32 ` [patch V4 34/36] loongarch: " Thomas Gleixner
2025-09-17  6:16   ` [tip: core/core] " tip-bot2 for Thomas Gleixner
2025-09-08 21:32 ` [patch V4 35/36] riscv: " Thomas Gleixner
2025-09-17  6:16   ` [tip: core/core] " tip-bot2 for Thomas Gleixner
2025-09-08 21:32 ` [patch V4 36/36] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
2025-09-10 13:55 ` [patch V4 00/36] rseq: Optimize exit to user space Jens Axboe
2025-09-10 14:45   ` Michael Jeanson
2025-09-10 15:34     ` Jens Axboe
2025-09-10 14:54   ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87frcrrbrx.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=arnd@arndb.de \
    --cc=axboe@kernel.dk \
    --cc=boqun.feng@gmail.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=chenhuacai@kernel.org \
    --cc=decui@microsoft.com \
    --cc=hca@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mjeanson@efficios.com \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=paulmck@kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=seanjc@google.com \
    --cc=svens@linux.ibm.com \
    --cc=wei.liu@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox