public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Thomas Gleixner <tglx@linutronix.de>,
	LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zilstra <peterz@infradead.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Boqun Feng <boqun.feng@gmail.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Prakash Sangappa <prakash.sangappa@oracle.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Arnd Bergmann <arnd@arndb.de>,
	linux-arch@vger.kernel.org, Florian Weimer <fweimer@redhat.com>,
	"carlos@redhat.com" <carlos@redhat.com>,
	libc-coord@lists.openwall.com
Subject: Re: [patch 00/12] rseq: Implement time slice extension mechanism
Date: Fri, 12 Sep 2025 15:26:22 -0400	[thread overview]
Message-ID: <a65dfd2c-b435-4d83-89d0-abc8002db7c7@efficios.com> (raw)
In-Reply-To: <87a52zr5sv.ffs@tglx>

[ For those just CC'd on this thread, the discussion is about time slice
   extension for userspace critical sections. We are specifically
   discussing the kernel ABI we plan to expose to userspace. ]

On 2025-09-12 12:31, Thomas Gleixner wrote:
> On Fri, Sep 12 2025 at 08:33, Mathieu Desnoyers wrote:
>> On 2025-09-11 16:18, Thomas Gleixner wrote:
>>> It receives SIGSEGV because that means that it did not follow the rules
>>> and stuck an arbitrary syscall into the critical section.
>>
>> Not following the rules could also be done by just looping for a long
>> time in userspace within or after the critical section, in which case
>> the timer should catch it.
> 
> It's pretty much impossible to tell for the kernel without more
> overhead, whether that's actually a violation of the rules or not.
> 
> The operation after the grant can be interrupted (without resulting in
> scheduling), which is out of control of the task which got the extension
> granted.
> 
> The timer is there to ensure that there is an upper bound to the grant
> independent of the actual reason.

If the worse side-effect of this feature is that the slice extension
is not granted when users misbehave, IMHO this would increase the
likelihood of adoption compared to failure modes that end up killing the
offending processes.

> 
> Going through a different syscall is an obvious deviation from the rule.

AFAIU, the grant is cleared when a signal handler is delivered, which
makes it OK for signals to issue system calls even if they are nested
on top of a granted extension critical section.

> 
> As far I understood the earlier discussions, scheduler folks want to
> enforce that because of PREEMPT_NONE semantics, where a randomly chosen
> syscall might not result in an immediate reschedule because the work,
> which needs to be done takes arbitrary time to complete.
> 
> Though that's arguably not much different from
> 
>         syscall()
>                  -> tick -> NEED_RESCHED
>          do_tons_of_work();
>         exit_to_user()
>            schedule();
> 
> except that in the slice extension case, the latency increases by the
> slice extension time.
> 
> If we allow arbitrary syscalls to terminate the grant, then we need to
> stick an immediate schedule() into the syscall entry work function. We'd
> still need the separate yield() syscall to provide a side effect free
> way of termination.
> 
> I have no strong opinions either way. Peter?

If it happens to not be too bothersome to allow arbitrary system calls
to act as implicit rseq_slice_yield() rather than result in a
segmentation fault, I think it would make this feature more widely
adopted.

Another scenario I have in mind is a userspace critical section that
would typically benefit from slice extension, but seldomly requires
to issue a system call. In C and higher level languages, that could be
very much outside of the user control, such as accessing a
global-dynamic TLS variable located within a global-dynamic shared
object, which can trigger memory allocation under the hood on first
access.

Handling syscall within granted extension by killing the process
will likely reserve this feature to the niche use-cases.

> 
>>>> rseq->slice_request = true;  /* WRITE_ONCE() */
>>>> barrier();
>>>> critical_section();
>>>> barrier();
>>>> rseq->slice_request = false; /* WRITE_ONCE() */
>>>> if (rseq->slice_grant)       /* READ_ONCE() */
>>>>      rseq_slice_yield();
>>>
>>> That should work as it's strictly CPU local. Good point, now that you
>>> said it it's obvious :)
>>>
>>> Let me rework it accordingly.
>>
>> I have two questions wrt ABI here:
>>
>> 1) Do we expect the slice requests to be done from C and higher level
>>      languages or only from assembly ?
> 
> It doesn't matter as long as the ordering is guaranteed.

OK, so I understand that you intend to target higher level languages
as well, which makes my second point (nesting) relevant.

> 
>> 2) Slice requests are a good fit for locking. Locking typically
>>      has nesting ability.
>>
>>      We should consider making the slice request ABI a 8-bit
>>      or 16-bit nesting counter to allow nesting of its users.
> 
> Making request a counter requires to keep request set when the
> extension is granted. So the states would be:
> 
>       request    granted
>       0          0               Neutral
>       >0         0               Requested
>       >=0        1               Granted

Yes.

> 
> That should work.
> 
> Though I'm not really convinced that unconditionally embeddeding it into
> random locking primitives is the right thing to do.

Me neither. I wonder what would be a good approach to integrate this
with locking APIs. Here are a few ideas, some worse than others:

- Extend pthread_mutexattr_t to set whether the mutex should be
   slice-extended. Downside: if a mutex has some long and some
   short critical sections, it's really a one-size fits all decision
   for all critical sections for that mutex.

- Extend the pthread_mutex_lock/trylock with new APIs to allow
   specifying whether slice-extension is needed for the upcoming critical
   section.

- Just let the pthread_mutex_lock caller explicitly request the
   slice extension *after* grabbing the lock. Downside: this opens
   a window of a few instructions where preemption can happen
   and slice extension would have been useful. Should we care ?

> 
> The extension makes only sense, when the actual critical section is
> small and likely to complete within the extension time, which is usually
> only true for highly optimized code and not for general usage, where the
> lock held section is arbitrary long and might even result in syscalls
> even if the critical section itself does not have an obvious explicit
> syscall embedded:
> 
>       lock(a)
>          lock(b) <- Contention results in syscall

Nested locking is another scenario where _typically_ we'd want the
slice extension for the outer lock if it is expected to be a short
critical section, and sometimes hit futex while the extension is granted
and clear the grant if this happens without killing the process.

> 
> Same applies for library functions within a critical section.

Yes.

> 
> That then immediately conflicts with the yield mechanism rules, because
> the extension could have been granted _before_ the syscall happens, so
> we'd have remove that restriction too.

Yes.

> 
> That said, we can make the ABI a counter and split the slice control
> word into two u16. So the decision function would be:
> 
>       get_usr(ctrl);
>       if (!ctrl.request)
>       	return;
>       ....
>       ctrl.granted = 1;
>       put_usr(ctrl);
> 
> Along with documentation why this should only be used nested when you
> know what you are doing.

Yes.

This would turn the end of critical section into a
decrement-and-test-for-zero. It's only when the request counter
decrements back to zero that userspace should handle the granted
flag and yield.

> 
>> 3) Slice requests are also a good fit for rseq critical sections.
>>      Of course someone could explicitly increment/decrement the
>>      slice request counter before/after the rseq critical sections, but
>>      I think we could do better there and integrate this directly within
>>      the struct rseq_cs as a new critical section flag. Basically, a
>>      critical section with this new RSEQ_CS_SLICE_REQUEST flag (or
>>      better name) set within its descriptor flags would behave as if
>>      the slice request counter is non-zero when preempted without
>>      requiring any extra instruction on the fast path. The only
>>      added overhead would be a check of the rseq->slice_grant flag
>>      when exiting the critical section to conditionally issue
>>      rseq_slice_yield().
> 
> Plus checking first whether rseq->slice.request is actually zero,
> i.e. whether the rseq critical section was the outermost one. If not,
> you cannot invoke the yield even if granted is true, right?

Right.

> 
> But mixing state spaces is not really a good idea at all. Let's not go
> there.

I agree, let's keep this (3) for later if there is a strong use-case
justifying the complexity.

What is important for right now though is to figure out the behavior
with respect to an ongoing rseq critical section when a time slice
extension is granted: is the rseq critical section aborted or does
it keep going on return to userspace ?

> 
> Also you'd make checking of rseq_cs unconditional, which means extra
> work in the grant decision function as it would then have to do:
> 
>           if (!usr->slice.ctrl.request) {
>              if (!usr->rseq_cs)
>                 return;
>              if (!valid_ptr(usr->rseq_cs))
>                 goto die;
>              if (!within(regs->ip, usr->rseq_cs.start_ip, usr->rseq_cs.offset))
>                 return;
>              if (!(use->rseq_cs.flags & REQUEST))
>                 return;
>           }
> 
> IOW, we'd copy half of the rseq cs handling into that code.
> 
> Can we please keep it independent and simple?

Of course.

So in summary, here is my current understanding:

- It would be good to support nested slice-extension requests,

- It would be preferable to allow arbitrary system calls to
   cancel an ongoing slice-extension grant rather than kill the
   process if we want the slice-extension feature to be useful
   outside of niche use-cases.

Thoughts ?

Thanks,

Mathieu


> 
> Thanks,
> 
>          tglx


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

  reply	other threads:[~2025-09-12 19:26 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-09-08 22:59 ` [patch 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
2025-09-08 22:59 ` [patch 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
2025-09-09  0:04   ` Randy Dunlap
2025-09-11 15:41   ` Mathieu Desnoyers
2025-09-11 15:49     ` Mathieu Desnoyers
2025-09-22  5:28   ` Prakash Sangappa
2025-09-22  5:57     ` K Prateek Nayak
2025-09-22 13:57       ` Mathieu Desnoyers
2025-09-22 13:55     ` Mathieu Desnoyers
2025-09-23  0:57       ` Prakash Sangappa
2025-09-08 22:59 ` [patch 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
2025-09-09  3:10   ` K Prateek Nayak
2025-09-09  4:11     ` Randy Dunlap
2025-09-09 12:12       ` Thomas Gleixner
2025-09-09 16:01         ` Randy Dunlap
2025-09-11 15:42   ` Mathieu Desnoyers
2025-09-08 22:59 ` [patch 04/12] rseq: Add statistics " Thomas Gleixner
2025-09-11 15:43   ` Mathieu Desnoyers
2025-09-08 22:59 ` [patch 05/12] rseq: Add prctl() to enable " Thomas Gleixner
2025-09-11 15:50   ` Mathieu Desnoyers
2025-09-11 16:52     ` K Prateek Nayak
2025-09-11 17:18       ` Mathieu Desnoyers
2025-09-08 23:00 ` [patch 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
2025-09-09  9:52   ` K Prateek Nayak
2025-09-09 12:23     ` Thomas Gleixner
2025-09-10 11:15   ` K Prateek Nayak
2025-09-08 23:00 ` [patch 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
2025-09-10  5:22   ` K Prateek Nayak
2025-09-10  7:49     ` Thomas Gleixner
2025-09-08 23:00 ` [patch 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
2025-09-10 11:20   ` K Prateek Nayak
2025-09-08 23:00 ` [patch 09/12] rseq: Reset slice extension when scheduled Thomas Gleixner
2025-09-08 23:00 ` [patch 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
2025-09-09  8:14   ` K Prateek Nayak
2025-09-09 12:16     ` Thomas Gleixner
2025-09-08 23:00 ` [patch 11/12] entry: Hook up rseq time slice extension Thomas Gleixner
2025-09-08 23:00 ` [patch 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
2025-09-10 11:23   ` K Prateek Nayak
2025-09-09 12:37 ` [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-09-10  4:42   ` K Prateek Nayak
2025-09-10 11:28 ` K Prateek Nayak
2025-09-10 14:50   ` Thomas Gleixner
2025-09-11  3:03     ` K Prateek Nayak
2025-09-11  7:36       ` Prakash Sangappa
2025-09-11 15:27 ` Mathieu Desnoyers
2025-09-11 20:18   ` Thomas Gleixner
2025-09-12 12:33     ` Mathieu Desnoyers
2025-09-12 16:31       ` Thomas Gleixner
2025-09-12 19:26         ` Mathieu Desnoyers [this message]
2025-09-13 13:02           ` Thomas Gleixner
2025-09-19 17:30             ` Prakash Sangappa
2025-09-22 14:09               ` Mathieu Desnoyers
2025-09-23  1:01                 ` Prakash Sangappa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a65dfd2c-b435-4d83-89d0-abc8002db7c7@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=arnd@arndb.de \
    --cc=bigeasy@linutronix.de \
    --cc=boqun.feng@gmail.com \
    --cc=carlos@redhat.com \
    --cc=corbet@lwn.net \
    --cc=fweimer@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=libc-coord@lists.openwall.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=prakash.sangappa@oracle.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=vineethr@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox