[PATCH] rseq: don't promote transient TLS faults to SIGSEGV

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] rseq: don't promote transient TLS faults to SIGSEGV
@ 2026-06-08  2:15 Yuanhe Shu
  2026-06-08  8:29 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Yuanhe Shu @ 2026-06-08  2:15 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Zijlstra
  Cc: Paul E . McKenney, Boqun Feng, Thomas Gleixner, linux-kernel,
	Yuanhe Shu

On return to user space the rseq slow path writes the new cpu_id /
mm_cid into the user-space rseq TLS. rseq_update_usr() already
classifies its failures in rseq_event::fatal: the flag is set only
when corrupt user data is positively identified (e.g. a bad rseq_cs
signature or an out-of-bounds abort IP) and stays clear when the
access merely hit an unresolved page fault.

rseq_slowpath_update_usr() ignores that and calls force_sig(SIGSEGV)
on any failure, so a transient page fault on a still-registered rseq
area becomes a fatal SIGSEGV. This is reachable since glibc >= 2.35
registers rseq for every thread by default: a memcg OOM victim can die
of SIGSEGV (si_code=SI_KERNEL, si_addr=NULL) shortly after fork,
before returning to user space, because the CoW of the inherited TLS
page cannot be charged to the OOM-locked memcg and the rseq write
faults.

With oom_score_adj=-1000 the OOM killer finds no killable task, so the
rseq SIGSEGV is the sole outcome; otherwise the rseq SIGSEGV can be
delivered before the OOM killer queues SIGKILL, and the process exits
139 instead of 137, breaking OOMKilled detection in container
runtimes. LTP mm/oom03 and mm/oom05 reproduce it on v7.1-rc6+, and a
strace A/B with glibc.pthread.rseq as the sole variable shows the
SIGSEGV only when rseq is registered.

Only raise SIGSEGV when rseq_event::fatal is set. A non-fatal fault
leaves the cached IDs untouched and is retried on a later return to
user; a genuinely unmapped area keeps faulting and user space takes
SIGSEGV through its own access. All corruption and ROP-hardening
checks keep their SIGSEGV.

Signal delivery is left untouched: it must abort the interrupted
critical section before the handler runs and therefore cannot safely
defer a fault.

Signed-off-by: Yuanhe Shu <xiangzao@linux.alibaba.com>
---
Tested on v7.1-rc6+ (vanilla):
 - LTP mm/oom03 (14/14) and mm/oom05 (8/8): pass with the patch (the
   victim is reaped with SIGKILL); without it the rseq SIGSEGV makes
   the same cases fail.
 - strace A/B on the oom03 binary with glibc.pthread.rseq as the sole
   variable: 2 SIGSEGV (SI_KERNEL, si_addr=NULL) with rseq registered,
   0 without -- isolates the cause to the rseq slow path.
 - tools/testing/selftests/rseq: run_param_test.sh,
   run_syscall_errors_test.sh, run_legacy_check.sh and
   run_timeslice_test.sh all pass.

 kernel/rseq.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index e75e3a5e312c..38a19cef4ad0 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -302,11 +302,18 @@ static void rseq_slowpath_update_usr(struct pt_regs *regs)

 	if (unlikely(!rseq_update_usr(t, regs, &ids))) {
 		/*
-		 * Clear the errors just in case this might survive magically, but
-		 * leave the rest intact.
+		 * rseq_update_usr() sets rseq_event::fatal only on corrupt
+		 * user data, which keeps its SIGSEGV. A clear fatal bit is an
+		 * unresolved page fault on a still-registered rseq area (e.g.
+		 * a CoW that cannot be charged to an OOM-locked memcg): that
+		 * is transient, so leave the cached IDs untouched and retry on
+		 * a later return to user instead of killing the task.
 		 */
+		bool fatal = t->rseq.event.fatal;
+
 		t->rseq.event.error = 0;
-		force_sig(SIGSEGV);
+		if (fatal)
+			force_sig(SIGSEGV);
 	}
 }

-- 
2.39.5 (Apple Git-154)

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] rseq: don't promote transient TLS faults to SIGSEGV
  2026-06-08  2:15 [PATCH] rseq: don't promote transient TLS faults to SIGSEGV Yuanhe Shu
@ 2026-06-08  8:29 ` Peter Zijlstra
  2026-06-08  9:15 ` Thomas Gleixner
  2026-06-08 12:52 ` Mathieu Desnoyers
  2 siblings, 0 replies; 5+ messages in thread
From: Peter Zijlstra @ 2026-06-08  8:29 UTC (permalink / raw)
  To: Yuanhe Shu
  Cc: Mathieu Desnoyers, Paul E . McKenney, Boqun Feng, Thomas Gleixner,
	linux-kernel

On Mon, Jun 08, 2026 at 10:15:53AM +0800, Yuanhe Shu wrote:
> On return to user space the rseq slow path writes the new cpu_id /
> mm_cid into the user-space rseq TLS. rseq_update_usr() already
> classifies its failures in rseq_event::fatal: the flag is set only
> when corrupt user data is positively identified (e.g. a bad rseq_cs
> signature or an out-of-bounds abort IP) and stays clear when the
> access merely hit an unresolved page fault.
> 
> rseq_slowpath_update_usr() ignores that and calls force_sig(SIGSEGV)
> on any failure, so a transient page fault on a still-registered rseq
> area becomes a fatal SIGSEGV. This is reachable since glibc >= 2.35
> registers rseq for every thread by default: a memcg OOM victim can die
> of SIGSEGV (si_code=SI_KERNEL, si_addr=NULL) shortly after fork,
> before returning to user space, because the CoW of the inherited TLS
> page cannot be charged to the OOM-locked memcg and the rseq write
> faults.
> 
> With oom_score_adj=-1000 the OOM killer finds no killable task, so the
> rseq SIGSEGV is the sole outcome; otherwise the rseq SIGSEGV can be
> delivered before the OOM killer queues SIGKILL, and the process exits
> 139 instead of 137, breaking OOMKilled detection in container
> runtimes. LTP mm/oom03 and mm/oom05 reproduce it on v7.1-rc6+, and a
> strace A/B with glibc.pthread.rseq as the sole variable shows the
> SIGSEGV only when rseq is registered.
> 
> Only raise SIGSEGV when rseq_event::fatal is set. A non-fatal fault
> leaves the cached IDs untouched and is retried on a later return to
> user; a genuinely unmapped area keeps faulting and user space takes
> SIGSEGV through its own access. All corruption and ROP-hardening
> checks keep their SIGSEGV.

But this will return to userspace with invalid (not updated) rseq
values. This can lead to data corruption.

If we cannot write new rseq values on return to userspace, we must not
return -- it really is that simple.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] rseq: don't promote transient TLS faults to SIGSEGV
  2026-06-08  2:15 [PATCH] rseq: don't promote transient TLS faults to SIGSEGV Yuanhe Shu
  2026-06-08  8:29 ` Peter Zijlstra
@ 2026-06-08  9:15 ` Thomas Gleixner
  2026-06-08 12:52 ` Mathieu Desnoyers
  2 siblings, 0 replies; 5+ messages in thread
From: Thomas Gleixner @ 2026-06-08  9:15 UTC (permalink / raw)
  To: Yuanhe Shu, Mathieu Desnoyers, Peter Zijlstra
  Cc: Paul E . McKenney, Boqun Feng, linux-kernel, Yuanhe Shu

On Mon, Jun 08 2026 at 10:15, Yuanhe Shu wrote:
> On return to user space the rseq slow path writes the new cpu_id /
> mm_cid into the user-space rseq TLS. rseq_update_usr() already
> classifies its failures in rseq_event::fatal: the flag is set only
> when corrupt user data is positively identified (e.g. a bad rseq_cs
> signature or an out-of-bounds abort IP) and stays clear when the
> access merely hit an unresolved page fault.
>
> rseq_slowpath_update_usr() ignores that and calls force_sig(SIGSEGV)
> on any failure, so a transient page fault on a still-registered rseq
> area becomes a fatal SIGSEGV. This is reachable since glibc >= 2.35

It's not transient.

rseq_slowpath_update_usr() does the full pagefault resolution, which
means if that returns without resolving the fault, then it's game over.

We also cannot return to user space in that case because the rseq area,
which is not accessible, has not been updated.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] rseq: don't promote transient TLS faults to SIGSEGV
  2026-06-08  2:15 [PATCH] rseq: don't promote transient TLS faults to SIGSEGV Yuanhe Shu
  2026-06-08  8:29 ` Peter Zijlstra
  2026-06-08  9:15 ` Thomas Gleixner
@ 2026-06-08 12:52 ` Mathieu Desnoyers
  2026-06-08 22:20   ` Thomas Gleixner
  2 siblings, 1 reply; 5+ messages in thread
From: Mathieu Desnoyers @ 2026-06-08 12:52 UTC (permalink / raw)
  To: Yuanhe Shu, Peter Zijlstra
  Cc: Paul E . McKenney, Boqun Feng, Thomas Gleixner, linux-kernel

On 2026-06-07 22:15, Yuanhe Shu wrote:
> With oom_score_adj=-1000 the OOM killer finds no killable task, so the
> rseq SIGSEGV is the sole outcome; otherwise the rseq SIGSEGV can be
> delivered before the OOM killer queues SIGKILL, and the process exits
> 139 instead of 137, breaking OOMKilled detection in container
> runtimes
As Peter and Thomas said, this is not transient. We simply cannot return
to userspace with an out-of-date value.

It looks like an issue with the choice of which signal should be
delivered in priority: rseq force signal enqueues SIGSEGV, and you
would expect the OOM killer to issue SIGKILL, and somehow it's the
forced SIGSEGV that wins.

Perhaps look into fixing that instead if you really care about which
signal is emitted ? (and that's a big _if_)

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] rseq: don't promote transient TLS faults to SIGSEGV
  2026-06-08 12:52 ` Mathieu Desnoyers
@ 2026-06-08 22:20   ` Thomas Gleixner
  0 siblings, 0 replies; 5+ messages in thread
From: Thomas Gleixner @ 2026-06-08 22:20 UTC (permalink / raw)
  To: Mathieu Desnoyers, Yuanhe Shu, Peter Zijlstra
  Cc: Paul E . McKenney, Boqun Feng, linux-kernel, Michal Hocko,
	David Rientjes, Shakeel Butt, linux-mm

On Mon, Jun 08 2026 at 08:52, Mathieu Desnoyers wrote:
> On 2026-06-07 22:15, Yuanhe Shu wrote:
>> With oom_score_adj=-1000 the OOM killer finds no killable task, so the
>> rseq SIGSEGV is the sole outcome; otherwise the rseq SIGSEGV can be
>> delivered before the OOM killer queues SIGKILL, and the process exits
>> 139 instead of 137, breaking OOMKilled detection in container
>> runtimes
> As Peter and Thomas said, this is not transient. We simply cannot return
> to userspace with an out-of-date value.
>
> It looks like an issue with the choice of which signal should be
> delivered in priority: rseq force signal enqueues SIGSEGV, and you
> would expect the OOM killer to issue SIGKILL, and somehow it's the
> forced SIGSEGV that wins.
>
> Perhaps look into fixing that instead if you really care about which
> signal is emitted ? (and that's a big _if_)

It's even worse. The proposed patch is actually creating an endless
loop unless there is really a signal pending at some point.

exit_to_user()
   rseq_update_usr();  // faults and defers the fault handling to rseq_slowpath_update_usr()

rseq_slowpath_update_usr()
   rseq_update_usr();  // Faults again and the fault cannot be resolved

   if (!fatal)         // Proposed solution....
      return;

So if there is no signal queued, then this will end up in exit_to_user()
again, which faults and defers the fault handling to
rseq_slowpath_update_usr() again, which just goes on in circles.

IOW, this would create an unpriviledged DoS attack - not a fatal one,
but at least one which eats up a full time slice in the kernel
forever. Use enough tasks, which register a rseq region and unregister it
after returning to user space ....

So no. And this comment in the patch does not make any sense at all:

> +		 * rseq_update_usr() sets rseq_event::fatal only on corrupt
> +		 * user data, which keeps its SIGSEGV. A clear fatal bit is an
> +		 * unresolved page fault on a still-registered rseq area (e.g.
> +		 * a CoW that cannot be charged to an OOM-locked memcg): that
> +		 * is transient, so leave the cached IDs untouched and retry on
> +		 * a later return to user instead of killing the task.

If the page fault handler fails to wait until the OOM locked memcg
figured out what to do, then that's a clear violation of expectation
vs. resolving a page fault in the context of user/kernel shared memory
with ABI constraints. But definitely not some transient failure which
can be hand waved away.

Not that it matters much whether the task dies from SIGSEGV or SIGKILL,
but that's clearly not a problem which can be papered over in the rseq
code.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-08 22:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08  2:15 [PATCH] rseq: don't promote transient TLS faults to SIGSEGV Yuanhe Shu
2026-06-08  8:29 ` Peter Zijlstra
2026-06-08  9:15 ` Thomas Gleixner
2026-06-08 12:52 ` Mathieu Desnoyers
2026-06-08 22:20   ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.