* [PATCH] rseq: don't promote transient TLS faults to SIGSEGV
@ 2026-06-08 2:15 Yuanhe Shu
2026-06-08 8:29 ` Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Yuanhe Shu @ 2026-06-08 2:15 UTC (permalink / raw)
To: Mathieu Desnoyers, Peter Zijlstra
Cc: Paul E . McKenney, Boqun Feng, Thomas Gleixner, linux-kernel,
Yuanhe Shu
On return to user space the rseq slow path writes the new cpu_id /
mm_cid into the user-space rseq TLS. rseq_update_usr() already
classifies its failures in rseq_event::fatal: the flag is set only
when corrupt user data is positively identified (e.g. a bad rseq_cs
signature or an out-of-bounds abort IP) and stays clear when the
access merely hit an unresolved page fault.
rseq_slowpath_update_usr() ignores that and calls force_sig(SIGSEGV)
on any failure, so a transient page fault on a still-registered rseq
area becomes a fatal SIGSEGV. This is reachable since glibc >= 2.35
registers rseq for every thread by default: a memcg OOM victim can die
of SIGSEGV (si_code=SI_KERNEL, si_addr=NULL) shortly after fork,
before returning to user space, because the CoW of the inherited TLS
page cannot be charged to the OOM-locked memcg and the rseq write
faults.
With oom_score_adj=-1000 the OOM killer finds no killable task, so the
rseq SIGSEGV is the sole outcome; otherwise the rseq SIGSEGV can be
delivered before the OOM killer queues SIGKILL, and the process exits
139 instead of 137, breaking OOMKilled detection in container
runtimes. LTP mm/oom03 and mm/oom05 reproduce it on v7.1-rc6+, and a
strace A/B with glibc.pthread.rseq as the sole variable shows the
SIGSEGV only when rseq is registered.
Only raise SIGSEGV when rseq_event::fatal is set. A non-fatal fault
leaves the cached IDs untouched and is retried on a later return to
user; a genuinely unmapped area keeps faulting and user space takes
SIGSEGV through its own access. All corruption and ROP-hardening
checks keep their SIGSEGV.
Signal delivery is left untouched: it must abort the interrupted
critical section before the handler runs and therefore cannot safely
defer a fault.
Signed-off-by: Yuanhe Shu <xiangzao@linux.alibaba.com>
---
Tested on v7.1-rc6+ (vanilla):
- LTP mm/oom03 (14/14) and mm/oom05 (8/8): pass with the patch (the
victim is reaped with SIGKILL); without it the rseq SIGSEGV makes
the same cases fail.
- strace A/B on the oom03 binary with glibc.pthread.rseq as the sole
variable: 2 SIGSEGV (SI_KERNEL, si_addr=NULL) with rseq registered,
0 without -- isolates the cause to the rseq slow path.
- tools/testing/selftests/rseq: run_param_test.sh,
run_syscall_errors_test.sh, run_legacy_check.sh and
run_timeslice_test.sh all pass.
kernel/rseq.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index e75e3a5e312c..38a19cef4ad0 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -302,11 +302,18 @@ static void rseq_slowpath_update_usr(struct pt_regs *regs)
if (unlikely(!rseq_update_usr(t, regs, &ids))) {
/*
- * Clear the errors just in case this might survive magically, but
- * leave the rest intact.
+ * rseq_update_usr() sets rseq_event::fatal only on corrupt
+ * user data, which keeps its SIGSEGV. A clear fatal bit is an
+ * unresolved page fault on a still-registered rseq area (e.g.
+ * a CoW that cannot be charged to an OOM-locked memcg): that
+ * is transient, so leave the cached IDs untouched and retry on
+ * a later return to user instead of killing the task.
*/
+ bool fatal = t->rseq.event.fatal;
+
t->rseq.event.error = 0;
- force_sig(SIGSEGV);
+ if (fatal)
+ force_sig(SIGSEGV);
}
}
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] rseq: don't promote transient TLS faults to SIGSEGV
2026-06-08 2:15 [PATCH] rseq: don't promote transient TLS faults to SIGSEGV Yuanhe Shu
@ 2026-06-08 8:29 ` Peter Zijlstra
2026-06-08 9:15 ` Thomas Gleixner
2026-06-08 12:52 ` Mathieu Desnoyers
2 siblings, 0 replies; 4+ messages in thread
From: Peter Zijlstra @ 2026-06-08 8:29 UTC (permalink / raw)
To: Yuanhe Shu
Cc: Mathieu Desnoyers, Paul E . McKenney, Boqun Feng, Thomas Gleixner,
linux-kernel
On Mon, Jun 08, 2026 at 10:15:53AM +0800, Yuanhe Shu wrote:
> On return to user space the rseq slow path writes the new cpu_id /
> mm_cid into the user-space rseq TLS. rseq_update_usr() already
> classifies its failures in rseq_event::fatal: the flag is set only
> when corrupt user data is positively identified (e.g. a bad rseq_cs
> signature or an out-of-bounds abort IP) and stays clear when the
> access merely hit an unresolved page fault.
>
> rseq_slowpath_update_usr() ignores that and calls force_sig(SIGSEGV)
> on any failure, so a transient page fault on a still-registered rseq
> area becomes a fatal SIGSEGV. This is reachable since glibc >= 2.35
> registers rseq for every thread by default: a memcg OOM victim can die
> of SIGSEGV (si_code=SI_KERNEL, si_addr=NULL) shortly after fork,
> before returning to user space, because the CoW of the inherited TLS
> page cannot be charged to the OOM-locked memcg and the rseq write
> faults.
>
> With oom_score_adj=-1000 the OOM killer finds no killable task, so the
> rseq SIGSEGV is the sole outcome; otherwise the rseq SIGSEGV can be
> delivered before the OOM killer queues SIGKILL, and the process exits
> 139 instead of 137, breaking OOMKilled detection in container
> runtimes. LTP mm/oom03 and mm/oom05 reproduce it on v7.1-rc6+, and a
> strace A/B with glibc.pthread.rseq as the sole variable shows the
> SIGSEGV only when rseq is registered.
>
> Only raise SIGSEGV when rseq_event::fatal is set. A non-fatal fault
> leaves the cached IDs untouched and is retried on a later return to
> user; a genuinely unmapped area keeps faulting and user space takes
> SIGSEGV through its own access. All corruption and ROP-hardening
> checks keep their SIGSEGV.
But this will return to userspace with invalid (not updated) rseq
values. This can lead to data corruption.
If we cannot write new rseq values on return to userspace, we must not
return -- it really is that simple.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] rseq: don't promote transient TLS faults to SIGSEGV
2026-06-08 2:15 [PATCH] rseq: don't promote transient TLS faults to SIGSEGV Yuanhe Shu
2026-06-08 8:29 ` Peter Zijlstra
@ 2026-06-08 9:15 ` Thomas Gleixner
2026-06-08 12:52 ` Mathieu Desnoyers
2 siblings, 0 replies; 4+ messages in thread
From: Thomas Gleixner @ 2026-06-08 9:15 UTC (permalink / raw)
To: Yuanhe Shu, Mathieu Desnoyers, Peter Zijlstra
Cc: Paul E . McKenney, Boqun Feng, linux-kernel, Yuanhe Shu
On Mon, Jun 08 2026 at 10:15, Yuanhe Shu wrote:
> On return to user space the rseq slow path writes the new cpu_id /
> mm_cid into the user-space rseq TLS. rseq_update_usr() already
> classifies its failures in rseq_event::fatal: the flag is set only
> when corrupt user data is positively identified (e.g. a bad rseq_cs
> signature or an out-of-bounds abort IP) and stays clear when the
> access merely hit an unresolved page fault.
>
> rseq_slowpath_update_usr() ignores that and calls force_sig(SIGSEGV)
> on any failure, so a transient page fault on a still-registered rseq
> area becomes a fatal SIGSEGV. This is reachable since glibc >= 2.35
It's not transient.
rseq_slowpath_update_usr() does the full pagefault resolution, which
means if that returns without resolving the fault, then it's game over.
We also cannot return to user space in that case because the rseq area,
which is not accessible, has not been updated.
Thanks,
tglx
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] rseq: don't promote transient TLS faults to SIGSEGV
2026-06-08 2:15 [PATCH] rseq: don't promote transient TLS faults to SIGSEGV Yuanhe Shu
2026-06-08 8:29 ` Peter Zijlstra
2026-06-08 9:15 ` Thomas Gleixner
@ 2026-06-08 12:52 ` Mathieu Desnoyers
2 siblings, 0 replies; 4+ messages in thread
From: Mathieu Desnoyers @ 2026-06-08 12:52 UTC (permalink / raw)
To: Yuanhe Shu, Peter Zijlstra
Cc: Paul E . McKenney, Boqun Feng, Thomas Gleixner, linux-kernel
On 2026-06-07 22:15, Yuanhe Shu wrote:
> With oom_score_adj=-1000 the OOM killer finds no killable task, so the
> rseq SIGSEGV is the sole outcome; otherwise the rseq SIGSEGV can be
> delivered before the OOM killer queues SIGKILL, and the process exits
> 139 instead of 137, breaking OOMKilled detection in container
> runtimes
As Peter and Thomas said, this is not transient. We simply cannot return
to userspace with an out-of-date value.
It looks like an issue with the choice of which signal should be
delivered in priority: rseq force signal enqueues SIGSEGV, and you
would expect the OOM killer to issue SIGKILL, and somehow it's the
forced SIGSEGV that wins.
Perhaps look into fixing that instead if you really care about which
signal is emitted ? (and that's a big _if_)
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-06-08 12:52 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08 2:15 [PATCH] rseq: don't promote transient TLS faults to SIGSEGV Yuanhe Shu
2026-06-08 8:29 ` Peter Zijlstra
2026-06-08 9:15 ` Thomas Gleixner
2026-06-08 12:52 ` Mathieu Desnoyers
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.