public inbox for linux-edac@vger.kernel.org
 help / color / mirror / Atom feed
* [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE
@ 2024-07-23 14:47 Andrew Zaborowski
  2024-07-23 14:47 ` [RESEND][PATCH 2/3] execve: Ensure SIGBUS delivered on memory failure Andrew Zaborowski
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Andrew Zaborowski @ 2024-07-23 14:47 UTC (permalink / raw)
  To: linux-edac, linux-mm
  Cc: Kees Cook, Tony Luck, Eric Biederman, Borislav Petkov

Uncorrected memory errors for user pages are signaled to processes
using SIGBUS or, if the error happens in a syscall, an error retval
from the syscall.  The SIGBUS is documented in
Documentation/mm/hwpoison.rst#failure-recovery-modes

But there are corner cases where we cannot or don't want to return a
plain error from the syscall.  Subsequent commits covers two such cases:
execve and rseq.  Current code, in both places, will kill the task with a
SIGSEGV on error.  While not explicitly stated, it can be argued that it
should be a SIGBUS, for consistency and for the benefit of the userspace
signal handlers.  Even if the process cannot handle the signal, perhaps
the parent process can.  This was the case in the scenario that
motivated this patch.

In both cases, the architecture's exception handler (MCE handler on x86)
will queue a call to memory_failure.  This doesn't work because the
syscall-specific code sees the -EFAULT and terminates the task before
the queued work runs.

To fix this: 1. let pending work run in the error cases in both places.

And 2. on MCE, ensure memory_failure() is passed MF_ACTION_REQUIRED so
that the SIGBUS is queued.  Normally when the MCE is in a syscall,
a fixup of return IP and a call to kill_me_never() are what we want.
But in this case it's necessary to queue kill_me_maybe() which will set
MF_ACTION_REQUIRED which is checked by memory_failure().

To do this the syscall code will set current->kill_on_efault, a new
task_struct flag.  Check that flag in
arch/x86/kernel/cpu/mce/core.c:do_machine_check()

Note: the flag is not x86 specific even if only x86 handling is being
added here.  The definition could be guarded by #ifdef
CONFIG_MEMORY_FAILURE, but it would then need set/clear utilities.

Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
---
Resending through an SMTP server that won't add the company footer.

This is a v2 of
https://lore.kernel.org/linux-mm/20240501015340.3014724-1-andrew.zaborowski@intel.com/
In the v1 the existing flag current->in_execve was being reused instead
of adding a new one.  Kees Cook commented in
https://lore.kernel.org/linux-mm/202405010915.465AF19@keescook/ that
current->in_execve is going away.  Lacking a better idea and seeing
that execve() and rseq() would benefit from using a common mechanism, I
decided to add this new flag.

Perhaps with a better name current->kill_on_efault could replace
brpm->point_of_no_return to offset the pain of having this extra flag.
---
 arch/x86/kernel/cpu/mce/core.c | 18 +++++++++++++++++-
 include/linux/sched.h          |  2 ++
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index ad0623b65..13f2ace3d 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1611,7 +1611,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 			if (p)
 				SetPageHWPoison(p);
 		}
-	} else {
+	} else if (!current->kill_on_efault) {
 		/*
 		 * Handle an MCE which has happened in kernel space but from
 		 * which the kernel can recover: ex_has_fault_handler() has
@@ -1628,6 +1628,22 @@ noinstr void do_machine_check(struct pt_regs *regs)
 
 		if (m.kflags & MCE_IN_KERNEL_COPYIN)
 			queue_task_work(&m, msg, kill_me_never);
+	} else {
+		/*
+		 * Even with recovery code extra handling is required when
+		 * we're not returning to userspace after error (e.g. in
+		 * execve() beyond the point of no return) to ensure that
+		 * a SIGBUS is delivered.
+		 */
+		if (m.kflags & MCE_IN_KERNEL_RECOV) {
+			if (!fixup_exception(regs, X86_TRAP_MC, 0, 0))
+				mce_panic("Failed kernel mode recovery", &m, msg);
+		}
+
+		if (!mce_usable_address(&m))
+			queue_task_work(&m, msg, kill_me_now);
+		else
+			queue_task_work(&m, msg, kill_me_maybe);
 	}
 
 out:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61591ac6e..0cde1ba11 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -975,6 +975,8 @@ struct task_struct {
 	/* delay due to memory thrashing */
 	unsigned                        in_thrashing:1;
 #endif
+	/* Kill task on user memory access error */
+	unsigned                        kill_on_efault:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread
* Re: [RESEND][PATCH 3/3] rseq: Ensure SIGBUS delivered on memory failure
@ 2024-08-08  2:33 Andrew Zaborowski
  0 siblings, 0 replies; 17+ messages in thread
From: Andrew Zaborowski @ 2024-08-08  2:33 UTC (permalink / raw)
  To: Peter Zijlstra, linux-edac@vger.kernel.org, linux-mm@kvack.org,
	Eric Biederman, Borislav Petkov, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, oleg@redhat.com, Tony Luck

[Sorry if breaking threading]

Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Aug 05, 2024 at 09:37:48PM -0700, Kees Cook wrote:
> > On Tue, Jul 23, 2024 at 04:47:52PM +0200, Andrew Zaborowski wrote:
> > > Uncorrected memory errors for user pages are signaled to processes
> > > using SIGBUS or, if the error happens in a syscall, an error retval
> > > from the syscall.  The SIGBUS is documented in
> > > Documentation/mm/hwpoison.rst#failure-recovery-modes
> > >
> > > Once a user task sets t->rseq in the rseq() syscall, if the kernel
> > > cannot access the memory pointed to by t->rseq->rseq_cs, that initial
> > > rseq() and all future syscalls should return an error so understandably
> > > the code just kills the task.
> > >
> > > To ensure that SIGBUS is used set the new t->kill_on_efault flag and
> > > run queued task work on rseq_get_rseq_cs() errors to give memory_failure
> > > the chance to run.
> > >
> > > Note: the rseq checks run inside resume_user_mode_work() so whenever
> > > _TIF_NOTIFY_RESUME is set.  They do not run on every syscall exit so
> > > I'm not concerned that these extra flag operations are in a hot path,
> > > except with CONFIG_DEBUG_RSEQ.
> > >
> > > Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
> > > ---
> > >  kernel/rseq.c | 25 +++++++++++++++++++++----
> >
> > Can an rseq maintainer please review this? I can carry it via the execve
> > tree with the related patches...
>
> *sigh*,.. because get_maintainers just doesn't work or something?

Oops, I should have re-run it on the v2.

>
> Anyway, I'm confused by the signal code (as always), why isn't the
> task_work_run() in get_signal() sufficient?

Ok, good point, I only considered the task_work_run() call inside
resume_user_mode_work() which was after the signals were already
delivered.

So the reason it's not sufficient seems to be that, given a SIGSEGV
and a SIGBUS, dequeue_synchronous_signal() returns the one that was
queued first i.e. the SIGSEGV.

dequeue_signal() would have returned the one with lower value, which
would have been the SIGBUS.

>
> At some point we're going to run into trouble with sprinkling
> task_work_run() around willy nilly :/

Right, this isn't ideal.

get_signal has this comment:

/*
 * Signals generated by the execution of an instruction
 * need to be delivered before any other pending signals
 * so that the instruction pointer in the signal stack
 * frame points to the faulting instruction.
 */
...
signr = dequeue_synchronous_signal(&ksig->info);
if (!signr)
    signr = dequeue_signal(current, &current->blocked, &ksig->info, &type);

In this case the two signals will have the same userspace IP.  The
SIGBUS will have a non-NULL .si_addr and is more specific.

I wonder if dequeue_synchronous_signal should have extra logic to
prefer the SIGBUS, or if memory_failure() should try to put the SIGBUS
at the top of current->pending.list.  In any case the code would need
to be careful when doing this and I can't think of the exact
conditions to be checked right now.

Best regards

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-08-10  9:24 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-23 14:47 [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE Andrew Zaborowski
2024-07-23 14:47 ` [RESEND][PATCH 2/3] execve: Ensure SIGBUS delivered on memory failure Andrew Zaborowski
2024-07-23 14:47 ` [RESEND][PATCH 3/3] rseq: " Andrew Zaborowski
2024-08-06  4:37   ` Kees Cook
2024-08-06  7:51     ` Peter Zijlstra
2024-08-06 14:19       ` Mathieu Desnoyers
2024-08-06  4:36 ` [RESEND][PATCH 1/3] x86: Add task_struct flag to force SIGBUS on MCE Kees Cook
2024-08-06  8:35   ` Borislav Petkov
     [not found]     ` <SA1PR11MB69926BFE8EFDA7B3C3D84560E7B82@SA1PR11MB6992.namprd11.prod.outlook.com>
2024-08-08  0:01       ` Andrew Zaborowski
2024-08-08 14:56         ` Borislav Petkov
2024-08-09  1:22           ` Andrew Zaborowski
2024-08-09  8:34             ` Borislav Petkov
     [not found]               ` <SA1PR11MB69927AE28B46583DCB5C97DEE7BA2@SA1PR11MB6992.namprd11.prod.outlook.com>
2024-08-10  1:20                 ` Andrew Zaborowski
2024-08-10  3:21                   ` Borislav Petkov
2024-08-10  3:55                     ` Andrew Zaborowski
2024-08-10  9:25                       ` Borislav Petkov
  -- strict thread matches above, loose matches on Subject: below --
2024-08-08  2:33 [RESEND][PATCH 3/3] rseq: Ensure SIGBUS delivered on memory failure Andrew Zaborowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox