Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Andy Lutomirski" <luto@kernel.org>
To: "Pasha Tatashin" <pasha.tatashin@soleen.com>
Cc: "Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, "Andrew Morton" <akpm@linux-foundation.org>,
	"the arch/x86 maintainers" <x86@kernel.org>,
	"Borislav Petkov" <bp@alien8.de>,
	"Christian Brauner" <brauner@kernel.org>,
	bristot@redhat.com, "Ben Segall" <bsegall@google.com>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	dianders@chromium.org, dietmar.eggemann@arm.com,
	eric.devolder@oracle.com, hca@linux.ibm.com,
	"hch@infradead.org" <hch@infradead.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Jacob Pan" <jacob.jun.pan@linux.intel.com>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	jpoimboe@kernel.org, "Joerg Roedel" <jroedel@suse.de>,
	juri.lelli@redhat.com,
	"Kent Overstreet" <kent.overstreet@linux.dev>,
	kinseyho@google.com,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	lstoakes@gmail.com, mgorman@suse.de, mic@digikod.net,
	michael.christie@oracle.com, "Ingo Molnar" <mingo@redhat.com>,
	mjguzik@gmail.com, "Michael S. Tsirkin" <mst@redhat.com>,
	"Nicholas Piggin" <npiggin@gmail.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	"Petr Mladek" <pmladek@suse.com>,
	"Rick P Edgecombe" <rick.p.edgecombe@intel.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Suren Baghdasaryan" <surenb@google.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Uladzislau Rezki" <urezki@gmail.com>,
	vincent.guittot@linaro.org, vschneid@redhat.com
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks
Date: Mon, 11 Mar 2024 16:34:11 -0700	[thread overview]
Message-ID: <1ac305b1-d28f-44f6-88e5-c85d9062f9e8@app.fastmail.com> (raw)
In-Reply-To: <CA+CK2bA22AP2jrbHjdN8nYFbYX2xJXQt+=4G3Rjw_Lyn5NOyKA@mail.gmail.com>

On Mon, Mar 11, 2024, at 4:10 PM, Pasha Tatashin wrote:
> On Mon, Mar 11, 2024 at 6:17 PM Andy Lutomirski <luto@kernel.org> wrote:
>>
>>
>>
>> On Mon, Mar 11, 2024, at 9:46 AM, Pasha Tatashin wrote:
>> > Add dynamic_stack_fault() calls to the kernel faults, and also declare
>> > HAVE_ARCH_DYNAMIC_STACK = y, so that dynamic kernel stacks can be
>> > enabled on x86 architecture.
>> >
>> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> > ---
>> >  arch/x86/Kconfig        | 1 +
>> >  arch/x86/kernel/traps.c | 3 +++
>> >  arch/x86/mm/fault.c     | 3 +++
>> >  3 files changed, 7 insertions(+)
>> >
>> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> > index 5edec175b9bf..9bb0da3110fa 100644
>> > --- a/arch/x86/Kconfig
>> > +++ b/arch/x86/Kconfig
>> > @@ -197,6 +197,7 @@ config X86
>> >       select HAVE_ARCH_USERFAULTFD_WP         if X86_64 && USERFAULTFD
>> >       select HAVE_ARCH_USERFAULTFD_MINOR      if X86_64 && USERFAULTFD
>> >       select HAVE_ARCH_VMAP_STACK             if X86_64
>> > +     select HAVE_ARCH_DYNAMIC_STACK          if X86_64
>> >       select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
>> >       select HAVE_ARCH_WITHIN_STACK_FRAMES
>> >       select HAVE_ASM_MODVERSIONS
>> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> > index c3b2f863acf0..cc05401e729f 100644
>> > --- a/arch/x86/kernel/traps.c
>> > +++ b/arch/x86/kernel/traps.c
>> > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
>> >       }
>> >  #endif
>> >
>> > +     if (dynamic_stack_fault(current, address))
>> > +             return;
>> > +
>>
>> Sorry, but no, you can't necessarily do this.  I say this as the person who write this code, and I justified my code on the basis that we are not recovering -- we're jumping out to a different context, and we won't crash if the origin context for the fault is corrupt.  The SDM is really quite unambiguous about it: we're in an "abort" context, and returning is not allowed.  And I this may well be is the real deal -- the microcode does not promise to have the return frame and the actual faulting context matched up here, and there's is no architectural guarantee that returning will do the right thing.
>>
>> Now we do have some history of getting a special exception, e.g. for espfix64.  But espfix64 is a very special case, and the situation you're looking at is very general.  So unless Intel and AMD are both wiling to publicly document that it's okay to handle stack overflow, where any instruction in the ISA may have caused the overflow, like this, then we're not going to do it.
>
> Hi Andy,
>
> Thank you for the insightful feedback.
>
> I'm somewhat confused about why we end up in exc_double_fault() in the
> first place. My initial assumption was that dynamic_stack_fault()
> would only be needed within do_kern_addr_fault(). However, while
> testing in QEMU, I found that when using memset() on a stack variable,
> code like this:
>
> rep stos %rax,%es:(%rdi)
>
> causes a double fault instead of a regular fault. I added it to
> exc_double_fault() as a result, but I'm curious if you have any
> insights into why this behavior occurs.
>

Imagine you're a CPU running kernel code, on a fairly traditional architecture like x86.  The code tries to access some swapped out user memory.  You say "sorry, that memory is not present" and generate a page fault.  You save the current state *to the stack* and chance the program counter to point to the page fault handler.  The page fault handler does its thing, then pops the old state off the stack and resumes the faulting code.

A few microseconds later, the kernel fills up its stack and then does:

PUSH something

but that would write to a not-present stack page, because you already filled the stack.  Okay, a page fault -- no big deal, we know how to handle that.  So you push the current state to the stack.  On wait, you *can't* push the current state to the stack, because that would involve writing to an unmapped page of memory.

So you trigger a double-fault.  You push some state to the double-fault handler's special emergency stack.  But wait, *what* state do you push?  Is it the state that did the "PUSH something" and overflowed the stack?  Or is some virtual state that's a mixture of that and the failed page fault handler?  What if the stack wasn't quite full and you actually succeeded in pushing the old stack pointer but not the old program counter?  What saved state goes where?

This is a complicated mess, so the people who designed all this said 'hey, wait a minute, let's not call double faults a "fault" -- let's call them an "abort"' so we can stop confusing ourselves and ship CPUs to customers.  And "abort" means "the saved state is not well defined -- don't rely on it having any particular meaning".

So, until a few years ago, we would just print something like "PANIC: double fault" and kill the whole system.  A few years ago, I decided this was lame, and I wanted to have stack guard pages, so i added real fancy new logic: instead, we do our best to display the old state, but it's a guess and all we're doing with it is printk -- if it's wrong, it's annoying, but that's all.  And then we kill the running thread -- instead of trying to return (and violating our sacred contract with the x86 architecture), we *reset* the current crashing thread's state to a known-good state.  Then we return to *that* state.  Now we're off the emergency stack and we're running something resembling normal kernel code, but we can't return, as there is nowhere to return to.  But that's fine -- instead we kill the current thread, kind of like _exit().  That never returns, so it's okay that we can't return.

But your patch adds a return statement to this whole mess, which will return to the moderately-likely-to-be-corrupt state that caused a double fault inside the microcode for the page fault path, and you have stepped outside the well-defined path in the x86 architecture, and you've triggered something akin to Undefined Behavior.  The CPU won't catch fire, but it reserves the right to execute from an incorrect RSP and/or RIP, to be in the middle of an instruction, etc.

(For that matter, what if there was exactly enough room to enter the page fault handler, but the very first instruction of the page fault handler overflowed the stack?  Then you allocate more memory, get lucky and successfully resume the page fault handler, and then promptly OOPS because you run the page fault handler and it thinks you got a kernel page fault?  My OOPS code handles that, but, again, it's not trying to recover.)

>> There are some other options: you could pre-map
>
> Pre-mapping would be expensive. It would mean pre-mapping the dynamic
> pages for every scheduled thread, and we'd still need to check the
> access bit every time a thread leaves the CPU.

That's a write to four consecutive words in memory, with no locking required.

> Dynamic thread faults
> should be considered rare events and thus shouldn't significantly
> affect the performance of normal context switch operations. With 8K
> stacks, we might encounter only 0.00001% of stacks requiring an extra
> page, and even fewer needing 16K.

Well yes, but if you crash 0.0001% of the time due to the microcode not liking you, you lose. :)

>
>> Also, I think the whole memory allocation concept in this whole series is a bit odd.  Fundamentally, we *can't* block on these stack faults -- we may be in a context where blocking will deadlock.  We may be in the page allocator.  Panicing due to kernel stack allocation  would be very unpleasant.
>
> We never block during handling stack faults. There's a per-CPU page
> pool, guaranteeing availability for the faulting thread. The thread
> simply takes pages from this per-CPU data structure and refills the
> pool when leaving the CPU. The faulting routine is efficient,
> requiring a fixed number of loads without any locks, stalling, or even
> cmpxchg operations.

You can't block when scheduling, either.  What if you can't refill the pool?

next prev parent reply	other threads:[~2024-03-11 23:34 UTC|newest]

Thread overview: 98+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-11 16:46 [RFC 00/14] Dynamic Kernel Stacks Pasha Tatashin
2024-03-11 16:46 ` [RFC 01/14] task_stack.h: remove obsolete __HAVE_ARCH_KSTACK_END check Pasha Tatashin
2024-03-17 14:36   ` Christophe JAILLET
2024-03-17 15:13     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 02/14] fork: Clean-up ifdef logic around stack allocation Pasha Tatashin
2024-03-11 16:46 ` [RFC 03/14] fork: Clean-up naming of vm_strack/vm_struct variables in vmap stacks code Pasha Tatashin
2024-03-17 14:42   ` Christophe JAILLET
2024-03-19 16:32     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 04/14] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE Pasha Tatashin
2024-03-17 14:45   ` Christophe JAILLET
2024-03-17 15:14     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 05/14] fork: check charging success before zeroing stack Pasha Tatashin
2024-03-12 15:57   ` Kirill A. Shutemov
2024-03-12 16:52     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 06/14] fork: zero vmap stack using clear_page() instead of memset() Pasha Tatashin
2024-03-12  7:15   ` Nikolay Borisov
2024-03-12 16:53     ` Pasha Tatashin
2024-03-14  7:55       ` Christophe Leroy
2024-03-14 13:52         ` Pasha Tatashin
2024-03-17 14:48   ` Christophe JAILLET
2024-03-17 15:15     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 07/14] fork: use the first page in stack to store vm_stack in cached_stacks Pasha Tatashin
2024-03-11 16:46 ` [RFC 08/14] fork: separate vmap stack alloction and free calls Pasha Tatashin
2024-03-14 15:18   ` Jeff Xie
2024-03-14 17:14     ` Pasha Tatashin
2024-03-17 14:51   ` Christophe JAILLET
2024-03-17 15:15     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 09/14] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range_noflush() public functions Pasha Tatashin
2024-03-11 16:46 ` [RFC 10/14] fork: Dynamic Kernel Stacks Pasha Tatashin
2024-03-11 19:32   ` Randy Dunlap
2024-03-11 19:55     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 11/14] x86: add support for " Pasha Tatashin
2024-03-11 22:17   ` Andy Lutomirski
2024-03-11 23:10     ` Pasha Tatashin
2024-03-11 23:33       ` Thomas Gleixner
2024-03-11 23:34       ` Andy Lutomirski [this message]
2024-03-12  0:08         ` Pasha Tatashin
2024-03-12  0:23           ` Pasha Tatashin
2024-03-11 23:34     ` Dave Hansen
2024-03-11 23:41       ` Andy Lutomirski
2024-03-11 23:56         ` Nadav Amit
2024-03-12  0:02           ` Andy Lutomirski
2024-03-12  7:20             ` Nadav Amit
2024-03-12  0:53           ` Dave Hansen
2024-03-12  1:25             ` H. Peter Anvin
2024-03-12  2:16               ` Andy Lutomirski
2024-03-12  2:20                 ` H. Peter Anvin
2024-03-12 21:58   ` Andi Kleen
2024-03-13 10:23   ` Thomas Gleixner
2024-03-13 13:43     ` Pasha Tatashin
2024-03-13 15:28       ` Pasha Tatashin
2024-03-13 16:12         ` Thomas Gleixner
2024-03-14 14:03           ` Pasha Tatashin
2024-03-14 18:26             ` Thomas Gleixner
2024-03-11 16:46 ` [RFC 12/14] task_stack.h: Clean-up stack_not_used() implementation Pasha Tatashin
2024-03-11 16:46 ` [RFC 13/14] task_stack.h: Add stack_not_used() support for dynamic stack Pasha Tatashin
2024-03-11 16:46 ` [RFC 14/14] fork: Dynamic Kernel Stack accounting Pasha Tatashin
2024-03-11 17:09 ` [RFC 00/14] Dynamic Kernel Stacks Mateusz Guzik
2024-03-11 18:58   ` Pasha Tatashin
2024-03-11 19:21     ` Mateusz Guzik
2024-03-11 19:55       ` Pasha Tatashin
2024-03-12 17:18 ` H. Peter Anvin
2024-03-12 19:45   ` Pasha Tatashin
2024-03-12 21:36     ` H. Peter Anvin
2024-03-14 19:05       ` Kent Overstreet
2024-03-14 19:23         ` Pasha Tatashin
2024-03-14 19:28           ` Kent Overstreet
2024-03-14 19:34             ` Pasha Tatashin
2024-03-14 19:49               ` Kent Overstreet
2024-03-12 22:18     ` David Laight
2024-03-14 19:43   ` Matthew Wilcox
2024-03-14 19:53     ` Kent Overstreet
2024-03-14 19:57       ` Matthew Wilcox
2024-03-14 19:58         ` Kent Overstreet
2024-03-15  3:13         ` Pasha Tatashin
2024-03-15  3:39           ` H. Peter Anvin
2024-03-16 19:17             ` Pasha Tatashin
2024-03-17  0:41               ` Matthew Wilcox
2024-03-17  1:32                 ` Kent Overstreet
2024-03-17 14:19                 ` Pasha Tatashin
2024-03-17 14:43               ` Brian Gerst
2024-03-17 16:15                 ` Pasha Tatashin
2024-03-17 21:30                   ` Brian Gerst
2024-03-18 14:59                     ` Pasha Tatashin
2024-03-18 21:02                       ` Brian Gerst
2024-03-19 14:56                         ` Pasha Tatashin
2024-03-17 18:57               ` David Laight
2024-03-18 15:09                 ` Pasha Tatashin
2024-03-18 15:13                   ` Pasha Tatashin
2024-03-18 15:19                   ` Matthew Wilcox
2024-03-18 15:30                     ` Pasha Tatashin
2024-03-18 15:53                       ` David Laight
2024-03-18 16:57                         ` Pasha Tatashin
2024-03-18 15:38               ` David Laight
2024-03-18 17:00                 ` Pasha Tatashin
2024-03-18 17:37                   ` Pasha Tatashin
2024-03-15  4:17           ` H. Peter Anvin
2024-03-17  0:47     ` H. Peter Anvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1ac305b1-d28f-44f6-88e5-c85d9062f9e8@app.fastmail.com \
    --to=luto@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=brauner@kernel.org \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dianders@chromium.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=eric.devolder@oracle.com \
    --cc=hca@linux.ibm.com \
    --cc=hch@infradead.org \
    --cc=hpa@zytor.com \
    --cc=jacob.jun.pan@linux.intel.com \
    --cc=jgg@ziepe.ca \
    --cc=jpoimboe@kernel.org \
    --cc=jroedel@suse.de \
    --cc=juri.lelli@redhat.com \
    --cc=kent.overstreet@linux.dev \
    --cc=kinseyho@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lstoakes@gmail.com \
    --cc=mgorman@suse.de \
    --cc=mic@digikod.net \
    --cc=michael.christie@oracle.com \
    --cc=mingo@redhat.com \
    --cc=mjguzik@gmail.com \
    --cc=mst@redhat.com \
    --cc=npiggin@gmail.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterz@infradead.org \
    --cc=pmladek@suse.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=rostedt@goodmis.org \
    --cc=surenb@google.com \
    --cc=tglx@linutronix.de \
    --cc=urezki@gmail.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).