[RFD] x86: Curing the exception and syscall trainwreck in hardware

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFD] x86: Curing the exception and syscall trainwreck in hardware
@ 2020-08-24 12:24 Thomas Gleixner
  2020-08-24 13:52 ` Andrew Cooper
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2020-08-24 12:24 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger,
	Sasha Levin, Andrew Cooper, Dirk Hohndel, Jan Kiszka,
	Tony W Wang-oc, H. Peter Anvin, Asit Mallick, Gordon Tetlow,
	David Kaplan, Tony Luck

It's a sad state of affairs that I have to write this mail at all and it's
nothing else than an act of desperation.

The x86 exception handling including the various ways of syscall entry/exit
are a constant source of trouble. Aside of being a functional disaster
quite some of these issues have severe security implications.

There are similar issues on the virtualization side including the handling
of essential MSRs which are required to run a guest OS and even more so
with the upcoming virt specific exceptions of various vendors.

We are asking the vendors for more than a decade to fix this situation, but
even the most trivial requests like an IRET variant which does not reenable
NMIs unconditionally and other small things which would make our life less
miserable aren't happening.

Instead of fixing the underlying design fails first and creating a solid
base the vendors add even more ill defined exception variants on top of
the existing pile. Unsurprisingly these add-ons are creating more
problems than they solve, but being based on the existing house of cards
that's obviously expected.

This really has to stop and the underlying issues have to be resolved
before more problems are inflicted upon operating systems and hypervisors.
The amount of code to workaround these issues is already by far larger than
the actual functional code. Some of these workarounds are just bandaids
which try to prevent the most obvious damage, but they are mostly based on
the hope that the unfixable corner cases never happen.

There is talk about solutions for years, but it's just talk and we have not
yet seen a coordinated effort accross the x86 vendors to come up with a
sane replacement for the x86 exception and syscall trainwreck.

The important word here is 'coordinated'. We are not at all interested
in different solutions from different vendors. It's going to be
challenging enough to maintain ONE parallel exception/syscall handling
implementation.  In other words, the kernel is going to support exactly
ONE new exception/syscall handling mechanism and not going to accomodate
every vendor.

So I call on the x86 vendors to sit together and come up with a unified
and consolidated base on which each of the vendors can build their
differentiating features.

Aside of coordination between the x86 vendors this also requires
coordination with the people who finally have to deal with that on the
software side. The prevailing hardware engineering principle "That can
be fixed in software" does not work; it never worked - especially not in
the area of x86 exception and syscall handling.

This coordination must include all major operating systems and hypervisors
whether open source or proprietary to ensure that the different
requirements are met. This kind of coordination has happened in the context
of the hardware vulnerability mitigations already in a fruitful way so
this request is not asking for something impossible.

If the x86 vendors are unable to talk to each other and coordinate on a
solution, then the ultimate backstop might be to take the first reasonable
design specification and the first reasonable silicon implementation of it
as the ONE alternative solution to the existing trainwreck. How the other
vendors are going to deal with that is none of our business. That's the
least useful and least desired outcome and will only happen when the x86
vendors are not able to get their act together and sort that out upfront.

Thanks,

	Thomas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFD] x86: Curing the exception and syscall trainwreck in hardware
  2020-08-24 12:24 [RFD] x86: Curing the exception and syscall trainwreck in hardware Thomas Gleixner
@ 2020-08-24 13:52 ` Andrew Cooper
  2020-08-25  4:39   ` TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Cooper @ 2020-08-24 13:52 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger,
	Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc,
	H. Peter Anvin, Asit Mallick, Gordon Tetlow, David Kaplan,
	Tony Luck, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 4091 bytes --]

On 24/08/2020 13:24, Thomas Gleixner wrote:
> It's a sad state of affairs that I have to write this mail at all and it's
> nothing else than an act of desperation.
>
> The x86 exception handling including the various ways of syscall entry/exit
> are a constant source of trouble. Aside of being a functional disaster
> quite some of these issues have severe security implications.
>
> There are similar issues on the virtualization side including the handling
> of essential MSRs which are required to run a guest OS and even more so
> with the upcoming virt specific exceptions of various vendors.
>
> We are asking the vendors for more than a decade to fix this situation, but
> even the most trivial requests like an IRET variant which does not reenable
> NMIs unconditionally and other small things which would make our life less
> miserable aren't happening.
>
> Instead of fixing the underlying design fails first and creating a solid
> base the vendors add even more ill defined exception variants on top of
> the existing pile. Unsurprisingly these add-ons are creating more
> problems than they solve, but being based on the existing house of cards
> that's obviously expected.
>
> This really has to stop and the underlying issues have to be resolved
> before more problems are inflicted upon operating systems and hypervisors.
> The amount of code to workaround these issues is already by far larger than
> the actual functional code. Some of these workarounds are just bandaids
> which try to prevent the most obvious damage, but they are mostly based on
> the hope that the unfixable corner cases never happen.
>
> There is talk about solutions for years, but it's just talk and we have not
> yet seen a coordinated effort accross the x86 vendors to come up with a
> sane replacement for the x86 exception and syscall trainwreck.
>
> The important word here is 'coordinated'. We are not at all interested
> in different solutions from different vendors. It's going to be
> challenging enough to maintain ONE parallel exception/syscall handling
> implementation.  In other words, the kernel is going to support exactly
> ONE new exception/syscall handling mechanism and not going to accomodate
> every vendor.
>
> So I call on the x86 vendors to sit together and come up with a unified
> and consolidated base on which each of the vendors can build their
> differentiating features.
>
> Aside of coordination between the x86 vendors this also requires
> coordination with the people who finally have to deal with that on the
> software side. The prevailing hardware engineering principle "That can
> be fixed in software" does not work; it never worked - especially not in
> the area of x86 exception and syscall handling.
>
> This coordination must include all major operating systems and hypervisors
> whether open source or proprietary to ensure that the different
> requirements are met. This kind of coordination has happened in the context
> of the hardware vulnerability mitigations already in a fruitful way so
> this request is not asking for something impossible.
>
> If the x86 vendors are unable to talk to each other and coordinate on a
> solution, then the ultimate backstop might be to take the first reasonable
> design specification and the first reasonable silicon implementation of it
> as the ONE alternative solution to the existing trainwreck. How the other
> vendors are going to deal with that is none of our business. That's the
> least useful and least desired outcome and will only happen when the x86
> vendors are not able to get their act together and sort that out upfront.

And to help with coordination, here is something prepared (slightly)
earlier.

https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing

This identifies the problems from software's perspective, along with
proposing behaviour which ought to resolve the issues.

It is still a work-in-progress.  The #VE section still needs updating in
light of the publication of the recent TDX spec.

Review and feedback welcome.

Thanks,

~Andrew

[-- Attachment #2: x86 Stack Switching - draft 2.1.pdf --]
[-- Type: application/pdf, Size: 108930 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)
  2020-08-24 13:52 ` Andrew Cooper
@ 2020-08-25  4:39   ` Sean Christopherson
  2020-08-25 15:25     ` Dave Hansen
  2020-08-25 16:49     ` Andy Lutomirski
  0 siblings, 2 replies; 15+ messages in thread
From: Sean Christopherson @ 2020-08-25  4:39 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Thomas Gleixner, LKML, x86, Linus Torvalds, Tom Lendacky, Pu Wen,
	Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka,
	Tony W Wang-oc, H. Peter Anvin, Asit Mallick, Gordon Tetlow,
	David Kaplan, Tony Luck, Andy Lutomirski

+Andy

On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote:
> And to help with coordination, here is something prepared (slightly)
> earlier.
> 
> https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing
> 
> This identifies the problems from software's perspective, along with
> proposing behaviour which ought to resolve the issues.
> 
> It is still a work-in-progress.  The #VE section still needs updating in
> light of the publication of the recent TDX spec.

For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this
something we (Linux) as the guest kernel actually want to handle gracefully
(where gracefully means not panicking)?  For TDX, a #VE in the SYSCALL gap
would require one of two things:

  a) The guest kernel to not accept/validate the GPA->HPA mapping for the
     relevant pages, e.g. code or scratch data.

  b) The host VMM to remap the GPA (making the GPA->HPA pending again).

(a) is only possible if there's a fatal buggy guest kernel (or perhaps vBIOS).
(b) requires either a buggy or malicious host VMM.

I ask because, if the answer is "no, panic at will", then we shouldn't need
to burn an IST for TDX #VE.  Exceptions won't morph to #VE and hitting an
instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug.
Ditto for a #VE in NMI entry before it gets to a thread stack.

Am I missing anything?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)
  2020-08-25  4:39   ` TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) Sean Christopherson
@ 2020-08-25 15:25     ` Dave Hansen
  2020-08-25 16:49     ` Andy Lutomirski
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2020-08-25 15:25 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Thomas Gleixner, LKML, x86, Linus Torvalds, Tom Lendacky, Pu Wen,
	Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka,
	Tony W Wang-oc, H. Peter Anvin, Asit Mallick, Gordon Tetlow,
	David Kaplan, Tony Luck, Andy Lutomirski, Sean Christopherson

On 8/24/20 9:39 PM, Sean Christopherson wrote:
> +Andy
> 
> On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote:
>> And to help with coordination, here is something prepared (slightly)
>> earlier.
>>
>> https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing
>>
>> This identifies the problems from software's perspective, along with
>> proposing behaviour which ought to resolve the issues.
>>
>> It is still a work-in-progress.  The #VE section still needs updating in
>> light of the publication of the recent TDX spec.
> 
> For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this
> something we (Linux) as the guest kernel actually want to handle gracefully
> (where gracefully means not panicking)?  For TDX, a #VE in the SYSCALL gap
> would require one of two things:
> 
>   a) The guest kernel to not accept/validate the GPA->HPA mapping for the
>      relevant pages, e.g. code or scratch data.
> 
>   b) The host VMM to remap the GPA (making the GPA->HPA pending again).
> 
> (a) is only possible if there's a fatal buggy guest kernel (or perhaps vBIOS).
> (b) requires either a buggy or malicious host VMM.
> 
> I ask because, if the answer is "no, panic at will", then we shouldn't need
> to burn an IST for TDX #VE.  Exceptions won't morph to #VE and hitting an
> instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug.
> Ditto for a #VE in NMI entry before it gets to a thread stack.
> 
> Am I missing anything?

No, that was my expectation as well.  My only concern is that someone
might unintentionally put a #VE'ing instruction in one of the tricky
entry paths, like if we decided we needed CPUID for serialization or
used a WRMSR that #VE's.

It's just something we need to look out for when mucking in the entry
paths.  But, it's not that hard given how few things actually #VE.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)
  2020-08-25  4:39   ` TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) Sean Christopherson
  2020-08-25 15:25     ` Dave Hansen
@ 2020-08-25 16:49     ` Andy Lutomirski
  2020-08-25 17:19       ` Sean Christopherson
  1 sibling, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2020-08-25 16:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin,
	Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin,
	Asit Mallick, Gordon Tetlow, David Kaplan, Tony Luck,
	Andy Lutomirski

On Mon, Aug 24, 2020 at 9:40 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> +Andy
>
> On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote:
> > And to help with coordination, here is something prepared (slightly)
> > earlier.
> >
> > https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing
> >
> > This identifies the problems from software's perspective, along with
> > proposing behaviour which ought to resolve the issues.
> >
> > It is still a work-in-progress.  The #VE section still needs updating in
> > light of the publication of the recent TDX spec.
>
> For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this
> something we (Linux) as the guest kernel actually want to handle gracefully
> (where gracefully means not panicking)?  For TDX, a #VE in the SYSCALL gap
> would require one of two things:
>
>   a) The guest kernel to not accept/validate the GPA->HPA mapping for the
>      relevant pages, e.g. code or scratch data.
>
>   b) The host VMM to remap the GPA (making the GPA->HPA pending again).
>
> (a) is only possible if there's a fatal buggy guest kernel (or perhaps vBIOS).
> (b) requires either a buggy or malicious host VMM.
>
> I ask because, if the answer is "no, panic at will", then we shouldn't need
> to burn an IST for TDX #VE.  Exceptions won't morph to #VE and hitting an
> instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug.

Or malicious hypervisor action, and that's a problem.

Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the
actual SYSCALL text or the first memory it accesses -- I don't have a
TDX spec so I don't know the details).  The user does SYSCALL, the
kernel hits the funny GPA, and #VE is delivered.  The microcode wil
write the IRET frame, with mostly user-controlled contents, wherever
RSP points, and RSP is also user controlled.  Calling this a "panic"
is charitable -- it's really game over against an attacker who is
moderately clever.

The kernel can't do anything about this because it's game over before
the kernel has had the chance to execute any instructions.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)
  2020-08-25 16:49     ` Andy Lutomirski
@ 2020-08-25 17:19       ` Sean Christopherson
  2020-08-25 17:28         ` Andy Lutomirski
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2020-08-25 17:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin,
	Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin,
	Asit Mallick, Gordon Tetlow, David Kaplan, Tony Luck

On Tue, Aug 25, 2020 at 09:49:05AM -0700, Andy Lutomirski wrote:
> On Mon, Aug 24, 2020 at 9:40 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > +Andy
> >
> > On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote:
> > > And to help with coordination, here is something prepared (slightly)
> > > earlier.
> > >
> > > https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing
> > >
> > > This identifies the problems from software's perspective, along with
> > > proposing behaviour which ought to resolve the issues.
> > >
> > > It is still a work-in-progress.  The #VE section still needs updating in
> > > light of the publication of the recent TDX spec.
> >
> > For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this
> > something we (Linux) as the guest kernel actually want to handle gracefully
> > (where gracefully means not panicking)?  For TDX, a #VE in the SYSCALL gap
> > would require one of two things:
> >
> >   a) The guest kernel to not accept/validate the GPA->HPA mapping for the
> >      relevant pages, e.g. code or scratch data.
> >
> >   b) The host VMM to remap the GPA (making the GPA->HPA pending again).
> >
> > (a) is only possible if there's a fatal buggy guest kernel (or perhaps vBIOS).
> > (b) requires either a buggy or malicious host VMM.
> >
> > I ask because, if the answer is "no, panic at will", then we shouldn't need
> > to burn an IST for TDX #VE.  Exceptions won't morph to #VE and hitting an
> > instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug.
> 
> Or malicious hypervisor action, and that's a problem.
> 
> Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the
> actual SYSCALL text or the first memory it accesses -- I don't have a
> TDX spec so I don't know the details).

You can thank our legal department :-)

> The user does SYSCALL, the kernel hits the funny GPA, and #VE is delivered.
> The microcode wil write the IRET frame, with mostly user-controlled contents,
> wherever RSP points, and RSP is also user controlled.  Calling this a "panic"
> is charitable -- it's really game over against an attacker who is moderately
> clever.
> 
> The kernel can't do anything about this because it's game over before
> the kernel has had the chance to execute any instructions.

Hrm, I was thinking that SMAP=1 would give the necessary protections, but
in typing that out I realized userspace can throw in an RSP value that
points at kernel memory.  Duh.

One thought would be to have the TDX module (thing that runs in SEAM and
sits between the VMM and the guest) provide a TDCALL (hypercall from guest
to TDX module) to the guest that would allow the guest to specify a very
limited number of GPAs that must never generate a #VE, e.g. go straight to
guest shutdown if a disallowed GPA would go pending.  That seems doable
from a TDX perspective without incurring noticeable overhead (assuming the
list of GPAs is very small) and should be easy to to support in the guest,
e.g. make a TDCALL/hypercall or two during boot to protect the SYSCALL
page and its scratch data.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)
  2020-08-25 17:19       ` Sean Christopherson
@ 2020-08-25 17:28         ` Andy Lutomirski
  2020-08-25 17:35           ` Luck, Tony
  2020-08-26 19:16           ` Sean Christopherson
  0 siblings, 2 replies; 15+ messages in thread
From: Andy Lutomirski @ 2020-08-25 17:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Andrew Cooper, Thomas Gleixner, LKML, X86 ML,
	Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger,
	Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc,
	H. Peter Anvin, Asit Mallick, Gordon Tetlow, David Kaplan,
	Tony Luck

On Tue, Aug 25, 2020 at 10:19 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Tue, Aug 25, 2020 at 09:49:05AM -0700, Andy Lutomirski wrote:
> > On Mon, Aug 24, 2020 at 9:40 PM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > >
> > > +Andy
> > >
> > > On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote:
> > > > And to help with coordination, here is something prepared (slightly)
> > > > earlier.
> > > >
> > > > https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing
> > > >
> > > > This identifies the problems from software's perspective, along with
> > > > proposing behaviour which ought to resolve the issues.
> > > >
> > > > It is still a work-in-progress.  The #VE section still needs updating in
> > > > light of the publication of the recent TDX spec.
> > >
> > > For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this
> > > something we (Linux) as the guest kernel actually want to handle gracefully
> > > (where gracefully means not panicking)?  For TDX, a #VE in the SYSCALL gap
> > > would require one of two things:
> > >
> > >   a) The guest kernel to not accept/validate the GPA->HPA mapping for the
> > >      relevant pages, e.g. code or scratch data.
> > >
> > >   b) The host VMM to remap the GPA (making the GPA->HPA pending again).
> > >
> > > (a) is only possible if there's a fatal buggy guest kernel (or perhaps vBIOS).
> > > (b) requires either a buggy or malicious host VMM.
> > >
> > > I ask because, if the answer is "no, panic at will", then we shouldn't need
> > > to burn an IST for TDX #VE.  Exceptions won't morph to #VE and hitting an
> > > instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug.
> >
> > Or malicious hypervisor action, and that's a problem.
> >
> > Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the
> > actual SYSCALL text or the first memory it accesses -- I don't have a
> > TDX spec so I don't know the details).
>
> You can thank our legal department :-)
>
> > The user does SYSCALL, the kernel hits the funny GPA, and #VE is delivered.
> > The microcode wil write the IRET frame, with mostly user-controlled contents,
> > wherever RSP points, and RSP is also user controlled.  Calling this a "panic"
> > is charitable -- it's really game over against an attacker who is moderately
> > clever.
> >
> > The kernel can't do anything about this because it's game over before
> > the kernel has had the chance to execute any instructions.
>
> Hrm, I was thinking that SMAP=1 would give the necessary protections, but
> in typing that out I realized userspace can throw in an RSP value that
> points at kernel memory.  Duh.
>
> One thought would be to have the TDX module (thing that runs in SEAM and
> sits between the VMM and the guest) provide a TDCALL (hypercall from guest
> to TDX module) to the guest that would allow the guest to specify a very
> limited number of GPAs that must never generate a #VE, e.g. go straight to
> guest shutdown if a disallowed GPA would go pending.  That seems doable
> from a TDX perspective without incurring noticeable overhead (assuming the
> list of GPAs is very small) and should be easy to to support in the guest,
> e.g. make a TDCALL/hypercall or two during boot to protect the SYSCALL
> page and its scratch data.

I guess you could do that, but this is getting gross.  The x86
architecture has really gone off the rails here.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)
  2020-08-25 17:28         ` Andy Lutomirski
@ 2020-08-25 17:35           ` Luck, Tony
  2020-08-25 17:41             ` Andy Lutomirski
                               ` (2 more replies)
  2020-08-26 19:16           ` Sean Christopherson
  1 sibling, 3 replies; 15+ messages in thread
From: Luck, Tony @ 2020-08-25 17:35 UTC (permalink / raw)
  To: Andy Lutomirski, Christopherson, Sean J
  Cc: Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin,
	Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin,
	Mallick, Asit K, Gordon Tetlow, David Kaplan

> > Or malicious hypervisor action, and that's a problem.
> >
> > Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the
> > actual SYSCALL text or the first memory it accesses -- I don't have a
> > TDX spec so I don't know the details).

Is it feasible to defend against a malicious (or buggy) hypervisor?

Obviously, we can't leave holes that guests can exploit. But the hypervisor
can crash the system no matter how clever TDX is.

-Tony

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)
  2020-08-25 17:35           ` Luck, Tony
@ 2020-08-25 17:41             ` Andy Lutomirski
  2020-08-25 17:59             ` Andrew Cooper
  2020-08-25 19:49             ` Thomas Gleixner
  2 siblings, 0 replies; 15+ messages in thread
From: Andy Lutomirski @ 2020-08-25 17:41 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Christopherson, Sean J, Andrew Cooper,
	Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Tom Lendacky,
	Pu Wen, Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka,
	Tony W Wang-oc, H. Peter Anvin, Mallick, Asit K, Gordon Tetlow,
	David Kaplan

On Tue, Aug 25, 2020 at 10:36 AM Luck, Tony <tony.luck@intel.com> wrote:
>
> > > Or malicious hypervisor action, and that's a problem.
> > >
> > > Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the
> > > actual SYSCALL text or the first memory it accesses -- I don't have a
> > > TDX spec so I don't know the details).
>
> Is it feasible to defend against a malicious (or buggy) hypervisor?
>
> Obviously, we can't leave holes that guests can exploit. But the hypervisor
> can crash the system no matter how clever TDX is.

Crashing the system is one thing.  Corrupting the system in a way that
could allow code execution is another thing entirely.  And the whole
point of TDX is to defend the guest against the hypervisor.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)
  2020-08-25 17:35           ` Luck, Tony
  2020-08-25 17:41             ` Andy Lutomirski
@ 2020-08-25 17:59             ` Andrew Cooper
  2020-08-25 18:38               ` Dave Hansen
  2020-08-25 19:49             ` Thomas Gleixner
  2 siblings, 1 reply; 15+ messages in thread
From: Andrew Cooper @ 2020-08-25 17:59 UTC (permalink / raw)
  To: Luck, Tony, Andy Lutomirski, Christopherson, Sean J
  Cc: Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin,
	Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin,
	Mallick, Asit K, Gordon Tetlow, David Kaplan, Andrew Cooper

On 25/08/2020 18:35, Luck, Tony wrote:
>>> Or malicious hypervisor action, and that's a problem.
>>>
>>> Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the
>>> actual SYSCALL text or the first memory it accesses -- I don't have a
>>> TDX spec so I don't know the details).
> Is it feasible to defend against a malicious (or buggy) hypervisor?
>
> Obviously, we can't leave holes that guests can exploit. But the hypervisor
> can crash the system no matter how clever TDX is.

You have to be more specific about what you mean by "malicious" hypervisor.

Nothing can protect against a hypervisor which refuses to schedule the
Trusted Domain.  The guest cannot protect against availability
maliciousness.  However, you can use market forces to fix that problem. 
(I'll take my credit card elsewhere if you don't schedule my VM, etc)

Things are more complicated when it comes to integrity or
confidentiality of the TD, but the prevailing feeling seems to be
"crashing obviously and reliably if something goes wrong is ok".

If I've read the TDX spec/whitepaper properly, the main hypervisor can
write to all the encrypted pages.  This will destroy data, break the
MAC, and yields #PF inside the SEAM hypervisor, or the TD when the cache
line is next referenced.

Cunning timing on behalf of a malicious hypervisor (hitting the SYSCALL
gap) will cause the guest's #PF handler to run on a user stack, opening
a privilege escalation hole.

Whatever you might want to say about the exact integrity/confidentiality
expectations, I think "the hypervisor can open a user=>kernel privilege
escalation hole inside the TD" is not what people would consider acceptable.

On AMD parts, this is why the #VC handler is IST, in an attempt to at
least notice this damage and crash.  There is no way TDX can get away
with requiring #PF to be IST as well.

~Andrew

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)
  2020-08-25 17:59             ` Andrew Cooper
@ 2020-08-25 18:38               ` Dave Hansen
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2020-08-25 18:38 UTC (permalink / raw)
  To: Andrew Cooper, Luck, Tony, Andy Lutomirski,
	Christopherson, Sean J
  Cc: Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin,
	Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin,
	Mallick, Asit K, Gordon Tetlow, David Kaplan

On 8/25/20 10:59 AM, Andrew Cooper wrote:
> If I've read the TDX spec/whitepaper properly, the main hypervisor can
> write to all the encrypted pages.  This will destroy data, break the
> MAC, and yields #PF inside the SEAM hypervisor, or the TD when the cache
> line is next referenced.

I think you're talking about:

> Attempting to access a private KeyID by software outside the SEAM
> mode would cause a page-fault exception (#PF).

I don't think that ever results in a TD guest #PF.  "A MAC-verification
failure would be fatal to the TD and lead to its termination."  In this
context, I think that means that the TD stops running and can not be
reentered.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)
  2020-08-25 17:35           ` Luck, Tony
  2020-08-25 17:41             ` Andy Lutomirski
  2020-08-25 17:59             ` Andrew Cooper
@ 2020-08-25 19:49             ` Thomas Gleixner
  2 siblings, 0 replies; 15+ messages in thread
From: Thomas Gleixner @ 2020-08-25 19:49 UTC (permalink / raw)
  To: Luck, Tony, Andy Lutomirski, Christopherson, Sean J
  Cc: Andrew Cooper, LKML, X86 ML, Linus Torvalds, Tom Lendacky, Pu Wen,
	Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka,
	Tony W Wang-oc, H. Peter Anvin, Mallick, Asit K, Gordon Tetlow,
	David Kaplan

On Tue, Aug 25 2020 at 17:35, Tony Luck wrote:
>> > Or malicious hypervisor action, and that's a problem.
>> >
>> > Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the
>> > actual SYSCALL text or the first memory it accesses -- I don't have a
>> > TDX spec so I don't know the details).
>
> Is it feasible to defend against a malicious (or buggy) hypervisor?
>
> Obviously, we can't leave holes that guests can exploit. But the hypervisor
> can crash the system no matter how clever TDX is.

If it crashes and burns reliably then fine, but is that guaranteed?

I have serious doubts about that given the history and fragility of all
of this and I really have zero interest in dealing with the fallout a
year from now.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)
  2020-08-25 17:28         ` Andy Lutomirski
  2020-08-25 17:35           ` Luck, Tony
@ 2020-08-26 19:16           ` Sean Christopherson
  2020-08-30 15:37             ` Andy Lutomirski
  1 sibling, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2020-08-26 19:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin,
	Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin,
	Asit Mallick, Gordon Tetlow, David Kaplan, Tony Luck

On Tue, Aug 25, 2020 at 10:28:53AM -0700, Andy Lutomirski wrote:
> On Tue, Aug 25, 2020 at 10:19 AM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> > One thought would be to have the TDX module (thing that runs in SEAM and
> > sits between the VMM and the guest) provide a TDCALL (hypercall from guest
> > to TDX module) to the guest that would allow the guest to specify a very
> > limited number of GPAs that must never generate a #VE, e.g. go straight to
> > guest shutdown if a disallowed GPA would go pending.  That seems doable
> > from a TDX perspective without incurring noticeable overhead (assuming the
> > list of GPAs is very small) and should be easy to to support in the guest,
> > e.g. make a TDCALL/hypercall or two during boot to protect the SYSCALL
> > page and its scratch data.
> 
> I guess you could do that, but this is getting gross.  The x86
> architecture has really gone off the rails here.

Does it suck less than using an IST?  Honest question.

I will add my voice to the "fix SYSCALL" train, but the odds of that getting
a proper fix in time to intercept TDX are not good.  On the other hand,
"fixing" the SYSCALL issue in the TDX module is much more feasible, but only
if we see real value in such an approach as opposed to just using an IST.  I
personally like the idea of a TDX module solution as I think it would be
simpler for the kernel to implement/support, and would mean we wouldn't need
to roll back IST usage for #VE if the heavens should part and bestow upon us
a sane SYSCALL.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)
  2020-08-26 19:16           ` Sean Christopherson
@ 2020-08-30 15:37             ` Andy Lutomirski
  2020-08-30 18:37               ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2020-08-30 15:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Andrew Cooper, Thomas Gleixner, LKML, X86 ML,
	Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger,
	Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc,
	H. Peter Anvin, Asit Mallick, Gordon Tetlow, David Kaplan,
	Tony Luck

On Wed, Aug 26, 2020 at 12:16 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Tue, Aug 25, 2020 at 10:28:53AM -0700, Andy Lutomirski wrote:
> > On Tue, Aug 25, 2020 at 10:19 AM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > > One thought would be to have the TDX module (thing that runs in SEAM and
> > > sits between the VMM and the guest) provide a TDCALL (hypercall from guest
> > > to TDX module) to the guest that would allow the guest to specify a very
> > > limited number of GPAs that must never generate a #VE, e.g. go straight to
> > > guest shutdown if a disallowed GPA would go pending.  That seems doable
> > > from a TDX perspective without incurring noticeable overhead (assuming the
> > > list of GPAs is very small) and should be easy to to support in the guest,
> > > e.g. make a TDCALL/hypercall or two during boot to protect the SYSCALL
> > > page and its scratch data.
> >
> > I guess you could do that, but this is getting gross.  The x86
> > architecture has really gone off the rails here.
>
> Does it suck less than using an IST?  Honest question.
>
> I will add my voice to the "fix SYSCALL" train, but the odds of that getting
> a proper fix in time to intercept TDX are not good.  On the other hand,
> "fixing" the SYSCALL issue in the TDX module is much more feasible, but only
> if we see real value in such an approach as opposed to just using an IST.  I
> personally like the idea of a TDX module solution as I think it would be
> simpler for the kernel to implement/support, and would mean we wouldn't need
> to roll back IST usage for #VE if the heavens should part and bestow upon us
> a sane SYSCALL.

There's no such thing as "just" using an IST.  Using IST opens a huge
can of works due to its recursion issues.

The TDX module solution is utterly gross but may well suck less than
using an IST.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)
  2020-08-30 15:37             ` Andy Lutomirski
@ 2020-08-30 18:37               ` Linus Torvalds
  0 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2020-08-30 18:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sean Christopherson, Andrew Cooper, Thomas Gleixner, LKML, X86 ML,
	Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin,
	Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin,
	Asit Mallick, Gordon Tetlow, David Kaplan, Tony Luck

On Sun, Aug 30, 2020 at 8:37 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> There's no such thing as "just" using an IST.  Using IST opens a huge
> can of works due to its recursion issues.

I absolutely despise all the x86 "indirect system structures". They
are horrible garbage. IST is only yet another example of that kind of
brokenness, and annoys me particularly because it (and swapgs) were
actually making x86 _worse_.

The old i386 exception model was actually better than what x86-64 did,
and IST is a big part of the problem. Just have a supervisor stack,
and push the state on it. Stop playing games with multiple stacks
depending on some magical indirect system state.

Other examples of stupid and bad indirection:

 - the GDT and LDT.

   The kernel should never have to use them. It would be much better
if the segment "shadow" state would stop being shadow state, and be
the REAL state that the kernel (and user space, for that matter)
accesses.

   Yeah, we got halfway there with MSR_FS/GS_BASE, but what a complete
garbage crock that was. So now we're forced to use the selector *and*
the base reghister, and they may be out of sync with each other, so
you have the worst of both worlds.

   Keep the GDT and LDT around for compatibility reasons, so that old
broken programs that want to load the segment state the oldfashioned
way can do so. But make it clear that that is purely for legacy, and
make the modern code just save and restore the actual true
non-indirect segment state.

   For new models, give us a way to load base/limit/permissions
directly, and reset them on kernel entry. No more descriptor table
indirection games.

 - the IDT and the TSS segment.

   Exact same arguments as above. Keep them around for legacy
programs, but let us just set "this is the entrypoint, this the the
kernel stack" as registers. Christ, we're probably better off with one
single entry-point for the whole kernel (ok, give us a separate one
for NMI/MCE/doublefault, since they are _so_ special, and maybe
separate "CPU exceptions" from "external interrupts), together with
just a register that says what the exception was.

 - swapgs needs to die.

   The kernel GS/FS segments should just be separate segment registers
from user space. No "swapping" needed. In CPL0, "gs" just means
something different from user space. No save/restore code for it, no
swapping, no nothing.

Honestly, I think %rsp/%rip could work like that too. Just make "rsp"
and "rip" be a completely different register in kernel mode - rename
it in the front-end of the CPU or whatever.

Imagine not having to save/restore rsp/rip on kernel entry/exit at
all, because returning to user more just implicitly starts using
ursp/urip. And a context switch uses (fast) MSR's to save/restore the
user state (or, since it's actually a real register in the register
file, just a new "mov" instruction to access the user registers).

                 Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-08-30 18:37 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-24 12:24 [RFD] x86: Curing the exception and syscall trainwreck in hardware Thomas Gleixner
2020-08-24 13:52 ` Andrew Cooper
2020-08-25  4:39   ` TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) Sean Christopherson
2020-08-25 15:25     ` Dave Hansen
2020-08-25 16:49     ` Andy Lutomirski
2020-08-25 17:19       ` Sean Christopherson
2020-08-25 17:28         ` Andy Lutomirski
2020-08-25 17:35           ` Luck, Tony
2020-08-25 17:41             ` Andy Lutomirski
2020-08-25 17:59             ` Andrew Cooper
2020-08-25 18:38               ` Dave Hansen
2020-08-25 19:49             ` Thomas Gleixner
2020-08-26 19:16           ` Sean Christopherson
2020-08-30 15:37             ` Andy Lutomirski
2020-08-30 18:37               ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox