* [RFD] x86: Curing the exception and syscall trainwreck in hardware @ 2020-08-24 12:24 Thomas Gleixner 2020-08-24 13:52 ` Andrew Cooper 0 siblings, 1 reply; 15+ messages in thread From: Thomas Gleixner @ 2020-08-24 12:24 UTC (permalink / raw) To: LKML Cc: x86, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin, Andrew Cooper, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin, Asit Mallick, Gordon Tetlow, David Kaplan, Tony Luck It's a sad state of affairs that I have to write this mail at all and it's nothing else than an act of desperation. The x86 exception handling including the various ways of syscall entry/exit are a constant source of trouble. Aside of being a functional disaster quite some of these issues have severe security implications. There are similar issues on the virtualization side including the handling of essential MSRs which are required to run a guest OS and even more so with the upcoming virt specific exceptions of various vendors. We are asking the vendors for more than a decade to fix this situation, but even the most trivial requests like an IRET variant which does not reenable NMIs unconditionally and other small things which would make our life less miserable aren't happening. Instead of fixing the underlying design fails first and creating a solid base the vendors add even more ill defined exception variants on top of the existing pile. Unsurprisingly these add-ons are creating more problems than they solve, but being based on the existing house of cards that's obviously expected. This really has to stop and the underlying issues have to be resolved before more problems are inflicted upon operating systems and hypervisors. The amount of code to workaround these issues is already by far larger than the actual functional code. Some of these workarounds are just bandaids which try to prevent the most obvious damage, but they are mostly based on the hope that the unfixable corner cases never happen. There is talk about solutions for years, but it's just talk and we have not yet seen a coordinated effort accross the x86 vendors to come up with a sane replacement for the x86 exception and syscall trainwreck. The important word here is 'coordinated'. We are not at all interested in different solutions from different vendors. It's going to be challenging enough to maintain ONE parallel exception/syscall handling implementation. In other words, the kernel is going to support exactly ONE new exception/syscall handling mechanism and not going to accomodate every vendor. So I call on the x86 vendors to sit together and come up with a unified and consolidated base on which each of the vendors can build their differentiating features. Aside of coordination between the x86 vendors this also requires coordination with the people who finally have to deal with that on the software side. The prevailing hardware engineering principle "That can be fixed in software" does not work; it never worked - especially not in the area of x86 exception and syscall handling. This coordination must include all major operating systems and hypervisors whether open source or proprietary to ensure that the different requirements are met. This kind of coordination has happened in the context of the hardware vulnerability mitigations already in a fruitful way so this request is not asking for something impossible. If the x86 vendors are unable to talk to each other and coordinate on a solution, then the ultimate backstop might be to take the first reasonable design specification and the first reasonable silicon implementation of it as the ONE alternative solution to the existing trainwreck. How the other vendors are going to deal with that is none of our business. That's the least useful and least desired outcome and will only happen when the x86 vendors are not able to get their act together and sort that out upfront. Thanks, Thomas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFD] x86: Curing the exception and syscall trainwreck in hardware 2020-08-24 12:24 [RFD] x86: Curing the exception and syscall trainwreck in hardware Thomas Gleixner @ 2020-08-24 13:52 ` Andrew Cooper 2020-08-25 4:39 ` TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) Sean Christopherson 0 siblings, 1 reply; 15+ messages in thread From: Andrew Cooper @ 2020-08-24 13:52 UTC (permalink / raw) To: Thomas Gleixner, LKML Cc: x86, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin, Asit Mallick, Gordon Tetlow, David Kaplan, Tony Luck, Andrew Cooper [-- Attachment #1: Type: text/plain, Size: 4091 bytes --] On 24/08/2020 13:24, Thomas Gleixner wrote: > It's a sad state of affairs that I have to write this mail at all and it's > nothing else than an act of desperation. > > The x86 exception handling including the various ways of syscall entry/exit > are a constant source of trouble. Aside of being a functional disaster > quite some of these issues have severe security implications. > > There are similar issues on the virtualization side including the handling > of essential MSRs which are required to run a guest OS and even more so > with the upcoming virt specific exceptions of various vendors. > > We are asking the vendors for more than a decade to fix this situation, but > even the most trivial requests like an IRET variant which does not reenable > NMIs unconditionally and other small things which would make our life less > miserable aren't happening. > > Instead of fixing the underlying design fails first and creating a solid > base the vendors add even more ill defined exception variants on top of > the existing pile. Unsurprisingly these add-ons are creating more > problems than they solve, but being based on the existing house of cards > that's obviously expected. > > This really has to stop and the underlying issues have to be resolved > before more problems are inflicted upon operating systems and hypervisors. > The amount of code to workaround these issues is already by far larger than > the actual functional code. Some of these workarounds are just bandaids > which try to prevent the most obvious damage, but they are mostly based on > the hope that the unfixable corner cases never happen. > > There is talk about solutions for years, but it's just talk and we have not > yet seen a coordinated effort accross the x86 vendors to come up with a > sane replacement for the x86 exception and syscall trainwreck. > > The important word here is 'coordinated'. We are not at all interested > in different solutions from different vendors. It's going to be > challenging enough to maintain ONE parallel exception/syscall handling > implementation. In other words, the kernel is going to support exactly > ONE new exception/syscall handling mechanism and not going to accomodate > every vendor. > > So I call on the x86 vendors to sit together and come up with a unified > and consolidated base on which each of the vendors can build their > differentiating features. > > Aside of coordination between the x86 vendors this also requires > coordination with the people who finally have to deal with that on the > software side. The prevailing hardware engineering principle "That can > be fixed in software" does not work; it never worked - especially not in > the area of x86 exception and syscall handling. > > This coordination must include all major operating systems and hypervisors > whether open source or proprietary to ensure that the different > requirements are met. This kind of coordination has happened in the context > of the hardware vulnerability mitigations already in a fruitful way so > this request is not asking for something impossible. > > If the x86 vendors are unable to talk to each other and coordinate on a > solution, then the ultimate backstop might be to take the first reasonable > design specification and the first reasonable silicon implementation of it > as the ONE alternative solution to the existing trainwreck. How the other > vendors are going to deal with that is none of our business. That's the > least useful and least desired outcome and will only happen when the x86 > vendors are not able to get their act together and sort that out upfront. And to help with coordination, here is something prepared (slightly) earlier. https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing This identifies the problems from software's perspective, along with proposing behaviour which ought to resolve the issues. It is still a work-in-progress. The #VE section still needs updating in light of the publication of the recent TDX spec. Review and feedback welcome. Thanks, ~Andrew [-- Attachment #2: x86 Stack Switching - draft 2.1.pdf --] [-- Type: application/pdf, Size: 108930 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) 2020-08-24 13:52 ` Andrew Cooper @ 2020-08-25 4:39 ` Sean Christopherson 2020-08-25 15:25 ` Dave Hansen 2020-08-25 16:49 ` Andy Lutomirski 0 siblings, 2 replies; 15+ messages in thread From: Sean Christopherson @ 2020-08-25 4:39 UTC (permalink / raw) To: Andrew Cooper Cc: Thomas Gleixner, LKML, x86, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin, Asit Mallick, Gordon Tetlow, David Kaplan, Tony Luck, Andy Lutomirski +Andy On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote: > And to help with coordination, here is something prepared (slightly) > earlier. > > https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing > > This identifies the problems from software's perspective, along with > proposing behaviour which ought to resolve the issues. > > It is still a work-in-progress. The #VE section still needs updating in > light of the publication of the recent TDX spec. For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this something we (Linux) as the guest kernel actually want to handle gracefully (where gracefully means not panicking)? For TDX, a #VE in the SYSCALL gap would require one of two things: a) The guest kernel to not accept/validate the GPA->HPA mapping for the relevant pages, e.g. code or scratch data. b) The host VMM to remap the GPA (making the GPA->HPA pending again). (a) is only possible if there's a fatal buggy guest kernel (or perhaps vBIOS). (b) requires either a buggy or malicious host VMM. I ask because, if the answer is "no, panic at will", then we shouldn't need to burn an IST for TDX #VE. Exceptions won't morph to #VE and hitting an instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug. Ditto for a #VE in NMI entry before it gets to a thread stack. Am I missing anything? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) 2020-08-25 4:39 ` TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) Sean Christopherson @ 2020-08-25 15:25 ` Dave Hansen 2020-08-25 16:49 ` Andy Lutomirski 1 sibling, 0 replies; 15+ messages in thread From: Dave Hansen @ 2020-08-25 15:25 UTC (permalink / raw) To: Andrew Cooper Cc: Thomas Gleixner, LKML, x86, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin, Asit Mallick, Gordon Tetlow, David Kaplan, Tony Luck, Andy Lutomirski, Sean Christopherson On 8/24/20 9:39 PM, Sean Christopherson wrote: > +Andy > > On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote: >> And to help with coordination, here is something prepared (slightly) >> earlier. >> >> https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing >> >> This identifies the problems from software's perspective, along with >> proposing behaviour which ought to resolve the issues. >> >> It is still a work-in-progress. The #VE section still needs updating in >> light of the publication of the recent TDX spec. > > For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this > something we (Linux) as the guest kernel actually want to handle gracefully > (where gracefully means not panicking)? For TDX, a #VE in the SYSCALL gap > would require one of two things: > > a) The guest kernel to not accept/validate the GPA->HPA mapping for the > relevant pages, e.g. code or scratch data. > > b) The host VMM to remap the GPA (making the GPA->HPA pending again). > > (a) is only possible if there's a fatal buggy guest kernel (or perhaps vBIOS). > (b) requires either a buggy or malicious host VMM. > > I ask because, if the answer is "no, panic at will", then we shouldn't need > to burn an IST for TDX #VE. Exceptions won't morph to #VE and hitting an > instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug. > Ditto for a #VE in NMI entry before it gets to a thread stack. > > Am I missing anything? No, that was my expectation as well. My only concern is that someone might unintentionally put a #VE'ing instruction in one of the tricky entry paths, like if we decided we needed CPUID for serialization or used a WRMSR that #VE's. It's just something we need to look out for when mucking in the entry paths. But, it's not that hard given how few things actually #VE. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) 2020-08-25 4:39 ` TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) Sean Christopherson 2020-08-25 15:25 ` Dave Hansen @ 2020-08-25 16:49 ` Andy Lutomirski 2020-08-25 17:19 ` Sean Christopherson 1 sibling, 1 reply; 15+ messages in thread From: Andy Lutomirski @ 2020-08-25 16:49 UTC (permalink / raw) To: Sean Christopherson Cc: Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin, Asit Mallick, Gordon Tetlow, David Kaplan, Tony Luck, Andy Lutomirski On Mon, Aug 24, 2020 at 9:40 PM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > +Andy > > On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote: > > And to help with coordination, here is something prepared (slightly) > > earlier. > > > > https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing > > > > This identifies the problems from software's perspective, along with > > proposing behaviour which ought to resolve the issues. > > > > It is still a work-in-progress. The #VE section still needs updating in > > light of the publication of the recent TDX spec. > > For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this > something we (Linux) as the guest kernel actually want to handle gracefully > (where gracefully means not panicking)? For TDX, a #VE in the SYSCALL gap > would require one of two things: > > a) The guest kernel to not accept/validate the GPA->HPA mapping for the > relevant pages, e.g. code or scratch data. > > b) The host VMM to remap the GPA (making the GPA->HPA pending again). > > (a) is only possible if there's a fatal buggy guest kernel (or perhaps vBIOS). > (b) requires either a buggy or malicious host VMM. > > I ask because, if the answer is "no, panic at will", then we shouldn't need > to burn an IST for TDX #VE. Exceptions won't morph to #VE and hitting an > instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug. Or malicious hypervisor action, and that's a problem. Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the actual SYSCALL text or the first memory it accesses -- I don't have a TDX spec so I don't know the details). The user does SYSCALL, the kernel hits the funny GPA, and #VE is delivered. The microcode wil write the IRET frame, with mostly user-controlled contents, wherever RSP points, and RSP is also user controlled. Calling this a "panic" is charitable -- it's really game over against an attacker who is moderately clever. The kernel can't do anything about this because it's game over before the kernel has had the chance to execute any instructions. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) 2020-08-25 16:49 ` Andy Lutomirski @ 2020-08-25 17:19 ` Sean Christopherson 2020-08-25 17:28 ` Andy Lutomirski 0 siblings, 1 reply; 15+ messages in thread From: Sean Christopherson @ 2020-08-25 17:19 UTC (permalink / raw) To: Andy Lutomirski Cc: Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin, Asit Mallick, Gordon Tetlow, David Kaplan, Tony Luck On Tue, Aug 25, 2020 at 09:49:05AM -0700, Andy Lutomirski wrote: > On Mon, Aug 24, 2020 at 9:40 PM Sean Christopherson > <sean.j.christopherson@intel.com> wrote: > > > > +Andy > > > > On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote: > > > And to help with coordination, here is something prepared (slightly) > > > earlier. > > > > > > https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing > > > > > > This identifies the problems from software's perspective, along with > > > proposing behaviour which ought to resolve the issues. > > > > > > It is still a work-in-progress. The #VE section still needs updating in > > > light of the publication of the recent TDX spec. > > > > For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this > > something we (Linux) as the guest kernel actually want to handle gracefully > > (where gracefully means not panicking)? For TDX, a #VE in the SYSCALL gap > > would require one of two things: > > > > a) The guest kernel to not accept/validate the GPA->HPA mapping for the > > relevant pages, e.g. code or scratch data. > > > > b) The host VMM to remap the GPA (making the GPA->HPA pending again). > > > > (a) is only possible if there's a fatal buggy guest kernel (or perhaps vBIOS). > > (b) requires either a buggy or malicious host VMM. > > > > I ask because, if the answer is "no, panic at will", then we shouldn't need > > to burn an IST for TDX #VE. Exceptions won't morph to #VE and hitting an > > instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug. > > Or malicious hypervisor action, and that's a problem. > > Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the > actual SYSCALL text or the first memory it accesses -- I don't have a > TDX spec so I don't know the details). You can thank our legal department :-) > The user does SYSCALL, the kernel hits the funny GPA, and #VE is delivered. > The microcode wil write the IRET frame, with mostly user-controlled contents, > wherever RSP points, and RSP is also user controlled. Calling this a "panic" > is charitable -- it's really game over against an attacker who is moderately > clever. > > The kernel can't do anything about this because it's game over before > the kernel has had the chance to execute any instructions. Hrm, I was thinking that SMAP=1 would give the necessary protections, but in typing that out I realized userspace can throw in an RSP value that points at kernel memory. Duh. One thought would be to have the TDX module (thing that runs in SEAM and sits between the VMM and the guest) provide a TDCALL (hypercall from guest to TDX module) to the guest that would allow the guest to specify a very limited number of GPAs that must never generate a #VE, e.g. go straight to guest shutdown if a disallowed GPA would go pending. That seems doable from a TDX perspective without incurring noticeable overhead (assuming the list of GPAs is very small) and should be easy to to support in the guest, e.g. make a TDCALL/hypercall or two during boot to protect the SYSCALL page and its scratch data. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) 2020-08-25 17:19 ` Sean Christopherson @ 2020-08-25 17:28 ` Andy Lutomirski 2020-08-25 17:35 ` Luck, Tony 2020-08-26 19:16 ` Sean Christopherson 0 siblings, 2 replies; 15+ messages in thread From: Andy Lutomirski @ 2020-08-25 17:28 UTC (permalink / raw) To: Sean Christopherson Cc: Andy Lutomirski, Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin, Asit Mallick, Gordon Tetlow, David Kaplan, Tony Luck On Tue, Aug 25, 2020 at 10:19 AM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > On Tue, Aug 25, 2020 at 09:49:05AM -0700, Andy Lutomirski wrote: > > On Mon, Aug 24, 2020 at 9:40 PM Sean Christopherson > > <sean.j.christopherson@intel.com> wrote: > > > > > > +Andy > > > > > > On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote: > > > > And to help with coordination, here is something prepared (slightly) > > > > earlier. > > > > > > > > https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing > > > > > > > > This identifies the problems from software's perspective, along with > > > > proposing behaviour which ought to resolve the issues. > > > > > > > > It is still a work-in-progress. The #VE section still needs updating in > > > > light of the publication of the recent TDX spec. > > > > > > For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this > > > something we (Linux) as the guest kernel actually want to handle gracefully > > > (where gracefully means not panicking)? For TDX, a #VE in the SYSCALL gap > > > would require one of two things: > > > > > > a) The guest kernel to not accept/validate the GPA->HPA mapping for the > > > relevant pages, e.g. code or scratch data. > > > > > > b) The host VMM to remap the GPA (making the GPA->HPA pending again). > > > > > > (a) is only possible if there's a fatal buggy guest kernel (or perhaps vBIOS). > > > (b) requires either a buggy or malicious host VMM. > > > > > > I ask because, if the answer is "no, panic at will", then we shouldn't need > > > to burn an IST for TDX #VE. Exceptions won't morph to #VE and hitting an > > > instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug. > > > > Or malicious hypervisor action, and that's a problem. > > > > Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the > > actual SYSCALL text or the first memory it accesses -- I don't have a > > TDX spec so I don't know the details). > > You can thank our legal department :-) > > > The user does SYSCALL, the kernel hits the funny GPA, and #VE is delivered. > > The microcode wil write the IRET frame, with mostly user-controlled contents, > > wherever RSP points, and RSP is also user controlled. Calling this a "panic" > > is charitable -- it's really game over against an attacker who is moderately > > clever. > > > > The kernel can't do anything about this because it's game over before > > the kernel has had the chance to execute any instructions. > > Hrm, I was thinking that SMAP=1 would give the necessary protections, but > in typing that out I realized userspace can throw in an RSP value that > points at kernel memory. Duh. > > One thought would be to have the TDX module (thing that runs in SEAM and > sits between the VMM and the guest) provide a TDCALL (hypercall from guest > to TDX module) to the guest that would allow the guest to specify a very > limited number of GPAs that must never generate a #VE, e.g. go straight to > guest shutdown if a disallowed GPA would go pending. That seems doable > from a TDX perspective without incurring noticeable overhead (assuming the > list of GPAs is very small) and should be easy to to support in the guest, > e.g. make a TDCALL/hypercall or two during boot to protect the SYSCALL > page and its scratch data. I guess you could do that, but this is getting gross. The x86 architecture has really gone off the rails here. ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) 2020-08-25 17:28 ` Andy Lutomirski @ 2020-08-25 17:35 ` Luck, Tony 2020-08-25 17:41 ` Andy Lutomirski ` (2 more replies) 2020-08-26 19:16 ` Sean Christopherson 1 sibling, 3 replies; 15+ messages in thread From: Luck, Tony @ 2020-08-25 17:35 UTC (permalink / raw) To: Andy Lutomirski, Christopherson, Sean J Cc: Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin, Mallick, Asit K, Gordon Tetlow, David Kaplan > > Or malicious hypervisor action, and that's a problem. > > > > Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the > > actual SYSCALL text or the first memory it accesses -- I don't have a > > TDX spec so I don't know the details). Is it feasible to defend against a malicious (or buggy) hypervisor? Obviously, we can't leave holes that guests can exploit. But the hypervisor can crash the system no matter how clever TDX is. -Tony ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) 2020-08-25 17:35 ` Luck, Tony @ 2020-08-25 17:41 ` Andy Lutomirski 2020-08-25 17:59 ` Andrew Cooper 2020-08-25 19:49 ` Thomas Gleixner 2 siblings, 0 replies; 15+ messages in thread From: Andy Lutomirski @ 2020-08-25 17:41 UTC (permalink / raw) To: Luck, Tony Cc: Andy Lutomirski, Christopherson, Sean J, Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin, Mallick, Asit K, Gordon Tetlow, David Kaplan On Tue, Aug 25, 2020 at 10:36 AM Luck, Tony <tony.luck@intel.com> wrote: > > > > Or malicious hypervisor action, and that's a problem. > > > > > > Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the > > > actual SYSCALL text or the first memory it accesses -- I don't have a > > > TDX spec so I don't know the details). > > Is it feasible to defend against a malicious (or buggy) hypervisor? > > Obviously, we can't leave holes that guests can exploit. But the hypervisor > can crash the system no matter how clever TDX is. Crashing the system is one thing. Corrupting the system in a way that could allow code execution is another thing entirely. And the whole point of TDX is to defend the guest against the hypervisor. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) 2020-08-25 17:35 ` Luck, Tony 2020-08-25 17:41 ` Andy Lutomirski @ 2020-08-25 17:59 ` Andrew Cooper 2020-08-25 18:38 ` Dave Hansen 2020-08-25 19:49 ` Thomas Gleixner 2 siblings, 1 reply; 15+ messages in thread From: Andrew Cooper @ 2020-08-25 17:59 UTC (permalink / raw) To: Luck, Tony, Andy Lutomirski, Christopherson, Sean J Cc: Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin, Mallick, Asit K, Gordon Tetlow, David Kaplan, Andrew Cooper On 25/08/2020 18:35, Luck, Tony wrote: >>> Or malicious hypervisor action, and that's a problem. >>> >>> Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the >>> actual SYSCALL text or the first memory it accesses -- I don't have a >>> TDX spec so I don't know the details). > Is it feasible to defend against a malicious (or buggy) hypervisor? > > Obviously, we can't leave holes that guests can exploit. But the hypervisor > can crash the system no matter how clever TDX is. You have to be more specific about what you mean by "malicious" hypervisor. Nothing can protect against a hypervisor which refuses to schedule the Trusted Domain. The guest cannot protect against availability maliciousness. However, you can use market forces to fix that problem. (I'll take my credit card elsewhere if you don't schedule my VM, etc) Things are more complicated when it comes to integrity or confidentiality of the TD, but the prevailing feeling seems to be "crashing obviously and reliably if something goes wrong is ok". If I've read the TDX spec/whitepaper properly, the main hypervisor can write to all the encrypted pages. This will destroy data, break the MAC, and yields #PF inside the SEAM hypervisor, or the TD when the cache line is next referenced. Cunning timing on behalf of a malicious hypervisor (hitting the SYSCALL gap) will cause the guest's #PF handler to run on a user stack, opening a privilege escalation hole. Whatever you might want to say about the exact integrity/confidentiality expectations, I think "the hypervisor can open a user=>kernel privilege escalation hole inside the TD" is not what people would consider acceptable. On AMD parts, this is why the #VC handler is IST, in an attempt to at least notice this damage and crash. There is no way TDX can get away with requiring #PF to be IST as well. ~Andrew ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) 2020-08-25 17:59 ` Andrew Cooper @ 2020-08-25 18:38 ` Dave Hansen 0 siblings, 0 replies; 15+ messages in thread From: Dave Hansen @ 2020-08-25 18:38 UTC (permalink / raw) To: Andrew Cooper, Luck, Tony, Andy Lutomirski, Christopherson, Sean J Cc: Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin, Mallick, Asit K, Gordon Tetlow, David Kaplan On 8/25/20 10:59 AM, Andrew Cooper wrote: > If I've read the TDX spec/whitepaper properly, the main hypervisor can > write to all the encrypted pages. This will destroy data, break the > MAC, and yields #PF inside the SEAM hypervisor, or the TD when the cache > line is next referenced. I think you're talking about: > Attempting to access a private KeyID by software outside the SEAM > mode would cause a page-fault exception (#PF). I don't think that ever results in a TD guest #PF. "A MAC-verification failure would be fatal to the TD and lead to its termination." In this context, I think that means that the TD stops running and can not be reentered. ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) 2020-08-25 17:35 ` Luck, Tony 2020-08-25 17:41 ` Andy Lutomirski 2020-08-25 17:59 ` Andrew Cooper @ 2020-08-25 19:49 ` Thomas Gleixner 2 siblings, 0 replies; 15+ messages in thread From: Thomas Gleixner @ 2020-08-25 19:49 UTC (permalink / raw) To: Luck, Tony, Andy Lutomirski, Christopherson, Sean J Cc: Andrew Cooper, LKML, X86 ML, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin, Mallick, Asit K, Gordon Tetlow, David Kaplan On Tue, Aug 25 2020 at 17:35, Tony Luck wrote: >> > Or malicious hypervisor action, and that's a problem. >> > >> > Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the >> > actual SYSCALL text or the first memory it accesses -- I don't have a >> > TDX spec so I don't know the details). > > Is it feasible to defend against a malicious (or buggy) hypervisor? > > Obviously, we can't leave holes that guests can exploit. But the hypervisor > can crash the system no matter how clever TDX is. If it crashes and burns reliably then fine, but is that guaranteed? I have serious doubts about that given the history and fragility of all of this and I really have zero interest in dealing with the fallout a year from now. Thanks, tglx ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) 2020-08-25 17:28 ` Andy Lutomirski 2020-08-25 17:35 ` Luck, Tony @ 2020-08-26 19:16 ` Sean Christopherson 2020-08-30 15:37 ` Andy Lutomirski 1 sibling, 1 reply; 15+ messages in thread From: Sean Christopherson @ 2020-08-26 19:16 UTC (permalink / raw) To: Andy Lutomirski Cc: Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin, Asit Mallick, Gordon Tetlow, David Kaplan, Tony Luck On Tue, Aug 25, 2020 at 10:28:53AM -0700, Andy Lutomirski wrote: > On Tue, Aug 25, 2020 at 10:19 AM Sean Christopherson > <sean.j.christopherson@intel.com> wrote: > > One thought would be to have the TDX module (thing that runs in SEAM and > > sits between the VMM and the guest) provide a TDCALL (hypercall from guest > > to TDX module) to the guest that would allow the guest to specify a very > > limited number of GPAs that must never generate a #VE, e.g. go straight to > > guest shutdown if a disallowed GPA would go pending. That seems doable > > from a TDX perspective without incurring noticeable overhead (assuming the > > list of GPAs is very small) and should be easy to to support in the guest, > > e.g. make a TDCALL/hypercall or two during boot to protect the SYSCALL > > page and its scratch data. > > I guess you could do that, but this is getting gross. The x86 > architecture has really gone off the rails here. Does it suck less than using an IST? Honest question. I will add my voice to the "fix SYSCALL" train, but the odds of that getting a proper fix in time to intercept TDX are not good. On the other hand, "fixing" the SYSCALL issue in the TDX module is much more feasible, but only if we see real value in such an approach as opposed to just using an IST. I personally like the idea of a TDX module solution as I think it would be simpler for the kernel to implement/support, and would mean we wouldn't need to roll back IST usage for #VE if the heavens should part and bestow upon us a sane SYSCALL. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) 2020-08-26 19:16 ` Sean Christopherson @ 2020-08-30 15:37 ` Andy Lutomirski 2020-08-30 18:37 ` Linus Torvalds 0 siblings, 1 reply; 15+ messages in thread From: Andy Lutomirski @ 2020-08-30 15:37 UTC (permalink / raw) To: Sean Christopherson Cc: Andy Lutomirski, Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin, Asit Mallick, Gordon Tetlow, David Kaplan, Tony Luck On Wed, Aug 26, 2020 at 12:16 PM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > On Tue, Aug 25, 2020 at 10:28:53AM -0700, Andy Lutomirski wrote: > > On Tue, Aug 25, 2020 at 10:19 AM Sean Christopherson > > <sean.j.christopherson@intel.com> wrote: > > > One thought would be to have the TDX module (thing that runs in SEAM and > > > sits between the VMM and the guest) provide a TDCALL (hypercall from guest > > > to TDX module) to the guest that would allow the guest to specify a very > > > limited number of GPAs that must never generate a #VE, e.g. go straight to > > > guest shutdown if a disallowed GPA would go pending. That seems doable > > > from a TDX perspective without incurring noticeable overhead (assuming the > > > list of GPAs is very small) and should be easy to to support in the guest, > > > e.g. make a TDCALL/hypercall or two during boot to protect the SYSCALL > > > page and its scratch data. > > > > I guess you could do that, but this is getting gross. The x86 > > architecture has really gone off the rails here. > > Does it suck less than using an IST? Honest question. > > I will add my voice to the "fix SYSCALL" train, but the odds of that getting > a proper fix in time to intercept TDX are not good. On the other hand, > "fixing" the SYSCALL issue in the TDX module is much more feasible, but only > if we see real value in such an approach as opposed to just using an IST. I > personally like the idea of a TDX module solution as I think it would be > simpler for the kernel to implement/support, and would mean we wouldn't need > to roll back IST usage for #VE if the heavens should part and bestow upon us > a sane SYSCALL. There's no such thing as "just" using an IST. Using IST opens a huge can of works due to its recursion issues. The TDX module solution is utterly gross but may well suck less than using an IST. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) 2020-08-30 15:37 ` Andy Lutomirski @ 2020-08-30 18:37 ` Linus Torvalds 0 siblings, 0 replies; 15+ messages in thread From: Linus Torvalds @ 2020-08-30 18:37 UTC (permalink / raw) To: Andy Lutomirski Cc: Sean Christopherson, Andrew Cooper, Thomas Gleixner, LKML, X86 ML, Tom Lendacky, Pu Wen, Stephen Hemminger, Sasha Levin, Dirk Hohndel, Jan Kiszka, Tony W Wang-oc, H. Peter Anvin, Asit Mallick, Gordon Tetlow, David Kaplan, Tony Luck On Sun, Aug 30, 2020 at 8:37 AM Andy Lutomirski <luto@kernel.org> wrote: > > There's no such thing as "just" using an IST. Using IST opens a huge > can of works due to its recursion issues. I absolutely despise all the x86 "indirect system structures". They are horrible garbage. IST is only yet another example of that kind of brokenness, and annoys me particularly because it (and swapgs) were actually making x86 _worse_. The old i386 exception model was actually better than what x86-64 did, and IST is a big part of the problem. Just have a supervisor stack, and push the state on it. Stop playing games with multiple stacks depending on some magical indirect system state. Other examples of stupid and bad indirection: - the GDT and LDT. The kernel should never have to use them. It would be much better if the segment "shadow" state would stop being shadow state, and be the REAL state that the kernel (and user space, for that matter) accesses. Yeah, we got halfway there with MSR_FS/GS_BASE, but what a complete garbage crock that was. So now we're forced to use the selector *and* the base reghister, and they may be out of sync with each other, so you have the worst of both worlds. Keep the GDT and LDT around for compatibility reasons, so that old broken programs that want to load the segment state the oldfashioned way can do so. But make it clear that that is purely for legacy, and make the modern code just save and restore the actual true non-indirect segment state. For new models, give us a way to load base/limit/permissions directly, and reset them on kernel entry. No more descriptor table indirection games. - the IDT and the TSS segment. Exact same arguments as above. Keep them around for legacy programs, but let us just set "this is the entrypoint, this the the kernel stack" as registers. Christ, we're probably better off with one single entry-point for the whole kernel (ok, give us a separate one for NMI/MCE/doublefault, since they are _so_ special, and maybe separate "CPU exceptions" from "external interrupts), together with just a register that says what the exception was. - swapgs needs to die. The kernel GS/FS segments should just be separate segment registers from user space. No "swapping" needed. In CPL0, "gs" just means something different from user space. No save/restore code for it, no swapping, no nothing. Honestly, I think %rsp/%rip could work like that too. Just make "rsp" and "rip" be a completely different register in kernel mode - rename it in the front-end of the CPU or whatever. Imagine not having to save/restore rsp/rip on kernel entry/exit at all, because returning to user more just implicitly starts using ursp/urip. And a context switch uses (fast) MSR's to save/restore the user state (or, since it's actually a real register in the register file, just a new "mov" instruction to access the user registers). Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2020-08-30 18:37 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-08-24 12:24 [RFD] x86: Curing the exception and syscall trainwreck in hardware Thomas Gleixner 2020-08-24 13:52 ` Andrew Cooper 2020-08-25 4:39 ` TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) Sean Christopherson 2020-08-25 15:25 ` Dave Hansen 2020-08-25 16:49 ` Andy Lutomirski 2020-08-25 17:19 ` Sean Christopherson 2020-08-25 17:28 ` Andy Lutomirski 2020-08-25 17:35 ` Luck, Tony 2020-08-25 17:41 ` Andy Lutomirski 2020-08-25 17:59 ` Andrew Cooper 2020-08-25 18:38 ` Dave Hansen 2020-08-25 19:49 ` Thomas Gleixner 2020-08-26 19:16 ` Sean Christopherson 2020-08-30 15:37 ` Andy Lutomirski 2020-08-30 18:37 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox