* [LSF/MM TOPIC] Address space isolation inside the kernel @ 2019-02-07 7:24 Mike Rapoport 2019-02-14 19:21 ` Kees Cook ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Mike Rapoport @ 2019-02-07 7:24 UTC (permalink / raw) To: lsf-pc; +Cc: linux-mm, James Bottomley (Joint proposal with James Bottomley) Address space isolation has been used to protect the kernel from the userspace and userspace programs from each other since the invention of the virtual memory. Assuming that kernel bugs and therefore vulnerabilities are inevitable it might be worth isolating parts of the kernel to minimize damage that these vulnerabilities can cause. There is already ongoing work in a similar direction, like XPFO [1] and temporary mappings proposed for the kernel text poking [2]. We have several vague ideas how we can take this even further and make different parts of kernel run in different address spaces: * Remove most of the kernel mappings from the syscall entry and add a trampoline when the syscall processing needs to call the "core kernel". * Make the parts of the kernel that execute in a namespace use their own mappings for the namespace private data * Extend EXPORT_SYMBOL to include a trampoline so that the code running in modules won't map the entire kernel * Execute BFP programs in a dedicated address space These are very general possible directions. We are exploring some of them now to understand if the security value is worth the complexity and the performance impact. We believe it would be helpful to discuss the general idea of address space isolation inside the kernel, both from the technical aspect of how it can be achieved simply and efficiently and from the isolation aspect of what actual security guarantees it usefully provides. [1] https://lore.kernel.org/lkml/cover.1547153058.git.khalid.aziz@oracle.com/ [2] https://lore.kernel.org/lkml/20190129003422.9328-4-rick.p.edgecombe@intel.com/ -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Address space isolation inside the kernel 2019-02-07 7:24 [LSF/MM TOPIC] Address space isolation inside the kernel Mike Rapoport @ 2019-02-14 19:21 ` Kees Cook [not found] ` <CA+VK+GOpjXQ2-CLZt6zrW6m-=WpWpvcrXGSJ-723tRDMeAeHmg@mail.gmail.com> 2019-02-16 12:19 ` Balbir Singh 2 siblings, 0 replies; 18+ messages in thread From: Kees Cook @ 2019-02-14 19:21 UTC (permalink / raw) To: Mike Rapoport; +Cc: lsf-pc, Linux-MM, James Bottomley On Wed, Feb 6, 2019 at 11:24 PM Mike Rapoport <rppt@linux.ibm.com> wrote: > Address space isolation has been used to protect the kernel from the > userspace and userspace programs from each other since the invention of > the virtual memory. Well, traditionally the kernel's protection has been one-sided: we've left userspace mapped while in the kernel, which has lead to countless exploits. SMEP/SMAP (or similar for other architectures, like ARM's PXN/PAN) have finally mitigated that, but we're still left with a lot of older machines (and other architectures) that would benefit from unmapping the userspace while in the kernel. > Assuming that kernel bugs and therefore vulnerabilities are inevitable > it might be worth isolating parts of the kernel to minimize damage > that these vulnerabilities can cause. Yes please. :) Two cases jump to mind: 1) Make regions unwritable to avoid write-anywhere data modification attacks. For code and rodata, this is already done with regular page table bits making them read-only for the entire lifetime of the kernel. For areas that need writing but are sensitive (e.g. the page tables themselves, and generally function pointer tables), there needs to be a way to keep modifications isolated to given code (to block write-anywhere attacks), keeping them read-only through all other accesses. This is could be done with per-CPU page tables, a faster version of the "write rarely" patch set[1], or maybe with the kernel text poking (mentioned in your email). Attacking the page tables directly is now the common way to gain execute control on the kernel, since so much of the rest of memory is locked down[2]. How can we keep page tables read-only except for when the page table code needs to write to them? 2) Make a region unreadable to avoid read-anywhere memory disclosure attacks. This mean it's either unmapped (for both data and code cases) or we gain execute-not-read hardware bits (for code cases). Unmapping code means a reduction in ROP gadgets, unmapping data means reduction in memory disclosure surface. Note that while both coarse (CET) and fine-grain (function-prototype-checking) CFI vastly reduces the availability of ROP gadgets, the kernel still has a lot of functions that return void and take a single unsigned long, so anything to remove more code from visibility is good. > There is already ongoing work in a similar direction, like XPFO [1] and > temporary mappings proposed for the kernel text poking [2]. > > We have several vague ideas how we can take this even further and make > different parts of kernel run in different address spaces: > * Remove most of the kernel mappings from the syscall entry and add a > trampoline when the syscall processing needs to call the "core > kernel". Defining this boundary may be very tricky, but maybe the same logic used for CFI and function graph analysis could be used to find the existing bright lines between code regions... > * Make the parts of the kernel that execute in a namespace use their > own mappings for the namespace private data > * Extend EXPORT_SYMBOL to include a trampoline so that the code > running in modules won't map the entire kernel > * Execute BFP programs in a dedicated address space Pushing drivers into isolated regions would be very interesting. If it needs context-switching, though, we're headed to microkernel fun. > These are very general possible directions. We are exploring some of > them now to understand if the security value is worth the complexity > and the performance impact. > > We believe it would be helpful to discuss the general idea of address > space isolation inside the kernel, both from the technical aspect of > how it can be achieved simply and efficiently and from the isolation > aspect of what actual security guarantees it usefully provides. > > [1] https://lore.kernel.org/lkml/cover.1547153058.git.khalid.aziz@oracle.com/ > [2] https://lore.kernel.org/lkml/20190129003422.9328-4-rick.p.edgecombe@intel.com/ I won't be able to make it to the conference, but I'm very interested in finding ways forward on this topic. :) -Kees [1] https://patchwork.kernel.org/project/kernel-hardening/list/?series=79855 [2] https://www.blackhat.com/docs/asia-18/asia-18-WANG-KSMA-Breaking-Android-kernel-isolation-and-Rooting-with-ARM-MMU-features.pdf -- Kees Cook ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <CA+VK+GOpjXQ2-CLZt6zrW6m-=WpWpvcrXGSJ-723tRDMeAeHmg@mail.gmail.com>]
* Re: [LSF/MM TOPIC] Address space isolation inside the kernel [not found] ` <CA+VK+GOpjXQ2-CLZt6zrW6m-=WpWpvcrXGSJ-723tRDMeAeHmg@mail.gmail.com> @ 2019-02-16 11:13 ` Paul Turner 2019-04-25 20:47 ` Jonathan Adams 0 siblings, 1 reply; 18+ messages in thread From: Paul Turner @ 2019-02-16 11:13 UTC (permalink / raw) To: lsf-pc, linux-mm, James Bottomley, Mike Rapoport; +Cc: Jonathan Adams [-- Attachment #1: Type: text/plain, Size: 2641 bytes --] I wanted to second the proposal for address space isolation. We have some new techniques to introduce her also, built around some new ideas using page-faults that we believe are interesting. To wit, page faults uniquely allow us to fork speculative and non-speculative execution as we can control the retired path within the fault itself (which as it turns out, will obviously never be executed speculatively). This lets us provide isolation against variant1 gadgets, as well as guarantee what data may or may not be cache present for the purposes of L1TF and Meltdown mitigation. I'm not sure whether or not I'll be able to attend (I have a newborn and there's a lot of other scheduling I'm trying to work out). But Jonathan Adams (cc'd) has been working on this and can speak to it. We also have some write-ups to publish independently of this. Thanks, - Paul (Joint proposal with James Bottomley) > > Address space isolation has been used to protect the kernel from the > userspace and userspace programs from each other since the invention of > the virtual memory. > > Assuming that kernel bugs and therefore vulnerabilities are inevitable > it might be worth isolating parts of the kernel to minimize damage > that these vulnerabilities can cause. > > There is already ongoing work in a similar direction, like XPFO [1] and > temporary mappings proposed for the kernel text poking [2]. > > We have several vague ideas how we can take this even further and make > different parts of kernel run in different address spaces: > * Remove most of the kernel mappings from the syscall entry and add a > trampoline when the syscall processing needs to call the "core > kernel". > * Make the parts of the kernel that execute in a namespace use their > own mappings for the namespace private data > * Extend EXPORT_SYMBOL to include a trampoline so that the code > running in modules won't map the entire kernel > * Execute BFP programs in a dedicated address space > > These are very general possible directions. We are exploring some of > them now to understand if the security value is worth the complexity > and the performance impact. > > We believe it would be helpful to discuss the general idea of address > space isolation inside the kernel, both from the technical aspect of > how it can be achieved simply and efficiently and from the isolation > aspect of what actual security guarantees it usefully provides. > > [1] > https://lore.kernel.org/lkml/cover.1547153058.git.khalid.aziz@oracle.com/ > [2] > https://lore.kernel.org/lkml/20190129003422.9328-4-rick.p.edgecombe@intel.com/ > > -- > Sincerely yours, > Mike. > [-- Attachment #2: Type: text/html, Size: 3409 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Address space isolation inside the kernel 2019-02-16 11:13 ` Paul Turner @ 2019-04-25 20:47 ` Jonathan Adams 2019-04-25 21:56 ` James Bottomley 0 siblings, 1 reply; 18+ messages in thread From: Jonathan Adams @ 2019-04-25 20:47 UTC (permalink / raw) To: Paul Turner; +Cc: lsf-pc, linux-mm, James Bottomley, Mike Rapoport It looks like the MM track isn't full, and I think this topic is an important thing to discuss. Cheers, - Jonathan On Sat, Feb 16, 2019 at 3:14 AM Paul Turner <pjt@google.com> wrote: > > I wanted to second the proposal for address space isolation. > > We have some new techniques to introduce her also, built around some new ideas using page-faults that we believe are interesting. > > To wit, page faults uniquely allow us to fork speculative and non-speculative execution as we can control the retired path within the fault itself (which as it turns out, will obviously never be executed speculatively). > > This lets us provide isolation against variant1 gadgets, as well as guarantee what data may or may not be cache present for the purposes of L1TF and Meltdown mitigation. > > I'm not sure whether or not I'll be able to attend (I have a newborn and there's a lot of other scheduling I'm trying to work out). But Jonathan Adams (cc'd) has been working on this and can speak to it. We also have some write-ups to publish independently of this. > > Thanks, > > - Paul > >> (Joint proposal with James Bottomley) >> >> Address space isolation has been used to protect the kernel from the >> userspace and userspace programs from each other since the invention of >> the virtual memory. >> >> Assuming that kernel bugs and therefore vulnerabilities are inevitable >> it might be worth isolating parts of the kernel to minimize damage >> that these vulnerabilities can cause. >> >> There is already ongoing work in a similar direction, like XPFO [1] and >> temporary mappings proposed for the kernel text poking [2]. >> >> We have several vague ideas how we can take this even further and make >> different parts of kernel run in different address spaces: >> * Remove most of the kernel mappings from the syscall entry and add a >> trampoline when the syscall processing needs to call the "core >> kernel". >> * Make the parts of the kernel that execute in a namespace use their >> own mappings for the namespace private data >> * Extend EXPORT_SYMBOL to include a trampoline so that the code >> running in modules won't map the entire kernel >> * Execute BFP programs in a dedicated address space >> >> These are very general possible directions. We are exploring some of >> them now to understand if the security value is worth the complexity >> and the performance impact. >> >> We believe it would be helpful to discuss the general idea of address >> space isolation inside the kernel, both from the technical aspect of >> how it can be achieved simply and efficiently and from the isolation >> aspect of what actual security guarantees it usefully provides. >> >> [1] https://lore.kernel.org/lkml/cover.1547153058.git.khalid.aziz@oracle.com/ >> [2] https://lore.kernel.org/lkml/20190129003422.9328-4-rick.p.edgecombe@intel.com/ >> >> -- >> Sincerely yours, >> Mike. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Address space isolation inside the kernel 2019-04-25 20:47 ` Jonathan Adams @ 2019-04-25 21:56 ` James Bottomley 2019-04-25 22:25 ` Paul Turner 0 siblings, 1 reply; 18+ messages in thread From: James Bottomley @ 2019-04-25 21:56 UTC (permalink / raw) To: Jonathan Adams, Paul Turner; +Cc: lsf-pc, linux-mm, Mike Rapoport On Thu, 2019-04-25 at 13:47 -0700, Jonathan Adams wrote: > It looks like the MM track isn't full, and I think this topic is an > important thing to discuss. Mike just posted the RFC patches for this using a ROP gadget preventor as a demo: https://lore.kernel.org/linux-mm/1556228754-12996-1-git-send-email-rppt@linux.ibm.com but, unfortunately, he won't be at LSF/MM. James ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Address space isolation inside the kernel 2019-04-25 21:56 ` James Bottomley @ 2019-04-25 22:25 ` Paul Turner 2019-04-25 22:31 ` [Lsf-pc] " Alexei Starovoitov 0 siblings, 1 reply; 18+ messages in thread From: Paul Turner @ 2019-04-25 22:25 UTC (permalink / raw) To: James Bottomley; +Cc: Jonathan Adams, lsf-pc, linux-mm, Mike Rapoport [-- Attachment #1: Type: text/plain, Size: 845 bytes --] On Thu, Apr 25, 2019 at 2:56 PM James Bottomley < James.Bottomley@hansenpartnership.com> wrote: > On Thu, 2019-04-25 at 13:47 -0700, Jonathan Adams wrote: > > It looks like the MM track isn't full, and I think this topic is an > > important thing to discuss. > > Mike just posted the RFC patches for this using a ROP gadget preventor > as a demo: > > > https://lore.kernel.org/linux-mm/1556228754-12996-1-git-send-email-rppt@linux.ibm.com > > but, unfortunately, he won't be at LSF/MM. > > James > Mike's proposal is quite different, and targeted at restricting ROP execution. The work proposed by Jonathan is aimed to transparently restrict speculative execution to provide generic mitigation against Spectre-V1 gadgets (and similar) and potentially eliminate the current need for for page table switches under most syscalls due to Meltdown. [-- Attachment #2: Type: text/html, Size: 1416 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Address space isolation inside the kernel 2019-04-25 22:25 ` Paul Turner @ 2019-04-25 22:31 ` Alexei Starovoitov 2019-04-25 22:40 ` Paul Turner 0 siblings, 1 reply; 18+ messages in thread From: Alexei Starovoitov @ 2019-04-25 22:31 UTC (permalink / raw) To: Paul Turner Cc: James Bottomley, linux-mm, lsf-pc, Mike Rapoport, Jonathan Adams, Daniel Borkmann, Jann Horn On Thu, Apr 25, 2019 at 3:27 PM Paul Turner via Lsf-pc <lsf-pc@lists.linux-foundation.org> wrote: > > On Thu, Apr 25, 2019 at 2:56 PM James Bottomley < > James.Bottomley@hansenpartnership.com> wrote: > > > On Thu, 2019-04-25 at 13:47 -0700, Jonathan Adams wrote: > > > It looks like the MM track isn't full, and I think this topic is an > > > important thing to discuss. > > > > Mike just posted the RFC patches for this using a ROP gadget preventor > > as a demo: > > > > > > https://lore.kernel.org/linux-mm/1556228754-12996-1-git-send-email-rppt@linux.ibm.com > > > > but, unfortunately, he won't be at LSF/MM. > > > > James > > > > Mike's proposal is quite different, and targeted at restricting ROP > execution. > The work proposed by Jonathan is aimed to transparently restrict > speculative execution to provide generic mitigation against Spectre-V1 > gadgets (and similar) and potentially eliminate the current need for for > page table switches under most syscalls due to Meltdown. sounds very interesting. "v1 gadgets" would include unpriv bpf code too? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Address space isolation inside the kernel 2019-04-25 22:31 ` [Lsf-pc] " Alexei Starovoitov @ 2019-04-25 22:40 ` Paul Turner 0 siblings, 0 replies; 18+ messages in thread From: Paul Turner @ 2019-04-25 22:40 UTC (permalink / raw) To: Alexei Starovoitov Cc: James Bottomley, linux-mm, lsf-pc, Mike Rapoport, Jonathan Adams, Daniel Borkmann, Jann Horn [-- Attachment #1: Type: text/plain, Size: 1224 bytes --] On Thu, Apr 25, 2019 at 3:31 PM Alexei Starovoitov < alexei.starovoitov@gmail.com> wrote: > On Thu, Apr 25, 2019 at 3:27 PM Paul Turner via Lsf-pc > <lsf-pc@lists.linux-foundation.org> wrote: > > > > On Thu, Apr 25, 2019 at 2:56 PM James Bottomley < > > James.Bottomley@hansenpartnership.com> wrote: > > > > > On Thu, 2019-04-25 at 13:47 -0700, Jonathan Adams wrote: > > > > It looks like the MM track isn't full, and I think this topic is an > > > > important thing to discuss. > > > > > > Mike just posted the RFC patches for this using a ROP gadget preventor > > > as a demo: > > > > > > > > > > https://lore.kernel.org/linux-mm/1556228754-12996-1-git-send-email-rppt@linux.ibm.com > > > > > > but, unfortunately, he won't be at LSF/MM. > > > > > > James > > > > > > > Mike's proposal is quite different, and targeted at restricting ROP > > execution. > > The work proposed by Jonathan is aimed to transparently restrict > > speculative execution to provide generic mitigation against Spectre-V1 > > gadgets (and similar) and potentially eliminate the current need for for > > page table switches under most syscalls due to Meltdown. > > sounds very interesting. > "v1 gadgets" would include unpriv bpf code too? > Yes [-- Attachment #2: Type: text/html, Size: 2085 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Address space isolation inside the kernel 2019-02-07 7:24 [LSF/MM TOPIC] Address space isolation inside the kernel Mike Rapoport 2019-02-14 19:21 ` Kees Cook [not found] ` <CA+VK+GOpjXQ2-CLZt6zrW6m-=WpWpvcrXGSJ-723tRDMeAeHmg@mail.gmail.com> @ 2019-02-16 12:19 ` Balbir Singh 2019-02-16 16:30 ` James Bottomley 2 siblings, 1 reply; 18+ messages in thread From: Balbir Singh @ 2019-02-16 12:19 UTC (permalink / raw) To: Mike Rapoport; +Cc: lsf-pc, linux-mm, James Bottomley On Thu, Feb 07, 2019 at 09:24:22AM +0200, Mike Rapoport wrote: > (Joint proposal with James Bottomley) > > Address space isolation has been used to protect the kernel from the > userspace and userspace programs from each other since the invention of > the virtual memory. > > Assuming that kernel bugs and therefore vulnerabilities are inevitable > it might be worth isolating parts of the kernel to minimize damage > that these vulnerabilities can cause. > Is Address Space limited to user space and kernel space, where does the hypervisor fit into the picture? > There is already ongoing work in a similar direction, like XPFO [1] and > temporary mappings proposed for the kernel text poking [2]. > > We have several vague ideas how we can take this even further and make > different parts of kernel run in different address spaces: > * Remove most of the kernel mappings from the syscall entry and add a > trampoline when the syscall processing needs to call the "core > kernel". > * Make the parts of the kernel that execute in a namespace use their > own mappings for the namespace private data Is the key reason for removing mappings -- to remove the processor from speculating data/text from those mappings? SMAP/SMEP provides a level of isolation from access and execution For namespaces, does allocating the right memory protection key work? At some point we'll need to recycle the keys It'll be an interesting discussion and I'd love to attend if invited Balbir Singh. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Address space isolation inside the kernel 2019-02-16 12:19 ` Balbir Singh @ 2019-02-16 16:30 ` James Bottomley 2019-02-17 8:01 ` Balbir Singh 2019-02-17 19:34 ` Matthew Wilcox 0 siblings, 2 replies; 18+ messages in thread From: James Bottomley @ 2019-02-16 16:30 UTC (permalink / raw) To: Balbir Singh, Mike Rapoport; +Cc: lsf-pc, linux-mm On Sat, 2019-02-16 at 23:19 +1100, Balbir Singh wrote: > On Thu, Feb 07, 2019 at 09:24:22AM +0200, Mike Rapoport wrote: > > (Joint proposal with James Bottomley) > > > > Address space isolation has been used to protect the kernel from > > the userspace and userspace programs from each other since the > > invention of the virtual memory. > > > > Assuming that kernel bugs and therefore vulnerabilities are > > inevitable it might be worth isolating parts of the kernel to > > minimize damage that these vulnerabilities can cause. > > > > Is Address Space limited to user space and kernel space, where does > the hypervisor fit into the picture? It doesn't really. The work is driven by the Nabla HAP measure https://blog.hansenpartnership.com/measuring-the-horizontal-attack-profile-of-nabla-containers/ Although the results are spectacular (building a container that's measurably more secure than a hypervisor based system), they come at the price of emulating a lot of the kernel and thus damaging the precise resource control advantage containers have. The idea then is to render parts of the kernel syscall interface safe enough that they have a security profile equivalent to the emulated one and can thus be called directly instead of being emulated, hoping to restore most of the container resource management properties. In theory, I suppose it would buy you protection from things like the kata containers host breach: https://nabla-containers.github.io/2018/11/28/fs/ > > There is already ongoing work in a similar direction, like XPFO [1] > > and temporary mappings proposed for the kernel text poking [2]. > > > > We have several vague ideas how we can take this even further and > > make different parts of kernel run in different address spaces: > > * Remove most of the kernel mappings from the syscall entry and add > > a > > trampoline when the syscall processing needs to call the "core > > kernel". > > * Make the parts of the kernel that execute in a namespace use > > their > > own mappings for the namespace private data > > Is the key reason for removing mappings -- to remove the processor > from speculating data/text from those mappings? SMAP/SMEP provides > a level of isolation from access and execution Not really, it's to reduce the exploitability of the code path and limit the exposure of data which can be compromised when you're exploited. > For namespaces, does allocating the right memory protection key > work? At some point we'll need to recycle the keys I don't think anyone mentioned memory keys and namespaces ... I take it you're thinking of SEV/MKTME? The idea being to shield one container's execution from another using memory encryption? We've speculated it's possible but the actual mechanism we were looking at is tagging pages to namespaces (essentially using the mount namspace and tags on the page cache) so the kernel would refuse to map a page into the wrong namespace. This approach doesn't seem to be as promising as the separated address space one because the security properties are harder to measure. James > It'll be an interesting discussion and I'd love to attend if invited > > Balbir Singh. > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Address space isolation inside the kernel 2019-02-16 16:30 ` James Bottomley @ 2019-02-17 8:01 ` Balbir Singh 2019-02-17 16:43 ` James Bottomley 2019-02-17 19:34 ` Matthew Wilcox 1 sibling, 1 reply; 18+ messages in thread From: Balbir Singh @ 2019-02-17 8:01 UTC (permalink / raw) To: James Bottomley; +Cc: Mike Rapoport, lsf-pc, linux-mm On Sat, Feb 16, 2019 at 08:30:16AM -0800, James Bottomley wrote: > On Sat, 2019-02-16 at 23:19 +1100, Balbir Singh wrote: > > On Thu, Feb 07, 2019 at 09:24:22AM +0200, Mike Rapoport wrote: > > > (Joint proposal with James Bottomley) > > > > > > Address space isolation has been used to protect the kernel from > > > the userspace and userspace programs from each other since the > > > invention of the virtual memory. > > > > > > Assuming that kernel bugs and therefore vulnerabilities are > > > inevitable it might be worth isolating parts of the kernel to > > > minimize damage that these vulnerabilities can cause. > > > > > > > Is Address Space limited to user space and kernel space, where does > > the hypervisor fit into the picture? > > It doesn't really. The work is driven by the Nabla HAP measure > > https://blog.hansenpartnership.com/measuring-the-horizontal-attack-profile-of-nabla-containers/ > > Although the results are spectacular (building a container that's > measurably more secure than a hypervisor based system), they come at > the price of emulating a lot of the kernel and thus damaging the > precise resource control advantage containers have. The idea then is > to render parts of the kernel syscall interface safe enough that they > have a security profile equivalent to the emulated one and can thus be > called directly instead of being emulated, hoping to restore most of > the container resource management properties. > > In theory, I suppose it would buy you protection from things like the > kata containers host breach: > > https://nabla-containers.github.io/2018/11/28/fs/ > Thanks, so it's largely to prevent escaping the container namespace. Since the topic thread was generic, I thought I'd ask > > > > There is already ongoing work in a similar direction, like XPFO [1] > > > and temporary mappings proposed for the kernel text poking [2]. > > > > > > We have several vague ideas how we can take this even further and > > > make different parts of kernel run in different address spaces: > > > * Remove most of the kernel mappings from the syscall entry and add > > > a > > > trampoline when the syscall processing needs to call the "core > > > kernel". > > > * Make the parts of the kernel that execute in a namespace use > > > their > > > own mappings for the namespace private data > > > > Is the key reason for removing mappings -- to remove the processor > > from speculating data/text from those mappings? SMAP/SMEP provides > > a level of isolation from access and execution > > Not really, it's to reduce the exploitability of the code path and > limit the exposure of data which can be compromised when you're > exploited. > Yep, understood > > For namespaces, does allocating the right memory protection key > > work? At some point we'll need to recycle the keys > > I don't think anyone mentioned memory keys and namespaces ... I take it > you're thinking of SEV/MKTME? The idea being to shield one container's I was wondering why keys are not sufficient? I know no one mentioned it, but something I thought I'd bring it up. > execution from another using memory encryption? We've speculated it's > possible but the actual mechanism we were looking at is tagging pages > to namespaces (essentially using the mount namspace and tags on the > page cache) so the kernel would refuse to map a page into the wrong > namespace. This approach doesn't seem to be as promising as the > separated address space one because the security properties are harder > to measure. > Thanks for clarifying the scope Balbir > James > > > > It'll be an interesting discussion and I'd love to attend if invited > > > > Balbir Singh. > > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Address space isolation inside the kernel 2019-02-17 8:01 ` Balbir Singh @ 2019-02-17 16:43 ` James Bottomley 0 siblings, 0 replies; 18+ messages in thread From: James Bottomley @ 2019-02-17 16:43 UTC (permalink / raw) To: Balbir Singh; +Cc: Mike Rapoport, lsf-pc, linux-mm On Sun, 2019-02-17 at 19:01 +1100, Balbir Singh wrote: > On Sat, Feb 16, 2019 at 08:30:16AM -0800, James Bottomley wrote: > > On Sat, 2019-02-16 at 23:19 +1100, Balbir Singh wrote: > > > On Thu, Feb 07, 2019 at 09:24:22AM +0200, Mike Rapoport wrote: > > > > (Joint proposal with James Bottomley) > > > > > > > > Address space isolation has been used to protect the kernel > > > > from the userspace and userspace programs from each other since > > > > the invention of the virtual memory. > > > > > > > > Assuming that kernel bugs and therefore vulnerabilities are > > > > inevitable it might be worth isolating parts of the kernel to > > > > minimize damage that these vulnerabilities can cause. > > > > > > > > > > Is Address Space limited to user space and kernel space, where > > > does the hypervisor fit into the picture? > > > > It doesn't really. The work is driven by the Nabla HAP measure > > > > https://blog.hansenpartnership.com/measuring-the-horizontal-attack- > > profile-of-nabla-containers/ > > > > Although the results are spectacular (building a container that's > > measurably more secure than a hypervisor based system), they come > > at the price of emulating a lot of the kernel and thus damaging the > > precise resource control advantage containers have. The idea then > > is to render parts of the kernel syscall interface safe enough that > > they have a security profile equivalent to the emulated one and can > > thus be called directly instead of being emulated, hoping to > > restore most of the container resource management properties. > > > > In theory, I suppose it would buy you protection from things like > > the kata containers host breach: > > > > https://nabla-containers.github.io/2018/11/28/fs/ > > > > Thanks, so it's largely to prevent escaping the container namespace. > Since the topic thread was generic, I thought I'd ask Actually, that's not quite it either. The motivation is certainly container security but the current thrust of the work is generic kernel security ... the rising tide principle. James ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Address space isolation inside the kernel 2019-02-16 16:30 ` James Bottomley 2019-02-17 8:01 ` Balbir Singh @ 2019-02-17 19:34 ` Matthew Wilcox 2019-02-17 20:09 ` James Bottomley 1 sibling, 1 reply; 18+ messages in thread From: Matthew Wilcox @ 2019-02-17 19:34 UTC (permalink / raw) To: James Bottomley; +Cc: Balbir Singh, Mike Rapoport, lsf-pc, linux-mm On Sat, Feb 16, 2019 at 08:30:16AM -0800, James Bottomley wrote: > On Sat, 2019-02-16 at 23:19 +1100, Balbir Singh wrote: > > For namespaces, does allocating the right memory protection key > > work? At some point we'll need to recycle the keys > > I don't think anyone mentioned memory keys and namespaces ... I take it > you're thinking of SEV/MKTME? I thought he meant Protection Keys https://en.wikipedia.org/wiki/Memory_protection#Protection_keys > The idea being to shield one container's > execution from another using memory encryption? We've speculated it's > possible but the actual mechanism we were looking at is tagging pages > to namespaces (essentially using the mount namspace and tags on the > page cache) so the kernel would refuse to map a page into the wrong > namespace. This approach doesn't seem to be as promising as the > separated address space one because the security properties are harder > to measure. What do you mean by "tags on the pages cache"? Is that different from the radix tree tags (now renamed to XArray marks), which are search keys. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Address space isolation inside the kernel 2019-02-17 19:34 ` Matthew Wilcox @ 2019-02-17 20:09 ` James Bottomley 2019-02-17 21:54 ` Balbir Singh 2019-02-17 22:01 ` Balbir Singh 0 siblings, 2 replies; 18+ messages in thread From: James Bottomley @ 2019-02-17 20:09 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Balbir Singh, Mike Rapoport, lsf-pc, linux-mm On Sun, 2019-02-17 at 11:34 -0800, Matthew Wilcox wrote: > On Sat, Feb 16, 2019 at 08:30:16AM -0800, James Bottomley wrote: > > On Sat, 2019-02-16 at 23:19 +1100, Balbir Singh wrote: > > > For namespaces, does allocating the right memory protection key > > > work? At some point we'll need to recycle the keys > > > > I don't think anyone mentioned memory keys and namespaces ... I > > take it you're thinking of SEV/MKTME? > > I thought he meant Protection Keys > https://en.wikipedia.org/wiki/Memory_protection#Protection_keys Really? I wasn't really considering that mainly because in parisc we use them to implement no execute, so they'd have to be repurposed. > > The idea being to shield one container's execution from another > > using memory encryption? We've speculated it's possible but the > > actual mechanism we were looking at is tagging pages to namespaces > > (essentially using the mount namspace and tags on the > > page cache) so the kernel would refuse to map a page into the wrong > > namespace. This approach doesn't seem to be as promising as the > > separated address space one because the security properties are > > harder > > to measure. > > What do you mean by "tags on the pages cache"? Is that different > from the radix tree tags (now renamed to XArray marks), which are > search keys. Tagging the page cache to namespaces means having a set of mount namespaces per page in the page cache and not allowing placing the page into a VMA unless the owning task's nsproxy is one of the tagged mount namespaces. The idea was to introduce kernel supported fencing between containers, particularly if they were handling sensitive data, so that if a container used an exploit to map another container's page, the mapping would fail. However, since sensitive data should be on an encrypted filesystem, it looks like SEV/MKTME coupled with file based encryption might provide a better mechanism. James ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Address space isolation inside the kernel 2019-02-17 20:09 ` James Bottomley @ 2019-02-17 21:54 ` Balbir Singh 2019-02-17 22:01 ` Balbir Singh 1 sibling, 0 replies; 18+ messages in thread From: Balbir Singh @ 2019-02-17 21:54 UTC (permalink / raw) To: James Bottomley; +Cc: Matthew Wilcox, Mike Rapoport, lsf-pc, linux-mm On Sun, Feb 17, 2019 at 12:09:06PM -0800, James Bottomley wrote: > On Sun, 2019-02-17 at 11:34 -0800, Matthew Wilcox wrote: > > On Sat, Feb 16, 2019 at 08:30:16AM -0800, James Bottomley wrote: > > > On Sat, 2019-02-16 at 23:19 +1100, Balbir Singh wrote: > > > > For namespaces, does allocating the right memory protection key > > > > work? At some point we'll need to recycle the keys > > > > > > I don't think anyone mentioned memory keys and namespaces ... I > > > take it you're thinking of SEV/MKTME? > > > > I thought he meant Protection Keys > > https://en.wikipedia.org/wiki/Memory_protection#Protection_keys > > Really? I wasn't really considering that mainly because in parisc we > use them to implement no execute, so they'd have to be repurposed. > Yes, but x86 and powerpc have the capability to use them for no-read, no-write and no-execute (powerpc). I agree that this might not work well across all architectures, but it might be an option for architectures that support it. > > > The idea being to shield one container's execution from another > > > using memory encryption? We've speculated it's possible but the > > > actual mechanism we were looking at is tagging pages to namespaces > > > (essentially using the mount namspace and tags on the > > > page cache) so the kernel would refuse to map a page into the wrong > > > namespace. This approach doesn't seem to be as promising as the > > > separated address space one because the security properties are > > > harder > > > to measure. > > > > What do you mean by "tags on the pages cache"? Is that different > > from the radix tree tags (now renamed to XArray marks), which are > > search keys. > > Tagging the page cache to namespaces means having a set of mount > namespaces per page in the page cache and not allowing placing the page > into a VMA unless the owning task's nsproxy is one of the tagged mount > namespaces. The idea was to introduce kernel supported fencing between > containers, particularly if they were handling sensitive data, so that > if a container used an exploit to map another container's page, the > mapping would fail. However, since sensitive data should be on an > encrypted filesystem, it looks like SEV/MKTME coupled with file based > encryption might provide a better mechanism. > > James Balbir Singh ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Address space isolation inside the kernel 2019-02-17 20:09 ` James Bottomley 2019-02-17 21:54 ` Balbir Singh @ 2019-02-17 22:01 ` Balbir Singh 2019-02-17 22:20 ` [Lsf-pc] " James Bottomley 1 sibling, 1 reply; 18+ messages in thread From: Balbir Singh @ 2019-02-17 22:01 UTC (permalink / raw) To: James Bottomley; +Cc: Matthew Wilcox, Mike Rapoport, lsf-pc, linux-mm On Sun, Feb 17, 2019 at 12:09:06PM -0800, James Bottomley wrote: > On Sun, 2019-02-17 at 11:34 -0800, Matthew Wilcox wrote: > > On Sat, Feb 16, 2019 at 08:30:16AM -0800, James Bottomley wrote: > > > On Sat, 2019-02-16 at 23:19 +1100, Balbir Singh wrote: > > > > For namespaces, does allocating the right memory protection key > > > > work? At some point we'll need to recycle the keys > > > > > > I don't think anyone mentioned memory keys and namespaces ... I > > > take it you're thinking of SEV/MKTME? > > > > I thought he meant Protection Keys > > https://en.wikipedia.org/wiki/Memory_protection#Protection_keys > > Really? I wasn't really considering that mainly because in parisc we > use them to implement no execute, so they'd have to be repurposed. > > > > The idea being to shield one container's execution from another > > > using memory encryption? We've speculated it's possible but the > > > actual mechanism we were looking at is tagging pages to namespaces > > > (essentially using the mount namspace and tags on the > > > page cache) so the kernel would refuse to map a page into the wrong > > > namespace. This approach doesn't seem to be as promising as the > > > separated address space one because the security properties are > > > harder > > > to measure. > > > > What do you mean by "tags on the pages cache"? Is that different > > from the radix tree tags (now renamed to XArray marks), which are > > search keys. > > Tagging the page cache to namespaces means having a set of mount > namespaces per page in the page cache and not allowing placing the page > into a VMA unless the owning task's nsproxy is one of the tagged mount > namespaces. The idea was to introduce kernel supported fencing between > containers, particularly if they were handling sensitive data, so that > if a container used an exploit to map another container's page, the > mapping would fail. However, since sensitive data should be on an > encrypted filesystem, it looks like SEV/MKTME coupled with file based > encryption might provide a better mechanism. > Splitting out this point to a different email, I think being able to tag page cache is quite interesting and in the long run might help us to get things like mincore() right across shared boundaries. But any fencing will come in the way of sharing and density of containers. I still don't see how a container can map page cache it does not have right permissions to/for? In an ideal world any writable pages (sensitive) should ideally go to the writable bits of the union mount filesystem which is private to the container (but I could be making up things without trying them out) Balbir Singh. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Address space isolation inside the kernel 2019-02-17 22:01 ` Balbir Singh @ 2019-02-17 22:20 ` James Bottomley 2019-02-18 11:15 ` Balbir Singh 0 siblings, 1 reply; 18+ messages in thread From: James Bottomley @ 2019-02-17 22:20 UTC (permalink / raw) To: Balbir Singh; +Cc: linux-mm, lsf-pc, Matthew Wilcox, Mike Rapoport On Mon, 2019-02-18 at 09:01 +1100, Balbir Singh wrote: > On Sun, Feb 17, 2019 at 12:09:06PM -0800, James Bottomley wrote: > > On Sun, 2019-02-17 at 11:34 -0800, Matthew Wilcox wrote: > > > On Sat, Feb 16, 2019 at 08:30:16AM -0800, James Bottomley wrote: > > > > On Sat, 2019-02-16 at 23:19 +1100, Balbir Singh wrote: > > > > > For namespaces, does allocating the right memory protection > > > > > key work? At some point we'll need to recycle the keys > > > > > > > > I don't think anyone mentioned memory keys and namespaces ... I > > > > take it you're thinking of SEV/MKTME? > > > > > > I thought he meant Protection Keys > > > https://en.wikipedia.org/wiki/Memory_protection#Protection_keys > > > > Really? I wasn't really considering that mainly because in parisc > > we use them to implement no execute, so they'd have to be > > repurposed. > > > > > > The idea being to shield one container's execution from another > > > > using memory encryption? We've speculated it's possible but > > > > the actual mechanism we were looking at is tagging pages to > > > > namespaces (essentially using the mount namspace and tags on > > > > the page cache) so the kernel would refuse to map a page into > > > > the wrong namespace. This approach doesn't seem to be as > > > > promising as the separated address space one because the > > > > security properties are harder to measure. > > > > > > What do you mean by "tags on the pages cache"? Is that different > > > from the radix tree tags (now renamed to XArray marks), which are > > > search keys. > > > > Tagging the page cache to namespaces means having a set of mount > > namespaces per page in the page cache and not allowing placing the > > page into a VMA unless the owning task's nsproxy is one of the > > tagged mount namespaces. The idea was to introduce kernel > > supported fencing between containers, particularly if they were > > handling sensitive data, so that if a container used an exploit to > > map another container's page, the mapping would fail. However, > > since sensitive data should be on an encrypted filesystem, it looks > > like SEV/MKTME coupled with file based encryption might provide a > > better mechanism. > > > > Splitting out this point to a different email, I think being able to > tag page cache is quite interesting and in the long run might help > us to get things like mincore() right across shared boundaries. > > But any fencing will come in the way of sharing and density of > containers. I still don't see how a container can map page cache it > does not have right permissions to/for? In an ideal world any > writable pages (sensitive) should ideally go to the writable bits of > the union mount filesystem which is private to the container (but I > could be making up things without trying them out) As I said before, it's about reducing the horizontal attack profile (HAP). If the kernel were perfectly free from bugs and exploits, containment would be perfect and the HAP would be zero. In the real world, where the kernel is trusted (it's your kernel) but potentially vulnerable (it's not free from possibly exploitable defects), the HAP is non-zero and the question becomes how do you prevent one tenant from exploiting a defect to interfere with or exfiltrate data from another tenant. The idea behind page tagging is that modern techniqes (like ROP attacks) use existing code sequences within the kernel to perform the exploit so if all code sequences that map pages contain tag guards, the defences against one container accessing another pages remain in place even in the face of exploits. James ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Address space isolation inside the kernel 2019-02-17 22:20 ` [Lsf-pc] " James Bottomley @ 2019-02-18 11:15 ` Balbir Singh 0 siblings, 0 replies; 18+ messages in thread From: Balbir Singh @ 2019-02-18 11:15 UTC (permalink / raw) To: James Bottomley; +Cc: linux-mm, lsf-pc, Matthew Wilcox, Mike Rapoport On Sun, Feb 17, 2019 at 02:20:50PM -0800, James Bottomley wrote: > On Mon, 2019-02-18 at 09:01 +1100, Balbir Singh wrote: > > On Sun, Feb 17, 2019 at 12:09:06PM -0800, James Bottomley wrote: > > > On Sun, 2019-02-17 at 11:34 -0800, Matthew Wilcox wrote: > > > > On Sat, Feb 16, 2019 at 08:30:16AM -0800, James Bottomley wrote: > > > > > On Sat, 2019-02-16 at 23:19 +1100, Balbir Singh wrote: > > > > > > For namespaces, does allocating the right memory protection > > > > > > key work? At some point we'll need to recycle the keys > > > > > > > > > > I don't think anyone mentioned memory keys and namespaces ... I > > > > > take it you're thinking of SEV/MKTME? > > > > > > > > I thought he meant Protection Keys > > > > https://en.wikipedia.org/wiki/Memory_protection#Protection_keys > > > > > > Really? I wasn't really considering that mainly because in parisc > > > we use them to implement no execute, so they'd have to be > > > repurposed. > > > > > > > > The idea being to shield one container's execution from another > > > > > using memory encryption? We've speculated it's possible but > > > > > the actual mechanism we were looking at is tagging pages to > > > > > namespaces (essentially using the mount namspace and tags on > > > > > the page cache) so the kernel would refuse to map a page into > > > > > the wrong namespace. This approach doesn't seem to be as > > > > > promising as the separated address space one because the > > > > > security properties are harder to measure. > > > > > > > > What do you mean by "tags on the pages cache"? Is that different > > > > from the radix tree tags (now renamed to XArray marks), which are > > > > search keys. > > > > > > Tagging the page cache to namespaces means having a set of mount > > > namespaces per page in the page cache and not allowing placing the > > > page into a VMA unless the owning task's nsproxy is one of the > > > tagged mount namespaces. The idea was to introduce kernel > > > supported fencing between containers, particularly if they were > > > handling sensitive data, so that if a container used an exploit to > > > map another container's page, the mapping would fail. However, > > > since sensitive data should be on an encrypted filesystem, it looks > > > like SEV/MKTME coupled with file based encryption might provide a > > > better mechanism. > > > > > > > Splitting out this point to a different email, I think being able to > > tag page cache is quite interesting and in the long run might help > > us to get things like mincore() right across shared boundaries. > > > > But any fencing will come in the way of sharing and density of > > containers. I still don't see how a container can map page cache it > > does not have right permissions to/for? In an ideal world any > > writable pages (sensitive) should ideally go to the writable bits of > > the union mount filesystem which is private to the container (but I > > could be making up things without trying them out) > > As I said before, it's about reducing the horizontal attack profile > (HAP). If the kernel were perfectly free from bugs and exploits, > containment would be perfect and the HAP would be zero. In the real > world, where the kernel is trusted (it's your kernel) but potentially > vulnerable (it's not free from possibly exploitable defects), the HAP > is non-zero and the question becomes how do you prevent one tenant from > exploiting a defect to interfere with or exfiltrate data from another > tenant. > > The idea behind page tagging is that modern techniqes (like ROP > attacks) use existing code sequences within the kernel to perform the > exploit so if all code sequences that map pages contain tag guards, the > defences against one container accessing another pages remain in place > even in the face of exploits. > Agreed, and I believe in defense in depth. I'd love to participate to see what the final proposal looks like and what elements are used Balbir Singh. ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2019-04-25 22:41 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-02-07 7:24 [LSF/MM TOPIC] Address space isolation inside the kernel Mike Rapoport 2019-02-14 19:21 ` Kees Cook [not found] ` <CA+VK+GOpjXQ2-CLZt6zrW6m-=WpWpvcrXGSJ-723tRDMeAeHmg@mail.gmail.com> 2019-02-16 11:13 ` Paul Turner 2019-04-25 20:47 ` Jonathan Adams 2019-04-25 21:56 ` James Bottomley 2019-04-25 22:25 ` Paul Turner 2019-04-25 22:31 ` [Lsf-pc] " Alexei Starovoitov 2019-04-25 22:40 ` Paul Turner 2019-02-16 12:19 ` Balbir Singh 2019-02-16 16:30 ` James Bottomley 2019-02-17 8:01 ` Balbir Singh 2019-02-17 16:43 ` James Bottomley 2019-02-17 19:34 ` Matthew Wilcox 2019-02-17 20:09 ` James Bottomley 2019-02-17 21:54 ` Balbir Singh 2019-02-17 22:01 ` Balbir Singh 2019-02-17 22:20 ` [Lsf-pc] " James Bottomley 2019-02-18 11:15 ` Balbir Singh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).