Linux Security Modules development
 help / color / mirror / Atom feed
* Re: [PATCH v7 1/2] mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
From: Michal Hocko @ 2019-06-21  9:11 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: Andrew Morton, Christoph Lameter, Kees Cook, Masahiro Yamada,
	James Morris, Serge E. Hallyn, Nick Desaulniers,
	Kostya Serebryany, Dmitry Vyukov, Sandeep Patil, Laura Abbott,
	Randy Dunlap, Jann Horn, Mark Rutland, Marco Elver,
	Linux Memory Management List, linux-security-module,
	Kernel Hardening
In-Reply-To: <CAG_fn=UFj0Lzy3FgMV_JBKtxCiwE03HVxnR8=f9a7=4nrUFXSw@mail.gmail.com>

On Fri 21-06-19 10:57:35, Alexander Potapenko wrote:
> On Fri, Jun 21, 2019 at 9:09 AM Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> > > index fd5c95ff9251..2f75dd0d0d81 100644
> > > --- a/kernel/kexec_core.c
> > > +++ b/kernel/kexec_core.c
> > > @@ -315,7 +315,7 @@ static struct page *kimage_alloc_pages(gfp_t gfp_mask, unsigned int order)
> > >               arch_kexec_post_alloc_pages(page_address(pages), count,
> > >                                           gfp_mask);
> > >
> > > -             if (gfp_mask & __GFP_ZERO)
> > > +             if (want_init_on_alloc(gfp_mask))
> > >                       for (i = 0; i < count; i++)
> > >                               clear_highpage(pages + i);
> > >       }
> >
> > I am not really sure I follow here. Why do we want to handle
> > want_init_on_alloc here? The allocated memory comes from the page
> > allocator and so it will get zeroed there. arch_kexec_post_alloc_pages
> > might touch the content there but is there any actual risk of any kind
> > of leak?
> You're right, we don't want to initialize this memory if init_on_alloc is on.
> We need something along the lines of:
>   if (!static_branch_unlikely(&init_on_alloc))
>     if (gfp_mask & __GFP_ZERO)
>       // clear the pages
> 
> Another option would be to disable initialization in alloc_pages() using a flag.

Or we can simply not care and keen the code the way it is. First of all
it seems that nobody actually does use __GFP_ZERO unless I have missed
soemthing
	- kimage_alloc_pages(KEXEC_CONTROL_MEMORY_GFP, order); # GFP_KERNEL | __GFP_NORETRY
	- kimage_alloc_pages(gfp_mask, 0);
		- kimage_alloc_page(image, GFP_KERNEL, KIMAGE_NO_DEST);
		- kimage_alloc_page(image, GFP_HIGHUSER, maddr);

but even if we actually had a user do we care about double intialization
for something kexec related? It is not any hot path AFAIR.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v7 1/2] mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
From: Alexander Potapenko @ 2019-06-21  8:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Christoph Lameter, Kees Cook, Masahiro Yamada,
	James Morris, Serge E. Hallyn, Nick Desaulniers,
	Kostya Serebryany, Dmitry Vyukov, Sandeep Patil, Laura Abbott,
	Randy Dunlap, Jann Horn, Mark Rutland, Marco Elver,
	Linux Memory Management List, linux-security-module,
	Kernel Hardening
In-Reply-To: <20190621070905.GA3429@dhcp22.suse.cz>

On Fri, Jun 21, 2019 at 9:09 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 17-06-19 17:10:49, Alexander Potapenko wrote:
> > The new options are needed to prevent possible information leaks and
> > make control-flow bugs that depend on uninitialized values more
> > deterministic.
> >
> > init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
> > objects with zeroes. Initialization is done at allocation time at the
> > places where checks for __GFP_ZERO are performed.
> >
> > init_on_free=1 makes the kernel initialize freed pages and heap objects
> > with zeroes upon their deletion. This helps to ensure sensitive data
> > doesn't leak via use-after-free accesses.
> >
> > Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
> > returns zeroed memory. The two exceptions are slab caches with
> > constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
> > zero-initialized to preserve their semantics.
> >
> > Both init_on_alloc and init_on_free default to zero, but those defaults
> > can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
> > CONFIG_INIT_ON_FREE_DEFAULT_ON.
> >
> > Slowdown for the new features compared to init_on_free=0,
> > init_on_alloc=0:
> >
> > hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
> > hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
> >
> > Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
> > Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
> > Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
> > Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
> >
> > The slowdown for init_on_free=0, init_on_alloc=0 compared to the
> > baseline is within the standard error.
> >
> > The new features are also going to pave the way for hardware memory
> > tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
> > hooks to set the tags for heap objects. With MTE, tagging will have the
> > same cost as memory initialization.
> >
> > Although init_on_free is rather costly, there are paranoid use-cases where
> > in-memory data lifetime is desired to be minimized. There are various
> > arguments for/against the realism of the associated threat models, but
> > given that we'll need the infrastructre for MTE anyway, and there are
> > people who want wipe-on-free behavior no matter what the performance cost,
> > it seems reasonable to include it in this series.
>
> Thanks for reworking the original implemenation. This looks much better!
>
> > Signed-off-by: Alexander Potapenko <glider@google.com>
> > Acked-by: Kees Cook <keescook@chromium.org>
> > To: Andrew Morton <akpm@linux-foundation.org>
> > To: Christoph Lameter <cl@linux.com>
> > To: Kees Cook <keescook@chromium.org>
> > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: James Morris <jmorris@namei.org>
> > Cc: "Serge E. Hallyn" <serge@hallyn.com>
> > Cc: Nick Desaulniers <ndesaulniers@google.com>
> > Cc: Kostya Serebryany <kcc@google.com>
> > Cc: Dmitry Vyukov <dvyukov@google.com>
> > Cc: Sandeep Patil <sspatil@android.com>
> > Cc: Laura Abbott <labbott@redhat.com>
> > Cc: Randy Dunlap <rdunlap@infradead.org>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Mark Rutland <mark.rutland@arm.com>
> > Cc: Marco Elver <elver@google.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-security-module@vger.kernel.org
> > Cc: kernel-hardening@lists.openwall.com
>
> Acked-by: Michal Hocko <mhocko@suse.cz> # page allocator parts.
>
> kmalloc based parts look good to me as well but I am not sure I fill
> qualified to give my ack there without much more digging and I do not
> have much time for that now.
>
> [...]
> > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> > index fd5c95ff9251..2f75dd0d0d81 100644
> > --- a/kernel/kexec_core.c
> > +++ b/kernel/kexec_core.c
> > @@ -315,7 +315,7 @@ static struct page *kimage_alloc_pages(gfp_t gfp_mask, unsigned int order)
> >               arch_kexec_post_alloc_pages(page_address(pages), count,
> >                                           gfp_mask);
> >
> > -             if (gfp_mask & __GFP_ZERO)
> > +             if (want_init_on_alloc(gfp_mask))
> >                       for (i = 0; i < count; i++)
> >                               clear_highpage(pages + i);
> >       }
>
> I am not really sure I follow here. Why do we want to handle
> want_init_on_alloc here? The allocated memory comes from the page
> allocator and so it will get zeroed there. arch_kexec_post_alloc_pages
> might touch the content there but is there any actual risk of any kind
> of leak?
You're right, we don't want to initialize this memory if init_on_alloc is on.
We need something along the lines of:
  if (!static_branch_unlikely(&init_on_alloc))
    if (gfp_mask & __GFP_ZERO)
      // clear the pages

Another option would be to disable initialization in alloc_pages() using a flag.
>
> > diff --git a/mm/dmapool.c b/mm/dmapool.c
> > index 8c94c89a6f7e..e164012d3491 100644
> > --- a/mm/dmapool.c
> > +++ b/mm/dmapool.c
> > @@ -378,7 +378,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
> >  #endif
> >       spin_unlock_irqrestore(&pool->lock, flags);
> >
> > -     if (mem_flags & __GFP_ZERO)
> > +     if (want_init_on_alloc(mem_flags))
> >               memset(retval, 0, pool->size);
> >
> >       return retval;
>
> Don't you miss dma_pool_free and want_init_on_free?
Agreed.
I'll fix this and add tests for DMA pools as well.
> --
> Michal Hocko
> SUSE Labs



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

^ permalink raw reply

* Re: [PATCH v7 1/2] mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
From: Michal Hocko @ 2019-06-21  7:09 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: Andrew Morton, Christoph Lameter, Kees Cook, Masahiro Yamada,
	James Morris, Serge E. Hallyn, Nick Desaulniers,
	Kostya Serebryany, Dmitry Vyukov, Sandeep Patil, Laura Abbott,
	Randy Dunlap, Jann Horn, Mark Rutland, Marco Elver, linux-mm,
	linux-security-module, kernel-hardening
In-Reply-To: <20190617151050.92663-2-glider@google.com>

On Mon 17-06-19 17:10:49, Alexander Potapenko wrote:
> The new options are needed to prevent possible information leaks and
> make control-flow bugs that depend on uninitialized values more
> deterministic.
> 
> init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
> objects with zeroes. Initialization is done at allocation time at the
> places where checks for __GFP_ZERO are performed.
> 
> init_on_free=1 makes the kernel initialize freed pages and heap objects
> with zeroes upon their deletion. This helps to ensure sensitive data
> doesn't leak via use-after-free accesses.
> 
> Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
> returns zeroed memory. The two exceptions are slab caches with
> constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
> zero-initialized to preserve their semantics.
> 
> Both init_on_alloc and init_on_free default to zero, but those defaults
> can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
> CONFIG_INIT_ON_FREE_DEFAULT_ON.
> 
> Slowdown for the new features compared to init_on_free=0,
> init_on_alloc=0:
> 
> hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
> hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
> 
> Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
> Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
> Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
> Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
> 
> The slowdown for init_on_free=0, init_on_alloc=0 compared to the
> baseline is within the standard error.
> 
> The new features are also going to pave the way for hardware memory
> tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
> hooks to set the tags for heap objects. With MTE, tagging will have the
> same cost as memory initialization.
> 
> Although init_on_free is rather costly, there are paranoid use-cases where
> in-memory data lifetime is desired to be minimized. There are various
> arguments for/against the realism of the associated threat models, but
> given that we'll need the infrastructre for MTE anyway, and there are
> people who want wipe-on-free behavior no matter what the performance cost,
> it seems reasonable to include it in this series.

Thanks for reworking the original implemenation. This looks much better!

> Signed-off-by: Alexander Potapenko <glider@google.com>
> Acked-by: Kees Cook <keescook@chromium.org>
> To: Andrew Morton <akpm@linux-foundation.org>
> To: Christoph Lameter <cl@linux.com>
> To: Kees Cook <keescook@chromium.org>
> Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: James Morris <jmorris@namei.org>
> Cc: "Serge E. Hallyn" <serge@hallyn.com>
> Cc: Nick Desaulniers <ndesaulniers@google.com>
> Cc: Kostya Serebryany <kcc@google.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Cc: Sandeep Patil <sspatil@android.com>
> Cc: Laura Abbott <labbott@redhat.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Jann Horn <jannh@google.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Marco Elver <elver@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-security-module@vger.kernel.org
> Cc: kernel-hardening@lists.openwall.com

Acked-by: Michal Hocko <mhocko@suse.cz> # page allocator parts.

kmalloc based parts look good to me as well but I am not sure I fill
qualified to give my ack there without much more digging and I do not
have much time for that now.

[...]
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index fd5c95ff9251..2f75dd0d0d81 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -315,7 +315,7 @@ static struct page *kimage_alloc_pages(gfp_t gfp_mask, unsigned int order)
>  		arch_kexec_post_alloc_pages(page_address(pages), count,
>  					    gfp_mask);
>  
> -		if (gfp_mask & __GFP_ZERO)
> +		if (want_init_on_alloc(gfp_mask))
>  			for (i = 0; i < count; i++)
>  				clear_highpage(pages + i);
>  	}

I am not really sure I follow here. Why do we want to handle
want_init_on_alloc here? The allocated memory comes from the page
allocator and so it will get zeroed there. arch_kexec_post_alloc_pages
might touch the content there but is there any actual risk of any kind
of leak?

> diff --git a/mm/dmapool.c b/mm/dmapool.c
> index 8c94c89a6f7e..e164012d3491 100644
> --- a/mm/dmapool.c
> +++ b/mm/dmapool.c
> @@ -378,7 +378,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
>  #endif
>  	spin_unlock_irqrestore(&pool->lock, flags);
>  
> -	if (mem_flags & __GFP_ZERO)
> +	if (want_init_on_alloc(mem_flags))
>  		memset(retval, 0, pool->size);
>  
>  	return retval;

Don't you miss dma_pool_free and want_init_on_free?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH V31 07/25] kexec_file: Restrict at runtime if the kernel is locked down
From: Dave Young @ 2019-06-21  6:43 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: jmorris, Jiri Bohac, linux-api, kexec, linux-kernel,
	Matthew Garrett, dhowells, linux-security-module, luto
In-Reply-To: <20190326182742.16950-8-matthewgarrett@google.com>

On 03/26/19 at 11:27am, Matthew Garrett wrote:
> From: Jiri Bohac <jbohac@suse.cz>
> 
> When KEXEC_SIG is not enabled, kernel should not load images through
> kexec_file systemcall if the kernel is locked down.
> 
> [Modified by David Howells to fit with modifications to the previous patch
>  and to return -EPERM if the kernel is locked down for consistency with
>  other lockdowns. Modified by Matthew Garrett to remove the IMA
>  integration, which will be replaced by integrating with the IMA
>  architecture policy patches.]
> 
> Signed-off-by: Jiri Bohac <jbohac@suse.cz>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> Reviewed-by: Jiri Bohac <jbohac@suse.cz>
> cc: kexec@lists.infradead.org
> ---
>  kernel/kexec_file.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 67f3a866eabe..a1cc37c8b43b 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -239,6 +239,12 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
>  		}
>  
>  		ret = 0;
> +
> +		if (kernel_is_locked_down(reason, LOCKDOWN_INTEGRITY)) {
> +			ret = -EPERM;
> +			goto out;
> +		}
> +

Checking here is late, it would be good to move the check to earlier
code around below code:
        /* We only trust the superuser with rebooting the system. */
        if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
                return -EPERM;

>  		break;
>  
>  		/* All other errors are fatal, including nomem, unparseable
> -- 
> 2.21.0.392.gf8f6787159e-goog
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Thanks
Dave

^ permalink raw reply

* Re: [PATCH V31 06/25] kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG and KEXEC_SIG_FORCE
From: Dave Young @ 2019-06-21  6:34 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: jmorris, Jiri Bohac, linux-api, kexec, linux-kernel,
	Matthew Garrett, dhowells, linux-security-module, luto
In-Reply-To: <20190326182742.16950-7-matthewgarrett@google.com>

On 03/26/19 at 11:27am, Matthew Garrett wrote:
> From: Jiri Bohac <jbohac@suse.cz>
> 
> This is a preparatory patch for kexec_file_load() lockdown.  A locked down
> kernel needs to prevent unsigned kernel images from being loaded with
> kexec_file_load().  Currently, the only way to force the signature
> verification is compiling with KEXEC_VERIFY_SIG.  This prevents loading
> usigned images even when the kernel is not locked down at runtime.
> 
> This patch splits KEXEC_VERIFY_SIG into KEXEC_SIG and KEXEC_SIG_FORCE.
> Analogous to the MODULE_SIG and MODULE_SIG_FORCE for modules, KEXEC_SIG
> turns on the signature verification but allows unsigned images to be
> loaded.  KEXEC_SIG_FORCE disallows images without a valid signature.
> 
> [Modified by David Howells such that:
> 
>  (1) verify_pefile_signature() differentiates between no-signature and
>      sig-didn't-match in its returned errors.
> 
>  (2) kexec fails with EKEYREJECTED and logs an appropriate message if
>      signature checking is enforced and an signature is not found, uses
>      unsupported crypto or has no matching key.
> 
>  (3) kexec fails with EKEYREJECTED if there is a signature for which we
>      have a key, but signature doesn't match - even if in non-forcing mode.
> 
>  (4) kexec fails with EBADMSG or some other error if there is a signature
>      which cannot be parsed - even if in non-forcing mode.
> 
>  (5) kexec fails with ELIBBAD if the PE file cannot be parsed to extract
>      the signature - even if in non-forcing mode.
> 
> ]
> 
> Signed-off-by: Jiri Bohac <jbohac@suse.cz>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> Reviewed-by: Jiri Bohac <jbohac@suse.cz>
> cc: kexec@lists.infradead.org
> ---
>  arch/x86/Kconfig                       | 20 ++++++++---
>  crypto/asymmetric_keys/verify_pefile.c |  4 ++-
>  include/linux/kexec.h                  |  4 +--
>  kernel/kexec_file.c                    | 48 ++++++++++++++++++++++----
>  4 files changed, 61 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 4b4a7f32b68e..735d04a4b18f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2016,20 +2016,30 @@ config KEXEC_FILE
>  config ARCH_HAS_KEXEC_PURGATORY
>  	def_bool KEXEC_FILE
>  
> -config KEXEC_VERIFY_SIG
> +config KEXEC_SIG
>  	bool "Verify kernel signature during kexec_file_load() syscall"
>  	depends on KEXEC_FILE
>  	---help---
> -	  This option makes kernel signature verification mandatory for
> -	  the kexec_file_load() syscall.
>  
> -	  In addition to that option, you need to enable signature
> +	  This option makes the kexec_file_load() syscall check for a valid
> +	  signature of the kernel image.  The image can still be loaded without
> +	  a valid signature unless you also enable KEXEC_SIG_FORCE, though if
> +	  there's a signature that we can check, then it must be valid.
> +
> +	  In addition to this option, you need to enable signature
>  	  verification for the corresponding kernel image type being
>  	  loaded in order for this to work.
>  
> +config KEXEC_SIG_FORCE
> +	bool "Require a valid signature in kexec_file_load() syscall"
> +	depends on KEXEC_SIG
> +	---help---
> +	  This option makes kernel signature verification mandatory for
> +	  the kexec_file_load() syscall.
> +
>  config KEXEC_BZIMAGE_VERIFY_SIG
>  	bool "Enable bzImage signature verification support"
> -	depends on KEXEC_VERIFY_SIG
> +	depends on KEXEC_SIG
>  	depends on SIGNED_PE_FILE_VERIFICATION
>  	select SYSTEM_TRUSTED_KEYRING
>  	---help---
> diff --git a/crypto/asymmetric_keys/verify_pefile.c b/crypto/asymmetric_keys/verify_pefile.c
> index d178650fd524..4473cea1e877 100644
> --- a/crypto/asymmetric_keys/verify_pefile.c
> +++ b/crypto/asymmetric_keys/verify_pefile.c
> @@ -100,7 +100,7 @@ static int pefile_parse_binary(const void *pebuf, unsigned int pelen,
>  
>  	if (!ddir->certs.virtual_address || !ddir->certs.size) {
>  		pr_debug("Unsigned PE binary\n");
> -		return -EKEYREJECTED;
> +		return -ENODATA;
>  	}
>  
>  	chkaddr(ctx->header_size, ddir->certs.virtual_address,
> @@ -408,6 +408,8 @@ static int pefile_digest_pe(const void *pebuf, unsigned int pelen,
>   *  (*) 0 if at least one signature chain intersects with the keys in the trust
>   *	keyring, or:
>   *
> + *  (*) -ENODATA if there is no signature present.
> + *
>   *  (*) -ENOPKG if a suitable crypto module couldn't be found for a check on a
>   *	chain.
>   *
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index b9b1bc5f9669..58b27c7bdc2b 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -125,7 +125,7 @@ typedef void *(kexec_load_t)(struct kimage *image, char *kernel_buf,
>  			     unsigned long cmdline_len);
>  typedef int (kexec_cleanup_t)(void *loader_data);
>  
> -#ifdef CONFIG_KEXEC_VERIFY_SIG
> +#ifdef CONFIG_KEXEC_SIG
>  typedef int (kexec_verify_sig_t)(const char *kernel_buf,
>  				 unsigned long kernel_len);
>  #endif
> @@ -134,7 +134,7 @@ struct kexec_file_ops {
>  	kexec_probe_t *probe;
>  	kexec_load_t *load;
>  	kexec_cleanup_t *cleanup;
> -#ifdef CONFIG_KEXEC_VERIFY_SIG
> +#ifdef CONFIG_KEXEC_SIG
>  	kexec_verify_sig_t *verify_sig;
>  #endif
>  };
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index f1d0e00a3971..67f3a866eabe 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -90,7 +90,7 @@ int __weak arch_kimage_file_post_load_cleanup(struct kimage *image)
>  	return kexec_image_post_load_cleanup_default(image);
>  }
>  
> -#ifdef CONFIG_KEXEC_VERIFY_SIG
> +#ifdef CONFIG_KEXEC_SIG
>  static int kexec_image_verify_sig_default(struct kimage *image, void *buf,
>  					  unsigned long buf_len)
>  {
> @@ -188,7 +188,8 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
>  			     const char __user *cmdline_ptr,
>  			     unsigned long cmdline_len, unsigned flags)
>  {
> -	int ret = 0;
> +	const char *reason;
> +	int ret;
>  	void *ldata;
>  	loff_t size;
>  
> @@ -207,15 +208,48 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
>  	if (ret)
>  		goto out;
>  
> -#ifdef CONFIG_KEXEC_VERIFY_SIG
> +#ifdef CONFIG_KEXEC_SIG
>  	ret = arch_kexec_kernel_verify_sig(image, image->kernel_buf,
>  					   image->kernel_buf_len);
> -	if (ret) {
> -		pr_debug("kernel signature verification failed.\n");
> +#else
> +	ret = -ENODATA;
> +#endif
> +
> +	switch (ret) {
> +	case 0:
> +		break;
> +
> +		/* Certain verification errors are non-fatal if we're not
> +		 * checking errors, provided we aren't mandating that there
> +		 * must be a valid signature.
> +		 */
> +	case -ENODATA:
> +		reason = "kexec of unsigned image";
> +		goto decide;
> +	case -ENOPKG:
> +		reason = "kexec of image with unsupported crypto";
> +		goto decide;
> +	case -ENOKEY:
> +		reason = "kexec of image with unavailable key";
> +	decide:
> +		if (IS_ENABLED(CONFIG_KEXEC_SIG_FORCE)) {
> +			pr_notice("%s rejected\n", reason);
> +			ret = -EKEYREJECTED;

Force use -EKEYREJECTED is odd,  why not just use original "ret"?

> +			goto out;
> +		}
> +
> +		ret = 0;
> +		break;
> +
> +		/* All other errors are fatal, including nomem, unparseable
> +		 * signatures and signature check failures - even if signatures
> +		 * aren't required.
> +		 */
> +	default:
> +		pr_notice("kernel signature verification failed (%d).\n", ret);
>  		goto out;
>  	}
> -	pr_debug("kernel signature verification successful.\n");
> -#endif
> +
>  	/* It is possible that there no initramfs is being loaded */
>  	if (!(flags & KEXEC_FILE_NO_INITRAMFS)) {
>  		ret = kernel_read_file_from_fd(initrd_fd, &image->initrd_buf,
> -- 
> 2.21.0.392.gf8f6787159e-goog
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Thanks
Dave

^ permalink raw reply

* Re: [RFC PATCH v4 07/12] LSM: x86/sgx: Introduce ->enclave_map() hook for Intel SGX
From: Jarkko Sakkinen @ 2019-06-21  2:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190619222401.14942-8-sean.j.christopherson@intel.com>

On Wed, Jun 19, 2019 at 03:23:56PM -0700, Sean Christopherson wrote:
> enclave_map() is an SGX specific variant of file_mprotect() and
> mmap_file(), and is provided so that LSMs can apply W^X restrictions to
> enclaves.
> 
> Due to the nature of SGX and its Enclave Page Cache (EPC), all enclave
> VMAs are backed by a single file, i.e. /dev/sgx/enclave, that must be
> MAP_SHARED.  Furthermore, all enclaves need read, write and execute
> VMAs.  As a result, applying W^X restrictions on /dev/sgx/enclave using
> existing LSM hooks is for all intents and purposes impossible, e.g.
> denying either W or X would deny access to any enclave.
> 
> Note, extensive discussion yielded no sane alternative to some form of
> SGX specific LSM hook[1].
> 
> [1] https://lkml.kernel.org/r/CALCETrXf8mSK45h7sTK5Wf+pXLVn=Bjsc_RLpgO-h-qdzBRo5Q@mail.gmail.com
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

All the non-LSM changes are almost cleared from my part.

I would suggest that we scrape v21 together as soon as you return from
your vacation discluding the LSM hooks.

There is no any particular reason to get the LSM changes to the mainline
before the SGX foundations so now is the right time close things
underlying them.

I'm now in the same boat with your changes to the ioctl API, which means
that we are ready to go. I feel a tiny bit bad that it took me so long
time with [1] but I'm a simple minded person so what I can do :-)

Once you can come back please deal with the suggestions that I made
and provide a "pure" SRCU patch (apologies for repeating myself). I
will the squash them to the existing patch set.

After that is fully done we can make v21 scope decision when it comes
to the enclave life-cycle.

Even if the LSM changes  would not be upstreamed as part of the
foundations I can start holding versions of them in my tree but only
after v21 is out.

Can you cope with this plan?

[1] https://patchwork.kernel.org/patch/11005431/

/Jarkko

^ permalink raw reply

* Re: [RFC PATCH v4 05/12] x86/sgx: Enforce noexec filesystem restriction for enclaves
From: Jarkko Sakkinen @ 2019-06-21  1:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190619222401.14942-6-sean.j.christopherson@intel.com>

On Wed, Jun 19, 2019 at 03:23:54PM -0700, Sean Christopherson wrote:
> Do not allow an enclave page to be mapped with PROT_EXEC if the source
> vma does not have VM_MAYEXEC.  This effectively enforces noexec as
> do_mmap() clears VM_MAYEXEC if the vma is being loaded from a noexec
> path, i.e. prevents executing a file by loading it into an enclave.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/driver/ioctl.c | 42 +++++++++++++++++++++++---
>  1 file changed, 37 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/driver/ioctl.c b/arch/x86/kernel/cpu/sgx/driver/ioctl.c
> index e18d2afd2aad..1fca70a36ce3 100644
> --- a/arch/x86/kernel/cpu/sgx/driver/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/driver/ioctl.c
> @@ -564,6 +564,39 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long addr,
>  	return ret;
>  }
>  
> +static int sgx_encl_page_copy(void *dst, unsigned long src, unsigned long prot)

I will probably forget the context with this name after this has been
merged :-) So many functions dealing with enclave pages. Two
alternatives that come up to my mind:

1. sgx_encl_page_user_import()
2. sgx_encl_page_user_copy_from()

Not saying that they are beatiful names but at least you immediately
know the context.

> +{
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	/* Hold mmap_sem across copy_from_user() to avoid a TOCTOU race. */
> +	down_read(&current->mm->mmap_sem);
> +
> +	/* Query vma's VM_MAYEXEC as an indirect path_noexec() check. */
> +	if (prot & PROT_EXEC) {
> +		vma = find_vma(current->mm, src);
> +		if (!vma) {
> +			ret = -EFAULT;
> +			goto out;

Should this be -EINVAL instead?

/Jarkko

^ permalink raw reply

* Re: [RFC PATCH v4 00/12] security: x86/sgx: SGX vs. LSM
From: Jarkko Sakkinen @ 2019-06-21  1:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190619222401.14942-1-sean.j.christopherson@intel.com>

On Wed, Jun 19, 2019 at 03:23:49PM -0700, Sean Christopherson wrote:
> [SNAP]

The reason for delaying my proposal for encl->refcount (which I will
post as RFC and won't merge it before getting screened from you) was
that I really focused making the most well thought comments to this.

Generally I think that we need to get your SRCU changes in before
anything in the form that the patch does not anything else but only
that. Without doing that comparing changes dealing with the life-cycle
of an enclave is way too much mind bending.

/Jarkko

^ permalink raw reply

* Re: [PATCH v6 1/3] mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
From: Kees Cook @ 2019-06-21  1:37 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: Andrew Morton, Christoph Lameter, Masahiro Yamada, Michal Hocko,
	James Morris, Serge E. Hallyn, Nick Desaulniers,
	Kostya Serebryany, Dmitry Vyukov, Sandeep Patil, Laura Abbott,
	Randy Dunlap, Jann Horn, Mark Rutland, Marco Elver, linux-mm,
	linux-security-module, kernel-hardening
In-Reply-To: <201906070841.4680E54@keescook>

On Fri, Jun 07, 2019 at 08:42:27AM -0700, Kees Cook wrote:
> On Thu, Jun 06, 2019 at 06:48:43PM +0200, Alexander Potapenko wrote:
> > [...]
> > diff --git a/mm/slub.c b/mm/slub.c
> > index cd04dbd2b5d0..9c4a8b9a955c 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > [...]
> > @@ -2741,8 +2758,14 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
> >  		prefetch_freepointer(s, next_object);
> >  		stat(s, ALLOC_FASTPATH);
> >  	}
> > +	/*
> > +	 * If the object has been wiped upon free, make sure it's fully
> > +	 * initialized by zeroing out freelist pointer.
> > +	 */
> > +	if (unlikely(slab_want_init_on_free(s)) && object)
> > +		*(void **)object = NULL;

In looking at metadata again, I noticed that I don't think this is
correct, as it needs to be using s->offset to find the location of the
freelist pointer:

	memset(object + s->offset, 0, sizeof(void *));

> >  
> > -	if (unlikely(gfpflags & __GFP_ZERO) && object)
> > +	if (unlikely(slab_want_init_on_alloc(gfpflags, s)) && object)
> >  		memset(object, 0, s->object_size);

init_on_alloc is using "object_size" but init_on_free is using "size". I
assume the "alloc" wipe is smaller because metadata was just written
for the allocation?

-- 
Kees Cook

^ permalink raw reply

* Re: [RFC PATCH v4 06/12] mm: Introduce vm_ops->may_mprotect()
From: Jarkko Sakkinen @ 2019-06-21  1:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190619222401.14942-7-sean.j.christopherson@intel.com>

On Wed, Jun 19, 2019 at 03:23:55PM -0700, Sean Christopherson wrote:
> SGX will use ->may_mprotect() to invoke an SGX variant of the existing
> file_mprotect() and mmap_file() LSM hooks.
> 
> The name may_mprotect() is intended to reflect the hook's purpose as a
> way to restrict mprotect() as opposed to a wholesale replacement.
> 
> Due to the nature of SGX and its Enclave Page Cache (EPC), all enclave
> VMAs are backed by a single file, i.e. /dev/sgx/enclave, that must be
> MAP_SHARED.  Furthermore, all enclaves need read, write and execute
> VMAs.  As a result, applying W^X restrictions on /dev/sgx/enclave using
> existing LSM hooks is for all intents and purposes impossible, e.g.
> denying either W or X would deny access to *any* enclave.
> 
> By hooking mprotect(), SGX can invoke an SGX specific LSM hook, which in
> turn allows LSMs to enforce W^X policies.
> 
> Alternatively, SGX could provide a helper to identify enclaves given a
> vma or file.  LSMs could then check if a mapping is for enclave and take
> action according.
> 
> A second alternative would be to have SGX implement its own LSM hooks
> for file_mprotect() and mmap_file(), using them to "forward" the call to
> the SGX specific hook.
> 
> The major con to both alternatives is that they provide zero flexibility
> for the SGX specific LSM hook.  The "is_sgx_enclave()" helper doesn't
> allow SGX can't supply any additional information whatsoever, and the
> mmap_file() hook is called before the final address is known, e.g. SGX
> can't provide any information about the specific enclave being mapped.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

Absolutely nothing to say about this one. We can take it as part of the
main patch set as it is. Not going to apply it though before the things
have been sorted out.

/Jarkko

^ permalink raw reply

* Re: [RFC PATCH v4 04/12] x86/sgx: Require userspace to define enclave pages' protection bits
From: Jarkko Sakkinen @ 2019-06-21  1:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190621010753.GG20474@linux.intel.com>

On Fri, Jun 21, 2019 at 04:07:53AM +0300, Jarkko Sakkinen wrote:
> /**
>  * sgx_calc_vma_prot_intersection() - Calculate intersection of the permissions
>  *				      for a VMA
>  * @encl:	an enclave
>  * @vma:	a VMA inside the enclave
>  *
>  * Iterate through the page addresses inside the VMA and calculate a bitmask
>  * of permissions that all pages have in common. Page addresses that do
>  * not have an associated enclave page are interpreted to zero
>  * permissions.
>  */
> 
> > +static unsigned long sgx_allowed_rwx(struct sgx_encl *encl,
> > +				     struct vm_area_struct *vma)
> 
> Suggestion for the name: sgx_calc_vma_prot_intersection()

And have you thought off caching these results?

I.e. hold the result for each VMA and only recalculate when the old
value is dirty. Just a random thought, zero analysis but though that
good to mention anyway.

/Jarkko

^ permalink raw reply

* Re: [RFC PATCH v4 04/12] x86/sgx: Require userspace to define enclave pages' protection bits
From: Jarkko Sakkinen @ 2019-06-21  1:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190619222401.14942-5-sean.j.christopherson@intel.com>

On Wed, Jun 19, 2019 at 03:23:53PM -0700, Sean Christopherson wrote:
>  arch/x86/include/uapi/asm/sgx.h        |  6 ++--
>  arch/x86/kernel/cpu/sgx/driver/ioctl.c | 15 +++++---
>  arch/x86/kernel/cpu/sgx/driver/main.c  | 49 ++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/encl.h         |  1 +
>  tools/testing/selftests/x86/sgx/main.c | 32 +++++++++++++++--

Please split the kselftest change to a separate patch.

>  5 files changed, 94 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/uapi/asm/sgx.h b/arch/x86/include/uapi/asm/sgx.h
> index 6dba9f282232..67a3babbb24d 100644
> --- a/arch/x86/include/uapi/asm/sgx.h
> +++ b/arch/x86/include/uapi/asm/sgx.h
> @@ -35,15 +35,17 @@ struct sgx_enclave_create  {
>   * @src:	address for the page data
>   * @secinfo:	address for the SECINFO data
>   * @mrmask:	bitmask for the measured 256 byte chunks
> + * @prot:	maximal PROT_{READ,WRITE,EXEC} protections for the page
>   */
>  struct sgx_enclave_add_page {
>  	__u64	addr;
>  	__u64	src;
>  	__u64	secinfo;
> -	__u64	mrmask;
> +	__u16	mrmask;
> +	__u8	prot;
> +	__u8	pad;

__u8 pad[7];

> +/*
> + * Returns the AND of VM_{READ,WRITE,EXEC} permissions across all pages
> + * covered by the specific VMA.  A non-existent (or yet to be added) enclave
> + * page is considered to have no RWX permissions, i.e. is inaccessible.
> + */

That was a bit hard to grasp (at least for me). I would rephrase it like:

/**
 * sgx_calc_vma_prot_intersection() - Calculate intersection of the permissions
 *				      for a VMA
 * @encl:	an enclave
 * @vma:	a VMA inside the enclave
 *
 * Iterate through the page addresses inside the VMA and calculate a bitmask
 * of permissions that all pages have in common. Page addresses that do
 * not have an associated enclave page are interpreted to zero
 * permissions.
 */

> +static unsigned long sgx_allowed_rwx(struct sgx_encl *encl,
> +				     struct vm_area_struct *vma)

Suggestion for the name: sgx_calc_vma_prot_intersection()

> +{
> +	unsigned long allowed_rwx = VM_READ | VM_WRITE | VM_EXEC;
> +	unsigned long idx, idx_start, idx_end;
> +	struct sgx_encl_page *page;
> +
> +	idx_start = PFN_DOWN(vma->vm_start);
> +	idx_end = PFN_DOWN(vma->vm_end - 1);

Suggestion: just open code these to the for-statement.

> +
> +	for (idx = idx_start; idx <= idx_end; ++idx) {
> +		/*
> +		 * No need to take encl->lock, vm_prot_bits is set prior to
> +		 * insertion and never changes, and racing with adding pages is
> +		 * a userspace bug.
> +		 */
> +		rcu_read_lock();
> +		page = radix_tree_lookup(&encl->page_tree, idx);
> +		rcu_read_unlock();
> +
> +		/* Do not allow R|W|X to a non-existent page. */
> +		if (!page)
> +			allowed_rwx = 0;
> +		else
> +			allowed_rwx &= page->vm_prot_bits;

This would be a more clean way to express the same:

	if (!page)
		return 0;

	allowed_rwx &= page->vm_prot_bits;

/Jarkko

^ permalink raw reply

* Re: [RFC PATCH v4 02/12] x86/sgx: Do not naturally align MAP_FIXED address
From: Jarkko Sakkinen @ 2019-06-20 22:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190620210851.GG15383@linux.intel.com>

On Fri, Jun 21, 2019 at 12:08:51AM +0300, Jarkko Sakkinen wrote:
> On Wed, Jun 19, 2019 at 03:23:51PM -0700, Sean Christopherson wrote:
> > SGX enclaves have an associated Enclave Linear Range (ELRANGE) that is
> > tracked and enforced by the CPU using a base+mask approach, similar to
> > how hardware range registers such as the variable MTRRs.  As a result,
> > the ELRANGE must be naturally sized and aligned.
> > 
> > To reduce boilerplate code that would be needed in every userspace
> > enclave loader, the SGX driver naturally aligns the mmap() address and
> > also requires the range to be naturally sized.  Unfortunately, SGX fails
> > to grant a waiver to the MAP_FIXED case, e.g. incorrectly rejects mmap()
> > if userspace is attempting to map a small slice of an existing enclave.
> > 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > ---
> >  arch/x86/kernel/cpu/sgx/driver/main.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/driver/main.c b/arch/x86/kernel/cpu/sgx/driver/main.c
> > index 07aa5f91b2dd..29384cdd0842 100644
> > --- a/arch/x86/kernel/cpu/sgx/driver/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/driver/main.c
> > @@ -115,7 +115,13 @@ static unsigned long sgx_get_unmapped_area(struct file *file,
> >  					   unsigned long pgoff,
> >  					   unsigned long flags)
> >  {
> > -	if (len < 2 * PAGE_SIZE || len & (len - 1) || flags & MAP_PRIVATE)
> > +	if (flags & MAP_PRIVATE)
> > +		return -EINVAL;
> > +
> > +	if (flags & MAP_FIXED)
> > +		return addr;
> > +
> > +	if (len < 2 * PAGE_SIZE || len & (len - 1))
> >  		return -EINVAL;
> 
> Why the last check is needed given that mmap() no longer does not
> associate with the layout of the enclave? I'd just wipe it away.

In any case, I squashed this one.

/Jarkko

^ permalink raw reply

* Re: [RFC PATCH v4 03/12] selftests: x86/sgx: Mark the enclave loader as not needing an exec stack
From: Jarkko Sakkinen @ 2019-06-20 21:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190619222401.14942-4-sean.j.christopherson@intel.com>

On Wed, Jun 19, 2019 at 03:23:52PM -0700, Sean Christopherson wrote:
> The SGX enclave loader doesn't need an executable stack, but linkers
> will assume it does due to the lack of .note.GNU-stack sections in the
> loader's assembly code.  As a result, the kernel tags the loader as
> having "read implies exec", and so adds PROT_EXEC to all mmap()s, even
> those for mapping EPC regions.  This will cause problems in the future
> when userspace needs to explicit state a page's protection bits when the
> page is added to an enclave, e.g. adding TCS pages as R+W will cause
> mmap() to fail when the kernel tacks on +X.
> 
> Explicitly tell the linker that an executable stack is not needed.
> Alternatively, each .S file could add .note.GNU-stack, but the loader
> should never need an executable stack so zap it in one fell swoop.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

OK, this one is squashed now. Thanks.

/Jarkko

^ permalink raw reply

* Re: [RFC PATCH v4 02/12] x86/sgx: Do not naturally align MAP_FIXED address
From: Jarkko Sakkinen @ 2019-06-20 21:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190619222401.14942-3-sean.j.christopherson@intel.com>

On Wed, Jun 19, 2019 at 03:23:51PM -0700, Sean Christopherson wrote:
> SGX enclaves have an associated Enclave Linear Range (ELRANGE) that is
> tracked and enforced by the CPU using a base+mask approach, similar to
> how hardware range registers such as the variable MTRRs.  As a result,
> the ELRANGE must be naturally sized and aligned.
> 
> To reduce boilerplate code that would be needed in every userspace
> enclave loader, the SGX driver naturally aligns the mmap() address and
> also requires the range to be naturally sized.  Unfortunately, SGX fails
> to grant a waiver to the MAP_FIXED case, e.g. incorrectly rejects mmap()
> if userspace is attempting to map a small slice of an existing enclave.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/driver/main.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/driver/main.c b/arch/x86/kernel/cpu/sgx/driver/main.c
> index 07aa5f91b2dd..29384cdd0842 100644
> --- a/arch/x86/kernel/cpu/sgx/driver/main.c
> +++ b/arch/x86/kernel/cpu/sgx/driver/main.c
> @@ -115,7 +115,13 @@ static unsigned long sgx_get_unmapped_area(struct file *file,
>  					   unsigned long pgoff,
>  					   unsigned long flags)
>  {
> -	if (len < 2 * PAGE_SIZE || len & (len - 1) || flags & MAP_PRIVATE)
> +	if (flags & MAP_PRIVATE)
> +		return -EINVAL;
> +
> +	if (flags & MAP_FIXED)
> +		return addr;
> +
> +	if (len < 2 * PAGE_SIZE || len & (len - 1))
>  		return -EINVAL;

Why the last check is needed given that mmap() no longer does not
associate with the layout of the enclave? I'd just wipe it away.

/Jarkko

^ permalink raw reply

* Re: [RFC PATCH v4 01/12] x86/sgx: Use mmu_notifier.release() instead of per-vma refcounting
From: Jarkko Sakkinen @ 2019-06-20 21:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190619222401.14942-2-sean.j.christopherson@intel.com>

On Wed, Jun 19, 2019 at 03:23:50PM -0700, Sean Christopherson wrote:
> Using per-vma refcounting to track mm_structs associated with an enclave
> requires hooking .vm_close(), which in turn prevents the mm from merging
> vmas (precisely to allow refcounting).

Why having sgx_vma_close() prevents that? I do not understand the
problem statement.

> Avoid refcounting encl_mm altogether by registering an mmu_notifier at
> .mmap(), removing the dying encl_mm at mmu_notifier.release() and
> protecting mm_list during reclaim via a per-enclave SRCU.

Right, there is the potential collision with my changes:

1. Your patch: enclave life-cycle equals life-cycle of all processes
   that are associated with the enclave.
2. My (yet be sent) patch: enclave life-cycle equals the life cycle.

I won't rush with my patch. I rather merge neither at this point and
you can review mine after you come back from your vacation.

> Removing refcounting/vm_close() allows merging of enclave vmas, at the
> cost of delaying removal of encl_mm structs from mm_list, i.e. an mm is
> disassociated from an enclave when the mm exits or the enclave dies, as
> opposed to when the last vma (in a given mm) is closed.
> 
> The impact of delying encl_mm removal is its memory footprint and
> whatever overhead is incurred during EPC reclaim (to walk an mm's vmas).
> Practically speaking, a stale encl_mm will exist for a meaningful amount
> of time if and only if the enclave is mapped in a long-lived process and
> then passed off to another long-lived process.  It is expected that the
> vast majority of use cases will not encounter this condition, e.g. even
> using a daemon to build enclaves should not result in a stale encl_mm as
> the builder should never need to mmap() the enclave.

This paragraph speaks only about "well behaving" software.

> Even if there are scenarios that lead to defunct encl_mms, the cost is
> likely far outweighed by the benefits of reducing the number of vmas
> across all enclaves.
> 
> Note, using SRCU to protect mm_list is not strictly necessary, i.e. the
> existing walker with encl_mm refcounting could be massaged to work with
> mmu_notifier.release(), but the resulting code is subtle and fragile (I
> never actually got it working).  The primary issue is that an encl_mm
> can't be moved off the list until its refcount goes to zero, otherwise
> the custom walker goes off into the weeds.  The refcount requirement
> then prevents using mm_list to identify if an mmu_notifier.release()
> has fired, i.e. another mechanism is needed to guard against races
> between exit_mmap() and sgx_release().

Is it really impossible to send a separate SRCU patch?

I fully agree with the SRCU whereas rest of this patch is still
under debate.

If you could do that, I can merge it in no time. It is a small
step into better direction.

> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

Needs to be rebased because the master missing your earlier bug fix.

> ---
>  arch/x86/Kconfig                       |   2 +
>  arch/x86/kernel/cpu/sgx/driver/ioctl.c |  14 --
>  arch/x86/kernel/cpu/sgx/driver/main.c  |  38 ++++
>  arch/x86/kernel/cpu/sgx/encl.c         | 234 +++++++++++--------------
>  arch/x86/kernel/cpu/sgx/encl.h         |  19 +-
>  arch/x86/kernel/cpu/sgx/reclaim.c      |  71 +++-----
>  6 files changed, 182 insertions(+), 196 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index a0fd17c32521..940c52762f24 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1918,6 +1918,8 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
>  config INTEL_SGX
>  	bool "Intel SGX core functionality"
>  	depends on X86_64 && CPU_SUP_INTEL
> +	select MMU_NOTIFIER
> +	select SRCU
>  	---help---
>  	  Intel(R) SGX is a set of CPU instructions that can be used by
>  	  applications to set aside private regions of code and data, referred
> diff --git a/arch/x86/kernel/cpu/sgx/driver/ioctl.c b/arch/x86/kernel/cpu/sgx/driver/ioctl.c
> index d17c60dca114..3552d642b26f 100644
> --- a/arch/x86/kernel/cpu/sgx/driver/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/driver/ioctl.c
> @@ -276,7 +276,6 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
>  {
>  	unsigned long encl_size = secs->size + PAGE_SIZE;
>  	struct sgx_epc_page *secs_epc;
> -	struct sgx_encl_mm *encl_mm;
>  	unsigned long ssaframesize;
>  	struct sgx_pageinfo pginfo;
>  	struct sgx_secinfo secinfo;
> @@ -311,12 +310,6 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
>  
>  	INIT_WORK(&encl->work, sgx_add_page_worker);
>  
> -	encl_mm = sgx_encl_mm_add(encl, current->mm);
> -	if (IS_ERR(encl_mm)) {
> -		ret = PTR_ERR(encl_mm);
> -		goto err_out;
> -	}
> -
>  	secs_epc = sgx_alloc_page(&encl->secs, true);
>  	if (IS_ERR(secs_epc)) {
>  		ret = PTR_ERR(secs_epc);
> @@ -369,13 +362,6 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
>  		encl->backing = NULL;
>  	}
>  
> -	if (!list_empty(&encl->mm_list)) {
> -		encl_mm = list_first_entry(&encl->mm_list, struct sgx_encl_mm,
> -					   list);
> -		list_del(&encl_mm->list);
> -		kfree(encl_mm);
> -	}
> -
>  	mutex_unlock(&encl->lock);
>  	return ret;
>  }
> diff --git a/arch/x86/kernel/cpu/sgx/driver/main.c b/arch/x86/kernel/cpu/sgx/driver/main.c
> index 0c831ee5e2de..07aa5f91b2dd 100644
> --- a/arch/x86/kernel/cpu/sgx/driver/main.c
> +++ b/arch/x86/kernel/cpu/sgx/driver/main.c
> @@ -25,6 +25,7 @@ u32 sgx_xsave_size_tbl[64];
>  static int sgx_open(struct inode *inode, struct file *file)
>  {
>  	struct sgx_encl *encl;
> +	int ret;
>  
>  	encl = kzalloc(sizeof(*encl), GFP_KERNEL);
>  	if (!encl)
> @@ -38,6 +39,12 @@ static int sgx_open(struct inode *inode, struct file *file)
>  	INIT_LIST_HEAD(&encl->mm_list);
>  	spin_lock_init(&encl->mm_lock);
>  
> +	ret = init_srcu_struct(&encl->srcu);
> +	if (ret) {
> +		kfree(encl);
> +		return ret;
> +	}
> +
>  	file->private_data = encl;
>  
>  	return 0;
> @@ -46,6 +53,32 @@ static int sgx_open(struct inode *inode, struct file *file)
>  static int sgx_release(struct inode *inode, struct file *file)
>  {
>  	struct sgx_encl *encl = file->private_data;
> +	struct sgx_encl_mm *encl_mm;
> +
> +	/*
> +	 * Objects can't be *moved* off an RCU protected list (deletion is ok),
> +	 * nor can the object be freed until after synchronize_srcu().
> +	 */
> +restart:
> +	spin_lock(&encl->mm_lock);
> +	if (list_empty(&encl->mm_list)) {
> +		encl_mm = NULL;
> +	} else {
> +		encl_mm = list_first_entry(&encl->mm_list, struct sgx_encl_mm,
> +					   list);
> +		list_del_rcu(&encl_mm->list);
> +	}
> +	spin_unlock(&encl->mm_lock);
> +
> +	if (encl_mm) {
> +		synchronize_srcu(&encl->srcu);
> +
> +		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
> +
> +		sgx_encl_mm_release(encl_mm);
> +
> +		goto restart;
> +	}
>  
>  	kref_put(&encl->refcount, sgx_encl_release);
>  
> @@ -63,6 +96,11 @@ static long sgx_compat_ioctl(struct file *filep, unsigned int cmd,
>  static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>  	struct sgx_encl *encl = file->private_data;
> +	int ret;
> +
> +	ret = sgx_encl_mm_add(encl, vma->vm_mm);
> +	if (ret)
> +		return ret;
>  
>  	vma->vm_ops = &sgx_vm_ops;
>  	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 9566eb72d417..c6436bbd4a68 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -132,103 +132,125 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>  	return entry;
>  }
>  
> -struct sgx_encl_mm *sgx_encl_mm_add(struct sgx_encl *encl,
> -				    struct mm_struct *mm)
> +static void sgx_encl_mm_release_wq(struct work_struct *work)
> +{
> +	struct sgx_encl_mm *encl_mm =
> +		container_of(work, struct sgx_encl_mm, release_work);
> +
> +	sgx_encl_mm_release(encl_mm);
> +}
> +
> +/*
> + * Being a call_srcu() callback, this needs to be short, and sgx_encl_release()
> + * is anything but short.  Do the final freeing in yet another async callback.
> + */
> +static void sgx_encl_mm_release_delayed(struct rcu_head *rcu)

Would rename this either as *_tail() or *_deferred().

> +{
> +	struct sgx_encl_mm *encl_mm =
> +		container_of(rcu, struct sgx_encl_mm, rcu);
> +
> +	INIT_WORK(&encl_mm->release_work, sgx_encl_mm_release_wq);
> +	schedule_work(&encl_mm->release_work);
> +}
> +
> +static void sgx_mmu_notifier_release(struct mmu_notifier *mn,
> +				     struct mm_struct *mm)
> +{
> +	struct sgx_encl_mm *encl_mm =
> +		container_of(mn, struct sgx_encl_mm, mmu_notifier);
> +	struct sgx_encl_mm *tmp = NULL;
> +
> +	/*
> +	 * The enclave itself can remove encl_mm.  Note, objects can't be moved
> +	 * off an RCU protected list, but deletion is ok.
> +	 */
> +	spin_lock(&encl_mm->encl->mm_lock);
> +	list_for_each_entry(tmp, &encl_mm->encl->mm_list, list) {
> +		if (tmp == encl_mm) {
> +			list_del_rcu(&encl_mm->list);
> +			break;
> +		}
> +	}
> +	spin_unlock(&encl_mm->encl->mm_lock);
> +
> +	if (tmp == encl_mm) {
> +		synchronize_srcu(&encl_mm->encl->srcu);
> +
> +		/*
> +		 * Delay freeing encl_mm until after mmu_notifier releases any
> +		 * SRCU locks.  synchronize_srcu() must be called from process
> +		 * context, i.e. we can't throw mmu_notifier_unregister() in a
> +		 * work queue and be done with it.
> +		 */
> +		mmu_notifier_unregister_no_release(mn, mm);
> +		mmu_notifier_call_srcu(&encl_mm->rcu,
> +				       &sgx_encl_mm_release_delayed);
> +	}
> +}
> +
> +static const struct mmu_notifier_ops sgx_mmu_notifier_ops = {
> +	.release		= sgx_mmu_notifier_release,
> +};
> +
> +static struct sgx_encl_mm *sgx_encl_find_mm(struct sgx_encl *encl,
> +					    struct mm_struct *mm)
> +{
> +	struct sgx_encl_mm *encl_mm = NULL;
> +	struct sgx_encl_mm *tmp;
> +	int idx;
> +
> +	idx = srcu_read_lock(&encl->srcu);
> +
> +	list_for_each_entry_rcu(tmp, &encl->mm_list, list) {
> +		if (tmp->mm == mm) {
> +			encl_mm = tmp;
> +			break;
> +		}
> +	}
> +
> +	srcu_read_unlock(&encl->srcu, idx);
> +
> +	return encl_mm;
> +}
> +
> +int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
>  {
>  	struct sgx_encl_mm *encl_mm;
> +	int ret;
> +
> +	lockdep_assert_held_exclusive(&mm->mmap_sem);
> +
> +	/*
> +	 * mm_structs are kept on mm_list until the mm or the enclave dies,
> +	 * i.e. once an mm is off the list, it's gone for good, therefore it's
> +	 * impossible to get a false positive on @mm due to a stale mm_list.
> +	 */
> +	if (sgx_encl_find_mm(encl, mm))
> +		return 0;
>  
>  	encl_mm = kzalloc(sizeof(*encl_mm), GFP_KERNEL);
>  	if (!encl_mm)
> -		return ERR_PTR(-ENOMEM);
> +		return -ENOMEM;
>  
>  	encl_mm->encl = encl;
>  	encl_mm->mm = mm;
> -	kref_init(&encl_mm->refcount);
> +	encl_mm->mmu_notifier.ops = &sgx_mmu_notifier_ops;
> +
> +	ret = __mmu_notifier_register(&encl_mm->mmu_notifier, mm);
> +	if (ret) {
> +		kfree(encl_mm);
> +		return ret;
> +	}
> +
> +	kref_get(&encl->refcount);
>  
>  	spin_lock(&encl->mm_lock);
> -	list_add(&encl_mm->list, &encl->mm_list);
> +	list_add_rcu(&encl_mm->list, &encl->mm_list);
>  	spin_unlock(&encl->mm_lock);
>  
> -	return encl_mm;
> -}
> +	synchronize_srcu(&encl->srcu);
>  
> -void sgx_encl_mm_release(struct kref *ref)
> -{
> -	struct sgx_encl_mm *encl_mm =
> -		container_of(ref, struct sgx_encl_mm, refcount);
> -
> -	spin_lock(&encl_mm->encl->mm_lock);
> -	list_del(&encl_mm->list);
> -	spin_unlock(&encl_mm->encl->mm_lock);
> -
> -	kfree(encl_mm);
> -}
> -
> -static struct sgx_encl_mm *sgx_encl_get_mm(struct sgx_encl *encl,
> -					   struct mm_struct *mm)
> -{
> -	struct sgx_encl_mm *encl_mm = NULL;
> -	struct sgx_encl_mm *prev_mm = NULL;
> -	int iter;
> -
> -	while (true) {
> -		encl_mm = sgx_encl_next_mm(encl, prev_mm, &iter);
> -		if (prev_mm)
> -			kref_put(&prev_mm->refcount, sgx_encl_mm_release);
> -		prev_mm = encl_mm;
> -
> -		if (iter == SGX_ENCL_MM_ITER_DONE)
> -			break;
> -
> -		if (iter == SGX_ENCL_MM_ITER_RESTART)
> -			continue;
> -
> -		if (mm == encl_mm->mm)
> -			return encl_mm;
> -	}
> -
> -	return NULL;
> -}
> -
> -static void sgx_vma_open(struct vm_area_struct *vma)
> -{
> -	struct sgx_encl *encl = vma->vm_private_data;
> -	struct sgx_encl_mm *encl_mm;
> -
> -	if (!encl)
> -		return;
> -
> -	if (encl->flags & SGX_ENCL_DEAD)
> -		goto error;
> -
> -	encl_mm = sgx_encl_get_mm(encl, vma->vm_mm);
> -	if (!encl_mm) {
> -		encl_mm = sgx_encl_mm_add(encl, vma->vm_mm);
> -		if (IS_ERR(encl_mm))
> -			goto error;
> -	}
> -
> -	return;
> -
> -error:
> -	vma->vm_private_data = NULL;
> -}
> -
> -static void sgx_vma_close(struct vm_area_struct *vma)
> -{
> -	struct sgx_encl *encl = vma->vm_private_data;
> -	struct sgx_encl_mm *encl_mm;
> -
> -	if (!encl)
> -		return;
> -
> -	encl_mm = sgx_encl_get_mm(encl, vma->vm_mm);
> -	if (encl_mm) {
> -		kref_put(&encl_mm->refcount, sgx_encl_mm_release);
> -
> -		/* Release kref for the VMA. */
> -		kref_put(&encl_mm->refcount, sgx_encl_mm_release);
> -	}
> +	return 0;
>  }
>  
>  static unsigned int sgx_vma_fault(struct vm_fault *vmf)
> @@ -366,8 +388,6 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
>  }
>  
>  const struct vm_operations_struct sgx_vm_ops = {
> -	.close = sgx_vma_close,
> -	.open = sgx_vma_open,
>  	.fault = sgx_vma_fault,
>  	.access = sgx_vma_access,
>  };
> @@ -465,7 +485,7 @@ void sgx_encl_release(struct kref *ref)
>  	if (encl->backing)
>  		fput(encl->backing);
>  
> -	WARN(!list_empty(&encl->mm_list), "sgx: mm_list non-empty");
> +	WARN_ONCE(!list_empty(&encl->mm_list), "sgx: mm_list non-empty");
>  
>  	kfree(encl);
>  }
> @@ -503,46 +523,6 @@ struct page *sgx_encl_get_backing_page(struct sgx_encl *encl, pgoff_t index)
>  	return shmem_read_mapping_page_gfp(mapping, index, gfpmask);
>  }
>  
> -/**
> - * sgx_encl_next_mm() - Iterate to the next mm
> - * @encl:	an enclave
> - * @mm:		an mm list entry
> - * @iter:	iterator status
> - *
> - * Return: the enclave mm or NULL
> - */
> -struct sgx_encl_mm *sgx_encl_next_mm(struct sgx_encl *encl,
> -				     struct sgx_encl_mm *encl_mm, int *iter)
> -{
> -	struct list_head *entry;
> -
> -	WARN(!encl, "%s: encl is NULL", __func__);
> -	WARN(!iter, "%s: iter is NULL", __func__);
> -
> -	spin_lock(&encl->mm_lock);
> -
> -	entry = encl_mm ? encl_mm->list.next : encl->mm_list.next;
> -	WARN(!entry, "%s: entry is NULL", __func__);
> -
> -	if (entry == &encl->mm_list) {
> -		spin_unlock(&encl->mm_lock);
> -		*iter = SGX_ENCL_MM_ITER_DONE;
> -		return NULL;
> -	}
> -
> -	encl_mm = list_entry(entry, struct sgx_encl_mm, list);
> -
> -	if (!kref_get_unless_zero(&encl_mm->refcount)) {
> -		spin_unlock(&encl->mm_lock);
> -		*iter = SGX_ENCL_MM_ITER_RESTART;
> -		return NULL;
> -	}
> -
> -	spin_unlock(&encl->mm_lock);
> -	*iter = SGX_ENCL_MM_ITER_NEXT;
> -	return encl_mm;
> -}
> -
>  static int sgx_encl_test_and_clear_young_cb(pte_t *ptep, pgtable_t token,
>  					    unsigned long addr, void *data)
>  {
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> index c557f0374d74..0904b3c20ed0 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -9,9 +9,11 @@
>  #include <linux/kref.h>
>  #include <linux/list.h>
>  #include <linux/mm_types.h>
> +#include <linux/mmu_notifier.h>
>  #include <linux/mutex.h>
>  #include <linux/notifier.h>
>  #include <linux/radix-tree.h>
> +#include <linux/srcu.h>
>  #include <linux/workqueue.h>
>  
>  /**
> @@ -57,8 +59,10 @@ enum sgx_encl_flags {
>  struct sgx_encl_mm {
>  	struct sgx_encl *encl;
>  	struct mm_struct *mm;
> -	struct kref refcount;
>  	struct list_head list;
> +	struct mmu_notifier mmu_notifier;
> +	struct work_struct release_work;
> +	struct rcu_head rcu;
>  };
>  
>  struct sgx_encl {
> @@ -72,6 +76,7 @@ struct sgx_encl {
>  	spinlock_t mm_lock;
>  	struct file *backing;
>  	struct kref refcount;
> +	struct srcu_struct srcu;
>  	unsigned long base;
>  	unsigned long size;
>  	unsigned long ssaframesize;
> @@ -118,11 +123,13 @@ void sgx_encl_destroy(struct sgx_encl *encl);
>  void sgx_encl_release(struct kref *ref);
>  pgoff_t sgx_encl_get_index(struct sgx_encl *encl, struct sgx_encl_page *page);
>  struct page *sgx_encl_get_backing_page(struct sgx_encl *encl, pgoff_t index);
> -struct sgx_encl_mm *sgx_encl_next_mm(struct sgx_encl *encl,
> -				     struct sgx_encl_mm *encl_mm, int *iter);
> -struct sgx_encl_mm *sgx_encl_mm_add(struct sgx_encl *encl,
> -				    struct mm_struct *mm);
> -void sgx_encl_mm_release(struct kref *ref);
> +int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
> +static inline void sgx_encl_mm_release(struct sgx_encl_mm *encl_mm)
> +{
> +	kref_put(&encl_mm->encl->refcount, sgx_encl_release);
> +
> +	kfree(encl_mm);
> +}

Please just open code this to the two call sites. Makes the code hard to
follow.

Right now I did not find anything else questionable from the code
changes. Repeating myself but if it is by any means possible before
going away, can you construct a pure SRCU patch.

I could then reconstruct my changes on top off that, which would
make evalution of both heck a lot easier.

/Jarkko

^ permalink raw reply

* Re: [PATCH] structleak: disable BYREF_ALL in combination with KASAN_STACK
From: Kees Cook @ 2019-06-20 17:35 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov, kasan-dev,
	Alexander Popov, Ard Biesheuvel, James Morris, Serge E. Hallyn,
	Masahiro Yamada, linux-security-module, linux-kernel
In-Reply-To: <20190618094731.3677294-1-arnd@arndb.de>

On Tue, Jun 18, 2019 at 11:47:13AM +0200, Arnd Bergmann wrote:
> The combination of KASAN_STACK and GCC_PLUGIN_STRUCTLEAK_BYREF_ALL
> leads to much larger kernel stack usage, as seen from the warnings
> about functions that now exceed the 2048 byte limit:

Is the preference that this go into v5.2 (there's not much time left),
or should this be v5.3? (You didn't mark it as Cc: stable?)

> one. I picked the dependency in GCC_PLUGIN_STRUCTLEAK_BYREF_ALL, as
> this option is designed to make uninitialized stack usage less harmful
> when enabled on its own, but it also prevents KASAN from detecting those
> cases in which it was in fact needed.

Right -- there's not much sense in both being enabled. I'd agree with
this rationale.

-- 
Kees Cook

^ permalink raw reply

* Stacked LSMs (was Re: [PATCH v2 00/25] LSM: Module stacking for AppArmor)
From: Kees Cook @ 2019-06-20 17:33 UTC (permalink / raw)
  To: Salvatore Mesoraca, Mickaël Salaün, James Morris
  Cc: Casey Schaufler, casey.schaufler, linux-security-module, selinux,
	john.johansen, penguin-kernel, paul, sds
In-Reply-To: <alpine.LRH.2.21.1906200555400.25930@namei.org>

On Thu, Jun 20, 2019 at 06:08:03AM +1000, James Morris wrote:
> We extended stacking support in March to allow Landlock and SARA to be 
> merged and have not seen anything from them since.

Salvatore and Mickaël, have you had a chance to look at the stacking
changes? I'd love to see work progress on your LSMs now that the
stacking prerequisites have landed.

Thanks!

-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH v2 18/25] LSM: Use lsmcontext in security_dentry_init_security
From: Kees Cook @ 2019-06-20 17:25 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: casey.schaufler, jmorris, linux-security-module, selinux,
	john.johansen, penguin-kernel, paul, sds
In-Reply-To: <063644f9-be64-175e-0bf2-cfa1afadc3d7@schaufler-ca.com>

On Wed, Jun 19, 2019 at 10:31:45AM -0700, Casey Schaufler wrote:
> On 6/18/2019 10:41 PM, Kees Cook wrote:
> > I think this is wrong: for NUL-terminated strings, "context.len" isn't
> > currently including the NUL byte (it's set to strlen()).
> >
> > So, if kmemdup() is used here, it means strlen() isn't correct in the
> > context init helper, it should be using the "size" argument, etc.
> 
> Would all be true if the context where being set by lsmcontext_init,
> but it's not. It's coming from the dentry_init_security hook, and
> the one instance of that, selinux_dentry_init_security() sets the
> size to strlen() + 1. The kmemdup() will include the terminating NUL.

Ah-ha! Okay, thanks, yes. I see now. Carry on! :)

> I too wish that the hooks and their use where more consistent.
> My sincere hope is that this revision of the infrastructure will
> help that to some extent.

Once these changes land it should be much much easier to find ways to
refactor for greater sanity. :)

> > Should label be set to NULL here and len reduced to 0?
> 
> It wasn't before, and I'd hate to make too many assumptions
> about what might be fragile in the NFS code.

Gotcha.

-- 
Kees Cook

^ permalink raw reply

* [RFC PATCH v4 09/12] LSM: x86/sgx: Introduce ->enclave_load() hook for Intel SGX
From: Sean Christopherson @ 2019-06-19 22:23 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190619222401.14942-1-sean.j.christopherson@intel.com>

enclave_load() is roughly analogous to the existing file_mprotect().

Due to the nature of SGX and its Enclave Page Cache (EPC), all enclave
VMAs are backed by a single file, i.e. /dev/sgx/enclave, that must be
MAP_SHARED.  Furthermore, all enclaves need read, write and execute
VMAs.  As a result, the existing/standard call to file_mprotect() does
not provide any meaningful security for enclaves since an LSM can only
deny/grant access to the EPC as a whole.

security_enclave_load() is called when SGX is first loading an enclave
page, i.e. copying a page from normal memory into the EPC.  Although
the prototype for enclave_load() is similar to file_mprotect(), e.g.
SGX could theoretically use file_mprotect() and set reqprot=prot, a
separate hook is desirable as the semantics of an enclave's protection
bits are different than those of vmas, e.g. an enclave page tracks the
maximal set of protections, whereas file_mprotect() operates on the
actual protections being provided.  Enclaves also have unique security
properties, e.g. measured code, that LSMs may want to consider.  In
other words, LSMs will likely want to implement different policies for
enclave page protections.

Note, extensive discussion yielded no sane alternative to some form of
SGX specific LSM hook[1].

[1] https://lkml.kernel.org/r/CALCETrXf8mSK45h7sTK5Wf+pXLVn=Bjsc_RLpgO-h-qdzBRo5Q@mail.gmail.com

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kernel/cpu/sgx/driver/ioctl.c | 32 ++++++++++++++------------
 include/linux/lsm_hooks.h              |  8 +++++++
 include/linux/security.h               |  7 ++++++
 security/security.c                    |  5 ++++
 4 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/driver/ioctl.c b/arch/x86/kernel/cpu/sgx/driver/ioctl.c
index 1fca70a36ce3..ae1b4d69441c 100644
--- a/arch/x86/kernel/cpu/sgx/driver/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/driver/ioctl.c
@@ -9,6 +9,7 @@
 #include <linux/highmem.h>
 #include <linux/ratelimit.h>
 #include <linux/sched/signal.h>
+#include <linux/security.h>
 #include <linux/shmem_fs.h>
 #include <linux/slab.h>
 #include <linux/suspend.h>
@@ -564,7 +565,8 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long addr,
 	return ret;
 }
 
-static int sgx_encl_page_copy(void *dst, unsigned long src, unsigned long prot)
+static int sgx_encl_page_copy(void *dst, unsigned long src, unsigned long prot,
+			      u16 mrmask)
 {
 	struct vm_area_struct *vma;
 	int ret;
@@ -572,24 +574,24 @@ static int sgx_encl_page_copy(void *dst, unsigned long src, unsigned long prot)
 	/* Hold mmap_sem across copy_from_user() to avoid a TOCTOU race. */
 	down_read(&current->mm->mmap_sem);
 
+	vma = find_vma(current->mm, src);
+	if (!vma) {
+		ret = -EFAULT;
+		goto out;
+	}
+
 	/* Query vma's VM_MAYEXEC as an indirect path_noexec() check. */
-	if (prot & PROT_EXEC) {
-		vma = find_vma(current->mm, src);
-		if (!vma) {
-			ret = -EFAULT;
-			goto out;
-		}
-
-		if (!(vma->vm_flags & VM_MAYEXEC)) {
-			ret = -EACCES;
-			goto out;
-		}
+	if ((prot & PROT_EXEC) && !(vma->vm_flags & VM_MAYEXEC)) {
+		ret = -EACCES;
+		goto out;
 	}
 
+	ret = security_enclave_load(vma, prot, mrmask == 0xffff);
+	if (ret)
+		goto out;
+
 	if (copy_from_user(dst, (void __user *)src, PAGE_SIZE))
 		ret = -EFAULT;
-	else
-		ret = 0;
 
 out:
 	up_read(&current->mm->mmap_sem);
@@ -639,7 +641,7 @@ static long sgx_ioc_enclave_add_page(struct file *filep, void __user *arg)
 
 	prot = addp.prot & (PROT_READ | PROT_WRITE | PROT_EXEC);
 
-	ret = sgx_encl_page_copy(data, addp.src, prot);
+	ret = sgx_encl_page_copy(data, addp.src, prot, addp.mrmask);
 	if (ret)
 		goto out;
 
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 7c1357105e61..3bc92c65f287 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1451,6 +1451,11 @@
  * @enclave_map:
  *	@prot contains the protection that will be applied by the kernel.
  *	Return 0 if permission is granted.
+ *
+ * @enclave_load:
+ *	@vma: the source memory region of the enclave page being loaded.
+ *	@prot: the (maximal) protections of the enclave page.
+ *	Return 0 if permission is granted.
  */
 union security_list_options {
 	int (*binder_set_context_mgr)(struct task_struct *mgr);
@@ -1815,6 +1820,8 @@ union security_list_options {
 
 #ifdef CONFIG_INTEL_SGX
 	int (*enclave_map)(unsigned long prot);
+	int (*enclave_load)(struct vm_area_struct *vma, unsigned long prot,
+			    bool measured);
 #endif /* CONFIG_INTEL_SGX */
 };
 
@@ -2057,6 +2064,7 @@ struct security_hook_heads {
 #endif /* CONFIG_BPF_SYSCALL */
 #ifdef CONFIG_INTEL_SGX
 	struct hlist_head enclave_map;
+	struct hlist_head enclave_load;
 #endif /* CONFIG_INTEL_SGX */
 } __randomize_layout;
 
diff --git a/include/linux/security.h b/include/linux/security.h
index 6a1f54ba6794..572ddfc53039 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -1832,11 +1832,18 @@ static inline void security_bpf_prog_free(struct bpf_prog_aux *aux)
 #ifdef CONFIG_INTEL_SGX
 #ifdef CONFIG_SECURITY
 int security_enclave_map(unsigned long prot);
+int security_enclave_load(struct vm_area_struct *vma, unsigned long prot,
+			  bool measured);
 #else
 static inline int security_enclave_map(unsigned long prot)
 {
 	return 0;
 }
+static inline int security_enclave_load(struct vm_area_struct *vma,
+					unsigned long prot, bool measured)
+{
+	return 0;
+}
 #endif /* CONFIG_SECURITY */
 #endif /* CONFIG_INTEL_SGX */
 
diff --git a/security/security.c b/security/security.c
index 03951e08bdfc..00f483beb1cc 100644
--- a/security/security.c
+++ b/security/security.c
@@ -2365,4 +2365,9 @@ int security_enclave_map(unsigned long prot)
 {
 	return call_int_hook(enclave_map, 0, prot);
 }
+int security_enclave_load(struct vm_area_struct *vma, unsigned long prot,
+			  bool measured)
+{
+	return call_int_hook(enclave_load, 0, vma, prot, measured);
+}
 #endif /* CONFIG_INTEL_SGX */
-- 
2.21.0


^ permalink raw reply related

* [RFC PATCH v4 01/12] x86/sgx: Use mmu_notifier.release() instead of per-vma refcounting
From: Sean Christopherson @ 2019-06-19 22:23 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190619222401.14942-1-sean.j.christopherson@intel.com>

Using per-vma refcounting to track mm_structs associated with an enclave
requires hooking .vm_close(), which in turn prevents the mm from merging
vmas (precisely to allow refcounting).

Avoid refcounting encl_mm altogether by registering an mmu_notifier at
.mmap(), removing the dying encl_mm at mmu_notifier.release() and
protecting mm_list during reclaim via a per-enclave SRCU.

Removing refcounting/vm_close() allows merging of enclave vmas, at the
cost of delaying removal of encl_mm structs from mm_list, i.e. an mm is
disassociated from an enclave when the mm exits or the enclave dies, as
opposed to when the last vma (in a given mm) is closed.

The impact of delying encl_mm removal is its memory footprint and
whatever overhead is incurred during EPC reclaim (to walk an mm's vmas).
Practically speaking, a stale encl_mm will exist for a meaningful amount
of time if and only if the enclave is mapped in a long-lived process and
then passed off to another long-lived process.  It is expected that the
vast majority of use cases will not encounter this condition, e.g. even
using a daemon to build enclaves should not result in a stale encl_mm as
the builder should never need to mmap() the enclave.

Even if there are scenarios that lead to defunct encl_mms, the cost is
likely far outweighed by the benefits of reducing the number of vmas
across all enclaves.

Note, using SRCU to protect mm_list is not strictly necessary, i.e. the
existing walker with encl_mm refcounting could be massaged to work with
mmu_notifier.release(), but the resulting code is subtle and fragile (I
never actually got it working).  The primary issue is that an encl_mm
can't be moved off the list until its refcount goes to zero, otherwise
the custom walker goes off into the weeds.  The refcount requirement
then prevents using mm_list to identify if an mmu_notifier.release()
has fired, i.e. another mechanism is needed to guard against races
between exit_mmap() and sgx_release().

Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/Kconfig                       |   2 +
 arch/x86/kernel/cpu/sgx/driver/ioctl.c |  14 --
 arch/x86/kernel/cpu/sgx/driver/main.c  |  38 ++++
 arch/x86/kernel/cpu/sgx/encl.c         | 234 +++++++++++--------------
 arch/x86/kernel/cpu/sgx/encl.h         |  19 +-
 arch/x86/kernel/cpu/sgx/reclaim.c      |  71 +++-----
 6 files changed, 182 insertions(+), 196 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a0fd17c32521..940c52762f24 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1918,6 +1918,8 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
 config INTEL_SGX
 	bool "Intel SGX core functionality"
 	depends on X86_64 && CPU_SUP_INTEL
+	select MMU_NOTIFIER
+	select SRCU
 	---help---
 	  Intel(R) SGX is a set of CPU instructions that can be used by
 	  applications to set aside private regions of code and data, referred
diff --git a/arch/x86/kernel/cpu/sgx/driver/ioctl.c b/arch/x86/kernel/cpu/sgx/driver/ioctl.c
index d17c60dca114..3552d642b26f 100644
--- a/arch/x86/kernel/cpu/sgx/driver/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/driver/ioctl.c
@@ -276,7 +276,6 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 {
 	unsigned long encl_size = secs->size + PAGE_SIZE;
 	struct sgx_epc_page *secs_epc;
-	struct sgx_encl_mm *encl_mm;
 	unsigned long ssaframesize;
 	struct sgx_pageinfo pginfo;
 	struct sgx_secinfo secinfo;
@@ -311,12 +310,6 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 
 	INIT_WORK(&encl->work, sgx_add_page_worker);
 
-	encl_mm = sgx_encl_mm_add(encl, current->mm);
-	if (IS_ERR(encl_mm)) {
-		ret = PTR_ERR(encl_mm);
-		goto err_out;
-	}
-
 	secs_epc = sgx_alloc_page(&encl->secs, true);
 	if (IS_ERR(secs_epc)) {
 		ret = PTR_ERR(secs_epc);
@@ -369,13 +362,6 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 		encl->backing = NULL;
 	}
 
-	if (!list_empty(&encl->mm_list)) {
-		encl_mm = list_first_entry(&encl->mm_list, struct sgx_encl_mm,
-					   list);
-		list_del(&encl_mm->list);
-		kfree(encl_mm);
-	}
-
 	mutex_unlock(&encl->lock);
 	return ret;
 }
diff --git a/arch/x86/kernel/cpu/sgx/driver/main.c b/arch/x86/kernel/cpu/sgx/driver/main.c
index 0c831ee5e2de..07aa5f91b2dd 100644
--- a/arch/x86/kernel/cpu/sgx/driver/main.c
+++ b/arch/x86/kernel/cpu/sgx/driver/main.c
@@ -25,6 +25,7 @@ u32 sgx_xsave_size_tbl[64];
 static int sgx_open(struct inode *inode, struct file *file)
 {
 	struct sgx_encl *encl;
+	int ret;
 
 	encl = kzalloc(sizeof(*encl), GFP_KERNEL);
 	if (!encl)
@@ -38,6 +39,12 @@ static int sgx_open(struct inode *inode, struct file *file)
 	INIT_LIST_HEAD(&encl->mm_list);
 	spin_lock_init(&encl->mm_lock);
 
+	ret = init_srcu_struct(&encl->srcu);
+	if (ret) {
+		kfree(encl);
+		return ret;
+	}
+
 	file->private_data = encl;
 
 	return 0;
@@ -46,6 +53,32 @@ static int sgx_open(struct inode *inode, struct file *file)
 static int sgx_release(struct inode *inode, struct file *file)
 {
 	struct sgx_encl *encl = file->private_data;
+	struct sgx_encl_mm *encl_mm;
+
+	/*
+	 * Objects can't be *moved* off an RCU protected list (deletion is ok),
+	 * nor can the object be freed until after synchronize_srcu().
+	 */
+restart:
+	spin_lock(&encl->mm_lock);
+	if (list_empty(&encl->mm_list)) {
+		encl_mm = NULL;
+	} else {
+		encl_mm = list_first_entry(&encl->mm_list, struct sgx_encl_mm,
+					   list);
+		list_del_rcu(&encl_mm->list);
+	}
+	spin_unlock(&encl->mm_lock);
+
+	if (encl_mm) {
+		synchronize_srcu(&encl->srcu);
+
+		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
+
+		sgx_encl_mm_release(encl_mm);
+
+		goto restart;
+	}
 
 	kref_put(&encl->refcount, sgx_encl_release);
 
@@ -63,6 +96,11 @@ static long sgx_compat_ioctl(struct file *filep, unsigned int cmd,
 static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct sgx_encl *encl = file->private_data;
+	int ret;
+
+	ret = sgx_encl_mm_add(encl, vma->vm_mm);
+	if (ret)
+		return ret;
 
 	vma->vm_ops = &sgx_vm_ops;
 	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 9566eb72d417..c6436bbd4a68 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -132,103 +132,125 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 	return entry;
 }
 
-struct sgx_encl_mm *sgx_encl_mm_add(struct sgx_encl *encl,
-				    struct mm_struct *mm)
+static void sgx_encl_mm_release_wq(struct work_struct *work)
+{
+	struct sgx_encl_mm *encl_mm =
+		container_of(work, struct sgx_encl_mm, release_work);
+
+	sgx_encl_mm_release(encl_mm);
+}
+
+/*
+ * Being a call_srcu() callback, this needs to be short, and sgx_encl_release()
+ * is anything but short.  Do the final freeing in yet another async callback.
+ */
+static void sgx_encl_mm_release_delayed(struct rcu_head *rcu)
+{
+	struct sgx_encl_mm *encl_mm =
+		container_of(rcu, struct sgx_encl_mm, rcu);
+
+	INIT_WORK(&encl_mm->release_work, sgx_encl_mm_release_wq);
+	schedule_work(&encl_mm->release_work);
+}
+
+static void sgx_mmu_notifier_release(struct mmu_notifier *mn,
+				     struct mm_struct *mm)
+{
+	struct sgx_encl_mm *encl_mm =
+		container_of(mn, struct sgx_encl_mm, mmu_notifier);
+	struct sgx_encl_mm *tmp = NULL;
+
+	/*
+	 * The enclave itself can remove encl_mm.  Note, objects can't be moved
+	 * off an RCU protected list, but deletion is ok.
+	 */
+	spin_lock(&encl_mm->encl->mm_lock);
+	list_for_each_entry(tmp, &encl_mm->encl->mm_list, list) {
+		if (tmp == encl_mm) {
+			list_del_rcu(&encl_mm->list);
+			break;
+		}
+	}
+	spin_unlock(&encl_mm->encl->mm_lock);
+
+	if (tmp == encl_mm) {
+		synchronize_srcu(&encl_mm->encl->srcu);
+
+		/*
+		 * Delay freeing encl_mm until after mmu_notifier releases any
+		 * SRCU locks.  synchronize_srcu() must be called from process
+		 * context, i.e. we can't throw mmu_notifier_unregister() in a
+		 * work queue and be done with it.
+		 */
+		mmu_notifier_unregister_no_release(mn, mm);
+		mmu_notifier_call_srcu(&encl_mm->rcu,
+				       &sgx_encl_mm_release_delayed);
+	}
+}
+
+static const struct mmu_notifier_ops sgx_mmu_notifier_ops = {
+	.release		= sgx_mmu_notifier_release,
+};
+
+static struct sgx_encl_mm *sgx_encl_find_mm(struct sgx_encl *encl,
+					    struct mm_struct *mm)
+{
+	struct sgx_encl_mm *encl_mm = NULL;
+	struct sgx_encl_mm *tmp;
+	int idx;
+
+	idx = srcu_read_lock(&encl->srcu);
+
+	list_for_each_entry_rcu(tmp, &encl->mm_list, list) {
+		if (tmp->mm == mm) {
+			encl_mm = tmp;
+			break;
+		}
+	}
+
+	srcu_read_unlock(&encl->srcu, idx);
+
+	return encl_mm;
+}
+
+int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
 {
 	struct sgx_encl_mm *encl_mm;
+	int ret;
+
+	lockdep_assert_held_exclusive(&mm->mmap_sem);
+
+	/*
+	 * mm_structs are kept on mm_list until the mm or the enclave dies,
+	 * i.e. once an mm is off the list, it's gone for good, therefore it's
+	 * impossible to get a false positive on @mm due to a stale mm_list.
+	 */
+	if (sgx_encl_find_mm(encl, mm))
+		return 0;
 
 	encl_mm = kzalloc(sizeof(*encl_mm), GFP_KERNEL);
 	if (!encl_mm)
-		return ERR_PTR(-ENOMEM);
+		return -ENOMEM;
 
 	encl_mm->encl = encl;
 	encl_mm->mm = mm;
-	kref_init(&encl_mm->refcount);
+	encl_mm->mmu_notifier.ops = &sgx_mmu_notifier_ops;
+
+	ret = __mmu_notifier_register(&encl_mm->mmu_notifier, mm);
+	if (ret) {
+		kfree(encl_mm);
+		return ret;
+	}
+
+	kref_get(&encl->refcount);
 
 	spin_lock(&encl->mm_lock);
-	list_add(&encl_mm->list, &encl->mm_list);
+	list_add_rcu(&encl_mm->list, &encl->mm_list);
 	spin_unlock(&encl->mm_lock);
 
-	return encl_mm;
-}
+	synchronize_srcu(&encl->srcu);
 
-void sgx_encl_mm_release(struct kref *ref)
-{
-	struct sgx_encl_mm *encl_mm =
-		container_of(ref, struct sgx_encl_mm, refcount);
-
-	spin_lock(&encl_mm->encl->mm_lock);
-	list_del(&encl_mm->list);
-	spin_unlock(&encl_mm->encl->mm_lock);
-
-	kfree(encl_mm);
-}
-
-static struct sgx_encl_mm *sgx_encl_get_mm(struct sgx_encl *encl,
-					   struct mm_struct *mm)
-{
-	struct sgx_encl_mm *encl_mm = NULL;
-	struct sgx_encl_mm *prev_mm = NULL;
-	int iter;
-
-	while (true) {
-		encl_mm = sgx_encl_next_mm(encl, prev_mm, &iter);
-		if (prev_mm)
-			kref_put(&prev_mm->refcount, sgx_encl_mm_release);
-		prev_mm = encl_mm;
-
-		if (iter == SGX_ENCL_MM_ITER_DONE)
-			break;
-
-		if (iter == SGX_ENCL_MM_ITER_RESTART)
-			continue;
-
-		if (mm == encl_mm->mm)
-			return encl_mm;
-	}
-
-	return NULL;
-}
-
-static void sgx_vma_open(struct vm_area_struct *vma)
-{
-	struct sgx_encl *encl = vma->vm_private_data;
-	struct sgx_encl_mm *encl_mm;
-
-	if (!encl)
-		return;
-
-	if (encl->flags & SGX_ENCL_DEAD)
-		goto error;
-
-	encl_mm = sgx_encl_get_mm(encl, vma->vm_mm);
-	if (!encl_mm) {
-		encl_mm = sgx_encl_mm_add(encl, vma->vm_mm);
-		if (IS_ERR(encl_mm))
-			goto error;
-	}
-
-	return;
-
-error:
-	vma->vm_private_data = NULL;
-}
-
-static void sgx_vma_close(struct vm_area_struct *vma)
-{
-	struct sgx_encl *encl = vma->vm_private_data;
-	struct sgx_encl_mm *encl_mm;
-
-	if (!encl)
-		return;
-
-	encl_mm = sgx_encl_get_mm(encl, vma->vm_mm);
-	if (encl_mm) {
-		kref_put(&encl_mm->refcount, sgx_encl_mm_release);
-
-		/* Release kref for the VMA. */
-		kref_put(&encl_mm->refcount, sgx_encl_mm_release);
-	}
+	return 0;
 }
 
 static unsigned int sgx_vma_fault(struct vm_fault *vmf)
@@ -366,8 +388,6 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
 }
 
 const struct vm_operations_struct sgx_vm_ops = {
-	.close = sgx_vma_close,
-	.open = sgx_vma_open,
 	.fault = sgx_vma_fault,
 	.access = sgx_vma_access,
 };
@@ -465,7 +485,7 @@ void sgx_encl_release(struct kref *ref)
 	if (encl->backing)
 		fput(encl->backing);
 
-	WARN(!list_empty(&encl->mm_list), "sgx: mm_list non-empty");
+	WARN_ONCE(!list_empty(&encl->mm_list), "sgx: mm_list non-empty");
 
 	kfree(encl);
 }
@@ -503,46 +523,6 @@ struct page *sgx_encl_get_backing_page(struct sgx_encl *encl, pgoff_t index)
 	return shmem_read_mapping_page_gfp(mapping, index, gfpmask);
 }
 
-/**
- * sgx_encl_next_mm() - Iterate to the next mm
- * @encl:	an enclave
- * @mm:		an mm list entry
- * @iter:	iterator status
- *
- * Return: the enclave mm or NULL
- */
-struct sgx_encl_mm *sgx_encl_next_mm(struct sgx_encl *encl,
-				     struct sgx_encl_mm *encl_mm, int *iter)
-{
-	struct list_head *entry;
-
-	WARN(!encl, "%s: encl is NULL", __func__);
-	WARN(!iter, "%s: iter is NULL", __func__);
-
-	spin_lock(&encl->mm_lock);
-
-	entry = encl_mm ? encl_mm->list.next : encl->mm_list.next;
-	WARN(!entry, "%s: entry is NULL", __func__);
-
-	if (entry == &encl->mm_list) {
-		spin_unlock(&encl->mm_lock);
-		*iter = SGX_ENCL_MM_ITER_DONE;
-		return NULL;
-	}
-
-	encl_mm = list_entry(entry, struct sgx_encl_mm, list);
-
-	if (!kref_get_unless_zero(&encl_mm->refcount)) {
-		spin_unlock(&encl->mm_lock);
-		*iter = SGX_ENCL_MM_ITER_RESTART;
-		return NULL;
-	}
-
-	spin_unlock(&encl->mm_lock);
-	*iter = SGX_ENCL_MM_ITER_NEXT;
-	return encl_mm;
-}
-
 static int sgx_encl_test_and_clear_young_cb(pte_t *ptep, pgtable_t token,
 					    unsigned long addr, void *data)
 {
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index c557f0374d74..0904b3c20ed0 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -9,9 +9,11 @@
 #include <linux/kref.h>
 #include <linux/list.h>
 #include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
 #include <linux/mutex.h>
 #include <linux/notifier.h>
 #include <linux/radix-tree.h>
+#include <linux/srcu.h>
 #include <linux/workqueue.h>
 
 /**
@@ -57,8 +59,10 @@ enum sgx_encl_flags {
 struct sgx_encl_mm {
 	struct sgx_encl *encl;
 	struct mm_struct *mm;
-	struct kref refcount;
 	struct list_head list;
+	struct mmu_notifier mmu_notifier;
+	struct work_struct release_work;
+	struct rcu_head rcu;
 };
 
 struct sgx_encl {
@@ -72,6 +76,7 @@ struct sgx_encl {
 	spinlock_t mm_lock;
 	struct file *backing;
 	struct kref refcount;
+	struct srcu_struct srcu;
 	unsigned long base;
 	unsigned long size;
 	unsigned long ssaframesize;
@@ -118,11 +123,13 @@ void sgx_encl_destroy(struct sgx_encl *encl);
 void sgx_encl_release(struct kref *ref);
 pgoff_t sgx_encl_get_index(struct sgx_encl *encl, struct sgx_encl_page *page);
 struct page *sgx_encl_get_backing_page(struct sgx_encl *encl, pgoff_t index);
-struct sgx_encl_mm *sgx_encl_next_mm(struct sgx_encl *encl,
-				     struct sgx_encl_mm *encl_mm, int *iter);
-struct sgx_encl_mm *sgx_encl_mm_add(struct sgx_encl *encl,
-				    struct mm_struct *mm);
-void sgx_encl_mm_release(struct kref *ref);
+int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
+static inline void sgx_encl_mm_release(struct sgx_encl_mm *encl_mm)
+{
+	kref_put(&encl_mm->encl->refcount, sgx_encl_release);
+
+	kfree(encl_mm);
+}
 int sgx_encl_test_and_clear_young(struct mm_struct *mm,
 				  struct sgx_encl_page *page);
 struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
diff --git a/arch/x86/kernel/cpu/sgx/reclaim.c b/arch/x86/kernel/cpu/sgx/reclaim.c
index f192ade93245..e9427220415b 100644
--- a/arch/x86/kernel/cpu/sgx/reclaim.c
+++ b/arch/x86/kernel/cpu/sgx/reclaim.c
@@ -140,23 +140,13 @@ static bool sgx_reclaimer_evict(struct sgx_epc_page *epc_page)
 {
 	struct sgx_encl_page *page = epc_page->owner;
 	struct sgx_encl *encl = page->encl;
-	struct sgx_encl_mm *encl_mm = NULL;
-	struct sgx_encl_mm *prev_mm = NULL;
+	struct sgx_encl_mm *encl_mm;
 	bool ret = true;
-	int iter;
+	int idx;
 
-	while (true) {
-		encl_mm = sgx_encl_next_mm(encl, prev_mm, &iter);
-		if (prev_mm)
-			kref_put(&prev_mm->refcount, sgx_encl_mm_release);
-		prev_mm = encl_mm;
-
-		if (iter == SGX_ENCL_MM_ITER_DONE)
-			break;
-
-		if (iter == SGX_ENCL_MM_ITER_RESTART)
-			continue;
+	idx = srcu_read_lock(&encl->srcu);
 
+	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
 		if (!mmget_not_zero(encl_mm->mm))
 			continue;
 
@@ -164,14 +154,14 @@ static bool sgx_reclaimer_evict(struct sgx_epc_page *epc_page)
 		ret = !sgx_encl_test_and_clear_young(encl_mm->mm, page);
 		up_read(&encl_mm->mm->mmap_sem);
 
-		mmput(encl_mm->mm);
+		mmput_async(encl_mm->mm);
 
-		if (!ret || (encl->flags & SGX_ENCL_DEAD)) {
-			kref_put(&encl_mm->refcount, sgx_encl_mm_release);
+		if (!ret || (encl->flags & SGX_ENCL_DEAD))
 			break;
-		}
 	}
 
+	srcu_read_unlock(&encl->srcu, idx);
+
 	/*
 	 * Do not reclaim this page if it has been recently accessed by any
 	 * mm_struct *and* if the enclave is still alive.  No need to take
@@ -195,24 +185,13 @@ static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
 	struct sgx_encl_page *page = epc_page->owner;
 	unsigned long addr = SGX_ENCL_PAGE_ADDR(page);
 	struct sgx_encl *encl = page->encl;
-	struct sgx_encl_mm *encl_mm = NULL;
-	struct sgx_encl_mm *prev_mm = NULL;
+	struct sgx_encl_mm *encl_mm;
 	struct vm_area_struct *vma;
-	int iter;
-	int ret;
+	int idx, ret;
 
-	while (true) {
-		encl_mm = sgx_encl_next_mm(encl, prev_mm, &iter);
-		if (prev_mm)
-			kref_put(&prev_mm->refcount, sgx_encl_mm_release);
-		prev_mm = encl_mm;
-
-		if (iter == SGX_ENCL_MM_ITER_DONE)
-			break;
-
-		if (iter == SGX_ENCL_MM_ITER_RESTART)
-			continue;
+	idx = srcu_read_lock(&encl->srcu);
 
+	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
 		if (!mmget_not_zero(encl_mm->mm))
 			continue;
 
@@ -224,9 +203,11 @@ static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
 
 		up_read(&encl_mm->mm->mmap_sem);
 
-		mmput(encl_mm->mm);
+		mmput_async(encl_mm->mm);
 	}
 
+	srcu_read_unlock(&encl->srcu, idx);
+
 	mutex_lock(&encl->lock);
 
 	if (!(encl->flags & SGX_ENCL_DEAD)) {
@@ -289,32 +270,24 @@ static void sgx_ipi_cb(void *info)
 static const cpumask_t *sgx_encl_ewb_cpumask(struct sgx_encl *encl)
 {
 	cpumask_t *cpumask = &encl->cpumask;
-	struct sgx_encl_mm *encl_mm = NULL;
-	struct sgx_encl_mm *prev_mm = NULL;
-	int iter;
+	struct sgx_encl_mm *encl_mm;
+	int idx;
 
 	cpumask_clear(cpumask);
 
-	while (true) {
-		encl_mm = sgx_encl_next_mm(encl, prev_mm, &iter);
-		if (prev_mm)
-			kref_put(&prev_mm->refcount, sgx_encl_mm_release);
-		prev_mm = encl_mm;
-
-		if (iter == SGX_ENCL_MM_ITER_DONE)
-			break;
-
-		if (iter == SGX_ENCL_MM_ITER_RESTART)
-			continue;
+	idx = srcu_read_lock(&encl->srcu);
 
+	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
 		if (!mmget_not_zero(encl_mm->mm))
 			continue;
 
 		cpumask_or(cpumask, cpumask, mm_cpumask(encl_mm->mm));
 
-		mmput(encl_mm->mm);
+		mmput_async(encl_mm->mm);
 	}
 
+	srcu_read_unlock(&encl->srcu, idx);
+
 	return cpumask;
 }
 
-- 
2.21.0


^ permalink raw reply related

* [RFC PATCH v4 12/12] LSM: x86/sgx: Show line of sight to LSM support SGX2's EAUG
From: Sean Christopherson @ 2019-06-19 22:24 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190619222401.14942-1-sean.j.christopherson@intel.com>

Wire up a theoretical EAUG flag to show that the proposed LSM model is
extensible to SGX2, i.e. that SGX can communicate to LSMs that an EAUG'd
page is being mapped executable, as opposed to having to require
userspace to state that an EAUG'd page *may* be mapped executable in the
future.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kernel/cpu/sgx/driver/main.c | 10 +++++---
 arch/x86/kernel/cpu/sgx/encl.c        | 33 ++++++++++++++++++++++++++-
 arch/x86/kernel/cpu/sgx/encl.h        |  2 ++
 include/linux/lsm_hooks.h             |  2 +-
 include/linux/security.h              |  4 ++--
 security/security.c                   |  4 ++--
 security/selinux/hooks.c              |  4 +++-
 7 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/driver/main.c b/arch/x86/kernel/cpu/sgx/driver/main.c
index 4379a2fb1f82..b478c0f45279 100644
--- a/arch/x86/kernel/cpu/sgx/driver/main.c
+++ b/arch/x86/kernel/cpu/sgx/driver/main.c
@@ -99,7 +99,8 @@ static long sgx_compat_ioctl(struct file *filep, unsigned int cmd,
  * page is considered to have no RWX permissions, i.e. is inaccessible.
  */
 static unsigned long sgx_allowed_rwx(struct sgx_encl *encl,
-				     struct vm_area_struct *vma)
+				     struct vm_area_struct *vma,
+				     bool *eaug)
 {
 	unsigned long allowed_rwx = VM_READ | VM_WRITE | VM_EXEC;
 	unsigned long idx, idx_start, idx_end;
@@ -123,6 +124,8 @@ static unsigned long sgx_allowed_rwx(struct sgx_encl *encl,
 			allowed_rwx = 0;
 		else
 			allowed_rwx &= page->vm_prot_bits;
+		if (page->vm_prot_bits & SGX_VM_EAUG)
+			*eaug = true;
 		if (!allowed_rwx)
 			break;
 	}
@@ -134,16 +137,17 @@ static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct sgx_encl *encl = file->private_data;
 	unsigned long allowed_rwx, prot;
+	bool eaug = false;
 	int ret;
 
-	allowed_rwx = sgx_allowed_rwx(encl, vma);
+	allowed_rwx = sgx_allowed_rwx(encl, vma, &eaug);
 	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC) & ~allowed_rwx)
 		return -EACCES;
 
 	prot = _calc_vm_trans(vma->vm_flags, VM_READ, PROT_READ) |
 	       _calc_vm_trans(vma->vm_flags, VM_WRITE, PROT_WRITE) |
 	       _calc_vm_trans(vma->vm_flags, VM_EXEC, PROT_EXEC);
-	ret = security_enclave_map(prot);
+	ret = security_enclave_map(prot, eaug);
 	if (ret)
 		return ret;
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 059d90dcaa27..2e64676a8144 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -389,10 +389,41 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
 }
 
 #ifdef CONFIG_SECURITY
+static bool is_eaug_range(struct sgx_encl *encl, unsigned long start,
+			  unsigned long end)
+{
+	unsigned long idx, idx_start, idx_end;
+	struct sgx_encl_page *page;
+
+	/* Enclave is dead or inaccessible. */
+	if (!encl)
+		return false;
+
+	idx_start = PFN_DOWN(start);
+	idx_end = PFN_DOWN(end - 1);
+
+	for (idx = idx_start; idx <= idx_end; ++idx) {
+		/*
+		 * No need to take encl->lock, vm_prot_bits is set prior to
+		 * insertion and never changes, and racing with adding pages is
+		 * a userspace bug.
+		 */
+		rcu_read_lock();
+		page = radix_tree_lookup(&encl->page_tree, idx);
+		rcu_read_unlock();
+
+		/* Non-existent page can only be PROT_NONE, bail early. */
+		if (!page || page->vm_prot_bits & SGX_VM_EAUG)
+			return true;
+	}
+
+	return false;
+}
 static int sgx_vma_mprotect(struct vm_area_struct *vma, unsigned long start,
 			    unsigned long end, unsigned long prot)
 {
-	return security_enclave_map(prot);
+	return security_enclave_map(prot,
+		is_eaug_range(vma->vm_private_data, start, end));
 }
 #endif
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index 5ad018c8d74c..dae1a22dc87c 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -41,6 +41,8 @@ enum sgx_encl_page_desc {
 #define SGX_ENCL_PAGE_VA_OFFSET(encl_page) \
 	((encl_page)->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK)
 
+#define SGX_VM_EAUG	BIT(3)
+
 struct sgx_encl_page {
 	unsigned long desc;
 	unsigned long vm_prot_bits;
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 3bc92c65f287..d7da732cf56e 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1819,7 +1819,7 @@ union security_list_options {
 #endif /* CONFIG_BPF_SYSCALL */
 
 #ifdef CONFIG_INTEL_SGX
-	int (*enclave_map)(unsigned long prot);
+	int (*enclave_map)(unsigned long prot, bool eaug);
 	int (*enclave_load)(struct vm_area_struct *vma, unsigned long prot,
 			    bool measured);
 #endif /* CONFIG_INTEL_SGX */
diff --git a/include/linux/security.h b/include/linux/security.h
index 572ddfc53039..c55e14d776c8 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -1831,11 +1831,11 @@ static inline void security_bpf_prog_free(struct bpf_prog_aux *aux)
 
 #ifdef CONFIG_INTEL_SGX
 #ifdef CONFIG_SECURITY
-int security_enclave_map(unsigned long prot);
+int security_enclave_map(unsigned long prot, bool eaug);
 int security_enclave_load(struct vm_area_struct *vma, unsigned long prot,
 			  bool measured);
 #else
-static inline int security_enclave_map(unsigned long prot)
+static inline int security_enclave_map(unsigned long prot, bool eaug)
 {
 	return 0;
 }
diff --git a/security/security.c b/security/security.c
index 00f483beb1cc..f276f05341f2 100644
--- a/security/security.c
+++ b/security/security.c
@@ -2361,9 +2361,9 @@ void security_bpf_prog_free(struct bpf_prog_aux *aux)
 #endif /* CONFIG_BPF_SYSCALL */
 
 #ifdef CONFIG_INTEL_SGX
-int security_enclave_map(unsigned long prot)
+int security_enclave_map(unsigned long prot, bool eaug)
 {
-	return call_int_hook(enclave_map, 0, prot);
+	return call_int_hook(enclave_map, 0, prot, eaug);
 }
 int security_enclave_load(struct vm_area_struct *vma, unsigned long prot,
 			  bool measured)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 8a431168e454..f349419d4c12 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -6733,7 +6733,7 @@ static inline int sgx_has_perm(u32 sid, u32 requested)
 			    SECCLASS_PROCESS2, requested, NULL);
 }
 
-static int selinux_enclave_map(unsigned long prot)
+static int selinux_enclave_map(unsigned long prot, bool eaug)
 {
 	const struct cred *cred = current_cred();
 	u32 sid = cred_sid(cred);
@@ -6743,6 +6743,8 @@ static int selinux_enclave_map(unsigned long prot)
 
 	if ((prot & PROT_EXEC) && (prot & PROT_WRITE))
 		return sgx_has_perm(sid, PROCESS2__SGX_MAPWX);
+	else if (eaug && (prot & PROT_EXEC))
+		return sgx_has_perm(sid, PROCESS2__SGX_EXECDIRTY);
 
 	return 0;
 }
-- 
2.21.0


^ permalink raw reply related

* [RFC PATCH v4 06/12] mm: Introduce vm_ops->may_mprotect()
From: Sean Christopherson @ 2019-06-19 22:23 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190619222401.14942-1-sean.j.christopherson@intel.com>

SGX will use ->may_mprotect() to invoke an SGX variant of the existing
file_mprotect() and mmap_file() LSM hooks.

The name may_mprotect() is intended to reflect the hook's purpose as a
way to restrict mprotect() as opposed to a wholesale replacement.

Due to the nature of SGX and its Enclave Page Cache (EPC), all enclave
VMAs are backed by a single file, i.e. /dev/sgx/enclave, that must be
MAP_SHARED.  Furthermore, all enclaves need read, write and execute
VMAs.  As a result, applying W^X restrictions on /dev/sgx/enclave using
existing LSM hooks is for all intents and purposes impossible, e.g.
denying either W or X would deny access to *any* enclave.

By hooking mprotect(), SGX can invoke an SGX specific LSM hook, which in
turn allows LSMs to enforce W^X policies.

Alternatively, SGX could provide a helper to identify enclaves given a
vma or file.  LSMs could then check if a mapping is for enclave and take
action according.

A second alternative would be to have SGX implement its own LSM hooks
for file_mprotect() and mmap_file(), using them to "forward" the call to
the SGX specific hook.

The major con to both alternatives is that they provide zero flexibility
for the SGX specific LSM hook.  The "is_sgx_enclave()" helper doesn't
allow SGX can't supply any additional information whatsoever, and the
mmap_file() hook is called before the final address is known, e.g. SGX
can't provide any information about the specific enclave being mapped.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 include/linux/mm.h |  2 ++
 mm/mprotect.c      | 15 +++++++++++----
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e8834ac32b7..b11ec420c8d7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -458,6 +458,8 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	int (*split)(struct vm_area_struct * area, unsigned long addr);
 	int (*mremap)(struct vm_area_struct * area);
+	int (*may_mprotect)(struct vm_area_struct *vma, unsigned long start,
+			    unsigned long end, unsigned long prot);
 	vm_fault_t (*fault)(struct vm_fault *vmf);
 	vm_fault_t (*huge_fault)(struct vm_fault *vmf,
 			enum page_entry_size pe_size);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index bf38dfbbb4b4..18732543b295 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -547,13 +547,20 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 			goto out;
 		}
 
-		error = security_file_mprotect(vma, reqprot, prot);
-		if (error)
-			goto out;
-
 		tmp = vma->vm_end;
 		if (tmp > end)
 			tmp = end;
+
+		if (vma->vm_ops && vma->vm_ops->may_mprotect) {
+			error = vma->vm_ops->may_mprotect(vma, nstart, tmp, prot);
+			if (error)
+				goto out;
+		}
+
+		error = security_file_mprotect(vma, reqprot, prot);
+		if (error)
+			goto out;
+
 		error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
 		if (error)
 			goto out;
-- 
2.21.0


^ permalink raw reply related

* [RFC PATCH v4 08/12] security/selinux: Require SGX_MAPWX to map enclave page WX
From: Sean Christopherson @ 2019-06-19 22:23 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190619222401.14942-1-sean.j.christopherson@intel.com>

Hook enclave_map() to require a new per-process capability, SGX_MAPWX,
when mapping an enclave as simultaneously writable and executable.
Note, @prot contains the actual protection bits that will be set by the
kernel, not the maximal protection bits specified by userspace when the
page was first loaded into the enclave.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 security/selinux/hooks.c            | 21 +++++++++++++++++++++
 security/selinux/include/classmap.h |  3 ++-
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 3ec702cf46ca..fc239e541b62 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -6726,6 +6726,23 @@ static void selinux_bpf_prog_free(struct bpf_prog_aux *aux)
 }
 #endif
 
+#ifdef CONFIG_INTEL_SGX
+static int selinux_enclave_map(unsigned long prot)
+{
+	const struct cred *cred = current_cred();
+	u32 sid = cred_sid(cred);
+
+	/* SGX is supported only in 64-bit kernels. */
+	WARN_ON_ONCE(!default_noexec);
+
+	if ((prot & PROT_EXEC) && (prot & PROT_WRITE))
+		return avc_has_perm(&selinux_state, sid, sid,
+				    SECCLASS_PROCESS2, PROCESS2__SGX_MAPWX,
+				    NULL);
+	return 0;
+}
+#endif
+
 struct lsm_blob_sizes selinux_blob_sizes __lsm_ro_after_init = {
 	.lbs_cred = sizeof(struct task_security_struct),
 	.lbs_file = sizeof(struct file_security_struct),
@@ -6968,6 +6985,10 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(bpf_map_free_security, selinux_bpf_map_free),
 	LSM_HOOK_INIT(bpf_prog_free_security, selinux_bpf_prog_free),
 #endif
+
+#ifdef CONFIG_INTEL_SGX
+	LSM_HOOK_INIT(enclave_map, selinux_enclave_map),
+#endif
 };
 
 static __init int selinux_init(void)
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 201f7e588a29..cfd91e879bdf 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -51,7 +51,8 @@ struct security_class_mapping secclass_map[] = {
 	    "execmem", "execstack", "execheap", "setkeycreate",
 	    "setsockcreate", "getrlimit", NULL } },
 	{ "process2",
-	  { "nnp_transition", "nosuid_transition", NULL } },
+	  { "nnp_transition", "nosuid_transition",
+	    "sgx_mapwx", NULL } },
 	{ "system",
 	  { "ipc_info", "syslog_read", "syslog_mod",
 	    "syslog_console", "module_request", "module_load", NULL } },
-- 
2.21.0


^ permalink raw reply related

* [RFC PATCH v4 10/12] security/selinux: Add enclave_load() implementation
From: Sean Christopherson @ 2019-06-19 22:23 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: linux-sgx, linux-security-module, selinux, Bill Roberts,
	Casey Schaufler, James Morris, Dave Hansen, Cedric Xing,
	Andy Lutomirski, Jethro Beekman, Dr . Greg Wettstein,
	Stephen Smalley
In-Reply-To: <20190619222401.14942-1-sean.j.christopherson@intel.com>

The goal of selinux_enclave_load() is to provide a facsimile of the
existing selinux_file_mprotect() and file_map_prot_check() policies,
but tailored to the unique properties of SGX.

For example, an enclave page is technically backed by a MAP_SHARED file,
but the "file" is essentially shared memory that is never persisted
anywhere and also requires execute permissions (for some pages).

Enclaves are also less priveleged than normal user code, e.g. SYSCALL
instructions #UD if attempted in an enclave.  For this reason, add SGX
specific permissions instead of reusing existing permissions such as
FILE__EXECUTE so that policies can allow running code in an enclave, or
allow dynamically loading code in an enclave without having to grant the
same capability to normal user code outside of the enclave.

Intended use of each permission:

  - SGX_EXECDIRTY: dynamically load code within the enclave itself
  - SGX_EXECUNMR: load unmeasured code into the enclave, e.g. Graphene
  - SGX_EXECANON: load code from anonymous memory (likely Graphene)
  - SGX_EXECUTE: load an enclave from a file, i.e. normal behavior

Note, equivalents to FILE__READ and FILE__WRITE are intentionally never
required.  Writes to the enclave page are contained to the EPC, i.e.
never hit the original file, and read permissions have already been
vetted (or the VMA doesn't have PROT_READ, in which case loading the
page into the enclave will fail).

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 security/selinux/hooks.c            | 55 +++++++++++++++++++++++++++--
 security/selinux/include/classmap.h |  5 +--
 2 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index fc239e541b62..8a431168e454 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -6727,6 +6727,12 @@ static void selinux_bpf_prog_free(struct bpf_prog_aux *aux)
 #endif
 
 #ifdef CONFIG_INTEL_SGX
+static inline int sgx_has_perm(u32 sid, u32 requested)
+{
+	return avc_has_perm(&selinux_state, sid, sid,
+			    SECCLASS_PROCESS2, requested, NULL);
+}
+
 static int selinux_enclave_map(unsigned long prot)
 {
 	const struct cred *cred = current_cred();
@@ -6736,11 +6742,53 @@ static int selinux_enclave_map(unsigned long prot)
 	WARN_ON_ONCE(!default_noexec);
 
 	if ((prot & PROT_EXEC) && (prot & PROT_WRITE))
-		return avc_has_perm(&selinux_state, sid, sid,
-				    SECCLASS_PROCESS2, PROCESS2__SGX_MAPWX,
-				    NULL);
+		return sgx_has_perm(sid, PROCESS2__SGX_MAPWX);
+
 	return 0;
 }
+
+static int selinux_enclave_load(struct vm_area_struct *vma, unsigned long prot,
+				bool measured)
+{
+	const struct cred *cred = current_cred();
+	u32 sid = cred_sid(cred);
+	int ret;
+
+	/* SGX is supported only in 64-bit kernels. */
+	WARN_ON_ONCE(!default_noexec);
+
+	/* Only executable enclave pages are restricted in any way. */
+	if (!(prot & PROT_EXEC))
+		return 0;
+
+	/*
+	 * WX at load time only requires EXECDIRTY, e.g. to allow W->X.  Actual
+	 * WX mappings require MAPWX (see selinux_enclave_map()).
+	 */
+	if (prot & PROT_WRITE) {
+		ret = sgx_has_perm(sid, PROCESS2__SGX_EXECDIRTY);
+		if (ret)
+			goto out;
+	}
+	if (!measured) {
+		ret = sgx_has_perm(sid, PROCESS2__SGX_EXECUNMR);
+		if (ret)
+			goto out;
+	}
+
+	if (!vma->vm_file || IS_PRIVATE(file_inode(vma->vm_file)) ||
+	    vma->anon_vma)
+		/*
+		 * Loading enclave code from an anonymous mapping or from a
+		 * modified private file mapping.
+		 */
+		ret = sgx_has_perm(sid, PROCESS2__SGX_EXECANON);
+	else
+		/* Loading from a shared or unmodified private file mapping. */
+		ret = file_has_perm(cred, vma->vm_file, FILE__SGX_EXECUTE);
+out:
+	return ret;
+}
 #endif
 
 struct lsm_blob_sizes selinux_blob_sizes __lsm_ro_after_init = {
@@ -6988,6 +7036,7 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
 
 #ifdef CONFIG_INTEL_SGX
 	LSM_HOOK_INIT(enclave_map, selinux_enclave_map),
+	LSM_HOOK_INIT(enclave_load, selinux_enclave_load),
 #endif
 };
 
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index cfd91e879bdf..baa1757be46a 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -7,7 +7,7 @@
 
 #define COMMON_FILE_PERMS COMMON_FILE_SOCK_PERMS, "unlink", "link", \
     "rename", "execute", "quotaon", "mounton", "audit_access", \
-    "open", "execmod"
+    "open", "execmod", "sgx_execute"
 
 #define COMMON_SOCK_PERMS COMMON_FILE_SOCK_PERMS, "bind", "connect", \
     "listen", "accept", "getopt", "setopt", "shutdown", "recvfrom",  \
@@ -52,7 +52,8 @@ struct security_class_mapping secclass_map[] = {
 	    "setsockcreate", "getrlimit", NULL } },
 	{ "process2",
 	  { "nnp_transition", "nosuid_transition",
-	    "sgx_mapwx", NULL } },
+	    "sgx_mapwx", "sgx_execdirty", "sgx_execanon", "sgx_execunmr",
+	    NULL } },
 	{ "system",
 	  { "ipc_info", "syslog_read", "syslog_mod",
 	    "syslog_console", "module_request", "module_load", NULL } },
-- 
2.21.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox