Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Yu-cheng Yu @ 2019-06-07 16:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <20190607080832.GT3419@hirez.programming.kicks-ass.net>

On Fri, 2019-06-07 at 10:08 +0200, Peter Zijlstra wrote:
> On Thu, Jun 06, 2019 at 01:09:15PM -0700, Yu-cheng Yu wrote:
> > Indirect Branch Tracking (IBT) provides an optional legacy code bitmap
> > that allows execution of legacy, non-IBT compatible library by an
> > IBT-enabled application.  When set, each bit in the bitmap indicates
> > one page of legacy code.
> > 
> > The bitmap is allocated and setup from the application.
> > +int cet_setup_ibt_bitmap(unsigned long bitmap, unsigned long size)
> > +{
> > +	u64 r;
> > +
> > +	if (!current->thread.cet.ibt_enabled)
> > +		return -EINVAL;
> > +
> > +	if (!PAGE_ALIGNED(bitmap) || (size > TASK_SIZE_MAX))
> > +		return -EINVAL;
> > +
> > +	current->thread.cet.ibt_bitmap_addr = bitmap;
> > +	current->thread.cet.ibt_bitmap_size = size;
> > +
> > +	/*
> > +	 * Turn on IBT legacy bitmap.
> > +	 */
> > +	modify_fpu_regs_begin();
> > +	rdmsrl(MSR_IA32_U_CET, r);
> > +	r |= (MSR_IA32_CET_LEG_IW_EN | bitmap);
> > +	wrmsrl(MSR_IA32_U_CET, r);
> > +	modify_fpu_regs_end();
> > +
> > +	return 0;
> > +}
> 
> So you just program a random user supplied address into the hardware.
> What happens if there's not actually anything at that address or the
> user munmap()s the data after doing this?

This function checks the bitmap's alignment and size, and anything else is the
app's responsibility.  What else do you think the kernel should check?

Yu-cheng

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Andy Lutomirski @ 2019-06-07 16:35 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <aa8a92ef231d512b5c9855ef416db050b5ab59a6.camel@intel.com>



> On Jun 7, 2019, at 9:23 AM, Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> 
>> On Fri, 2019-06-07 at 10:08 +0200, Peter Zijlstra wrote:
>>> On Thu, Jun 06, 2019 at 01:09:15PM -0700, Yu-cheng Yu wrote:
>>> Indirect Branch Tracking (IBT) provides an optional legacy code bitmap
>>> that allows execution of legacy, non-IBT compatible library by an
>>> IBT-enabled application.  When set, each bit in the bitmap indicates
>>> one page of legacy code.
>>> 
>>> The bitmap is allocated and setup from the application.
>>> +int cet_setup_ibt_bitmap(unsigned long bitmap, unsigned long size)
>>> +{
>>> +    u64 r;
>>> +
>>> +    if (!current->thread.cet.ibt_enabled)
>>> +        return -EINVAL;
>>> +
>>> +    if (!PAGE_ALIGNED(bitmap) || (size > TASK_SIZE_MAX))
>>> +        return -EINVAL;
>>> +
>>> +    current->thread.cet.ibt_bitmap_addr = bitmap;
>>> +    current->thread.cet.ibt_bitmap_size = size;
>>> +
>>> +    /*
>>> +     * Turn on IBT legacy bitmap.
>>> +     */
>>> +    modify_fpu_regs_begin();
>>> +    rdmsrl(MSR_IA32_U_CET, r);
>>> +    r |= (MSR_IA32_CET_LEG_IW_EN | bitmap);
>>> +    wrmsrl(MSR_IA32_U_CET, r);
>>> +    modify_fpu_regs_end();
>>> +
>>> +    return 0;
>>> +}
>> 
>> So you just program a random user supplied address into the hardware.
>> What happens if there's not actually anything at that address or the
>> user munmap()s the data after doing this?
> 
> This function checks the bitmap's alignment and size, and anything else is the
> app's responsibility.  What else do you think the kernel should check?
> 

One might reasonably wonder why this state is privileged in the first place and, given that, why we’re allowing it to be written like this.

Arguably we should have another prctl to lock these values (until exec) as a gardening measure.

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Dave Hansen @ 2019-06-07 16:39 UTC (permalink / raw)
  To: Andy Lutomirski, Yu-cheng Yu
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <76B7B1AE-3AEA-4162-B539-990EF3CCE2C2@amacapital.net>

On 6/7/19 9:35 AM, Andy Lutomirski wrote:
> One might reasonably wonder why this state is privileged in the first
> place and, given that, why we’re allowing it to be written like
> this.

I think it's generally a good architectural practice to make things like
this privileged.  They're infrequent so can survive the cost of a trip
in/out of the kernel and are a great choke point to make sure the OS is
involved.  I wish we had the same for MPX or pkeys per-task "setup".

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Yu-cheng Yu @ 2019-06-07 16:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <76B7B1AE-3AEA-4162-B539-990EF3CCE2C2@amacapital.net>

On Fri, 2019-06-07 at 09:35 -0700, Andy Lutomirski wrote:
> > On Jun 7, 2019, at 9:23 AM, Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > 
> > > On Fri, 2019-06-07 at 10:08 +0200, Peter Zijlstra wrote:
> > > > On Thu, Jun 06, 2019 at 01:09:15PM -0700, Yu-cheng Yu wrote:
> > > > Indirect Branch Tracking (IBT) provides an optional legacy code bitmap
> > > > that allows execution of legacy, non-IBT compatible library by an
> > > > IBT-enabled application.  When set, each bit in the bitmap indicates
> > > > one page of legacy code.
> > > > 
> > > > The bitmap is allocated and setup from the application.
> > > > +int cet_setup_ibt_bitmap(unsigned long bitmap, unsigned long size)
> > > > +{
> > > > +    u64 r;
> > > > +
> > > > +    if (!current->thread.cet.ibt_enabled)
> > > > +        return -EINVAL;
> > > > +
> > > > +    if (!PAGE_ALIGNED(bitmap) || (size > TASK_SIZE_MAX))
> > > > +        return -EINVAL;
> > > > +
> > > > +    current->thread.cet.ibt_bitmap_addr = bitmap;
> > > > +    current->thread.cet.ibt_bitmap_size = size;
> > > > +
> > > > +    /*
> > > > +     * Turn on IBT legacy bitmap.
> > > > +     */
> > > > +    modify_fpu_regs_begin();
> > > > +    rdmsrl(MSR_IA32_U_CET, r);
> > > > +    r |= (MSR_IA32_CET_LEG_IW_EN | bitmap);
> > > > +    wrmsrl(MSR_IA32_U_CET, r);
> > > > +    modify_fpu_regs_end();
> > > > +
> > > > +    return 0;
> > > > +}
> > > 
> > > So you just program a random user supplied address into the hardware.
> > > What happens if there's not actually anything at that address or the
> > > user munmap()s the data after doing this?
> > 
> > This function checks the bitmap's alignment and size, and anything else is
> > the
> > app's responsibility.  What else do you think the kernel should check?
> > 
> 
> One might reasonably wonder why this state is privileged in the first place
> and, given that, why we’re allowing it to be written like this.
> 
> Arguably we should have another prctl to lock these values (until exec) as a
> gardening measure.

We can prevent the bitmap from being set more than once.  I will test it.

Yu-cheng

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Andy Lutomirski @ 2019-06-07 17:05 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <ac8827d7b516f4b58e1df20f45b94998d36c418c.camel@intel.com>




> On Jun 7, 2019, at 9:45 AM, Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> 
> On Fri, 2019-06-07 at 09:35 -0700, Andy Lutomirski wrote:
>>> On Jun 7, 2019, at 9:23 AM, Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>>> 
>>>>> On Fri, 2019-06-07 at 10:08 +0200, Peter Zijlstra wrote:
>>>>> On Thu, Jun 06, 2019 at 01:09:15PM -0700, Yu-cheng Yu wrote:
>>>>> Indirect Branch Tracking (IBT) provides an optional legacy code bitmap
>>>>> that allows execution of legacy, non-IBT compatible library by an
>>>>> IBT-enabled application.  When set, each bit in the bitmap indicates
>>>>> one page of legacy code.
>>>>> 
>>>>> The bitmap is allocated and setup from the application.
>>>>> +int cet_setup_ibt_bitmap(unsigned long bitmap, unsigned long size)
>>>>> +{
>>>>> +    u64 r;
>>>>> +
>>>>> +    if (!current->thread.cet.ibt_enabled)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    if (!PAGE_ALIGNED(bitmap) || (size > TASK_SIZE_MAX))
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    current->thread.cet.ibt_bitmap_addr = bitmap;
>>>>> +    current->thread.cet.ibt_bitmap_size = size;
>>>>> +
>>>>> +    /*
>>>>> +     * Turn on IBT legacy bitmap.
>>>>> +     */
>>>>> +    modify_fpu_regs_begin();
>>>>> +    rdmsrl(MSR_IA32_U_CET, r);
>>>>> +    r |= (MSR_IA32_CET_LEG_IW_EN | bitmap);
>>>>> +    wrmsrl(MSR_IA32_U_CET, r);
>>>>> +    modify_fpu_regs_end();
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>> 
>>>> So you just program a random user supplied address into the hardware.
>>>> What happens if there's not actually anything at that address or the
>>>> user munmap()s the data after doing this?
>>> 
>>> This function checks the bitmap's alignment and size, and anything else is
>>> the
>>> app's responsibility.  What else do you think the kernel should check?
>>> 
>> 
>> One might reasonably wonder why this state is privileged in the first place
>> and, given that, why we’re allowing it to be written like this.
>> 
>> Arguably we should have another prctl to lock these values (until exec) as a
>> gardening measure.
> 
> We can prevent the bitmap from being set more than once.  I will test it.
> 

I think it would be better to make locking an explicit opt-in.

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Peter Zijlstra @ 2019-06-07 17:43 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <aa8a92ef231d512b5c9855ef416db050b5ab59a6.camel@intel.com>

On Fri, Jun 07, 2019 at 09:23:43AM -0700, Yu-cheng Yu wrote:
> On Fri, 2019-06-07 at 10:08 +0200, Peter Zijlstra wrote:
> > On Thu, Jun 06, 2019 at 01:09:15PM -0700, Yu-cheng Yu wrote:
> > > Indirect Branch Tracking (IBT) provides an optional legacy code bitmap
> > > that allows execution of legacy, non-IBT compatible library by an
> > > IBT-enabled application.  When set, each bit in the bitmap indicates
> > > one page of legacy code.
> > > 
> > > The bitmap is allocated and setup from the application.
> > > +int cet_setup_ibt_bitmap(unsigned long bitmap, unsigned long size)
> > > +{
> > > +	u64 r;
> > > +
> > > +	if (!current->thread.cet.ibt_enabled)
> > > +		return -EINVAL;
> > > +
> > > +	if (!PAGE_ALIGNED(bitmap) || (size > TASK_SIZE_MAX))
> > > +		return -EINVAL;
> > > +
> > > +	current->thread.cet.ibt_bitmap_addr = bitmap;
> > > +	current->thread.cet.ibt_bitmap_size = size;
> > > +
> > > +	/*
> > > +	 * Turn on IBT legacy bitmap.
> > > +	 */
> > > +	modify_fpu_regs_begin();
> > > +	rdmsrl(MSR_IA32_U_CET, r);
> > > +	r |= (MSR_IA32_CET_LEG_IW_EN | bitmap);
> > > +	wrmsrl(MSR_IA32_U_CET, r);
> > > +	modify_fpu_regs_end();
> > > +
> > > +	return 0;
> > > +}
> > 
> > So you just program a random user supplied address into the hardware.
> > What happens if there's not actually anything at that address or the
> > user munmap()s the data after doing this?
> 
> This function checks the bitmap's alignment and size, and anything else is the
> app's responsibility.  What else do you think the kernel should check?

I've no idea what the kernel should do; since you failed to answer the
question what happens when you point this to garbage.

Does it then fault or what?

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Dave Hansen @ 2019-06-07 17:59 UTC (permalink / raw)
  To: Peter Zijlstra, Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <20190607174336.GM3436@hirez.programming.kicks-ass.net>

On 6/7/19 10:43 AM, Peter Zijlstra wrote:
> I've no idea what the kernel should do; since you failed to answer the
> question what happens when you point this to garbage.
> 
> Does it then fault or what?

Yeah, I think you'll fault with a rather mysterious CR2 value since
you'll go look at the instruction that faulted and not see any
references to the CR2 value.

I think this new MSR probably needs to get included in oops output when
CET is enabled.

Why don't we require that a VMA be in place for the entire bitmap?
Don't we need a "get" prctl function too in case something like a JIT is
running and needs to find the location of this bitmap to set bits itself?

Or, do we just go whole-hog and have the kernel manage the bitmap
itself. Our interface here could be:

	prctl(PR_MARK_CODE_AS_LEGACY, start, size);

and then have the kernel allocate and set the bitmap for those code
locations.

^ permalink raw reply

* Re: [PATCH v7 22/27] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Dave Martin @ 2019-06-07 18:01 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <20190606200646.3951-23-yu-cheng.yu@intel.com>

On Thu, Jun 06, 2019 at 01:06:41PM -0700, Yu-cheng Yu wrote:
> An ELF file's .note.gnu.property indicates features the executable file
> can support.  For example, the property GNU_PROPERTY_X86_FEATURE_1_AND
> indicates the file supports GNU_PROPERTY_X86_FEATURE_1_IBT and/or
> GNU_PROPERTY_X86_FEATURE_1_SHSTK.
> 
> With this patch, if an arch needs to setup features from ELF properties,
> it needs CONFIG_ARCH_USE_GNU_PROPERTY to be set, and a specific
> arch_setup_property().
> 
> For example, for X86_64:
> 
> int arch_setup_property(void *ehdr, void *phdr, struct file *f, bool inter)
> {
> 	int r;
> 	uint32_t property;
> 
> 	r = get_gnu_property(ehdr, phdr, f, GNU_PROPERTY_X86_FEATURE_1_AND,
> 			     &property);
> 	...
> }

Although this code works for the simple case, I have some concerns about
some aspects of the implementation here.  There appear to be some bounds
checking / buffer overrun issues, and the code seems quite complex.

Maybe this patch tries too hard to be compatible with toolchains that do
silly things such as embedding huge notes in an executable, or mixing
NT_GNU_PROPERTY_TYPE_0 in a single PT_NOTE with a load of junk not
relevant to the loader.  I wonder whether Linux can dictate what
interpretation(s) of the ELF specs it is prepared to support, rather than
trying to support absolutely anything.


I've commented on some potential issues below, but my review isn't
exhaustive -- I may also have simply not understood the code in some
cases, so I apologise in advance for that!

I've also marked a few coding style nits that make the code harder to
read than necessary (but this is partly a matter of taste).

Comments below.

Cheers
---Dave

> 
> Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  fs/Kconfig.binfmt        |   3 +
>  fs/Makefile              |   1 +
>  fs/binfmt_elf.c          |  13 ++
>  fs/gnu_property.c        | 351 +++++++++++++++++++++++++++++++++++++++
>  include/linux/elf.h      |  12 ++
>  include/uapi/linux/elf.h |  14 ++
>  6 files changed, 394 insertions(+)
>  create mode 100644 fs/gnu_property.c
> 
> diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
> index f87ddd1b6d72..397138ab305b 100644
> --- a/fs/Kconfig.binfmt
> +++ b/fs/Kconfig.binfmt
> @@ -36,6 +36,9 @@ config COMPAT_BINFMT_ELF
>  config ARCH_BINFMT_ELF_STATE
>  	bool
>  
> +config ARCH_USE_GNU_PROPERTY
> +	bool
> +
>  config BINFMT_ELF_FDPIC
>  	bool "Kernel support for FDPIC ELF binaries"
>  	default y if !BINFMT_ELF
> diff --git a/fs/Makefile b/fs/Makefile
> index c9aea23aba56..b69f18c14e09 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -44,6 +44,7 @@ obj-$(CONFIG_BINFMT_ELF)	+= binfmt_elf.o
>  obj-$(CONFIG_COMPAT_BINFMT_ELF)	+= compat_binfmt_elf.o
>  obj-$(CONFIG_BINFMT_ELF_FDPIC)	+= binfmt_elf_fdpic.o
>  obj-$(CONFIG_BINFMT_FLAT)	+= binfmt_flat.o
> +obj-$(CONFIG_ARCH_USE_GNU_PROPERTY) += gnu_property.o
>  
>  obj-$(CONFIG_FS_MBCACHE)	+= mbcache.o
>  obj-$(CONFIG_FS_POSIX_ACL)	+= posix_acl.o
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 8264b468f283..c3ea73787e93 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -1080,6 +1080,19 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  		goto out_free_dentry;
>  	}
>  
> +	if (interpreter) {
> +		retval = arch_setup_property(&loc->interp_elf_ex,
> +					     interp_elf_phdata,
> +					     interpreter, true);
> +	} else {
> +		retval = arch_setup_property(&loc->elf_ex,
> +					     elf_phdata,
> +					     bprm->file, false);
> +	}
> +
> +	if (retval < 0)
> +		goto out_free_dentry;
> +
>  	if (interpreter) {
>  		unsigned long interp_map_addr = 0;
>  
> diff --git a/fs/gnu_property.c b/fs/gnu_property.c
> new file mode 100644
> index 000000000000..9c4d1d5ebf00
> --- /dev/null
> +++ b/fs/gnu_property.c
> @@ -0,0 +1,351 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Extract an ELF file's .note.gnu.property.
> + *
> + * The path from the ELF header to the note section is the following:
> + * elfhdr->elf_phdr->elf_note->property[].
> + */
> +
> +#include <uapi/linux/elf-em.h>
> +#include <linux/processor.h>
> +#include <linux/binfmts.h>
> +#include <linux/elf.h>
> +#include <linux/slab.h>
> +#include <linux/fs.h>
> +#include <linux/uaccess.h>
> +#include <linux/string.h>
> +#include <linux/compat.h>
> +
> +/*
> + * The .note.gnu.property layout:
> + *
> + *	struct elf_note {
> + *		u32 n_namesz; --> sizeof(n_name[]); always (4)
> + *		u32 n_ndescsz;--> sizeof(property[])
> + *		u32 n_type;   --> always NT_GNU_PROPERTY_TYPE_0
> + *	};
> + *	char n_name[4]; --> always 'GNU\0'
> + *
> + *	struct {
> + *		struct gnu_property {
> + *			u32 pr_type;
> + *			u32 pr_datasz;
> + *		};
> + *		u8 pr_data[pr_datasz];
> + *	}[];
> + */
> +
> +#define BUF_SIZE (PAGE_SIZE / 4)

Nit: magic number in disguise.  What does the size of ELF notes have
to do with the page size?

> +
> +typedef bool (test_item_fn)(void *buf, u32 *arg, u32 type);
> +typedef void *(next_item_fn)(void *buf, u32 *arg, u32 type);
> +
> +static inline bool test_note_type(void *buf, u32 *align, u32 note_type)
> +{
> +	struct elf_note *n = buf;
> +
> +	return ((n->n_type == note_type) && (n->n_namesz == 4) &&
> +		(memcmp(n + 1, "GNU", 4) == 0));
> +}
> +
> +static inline void *next_note(void *buf, u32 *align, u32 note_type)
> +{
> +	struct elf_note *n = buf;
> +	u64 size;
> +
> +	if (check_add_overflow((u64)sizeof(*n), (u64)n->n_namesz, &size))
> +		return NULL;

sizeof(*n) is a small integer under our control, and n->n_namesz is a
u32.

So, I'm not sure how we would overflow 64 bits here, although if we can
get arbitrarily close to ~(u64)0 then:

> +
> +	size = round_up(size, *align);

this can overflow too.

> +
> +	if (check_add_overflow(size, (u64)n->n_descsz, &size))
> +		return NULL;
> +
> +	size = round_up(size, *align);

Similarly here.

> +
> +	if (buf + size < buf)

Isn't this undefined behaviour of it overflows?  If so, the compiler can
probably delete the check entirely, making it useless.  Does UBSAN warn
about it?

> +		return NULL;
> +	else
> +		return (buf + size);

Nit: Unnecessary ()  (There are surplus () all over this patch; I won't
comment on them all.)

> +}
> +
> +static inline bool test_property(void *buf, u32 *max_type, u32 pr_type)
> +{
> +	struct gnu_property *pr = buf;
> +
> +	/*
> +	 * Property types must be in ascending order.
> +	 * Keep track of the max when testing each.
> +	 */
> +	if (pr->pr_type > *max_type)
> +		*max_type = pr->pr_type;

Is this worthwhile?  In general we don't try very hard to check that the
ELF file is well-formed.

Ideally we could search by binary chop, but the property size is
variable, so the sortedness is useless to us (yay).

> +
> +	return (pr->pr_type == pr_type);
> +}
> +
> +static inline void *next_property(void *buf, u32 *max_type, u32 pr_type)

Nit: does this need to be inline?  The compiler's guess is usually good
enough...

> +{
> +	struct gnu_property *pr = buf;
> +
> +	if ((buf + sizeof(*pr) +  pr->pr_datasz < buf) ||

Nit: random extra space, redundant (), etc.

> +	    (pr->pr_type > pr_type) ||
> +	    (pr->pr_type > *max_type))
> +		return NULL;
> +	else
> +		return (buf + sizeof(*pr) + pr->pr_datasz);

We can exceed the underlying buffer bounds here, which is technically
undefined behaviour.

I suspect we may be relying on similar tricks all over the kernel, but
IT MAy be best avoided anyway.


If we always pass in the buffer base pointer and the size of the buffer, say

	static int next_property(void *buf, size_t *offset,
						size_t bufsz, ...)

then we may be able to use direct comparisons that can't overflow
rather than relying on potentially undefined behaviour.  For example:

	size_t o = *offset;

	if (o > bufsz || sizeof (*pr) > bufsz - o)
		return -1;

	pr = buf + o;
	if (pr->pr_type > pr_type || pr->pr_type > *max_type)
		return -1;

	if (pr->pr_datasz > bufsz - o - sizeof (*pr))
		return -1;

	*offset = o + sizeof (*pr) + pr->pr_datasz;
	return 0;

(There may be neater ways to do this.)

> +}
> +
> +/*
> + * Scan 'buf' for a pattern; return true if found.
> + * *pos is the distance from the beginning of buf to where
> + * the searched item or the next item is located.
> + */
> +static int scan(u8 *buf, u32 buf_size, int item_size, test_item_fn test_item,
> +		next_item_fn next_item, u32 *arg, u32 type, u32 *pos)
> +{
> +	int found = 0;
> +	u8 *p, *max;
> +
> +	max = buf + buf_size;
> +	if (max < buf)

See comment about undefined behaviour above.

Also, I'm not sure this check adds anything.  We know buf_size is
<= BUF_SIZE (though we could stick a WARN_ON() here and bail out if we
want to make absolutely sure).

If buf is always the base pointer returned by kmalloc(BUF_SIZE), then
I think buf_size can never go outside its bounds?

> +		return 0;
> +
> +	p = buf;
> +
> +	while ((p + item_size < max) && (p + item_size > buf)) {

                           ^ <= ?                   ^ undefined behaviour?

> +		if (test_item(p, arg, type)) {
> +			found = 1;
> +			break;
> +		}
> +
> +		p = next_item(p, arg, type);
> +	}
> +
> +	*pos = (p + item_size <= buf) ? 0 : (u32)(p - buf);

Can this be written more simply, say:

	if (p + item_size > buf)
		*pos += p - buf;

Also, since next_property() adds pr_datasz onto buf, could we get
unlucky and wrap past (void *)~0UL?  Then (u32)(p - buf) may be giant.
Not sure whether this breaks code elsewhere.

> +	return found;
> +}
> +
> +/*
> + * Search an NT_GNU_PROPERTY_TYPE_0 for the property of 'pr_type'.
> + */
> +static int find_property(struct file *file, unsigned long desc_size,
> +			 loff_t file_offset, u8 *buf,
> +			 u32 pr_type, u32 *property)
> +{
> +	u32 buf_pos;
> +	unsigned long read_size;
> +	unsigned long done;
> +	int found = 0;
> +	int ret = 0;
> +	u32 last_pr = 0;
> +
> +	*property = 0;
> +	buf_pos = 0;
> +
> +	for (done = 0; done < desc_size; done += buf_pos) {
> +		read_size = desc_size - done;
> +		if (read_size > BUF_SIZE)
> +			read_size = BUF_SIZE;
> +
> +		ret = kernel_read(file, buf, read_size, &file_offset);
> +
> +		if (ret != read_size)
> +			return (ret < 0) ? ret : -EIO;
> +
> +		ret = 0;
> +		found = scan(buf, read_size, sizeof(struct gnu_property),
> +			     test_property, next_property,
> +			     &last_pr, pr_type, &buf_pos);
> +
> +		if ((!buf_pos) || found)
> +			break;
> +
> +		file_offset += buf_pos - read_size;
> +	}
> +
> +	if (found) {
> +		struct gnu_property *pr =
> +			(struct gnu_property *)(buf + buf_pos);
> +
> +		if (pr->pr_datasz == 4) {
> +			u32 *max =  (u32 *)(buf + read_size);
> +			u32 *data = (u32 *)((u8 *)pr + sizeof(*pr));
> +
> +			if (data + 1 <= max) {
> +				*property = *data;
> +			} else {
> +				file_offset += buf_pos - read_size;
> +				file_offset += sizeof(*pr);
> +				ret = kernel_read(file, property, 4,
> +						  &file_offset);
> +			}
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Search a PT_NOTE segment for NT_GNU_PROPERTY_TYPE_0.
> + */
> +static int find_note_type_0(struct file *file, loff_t file_offset,
> +			    unsigned long note_size, u32 align,
> +			    u32 pr_type, u32 *property)
> +{
> +	u8 *buf;
> +	u32 buf_pos;
> +	unsigned long read_size;
> +	unsigned long done;
> +	int found = 0;
> +	int ret = 0;
> +
> +	buf = kmalloc(BUF_SIZE, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;

Do we really need to alloc/free this once per note?

> +
> +	*property = 0;
> +	buf_pos = 0;
> +
> +	for (done = 0; done < note_size; done += buf_pos) {
> +		read_size = note_size - done;
> +		if (read_size > BUF_SIZE)
> +			read_size = BUF_SIZE;
> +
> +		ret = kernel_read(file, buf, read_size, &file_offset);
> +
> +		if (ret != read_size) {
> +			ret = (ret < 0) ? ret : -EIO;
> +			kfree(buf);
> +			return ret;
> +		}
> +
> +		/*
> +		 * item_size = sizeof(struct elf_note) + elf_note.n_namesz.
> +		 * n_namesz is 4 for the note type we look for.
> +		 */
> +		ret = scan(buf, read_size, sizeof(struct elf_note) + 4,
> +			      test_note_type, next_note,
> +			      &align, NT_GNU_PROPERTY_TYPE_0, &buf_pos);
> +
> +		file_offset += buf_pos - read_size;
> +
> +		if (ret && !found) {
> +			struct elf_note *n =
> +				(struct elf_note *)(buf + buf_pos);
> +			u64 start = round_up(sizeof(*n) + n->n_namesz, align);
> +			u64 total = 0;
> +
> +			if (check_add_overflow(start, (u64)n->n_descsz, &total)) {
> +				ret = -EINVAL;
> +				break;
> +			}
> +			total = round_up(total, align);
> +
> +			ret = find_property(file, n->n_descsz,
> +					    file_offset + start,
> +					    buf, pr_type, property);
> +			found++;
> +			file_offset += total;
> +			buf_pos += total;
> +		} else if (!buf_pos || ret) {
> +			ret = 0;
> +			*property = 0;
> +			break;
> +		}
> +	}

Do we really need this complexity?  How big are the notes realistically
going to be?

Since a file with bloated notes is going to be inefficient to exec
anyway if we have to scan all the way through them, would it be better
just to choke on it and force the toolchain to do something more
sensible?

This in one reason why it would be good for the kernel to require
PT_GNU_PROPERTY if possible, so we know the precise offset and size
without having to search...

> +
> +	kfree(buf);
> +	return ret;
> +}
> +
> +/*
> + * Look at an ELF file's PT_NOTE segments, then NT_GNU_PROPERTY_TYPE_0, then
> + * the property of pr_type.
> + *
> + * Input:
> + *	file: the file to search;
> + *	phdr: the file's elf header;
> + *	phnum: number of entries in phdr;
> + *	pr_type: the property type.
> + *
> + * Output:
> + *	The property found.
> + *
> + * Return:
> + *	Zero or error.
> + */
> +static int scan_segments_64(struct file *file, struct elf64_phdr *phdr,
> +			    int phnum, u32 pr_type, u32 *property)
> +{
> +	int i;
> +	int err = 0;
> +
> +	for (i = 0; i < phnum; i++, phdr++) {
> +		if ((phdr->p_type != PT_NOTE) || (phdr->p_align != 8))
> +			continue;
> +
> +		/*
> +		 * Search the PT_NOTE segment for NT_GNU_PROPERTY_TYPE_0.
> +		 */
> +		err = find_note_type_0(file, phdr->p_offset, phdr->p_filesz,
> +				       phdr->p_align, pr_type, property);
> +		if (err)
> +			return err;
> +	}
> +
> +	return 0;
> +}
> +
> +static int scan_segments_32(struct file *file, struct elf32_phdr *phdr,
> +			    int phnum, u32 pr_type, u32 *property)
> +{
> +	int i;
> +	int err = 0;
> +
> +	for (i = 0; i < phnum; i++, phdr++) {
> +		if ((phdr->p_type != PT_NOTE) || (phdr->p_align != 4))
> +			continue;
> +
> +		/*
> +		 * Search the PT_NOTE segment for NT_GNU_PROPERTY_TYPE_0.
> +		 */
> +		err = find_note_type_0(file, phdr->p_offset, phdr->p_filesz,
> +				       phdr->p_align, pr_type, property);
> +		if (err)
> +			return err;
> +	}
> +
> +	return 0;
> +}
> +
> +int get_gnu_property(void *ehdr_p, void *phdr_p, struct file *f,
> +		     u32 pr_type, u32 *property)
> +{
> +	struct elf64_hdr *ehdr64 = ehdr_p;
> +	int err = 0;
> +
> +	*property = 0;
> +
> +	if (ehdr64->e_ident[EI_CLASS] == ELFCLASS64) {
> +		struct elf64_phdr *phdr64 = phdr_p;
> +
> +		err = scan_segments_64(f, phdr64, ehdr64->e_phnum,
> +				       pr_type, property);
> +		if (err < 0)
> +			goto out;
> +	} else {
> +		struct elf32_hdr *ehdr32 = ehdr_p;
> +
> +		if (ehdr32->e_ident[EI_CLASS] == ELFCLASS32) {
> +			struct elf32_phdr *phdr32 = phdr_p;
> +
> +			err = scan_segments_32(f, phdr32, ehdr32->e_phnum,
> +					       pr_type, property);
> +			if (err < 0)
> +				goto out;
> +		}
> +	}
> +
> +out:
> +	return err;
> +}
> diff --git a/include/linux/elf.h b/include/linux/elf.h
> index e3649b3e970e..c15febebe7f2 100644
> --- a/include/linux/elf.h
> +++ b/include/linux/elf.h
> @@ -56,4 +56,16 @@ static inline int elf_coredump_extra_notes_write(struct coredump_params *cprm) {
>  extern int elf_coredump_extra_notes_size(void);
>  extern int elf_coredump_extra_notes_write(struct coredump_params *cprm);
>  #endif
> +
> +#ifdef CONFIG_ARCH_USE_GNU_PROPERTY
> +extern int arch_setup_property(void *ehdr, void *phdr, struct file *f,
> +			       bool interp);
> +extern int get_gnu_property(void *ehdr_p, void *phdr_p, struct file *f,
> +			    u32 pr_type, u32 *feature);
> +#else
> +static inline int arch_setup_property(void *ehdr, void *phdr, struct file *f,
> +				      bool interp) { return 0; }
> +static inline int get_gnu_property(void *ehdr_p, void *phdr_p, struct file *f,
> +				   u32 pr_type, u32 *feature) { return 0; }
> +#endif
>  #endif /* _LINUX_ELF_H */
> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
> index 34c02e4290fe..316177ce9e76 100644
> --- a/include/uapi/linux/elf.h
> +++ b/include/uapi/linux/elf.h
> @@ -372,6 +372,7 @@ typedef struct elf64_shdr {
>  #define NT_PRFPREG	2
>  #define NT_PRPSINFO	3
>  #define NT_TASKSTRUCT	4
> +#define NT_GNU_PROPERTY_TYPE_0 5

Should this be in a separate block.  This required n_name = "GNU",
whereas the rest are "LINUX" notes AFAIK: it's really a separate
namespace.

I think the gap between 4 and 6 may be just coincidence: glibc's elf.h
already has NT_PLATFORM here (whatever that is).

>  #define NT_AUXV		6
>  /*
>   * Note to userspace developers: size of NT_SIGINFO note may increase
> @@ -443,4 +444,17 @@ typedef struct elf64_note {
>    Elf64_Word n_type;	/* Content type */
>  } Elf64_Nhdr;
>  
> +/* NT_GNU_PROPERTY_TYPE_0 header */
> +struct gnu_property {
> +  __u32 pr_type;
> +  __u32 pr_datasz;
> +};
> +
> +/* .note.gnu.property types */
> +#define GNU_PROPERTY_X86_FEATURE_1_AND		(0xc0000002)
> +
> +/* Bits of GNU_PROPERTY_X86_FEATURE_1_AND */
> +#define GNU_PROPERTY_X86_FEATURE_1_IBT		(0x00000001)
> +#define GNU_PROPERTY_X86_FEATURE_1_SHSTK	(0x00000002)
> +

Redundant ().  The rest of the file doesn't have them; can we conform to
the prevailing style there?

>  #endif /* _UAPI_LINUX_ELF_H */
> -- 
> 2.17.1
> 

^ permalink raw reply

* Re: [PATCH] uapi: avoid namespace conflict in linux/posix_types.h
From: Linus Torvalds @ 2019-06-07 18:27 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Joseph Myers, Arnd Bergmann, Linux API, linux-arch, Netdev,
	Laura Abbott, Paul Burton, Deepa Dinamani,
	Linux List Kernel Mailing
In-Reply-To: <87tvd2j9ye.fsf@oldenburg2.str.redhat.com>

On Thu, Jun 6, 2019 at 9:28 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> This regression fix still hasn't been merged into Linus' tree.  What is
> going on here?

.. it was never sent to me.

That said, now that I see the patch, I wonder why we'd have that
#ifdef __KERNEL__ in here:

 typedef struct {
+#ifdef __KERNEL__
        int     val[2];
+#else
+       int     __kernel_val[2];
+#endif
 } __kernel_fsid_t;

and not just unconditionally do

    int   __fsid_val[2]

If we're changing kernel header files, it's easy enough to change the
kernel users. I'd be more worried about user space that *uses* that
thing, and currently accesses 'val[]' by name.

So the patch looks a bit odd to me. How are people supposed to use
fsid_t if they can't look at it?

The man-page makes it pretty clear that fsid_t is complete garbage,
but it's *documented* garbage:

   The f_fsid field
       Solaris, Irix and POSIX have a system call statvfs(2) that
returns a struct statvfs (defined in <sys/statvfs.h>) containing an
unsigned long f_fsid.  Linux, SunOS, HP-UX, 4.4BSD have a system call
statfs() that returns a  struct
       statfs (defined in <sys/vfs.h>) containing a fsid_t f_fsid,
where fsid_t is defined as struct { int val[2]; }.  The same holds for
FreeBSD, except that it uses the include file <sys/mount.h>.

so that "val[]" name does seem to be pretty much required.

In other words, I don't think the patch is acceptable. User space sees
"val[]" and _needs_ to see it. Otherwise the type is entirely
pointless.

The proper fix is presumably do make sure the fsid_t type definitions
aren't visible to user space at all in this context, and is only
visible in <sys/statvfs.h>.

So now that I _do_ see the patch, there's no way I'll apply it.

               Linus

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Andy Lutomirski @ 2019-06-07 18:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <b3de4110-5366-fdc7-a960-71dea543a42f@intel.com>

> On Jun 7, 2019, at 10:59 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
>> On 6/7/19 10:43 AM, Peter Zijlstra wrote:
>> I've no idea what the kernel should do; since you failed to answer the
>> question what happens when you point this to garbage.
>> 
>> Does it then fault or what?
> 
> Yeah, I think you'll fault with a rather mysterious CR2 value since
> you'll go look at the instruction that faulted and not see any
> references to the CR2 value.
> 
> I think this new MSR probably needs to get included in oops output when
> CET is enabled.

This shouldn’t be able to OOPS because it only happens at CPL 3, right?  We should put it into core dumps, though.

> 
> Why don't we require that a VMA be in place for the entire bitmap?
> Don't we need a "get" prctl function too in case something like a JIT is
> running and needs to find the location of this bitmap to set bits itself?
> 
> Or, do we just go whole-hog and have the kernel manage the bitmap
> itself. Our interface here could be:
> 
>    prctl(PR_MARK_CODE_AS_LEGACY, start, size);
> 
> and then have the kernel allocate and set the bitmap for those code
> locations.

Given that the format depends on the VA size, this might be a good idea.  I bet we can reuse the special mapping infrastructure for this — the VMA could
be a MAP_PRIVATE special mapping named [cet_legacy_bitmap] or similar, and we can even make special rules to core dump it intelligently if needed.  And we can make mremap() on it work correctly if anyone (CRIU?) cares.

Hmm.  Can we be creative and skip populating it with zeros?  The CPU should only ever touch a page if we miss an ENDBR on it, so, in normal operation, we don’t need anything to be there.  We could try to prevent anyone from *reading* it outside of ENDBR tracking if we want to avoid people accidentally wasting lots of memory by forcing it to be fully populated when the read it.

The one downside is this forces it to be per-mm, but that seems like a generally reasonable model anyway.

This also gives us an excellent opportunity to make it read-only as seen from userspace to prevent exploits from just poking it full of ones before redirecting execution.

^ permalink raw reply

* Re: [PATCH] uapi: avoid namespace conflict in linux/posix_types.h
From: Florian Weimer @ 2019-06-07 18:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joseph Myers, Arnd Bergmann, Linux API, linux-arch, Netdev,
	Laura Abbott, Paul Burton, Deepa Dinamani,
	Linux List Kernel Mailing
In-Reply-To: <CAHk-=wio1e4=WUUwmo-Ph55BEgH_X3oXzBpvPyLQg2TxzfGYuw@mail.gmail.com>

* Linus Torvalds:

> If we're changing kernel header files, it's easy enough to change the
> kernel users. I'd be more worried about user space that *uses* that
> thing, and currently accesses 'val[]' by name.
>
> So the patch looks a bit odd to me. How are people supposed to use
> fsid_t if they can't look at it?

The problem is that the header was previously not used pervasively in
userspace headers.  See commit a623a7a1a5670c25a16881f5078072d272d96b71
("y2038: fix socket.h header inclusion").  Very little code needed it
before.

On the glibc side, we nowadays deal with this by splitting headers
further.  (We used to suppress definitions with macros, but that tended
to become convoluted.)  In this case, moving the definition of
__kernel_long_t to its own header, so that
include/uapi/asm-generic/socket.h can include that should fix it.

> So now that I _do_ see the patch, there's no way I'll apply it.

Fair enough.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH] uapi: avoid namespace conflict in linux/posix_types.h
From: Linus Torvalds @ 2019-06-07 18:56 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Joseph Myers, Arnd Bergmann, Linux API, linux-arch, Netdev,
	Laura Abbott, Paul Burton, Deepa Dinamani,
	Linux List Kernel Mailing
In-Reply-To: <871s05fd8o.fsf@oldenburg2.str.redhat.com>

On Fri, Jun 7, 2019 at 11:43 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> On the glibc side, we nowadays deal with this by splitting headers
> further.  (We used to suppress definitions with macros, but that tended
> to become convoluted.)  In this case, moving the definition of
> __kernel_long_t to its own header, so that
> include/uapi/asm-generic/socket.h can include that should fix it.

I think we should strive to do that on the kernel side too, since
clearly we shouldn't expose that "val[]" thing in the core posix types
due to namespace rules, but at the same time I think the patch to
rename val[] is fundamentally broken too.

Can you describe how you split things (perhaps even with a patch ;)?
Is this literally the only issue you currently have? Because I'd
expect similar issues to show up elsewhere too, but who knows.. You
presumably do.

                Linus

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Dave Hansen @ 2019-06-07 18:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <34E0D316-552A-401C-ABAA-5584B5BC98C5@amacapital.net>

On 6/7/19 11:29 AM, Andy Lutomirski wrote:
...
>> I think this new MSR probably needs to get included in oops output when
>> CET is enabled.
> 
> This shouldn’t be able to OOPS because it only happens at CPL 3,
> right?  We should put it into core dumps, though.

Good point.

Yu-cheng, can you just confirm that the bitmap can't be referenced in
ring-0, no matter what?  We should also make sure that no funny business
happens if we put an address in the bitmap that faults, or is
non-canonical.  Do we have any self-tests for that?

Let's say userspace gets a fault on this.  Do they have the
introspection capability to figure out why they faulted, say in their
signal handler?

>> Why don't we require that a VMA be in place for the entire bitmap?
>> Don't we need a "get" prctl function too in case something like a JIT is
>> running and needs to find the location of this bitmap to set bits itself?
>>
>> Or, do we just go whole-hog and have the kernel manage the bitmap
>> itself. Our interface here could be:
>>
>>    prctl(PR_MARK_CODE_AS_LEGACY, start, size);
>>
>> and then have the kernel allocate and set the bitmap for those code
>> locations.
> 
> Given that the format depends on the VA size, this might be a good
> idea.

Yeah, making userspace know how large the address space is or could be
is rather nasty, especially if we ever get any fancy CPU features that
eat up address bits (a la ARM top-byte-ignore or SPARC ADI).

> Hmm.  Can we be creative and skip populating it with zeros?  The CPU
should only ever touch a page if we miss an ENDBR on it, so, in normal
operation, we don’t need anything to be there.  We could try to prevent
anyone from *reading* it outside of ENDBR tracking if we want to avoid
people accidentally wasting lots of memory by forcing it to be fully
populated when the read it.

Won't reads on a big, contiguous private mapping get the huge zero page
anyway?

> The one downside is this forces it to be per-mm, but that seems like
> a generally reasonable model anyway.

Yeah, practically, you could only make it shared if you shared the
layout of all code in the address space.  I'm sure the big database(s)
do that cross-process, but I bet nobody else does.  User ASLR
practically guarantees that nobody can do this.

> This also gives us an excellent opportunity to make it read-only as
> seen from userspace to prevent exploits from just poking it full of
> ones before redirecting execution.

That would be fun.

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Dave Hansen @ 2019-06-07 19:03 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <20190606200926.4029-4-yu-cheng.yu@intel.com>

On 6/6/19 1:09 PM, Yu-cheng Yu wrote:
> +	modify_fpu_regs_begin();
> +	rdmsrl(MSR_IA32_U_CET, r);
> +	r |= (MSR_IA32_CET_LEG_IW_EN | bitmap);
> +	wrmsrl(MSR_IA32_U_CET, r);
> +	modify_fpu_regs_end();

Isn't there a bunch of other stuff in this MSR?  It seems like the
bitmap value would allow overwriting lots of bits in the MSR that have
nothing to do with the bitmap... in a prctl() that's supposed to only be
dealing with the bitmap.

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Yu-cheng Yu @ 2019-06-07 19:23 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <c5c21778-f10f-cef8-c937-1e8ad1e2a7cf@intel.com>

On Fri, 2019-06-07 at 12:03 -0700, Dave Hansen wrote:
> On 6/6/19 1:09 PM, Yu-cheng Yu wrote:
> > +	modify_fpu_regs_begin();
> > +	rdmsrl(MSR_IA32_U_CET, r);
> > +	r |= (MSR_IA32_CET_LEG_IW_EN | bitmap);
> > +	wrmsrl(MSR_IA32_U_CET, r);
> > +	modify_fpu_regs_end();
> 
> Isn't there a bunch of other stuff in this MSR?  It seems like the
> bitmap value would allow overwriting lots of bits in the MSR that have
> nothing to do with the bitmap... in a prctl() that's supposed to only be
> dealing with the bitmap.

Yes, the bitmap address should have been masked, although it is checked for page
alignment (which has the same effect).  I will fix it.

Yu-cheng

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Yu-cheng Yu @ 2019-06-07 19:49 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <34E0D316-552A-401C-ABAA-5584B5BC98C5@amacapital.net>

On Fri, 2019-06-07 at 11:29 -0700, Andy Lutomirski wrote:
> > On Jun 7, 2019, at 10:59 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> > 
> > > On 6/7/19 10:43 AM, Peter Zijlstra wrote:
> > > I've no idea what the kernel should do; since you failed to answer the
> > > question what happens when you point this to garbage.
> > > 
> > > Does it then fault or what?
> > 
> > Yeah, I think you'll fault with a rather mysterious CR2 value since
> > you'll go look at the instruction that faulted and not see any
> > references to the CR2 value.
> > 
> > I think this new MSR probably needs to get included in oops output when
> > CET is enabled.
> 
> This shouldn’t be able to OOPS because it only happens at CPL 3, right?  We
> should put it into core dumps, though.
> 
> > 
> > Why don't we require that a VMA be in place for the entire bitmap?
> > Don't we need a "get" prctl function too in case something like a JIT is
> > running and needs to find the location of this bitmap to set bits itself?
> > 
> > Or, do we just go whole-hog and have the kernel manage the bitmap
> > itself. Our interface here could be:
> > 
> >    prctl(PR_MARK_CODE_AS_LEGACY, start, size);
> > 
> > and then have the kernel allocate and set the bitmap for those code
> > locations.
> 
> Given that the format depends on the VA size, this might be a good idea.  I
> bet we can reuse the special mapping infrastructure for this — the VMA could
> be a MAP_PRIVATE special mapping named [cet_legacy_bitmap] or similar, and we
> can even make special rules to core dump it intelligently if needed.  And we
> can make mremap() on it work correctly if anyone (CRIU?) cares.
> 
> Hmm.  Can we be creative and skip populating it with zeros?  The CPU should
> only ever touch a page if we miss an ENDBR on it, so, in normal operation, we
> don’t need anything to be there.  We could try to prevent anyone from
> *reading* it outside of ENDBR tracking if we want to avoid people accidentally
> wasting lots of memory by forcing it to be fully populated when the read it.
> 
> The one downside is this forces it to be per-mm, but that seems like a
> generally reasonable model anyway.
> 
> This also gives us an excellent opportunity to make it read-only as seen from
> userspace to prevent exploits from just poking it full of ones before
> redirecting execution.

GLIBC sets bits only for legacy code, and then makes the bitmap read-only.  That
avoids most issues:

  To populate bitmap pages, mprotect() is required.
  Reading zero bitmap pages would not waste more physical memory, right?

Yu-cheng

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Yu-cheng Yu @ 2019-06-07 19:56 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <352e6172-938d-f8e4-c195-9fd1b881bdee@intel.com>

On Fri, 2019-06-07 at 11:58 -0700, Dave Hansen wrote:
> On 6/7/19 11:29 AM, Andy Lutomirski wrote:
> ...
> > > I think this new MSR probably needs to get included in oops output when
> > > CET is enabled.
> > 
> > This shouldn’t be able to OOPS because it only happens at CPL 3,
> > right?  We should put it into core dumps, though.
> 
> Good point.
> 
> Yu-cheng, can you just confirm that the bitmap can't be referenced in
> ring-0, no matter what?  We should also make sure that no funny business
> happens if we put an address in the bitmap that faults, or is
> non-canonical.  Do we have any self-tests for that?

Yes, the bitmap is user memory, but the kernel can still get to it (e.g.
copy_from_user()).  We can do more check on the address.

> 
> Let's say userspace gets a fault on this.  Do they have the
> introspection capability to figure out why they faulted, say in their
> signal handler?

The bitmap address is kept by the application; the kernel won't provide it again
to user-space.  In the signal handler, the app can find out from its own record.

[...]

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Dave Hansen @ 2019-06-07 20:00 UTC (permalink / raw)
  To: Yu-cheng Yu, Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <7e0b97bf1fbe6ff20653a8e4e147c6285cc5552d.camel@intel.com>

On 6/7/19 12:49 PM, Yu-cheng Yu wrote:
>>
>> This also gives us an excellent opportunity to make it read-only as seen from
>> userspace to prevent exploits from just poking it full of ones before
>> redirecting execution.
> GLIBC sets bits only for legacy code, and then makes the bitmap read-only.  That
> avoids most issues:
> 
>   To populate bitmap pages, mprotect() is required.
>   Reading zero bitmap pages would not waste more physical memory, right?

Huh, how does glibc know about all possible past and future legacy code
in the application?

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Yu-cheng Yu @ 2019-06-07 20:06 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <4b448cde-ee4e-1c95-0f7f-4fe694be7db6@intel.com>

On Fri, 2019-06-07 at 13:00 -0700, Dave Hansen wrote:
> On 6/7/19 12:49 PM, Yu-cheng Yu wrote:
> > > 
> > > This also gives us an excellent opportunity to make it read-only as seen
> > > from
> > > userspace to prevent exploits from just poking it full of ones before
> > > redirecting execution.
> > 
> > GLIBC sets bits only for legacy code, and then makes the bitmap read-
> > only.  That
> > avoids most issues:
> > 
> >   To populate bitmap pages, mprotect() is required.
> >   Reading zero bitmap pages would not waste more physical memory, right?
> 
> Huh, how does glibc know about all possible past and future legacy code
> in the application?

When dlopen() gets a legacy binary and the policy allows that, it will manage
the bitmap:

  If a bitmap has not been created, create one.
  Set bits for the legacy code being loaded.

Yu-cheng

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Andy Lutomirski @ 2019-06-07 20:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <352e6172-938d-f8e4-c195-9fd1b881bdee@intel.com>



> On Jun 7, 2019, at 11:58 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 6/7/19 11:29 AM, Andy Lutomirski wrote:
> ...
>>> I think this new MSR probably needs to get included in oops output when
>>> CET is enabled.
>> 
>> This shouldn’t be able to OOPS because it only happens at CPL 3,
>> right?  We should put it into core dumps, though.
> 
> Good point.
> 
> Yu-cheng, can you just confirm that the bitmap can't be referenced in
> ring-0, no matter what?  We should also make sure that no funny business
> happens if we put an address in the bitmap that faults, or is
> non-canonical.  Do we have any self-tests for that?
> 
> Let's say userspace gets a fault on this.  Do they have the
> introspection capability to figure out why they faulted, say in their
> signal handler?

We need to stick the tracker state in the sigcontext somewhere.

Did we end up defining a signal frame shadow stack token?

> 
>>> Why don't we require that a VMA be in place for the entire bitmap?
>>> Don't we need a "get" prctl function too in case something like a JIT is
>>> running and needs to find the location of this bitmap to set bits itself?
>>> 
>>> Or, do we just go whole-hog and have the kernel manage the bitmap
>>> itself. Our interface here could be:
>>> 
>>>   prctl(PR_MARK_CODE_AS_LEGACY, start, size);
>>> 
>>> and then have the kernel allocate and set the bitmap for those code
>>> locations.
>> 
>> Given that the format depends on the VA size, this might be a good
>> idea.
> 
> Yeah, making userspace know how large the address space is or could be
> is rather nasty, especially if we ever get any fancy CPU features that
> eat up address bits (a la ARM top-byte-ignore or SPARC ADI).

That gets extra bad if we ever grow user code that uses it but is unaware. It could poke the wrong part of the bitmap.

> 
>> Hmm.  Can we be creative and skip populating it with zeros?  The CPU
> should only ever touch a page if we miss an ENDBR on it, so, in normal
> operation, we don’t need anything to be there.  We could try to prevent
> anyone from *reading* it outside of ENDBR tracking if we want to avoid
> people accidentally wasting lots of memory by forcing it to be fully
> populated when the read it.
> 
> Won't reads on a big, contiguous private mapping get the huge zero page
> anyway?

The zero pages may be free, but the page tables could be decently large.  Does the core mm code use huge, immense, etc huge zero pages?  Or can it synthesize them by reusing page table pages that map zeros?

> 
>> The one downside is this forces it to be per-mm, but that seems like
>> a generally reasonable model anyway.
> 
> Yeah, practically, you could only make it shared if you shared the
> layout of all code in the address space.  I'm sure the big database(s)
> do that cross-process, but I bet nobody else does.  User ASLR
> practically guarantees that nobody can do this.

I meant per-mm instead of per-task.

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Andy Lutomirski @ 2019-06-07 20:43 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: Dave Hansen, Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <7e0b97bf1fbe6ff20653a8e4e147c6285cc5552d.camel@intel.com>



> On Jun 7, 2019, at 12:49 PM, Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> 
> On Fri, 2019-06-07 at 11:29 -0700, Andy Lutomirski wrote:
>>> On Jun 7, 2019, at 10:59 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>> 
>>>> On 6/7/19 10:43 AM, Peter Zijlstra wrote:
>>>> I've no idea what the kernel should do; since you failed to answer the
>>>> question what happens when you point this to garbage.
>>>> 
>>>> Does it then fault or what?
>>> 
>>> Yeah, I think you'll fault with a rather mysterious CR2 value since
>>> you'll go look at the instruction that faulted and not see any
>>> references to the CR2 value.
>>> 
>>> I think this new MSR probably needs to get included in oops output when
>>> CET is enabled.
>> 
>> This shouldn’t be able to OOPS because it only happens at CPL 3, right?  We
>> should put it into core dumps, though.
>> 
>>> 
>>> Why don't we require that a VMA be in place for the entire bitmap?
>>> Don't we need a "get" prctl function too in case something like a JIT is
>>> running and needs to find the location of this bitmap to set bits itself?
>>> 
>>> Or, do we just go whole-hog and have the kernel manage the bitmap
>>> itself. Our interface here could be:
>>> 
>>>   prctl(PR_MARK_CODE_AS_LEGACY, start, size);
>>> 
>>> and then have the kernel allocate and set the bitmap for those code
>>> locations.
>> 
>> Given that the format depends on the VA size, this might be a good idea.  I
>> bet we can reuse the special mapping infrastructure for this — the VMA could
>> be a MAP_PRIVATE special mapping named [cet_legacy_bitmap] or similar, and we
>> can even make special rules to core dump it intelligently if needed.  And we
>> can make mremap() on it work correctly if anyone (CRIU?) cares.
>> 
>> Hmm.  Can we be creative and skip populating it with zeros?  The CPU should
>> only ever touch a page if we miss an ENDBR on it, so, in normal operation, we
>> don’t need anything to be there.  We could try to prevent anyone from
>> *reading* it outside of ENDBR tracking if we want to avoid people accidentally
>> wasting lots of memory by forcing it to be fully populated when the read it.
>> 
>> The one downside is this forces it to be per-mm, but that seems like a
>> generally reasonable model anyway.
>> 
>> This also gives us an excellent opportunity to make it read-only as seen from
>> userspace to prevent exploits from just poking it full of ones before
>> redirecting execution.
> 
> GLIBC sets bits only for legacy code, and then makes the bitmap read-only.  That
> avoids most issues:

How does glibc know the linear address space size?  We don’t want LA64 to break old binaries because the address calculation changed.

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Dave Hansen @ 2019-06-07 21:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <D10B5B59-1BE7-44DC-8E91-C8E4292DC6FB@amacapital.net>

On 6/7/19 1:40 PM, Andy Lutomirski wrote:
>>> Hmm.  Can we be creative and skip populating it with zeros?  The
>>> CPU
>> should only ever touch a page if we miss an ENDBR on it, so, in
>> normal operation, we don’t need anything to be there.  We could try
>> to prevent anyone from *reading* it outside of ENDBR tracking if we
>> want to avoid people accidentally wasting lots of memory by forcing
>> it to be fully populated when the read it.
>> 
>> Won't reads on a big, contiguous private mapping get the huge zero
>> page anyway?
> 
> The zero pages may be free, but the page tables could be decently
large.  Does the core mm code use huge, immense, etc huge zero pages?
Or can it synthesize them by reusing page table pages that map zeros?

IIRC, we only ever fill single PMDs, even though we could gang a pmd
page up and do it for 1GB areas too.

I guess the page table consumption could really suck if we had code all
over the 57-bit address space and that code moved around and the process
ran for a long long time.  Pathologically, we need a ulong/pmd_t for
each 2MB of address space which is 8*2^56-30=512GB per process.  Yikes.
 Right now, we'd at least detect the memory consumption and OOM-kill the
process(es) eventually.  But, that's not really _this_ patch's problem.
 It's a general problem, and doesn't even require the zero page to be
mapped all over.

Longer-term, I'd much rather see us add some page table reclaim
mechanism that new how to go after things like excessive page tables  in
MAP_NORESERVE areas.

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Dave Hansen @ 2019-06-07 21:09 UTC (permalink / raw)
  To: Yu-cheng Yu, Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <0e505563f7dae3849b57fb327f578f41b760b6f7.camel@intel.com>

On 6/7/19 1:06 PM, Yu-cheng Yu wrote:
>> Huh, how does glibc know about all possible past and future legacy code
>> in the application?
> When dlopen() gets a legacy binary and the policy allows that, it will manage
> the bitmap:
> 
>   If a bitmap has not been created, create one.
>   Set bits for the legacy code being loaded.

I was thinking about code that doesn't go through GLIBC like JITs.

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Andy Lutomirski @ 2019-06-07 22:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu-cheng Yu, Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <f6de9073-9939-a20d-2196-25fa223cf3fc@intel.com>


> On Jun 7, 2019, at 2:09 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 6/7/19 1:06 PM, Yu-cheng Yu wrote:
>>> Huh, how does glibc know about all possible past and future legacy code
>>> in the application?
>> When dlopen() gets a legacy binary and the policy allows that, it will manage
>> the bitmap:
>> 
>>  If a bitmap has not been created, create one.
>>  Set bits for the legacy code being loaded.
> 
> I was thinking about code that doesn't go through GLIBC like JITs.

CRIU is another consideration: it would be rather annoying if CET programs can’t migrate between LA57 and normal machines.

^ permalink raw reply

* Re: [PATCH v3 1/2] fork: add clone3
From: Christian Brauner @ 2019-06-08  8:15 UTC (permalink / raw)
  To: viro, linux-kernel, torvalds
  Cc: jannh, Serge E. Hallyn, keescook, fweimer, oleg, arnd, dhowells,
	Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	linux-api
In-Reply-To: <20190606214645.GA31599@mail.hallyn.com>

On Thu, Jun 06, 2019 at 04:46:45PM -0500, Serge Hallyn wrote:
> On Tue, Jun 04, 2019 at 06:09:43PM +0200, Christian Brauner wrote:
> > This adds the clone3 system call.
> > 
> > As mentioned several times already (cf. [7], [8]) here's the promised
> > patchset for clone3().
> > 
> > We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
> > free flag from clone().
> > 
> > Independent of the CLONE_PIDFD patchset a time namespace has been discussed
> > at Linux Plumber Conference last year and has been sent out and reviewed
> > (cf. [5]). It is expected that it will go upstream in the not too distant
> > future. However, it relies on the addition of the CLONE_NEWTIME flag to
> > clone(). The only other good candidate - CLONE_DETACHED - is currently not
> > recyclable as we have identified at least two large or widely used
> > codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
> > CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
> > blocked. clone3() has the advantage that it will unblock this patchset
> > again. In general, clone3() is extensible and allows for the implementation
> > of new features.
> > 
> > The idea is to keep clone3() very simple and close to the original clone(),
> > specifically, to keep on supporting old clone()-based workloads.
> > We know there have been various creative proposals how a new process
> > creation syscall or even api is supposed to look like. Some people even
> > going so far as to argue that the traditional fork()+exec() split should be
> > abandoned in favor of an in-kernel version of spawn(). Independent of
> > whether or not we personally think spawn() is a good idea this patchset has
> > and does not want to have anything to do with this.
> > One stance we take is that there's no real good alternative to
> > clone()+exec() and we need and want to support this model going forward;
> > independent of spawn().
> > The following requirements guided clone3():
> > - bump the number of available flags
> > - move arguments that are currently passed as separate arguments
> >   in clone() into a dedicated struct clone_args
> >   - choose a struct layout that is easy to handle on 32 and on 64 bit
> >   - choose a struct layout that is extensible
> >   - give new flags that currently need to abuse another flag's dedicated
> >     return argument in clone() their own dedicated return argument
> >     (e.g. CLONE_PIDFD)
> >   - use a separate kernel internal struct kernel_clone_args that is
> >     properly typed according to current kernel conventions in fork.c and is
> >     different from  the uapi struct clone_args
> > - port _do_fork() to use kernel_clone_args so that all process creation
> >   syscalls such as fork(), vfork(), clone(), and clone3() behave identical
> >   (Arnd suggested, that we can probably also port do_fork() itself in a
> >    separate patchset.)
> > - ease of transition for userspace from clone() to clone3()
> >   This very much means that we do *not* remove functionality that userspace
> >   currently relies on as the latter is a good way of creating a syscall
> >   that won't be adopted.
> > - do not try to be clever or complex: keep clone3() as dumb as possible
> > 
> > In accordance with Linus suggestions (cf. [11]), clone3() has the following
> > signature:
> > 
> > /* uapi */
> > struct clone_args {
> >         __aligned_u64 flags;
> >         __aligned_u64 pidfd;
> >         __aligned_u64 child_tid;
> >         __aligned_u64 parent_tid;
> >         __aligned_u64 exit_signal;
> >         __aligned_u64 stack;
> >         __aligned_u64 stack_size;
> >         __aligned_u64 tls;
> > };
> > 
> > /* kernel internal */
> > struct kernel_clone_args {
> >         u64 flags;
> >         int __user *pidfd;
> >         int __user *child_tid;
> >         int __user *parent_tid;
> >         int exit_signal;
> >         unsigned long stack;
> >         unsigned long stack_size;
> >         unsigned long tls;
> > };
> > 
> > long sys_clone3(struct clone_args __user *uargs, size_t size)
> > 
> > clone3() cleanly supports all of the supported flags from clone() and thus
> > all legacy workloads.
> > The advantage of sticking close to the old clone() is the low cost for
> > userspace to switch to this new api. Quite a lot of userspace apis (e.g.
> > pthreads) are based on the clone() syscall. With the new clone3() syscall
> > supporting all of the old workloads and opening up the ability to add new
> > features should make switching to it for userspace more appealing. In
> > essence, glibc can just write a simple wrapper to switch from clone() to
> > clone3().
> > 
> > There has been some interest in this patchset already. We have received a
> > patch from the CRIU corner for clone3() that would set the PID/TID of a
> > restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
> > 
> > /* User visible differences to legacy clone() */
> > - CLONE_DETACHED will cause EINVAL with clone3()
> > - CSIGNAL is deprecated
> >   It is superseeded by a dedicated "exit_signal" argument in struct
> >   clone_args freeing up space for additional flags.
> >   This is based on a suggestion from Andrei and Linus (cf. [9] and [10])
> > 
> > /* References */
> > [1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
> > [2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
> > [3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
> > [4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
> > [5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
> > [6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
> > [7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
> > [8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
> > [9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
> > [10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
> > [11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/
> > 
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Signed-off-by: Christian Brauner <christian@brauner.io>
> 
> Acked-by: Serge Hallyn <serge@hallyn.com>

This also carries an Ack by Arnd and there don't seem to be technical
issues anymore.
So I'm going to move this over into my for-next branch targeting 5.3 to
see some testing.

Thanks!
Christian

> 
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Pavel Emelyanov <xemul@virtuozzo.com>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: David Howells <dhowells@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Adrian Reber <adrian@lisas.de>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Andrei Vagin <avagin@gmail.com>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > Cc: Florian Weimer <fweimer@redhat.com>
> > Cc: linux-api@vger.kernel.org
> > ---
> > v1:
> > - Linus Torvalds <torvalds@linux-foundation.org>:
> >   - redesign based on Linus proposal
> >   - switch from arg-based to revision-based naming scheme: s/clone6/clone3/
> > - Arnd Bergmann <arnd@arndb.de>:
> >   - use a single copy_from_user() instead of multiple get_user() calls
> >     since the latter have a constant overhead on some architectures
> >   - a range of other tweaks and suggestions
> > v2:
> > - Linus Torvalds <torvalds@linux-foundation.org>,
> >   Andrei Vagin <avagin@gmail.com>:
> >   - replace CSIGNAL flag with dedicated exit_signal argument in struct
> >     clone_args
> > - Christian Brauner <christian@brauner.io>:
> >   - improve naming for some struct clone_args members
> > v3:
> > - Arnd Bergmann <arnd@arndb.de>:
> >   - replace memset with constructor for clarity and better object code
> >   - call flag verification function clone3_flags_valid() on
> >     kernel_clone_args instead of clone_args
> >   - remove __ARCH_WANT_SYS_CLONE ifdefine around sys_clone3()
> > - Christian Brauner <christian@brauner.io>:
> >   - replace clone3_flags_valid() with clone3_args_valid() and call in
> >     clone3() directly rather than in copy_clone_args_from_user()
> >     This cleanly separates copying the args from userspace from the
> >     verification whether those args are sane.
> > - David Howells <dhowells@redhat.com>:
> >   - align new struct member assignments with tabs
> >   - replace CLONE_MAX by with a non-uapi exported CLONE_LEGACY_FLAGS and
> >     define it as  0xffffffffULL for clarity
> >   - make copy_clone_args_from_user() noinline
> >   - avoid assigning to local variables from struct kernel_clone_args
> >     members in cases where it makes sense
> > ---
> >  arch/x86/ia32/sys_ia32.c   |  12 ++-
> >  include/linux/sched/task.h |  17 +++-
> >  include/linux/syscalls.h   |   4 +
> >  include/uapi/linux/sched.h |  16 +++
> >  kernel/fork.c              | 201 ++++++++++++++++++++++++++++---------
> >  5 files changed, 199 insertions(+), 51 deletions(-)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox