* Re: [PATCH 1/3] kcmp: Move kcmp.h into uapi
From: Arnd Bergmann @ 2014-10-23 7:49 UTC (permalink / raw)
To: Michael Ellerman
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw, linux-api-u79uwXL29TY76Z2rM5mHXA,
Andrew Morton, gorcunov-GEFAQzZX7r8dnm+yROfE0A
In-Reply-To: <1414040834-30209-1-git-send-email-mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
On Thursday 23 October 2014 16:07:12 Michael Ellerman wrote:
> --- a/include/linux/kcmp.h
> +++ b/include/linux/kcmp.h
> @@ -1,17 +1,6 @@
> #ifndef _LINUX_KCMP_H
> #define _LINUX_KCMP_H
>
> -/* Comparison type */
> -enum kcmp_type {
> - KCMP_FILE,
> - KCMP_VM,
> - KCMP_FILES,
> - KCMP_FS,
> - KCMP_SIGHAND,
> - KCMP_IO,
> - KCMP_SYSVSEM,
> -
> - KCMP_TYPES,
> -};
> +#include <uapi/linux/kcmp.h>
>
> #endif /* _LINUX_KCMP_H */
>
If the file is empty except for the uapi include, I think it's better to
delete it completely. The include path logic should ensure we pick the
other one up.
Arnd
^ permalink raw reply
* Re: [PATCH v1 3/3] tpm: fix multiple race conditions in tpm_ppi.c
From: Jarkko Sakkinen @ 2014-10-23 7:30 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Peter Huewe, Ashley Lai, Marcel Selhorst,
tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
josh.triplett-ral2JQCrhuEAvxtiuMwx3w,
christophe.ricard-Re5JQEeQqe8AvxtiuMwx3w,
jason.gunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/
In-Reply-To: <20141022172646.GD12775-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
Thanks for the excellent review comments. I'll do another spin an try to
incorporate most them.
/Jarkko
On Wed, Oct 22, 2014 at 11:26:46AM -0600, Jason Gunthorpe wrote:
> On Wed, Oct 22, 2014 at 07:23:56PM +0300, Jarkko Sakkinen wrote:
> > Traversal of the ACPI device tree was not done right. It should lookup
> > PPI only under the ACPI device that it is associated. Otherwise, it could
> > match to a wrong PPI interface if there are two TPM devices in the device
> > tree.
> >
> > Removed global ACPI handle and version string from tpm_ppi.c as this
> > is racy. Instead they should be associated with the chip.
> >
> > Moved code just a tiny bit towards two-phase allocation to implement
> > fix for the PPI race conditions.
>
> Not this version..
>
> > Added missing copyright platter to tpm_ppi.c.
> >
> > Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>
> Reviewed-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
>
> I like this one the most of the three I've seen :)
>
> Did you also look in tpm_acpi.c to see if it needs to use
> acpi_dev_handle somehow too?
>
> > + union acpi_object *obj;
> > + struct kobject *parent = &chip->dev->kobj;
>
> Nit, this variable is only used once, it would be clearer to inline
>
> > + /* Cache PPI version string. */
> > + obj = acpi_evaluate_dsm_typed(chip->acpi_dev_handle, tpm_ppi_uuid,
> > + TPM_PPI_REVISION_ID, TPM_PPI_FN_VERSION,
> > + NULL, ACPI_TYPE_STRING);
> > + if (obj) {
> > + strlcpy(chip->ppi_version, obj->string.pointer,
> > + PPI_VERSION_LEN + 1);
> > + ACPI_FREE(obj);
> > + } else
> > + return -ENOMEM;
> > +
> > + return chip->acpi_dev_handle ?
> > + sysfs_create_group(parent, &ppi_attr_grp) : 0;
>
> The above sequence can just be:
>
> if (!obj)
> return -ENOMEM;
>
> strlcpy(chip->ppi_version, obj->string.pointer, sizeof(chip->ppi_version));
> ACPI_FREE(obj);
>
> return sysfs_create_group(&chip->dev->kobj, &ppi_attr_grp);
>
> Which is more idiomatic. Also remove TPM_PPI_VERSION_LEN, sizeof is better.
>
> I know nothing about acpi, but is ENOMEM the right code? I would think
> acpi_evalute_dsm_typed would also fail if tpm_ppi_uuid is not found??
>
> > + return chip->acpi_dev_handle ?
> > + sysfs_create_group(parent, &ppi_attr_grp) : 0;
>
> dev_handle is already checked to be non 0
>
> > +void tpm_remove_ppi(struct tpm_chip *chip)
> > + struct kobject *parent = &chip->dev->kobj;
>
> Also used only once
>
> Jason
^ permalink raw reply
* Re: [PATCH v1 2/3] tpm: two-phase chip management functions
From: Jarkko Sakkinen @ 2014-10-23 7:22 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Peter Huewe, Ashley Lai, Marcel Selhorst,
tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
josh.triplett-ral2JQCrhuEAvxtiuMwx3w,
christophe.ricard-Re5JQEeQqe8AvxtiuMwx3w,
jason.gunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/
In-Reply-To: <20141022171603.GC12775-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
On Wed, Oct 22, 2014 at 11:16:03AM -0600, Jason Gunthorpe wrote:
> On Wed, Oct 22, 2014 at 07:23:55PM +0300, Jarkko Sakkinen wrote:
> > tpm_register_hardware() and tpm_remove_hardware() are called often
> > before initializing the device. This is wrong order since it could
> > be that main TPM driver needs a fully initialized chip to be able to
> > do its job. For example, now it is impossible to move common startup
> > functions such as tpm_do_selftest() to tpm_register_hardware().
>
> > Added tpm_chip_alloc() and tpm_chip_register() where tpm_chip_alloc()
>
> It is called tpmm_chip_alloc() in this version..
>
> > Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>
> Reviewed-By: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
>
> Looks fine, if you have to spin this again you can incorporate the
> nits.
Thanks for the review! I'll try to incorporate most (if not all of
them). I was going to comment some of the points you made but they
would have mostly been "ACK".
> > +
> > +static DECLARE_BITMAP(dev_mask, TPM_NUM_DEVICES);
>
> Not for this patch, but while this area of code is being looked at
> this should probably be an IDR/IDA like other subsytems?
Yes, it should use IDR. I'll put this into my backlog but don't
include into this patch set.
> > + if (try_module_get(pos->dev->driver->owner)) {
> > + chip = pos;
> > + break;
> > + }
>
> Not for this patch, this module stuff should be wiped in favor of chip->ops
> locking.
Definitely, horrible stuff, putting into my backlog (kref + mutex
should be a better solution).
> > +static void tpmm_chip_remove(void *data)
> > +{
> > + struct tpm_chip *chip = (struct tpm_chip *) data;
> > + dev_dbg(chip->dev, "%s\n", __func__);
>
> This print is silent in the default compile, right?
Forgotten clutter, removing it, thanks.
> > + chip->dev_num = find_first_zero_bit(dev_mask, TPM_NUM_DEVICES);
> [..]
> > + set_bit(chip->dev_num, dev_mask);
>
> Not for this patch but somehow there is no locking for dev_mask
> here. I guess it should use the driver_lock spinlock?
I'll add it anyway, thanks.
> > + chip->bios_dir = tpm_bios_log_setup(chip->devname);
>
> Not for this patch, but tpm_bios_log_setup can return NULL if
> securityfs setup fails.
Easy to fix so I'll just fix it.
> > diff --git a/drivers/char/tpm/tpm_atmel.c b/drivers/char/tpm/tpm_atmel.c
> > index 6069d13..8e2576a 100644
> > +++ b/drivers/char/tpm/tpm_atmel.c
> > @@ -138,11 +138,11 @@ static void atml_plat_remove(void)
> > struct tpm_chip *chip = dev_get_drvdata(&pdev->dev);
> >
> > if (chip) {
> > + tpm_chip_unregister(chip);
> > if (chip->vendor.have_region)
> > atmel_release_region(chip->vendor.base,
> > chip->vendor.region_size);
> > atmel_put_base_addr(chip->vendor.iobase);
> > - tpm_remove_hardware(chip->dev);
> > platform_device_unregister(pdev);
>
> Missed this before, the Atmel driver is the same as the TIS driver in
> force mode, ie it isn't going through the driver APIs, but instead
> force creating platform devices. Same comment as for TIS - I'm not
> sure devm works properly like this.
>
> I guess just add the same comment as for TIS?
Yup, makes sense.
> > diff --git a/drivers/char/tpm/tpm_i2c_infineon.c b/drivers/char/tpm/tpm_i2c_infineon.c
> > index 472af4b..6f00bc3 100644
> > +++ b/drivers/char/tpm/tpm_i2c_infineon.c
> > @@ -581,10 +581,9 @@ static int tpm_tis_i2c_init(struct device *dev)
> > int rc = 0;
> > struct tpm_chip *chip;
> >
> > - chip = tpm_register_hardware(dev, &tpm_tis_i2c);
> > - if (!chip) {
> > - dev_err(dev, "could not register hardware\n");
> > - rc = -ENODEV;
> > + chip = tpmm_chip_alloc(dev, &tpm_tis_i2c);
> > + if (IS_ERR(chip)) {
> > + rc = PTR_ERR(chip);
> > goto out_err;
>
> Nit: out_err is synonymous with 'return rc;', so all the goto out_err
> in this driver can just return;
>
> > + rc = tpm_chip_register(chip);
> > + if (rc)
> > + return rc;
> > + return 0;
>
> Nit: could just be return tpm_chip_register(chip);
>
> > @@ -619,10 +619,9 @@ tpm_st33_i2c_probe(struct i2c_client *client, const struct i2c_device_id *id)
> > goto end;
> > }
> >
> > - chip = tpm_register_hardware(&client->dev, &st_i2c_tpm);
> > - if (!chip) {
> > - dev_info(&client->dev, "fail chip\n");
> > - err = -ENODEV;
> > + chip = tpmm_chip_alloc(&client->dev, &st_i2c_tpm);
> > + if (IS_ERR(chip)) {
> > + err = PTR_ERR(chip);
> > goto end;
> > }
>
> Nit: Same comment, 'goto end' can just be 'return rc'
>
> > - dev_info(chip->dev, "TPM I2C Initialized\n");
> > + err = tpm_chip_register(chip);
> > + if (err)
> > + goto _irq_set;
> > +
>
> Nit: Same comment
>
> > end:
> > pr_info("TPM I2C initialisation fail\n");
>
> This pr_info should be deleted
Agreed.
> > @@ -573,6 +580,8 @@ static void tpm_inf_pnp_remove(struct pnp_dev *dev)
> > struct tpm_chip *chip = pnp_get_drvdata(dev);
> >
> > if (chip) {
>
> Nit: This if in the remove callback is an anti-pattern and should be
> globally removed. If remove is being called then probe succeeded,
> there is no way for probe to succed and drvdata to be unset.
>
> > static void tpm_nsc_remove(struct device *dev)
> > {
> > struct tpm_chip *chip = dev_get_drvdata(dev);
> > - if ( chip ) {
> > + if (chip) {
>
> Same comment
>
> > @@ -836,7 +831,7 @@ MODULE_PARM_DESC(hid, "Set additional specific HID for this driver to probe");
> >
> > static struct platform_driver tis_drv = {
> > .driver = {
> > - .name = "tpm_tis",
> > + .name = "tpm_tis",
> > .owner = THIS_MODULE,
> > .pm = &tpm_tis_pm,
> > },
>
> There is no remove method here in this platform_driver, this ties into
> the question if force works or not. The tpm_tis_remove call in the
> cleanup_hardware should be done through the .remove method of this
> driver structure..
I'll try to get this tested.
> Jason
/Jarkko
^ permalink raw reply
* Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call
From: David Drysdale @ 2014-10-23 6:40 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Andy Lutomirski, Alexander Viro, Meredydd Luff,
linux-kernel@vger.kernel.org, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, Andrew Morton, Kees Cook, Arnd Bergmann, X86 ML,
linux-arch, Linux API
In-Reply-To: <87k33sgeq9.fsf@x220.int.ebiederm.org>
On Wed, Oct 22, 2014 at 7:07 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> David Drysdale <drysdale@google.com> writes:
>
>> Add a new system execveat(2) syscall. execveat() is to execve() as
>> openat() is to open(): it takes a file descriptor that refers to a
>> directory, and resolves the filename relative to that.
>>
>> In addition, if the filename is empty and AT_EMPTY_PATH is specified,
>> execveat() executes the file to which the file descriptor refers. This
>> replicates the functionality of fexecve(), which is a system call in
>> other UNIXen, but in Linux glibc it depends on opening
>> "/proc/self/fd/<fd>" (and so relies on /proc being mounted).
>>
>> The filename fed to the executed program as argv[0] (or the name of the
>> script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>> (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>> reflecting how the executable was found. This does however mean that
>> execution of a script in a /proc-less environment won't work.
>>
>> Only x86-64, i386 and x32 ABIs are supported in this patch.
>>
>> Based on patches by Meredydd Luff <meredydd@senatehouse.org>
>>
>> Signed-off-by: David Drysdale <drysdale@google.com>
>> ---
>> arch/x86/ia32/audit.c | 1 +
>> arch/x86/ia32/ia32entry.S | 1 +
>> arch/x86/kernel/audit_64.c | 1 +
>> arch/x86/kernel/entry_64.S | 28 ++++++++
>> arch/x86/syscalls/syscall_32.tbl | 1 +
>> arch/x86/syscalls/syscall_64.tbl | 2 +
>> arch/x86/um/sys_call_table_64.c | 1 +
>> fs/exec.c | 130 ++++++++++++++++++++++++++++++++++----
>> fs/namei.c | 2 +-
>> include/linux/compat.h | 3 +
>> include/linux/fs.h | 1 +
>> include/linux/sched.h | 4 ++
>> include/linux/syscalls.h | 4 ++
>> include/uapi/asm-generic/unistd.h | 4 +-
>> kernel/sys_ni.c | 3 +
>> lib/audit.c | 3 +
>> 16 files changed, 173 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c
>> index 5d7b381da692..2eccc8932ae6 100644
>> --- a/arch/x86/ia32/audit.c
>> +++ b/arch/x86/ia32/audit.c
>> @@ -35,6 +35,7 @@ int ia32_classify_syscall(unsigned syscall)
>> case __NR_socketcall:
>> return 4;
>> case __NR_execve:
>> + case __NR_execveat:
>> return 5;
>> default:
>> return 1;
>> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
>> index 4299eb05023c..2516c09743e0 100644
>> --- a/arch/x86/ia32/ia32entry.S
>> +++ b/arch/x86/ia32/ia32entry.S
>> @@ -464,6 +464,7 @@ GLOBAL(\label)
>> PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn
>> PTREGSCALL stub32_sigreturn, sys32_sigreturn
>> PTREGSCALL stub32_execve, compat_sys_execve
>> + PTREGSCALL stub32_execveat, compat_sys_execveat
>> PTREGSCALL stub32_fork, sys_fork
>> PTREGSCALL stub32_vfork, sys_vfork
>>
>> diff --git a/arch/x86/kernel/audit_64.c b/arch/x86/kernel/audit_64.c
>> index 06d3e5a14d9d..f3672508b249 100644
>> --- a/arch/x86/kernel/audit_64.c
>> +++ b/arch/x86/kernel/audit_64.c
>> @@ -50,6 +50,7 @@ int audit_classify_syscall(int abi, unsigned syscall)
>> case __NR_openat:
>> return 3;
>> case __NR_execve:
>> + case __NR_execveat:
>> return 5;
>> default:
>> return 0;
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index 2fac1343a90b..00c4526e6ffe 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -665,6 +665,20 @@ ENTRY(stub_execve)
>> CFI_ENDPROC
>> END(stub_execve)
>>
>> +ENTRY(stub_execveat)
>> + CFI_STARTPROC
>> + addq $8, %rsp
>> + PARTIAL_FRAME 0
>> + SAVE_REST
>> + FIXUP_TOP_OF_STACK %r11
>> + call sys_execveat
>> + RESTORE_TOP_OF_STACK %r11
>> + movq %rax,RAX(%rsp)
>> + RESTORE_REST
>> + jmp int_ret_from_sys_call
>> + CFI_ENDPROC
>> +END(stub_execveat)
>> +
>> /*
>> * sigreturn is special because it needs to restore all registers on return.
>> * This cannot be done with SYSRET, so use the IRET return path instead.
>> @@ -710,6 +724,20 @@ ENTRY(stub_x32_execve)
>> CFI_ENDPROC
>> END(stub_x32_execve)
>>
>> +ENTRY(stub_x32_execveat)
>> + CFI_STARTPROC
>> + addq $8, %rsp
>> + PARTIAL_FRAME 0
>> + SAVE_REST
>> + FIXUP_TOP_OF_STACK %r11
>> + call compat_sys_execveat
>> + RESTORE_TOP_OF_STACK %r11
>> + movq %rax,RAX(%rsp)
>> + RESTORE_REST
>> + jmp int_ret_from_sys_call
>> + CFI_ENDPROC
>> +END(stub_x32_execveat)
>> +
>> #endif
>>
>> /*
>> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
>> index 028b78168d85..2633e3195455 100644
>> --- a/arch/x86/syscalls/syscall_32.tbl
>> +++ b/arch/x86/syscalls/syscall_32.tbl
>> @@ -363,3 +363,4 @@
>> 354 i386 seccomp sys_seccomp
>> 355 i386 getrandom sys_getrandom
>> 356 i386 memfd_create sys_memfd_create
>> +357 i386 execveat sys_execveat stub32_execveat
>> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
>> index 35dd922727b9..1af5badd159c 100644
>> --- a/arch/x86/syscalls/syscall_64.tbl
>> +++ b/arch/x86/syscalls/syscall_64.tbl
>> @@ -327,6 +327,7 @@
>> 318 common getrandom sys_getrandom
>> 319 common memfd_create sys_memfd_create
>> 320 common kexec_file_load sys_kexec_file_load
>> +321 64 execveat stub_execveat
>>
>> #
>> # x32-specific system call numbers start at 512 to avoid cache impact
>> @@ -365,3 +366,4 @@
>> 542 x32 getsockopt compat_sys_getsockopt
>> 543 x32 io_setup compat_sys_io_setup
>> 544 x32 io_submit compat_sys_io_submit
>> +545 x32 execveat stub_x32_execveat
>> diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
>> index f2f0723070ca..20c3649d0691 100644
>> --- a/arch/x86/um/sys_call_table_64.c
>> +++ b/arch/x86/um/sys_call_table_64.c
>> @@ -31,6 +31,7 @@
>> #define stub_fork sys_fork
>> #define stub_vfork sys_vfork
>> #define stub_execve sys_execve
>> +#define stub_execveat sys_execveat
>> #define stub_rt_sigreturn sys_rt_sigreturn
>>
>> #define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
>> diff --git a/fs/exec.c b/fs/exec.c
>> index a2b42a98c743..92a6e14f096a 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -747,7 +747,7 @@ EXPORT_SYMBOL(setup_arg_pages);
>>
>> #endif /* CONFIG_MMU */
>>
>> -static struct file *do_open_exec(struct filename *name)
>> +static struct file *do_open_execat(int fd, struct filename *name, int flags)
>> {
>> struct file *file;
>> int err;
>> @@ -757,10 +757,34 @@ static struct file *do_open_exec(struct filename *name)
>> .intent = LOOKUP_OPEN,
>> .lookup_flags = LOOKUP_FOLLOW,
>> };
>> + static const struct open_flags open_exec_nofollow_flags = {
>> + .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
>> + .acc_mode = MAY_EXEC | MAY_OPEN,
>> + .intent = LOOKUP_OPEN,
>> + .lookup_flags = 0,
>> + };
>>
>> - file = do_filp_open(AT_FDCWD, name, &open_exec_flags);
>> - if (IS_ERR(file))
>> - goto out;
>> + if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
>> + return ERR_PTR(-EINVAL);
>> +
>> + if (name->name[0] != '\0') {
>
> Is it really necessary to special case AT_EMPTY_PATH here. I would
> have thought the existing logic in namei.c would have been fine
> assuning we passed LOOKUP_EMPTY.
Just using do_filp_open() throughout looks mostly plausible on a quick
experiment, but my initial version appears to make O_PATH fds unexpectedly
fexecve()-able (I'm glad I had a test case for that).
I'll look for a way around that, hopefully without an explicit special case.
>> + const struct open_flags *oflags = ((flags & AT_SYMLINK_NOFOLLOW)
>> + ? &open_exec_nofollow_flags
>> + : &open_exec_flags);
>> +
>> + file = do_filp_open(fd, name, oflags);
>> + if (IS_ERR(file))
>> + goto out;
>> + } else {
>> + file = fget(fd);
>> + if (!file)
>> + return ERR_PTR(-EBADF);
>> +
>> + err = inode_permission(file->f_path.dentry->d_inode,
>> + open_exec_flags.acc_mode);
>> + if (err)
>> + goto exit;
>> + }
>>
>> err = -EACCES;
>> if (!S_ISREG(file_inode(file)->i_mode))
>> @@ -769,12 +793,13 @@ static struct file *do_open_exec(struct filename *name)
>> if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
>> goto exit;
>>
>> - fsnotify_open(file);
>> -
>> err = deny_write_access(file);
>> if (err)
>> goto exit;
>>
>> + if (name->name[0] != '\0')
>> + fsnotify_open(file);
>> +
>> out:
>> return file;
>>
>> @@ -786,7 +811,7 @@ exit:
>> struct file *open_exec(const char *name)
>> {
>> struct filename tmp = { .name = name };
>> - return do_open_exec(&tmp);
>> + return do_open_execat(AT_FDCWD, &tmp, 0);
>> }
>> EXPORT_SYMBOL(open_exec);
>>
>> @@ -1422,10 +1447,12 @@ static int exec_binprm(struct linux_binprm *bprm)
>> /*
>> * sys_execve() executes a new program.
>> */
>> -static int do_execve_common(struct filename *filename,
>> - struct user_arg_ptr argv,
>> - struct user_arg_ptr envp)
>> +static int do_execveat_common(int fd, struct filename *filename,
>> + struct user_arg_ptr argv,
>> + struct user_arg_ptr envp,
>> + int flags)
>> {
>> + char *pathbuf = NULL;
>> struct linux_binprm *bprm;
>> struct file *file;
>> struct files_struct *displaced;
>> @@ -1466,7 +1493,7 @@ static int do_execve_common(struct filename *filename,
>> check_unsafe_exec(bprm);
>> current->in_execve = 1;
>>
>> - file = do_open_exec(filename);
>> + file = do_open_execat(fd, filename, flags);
>> retval = PTR_ERR(file);
>> if (IS_ERR(file))
>> goto out_unmark;
>> @@ -1474,7 +1501,27 @@ static int do_execve_common(struct filename *filename,
>> sched_exec();
>>
>> bprm->file = file;
>> - bprm->filename = bprm->interp = filename->name;
>> + if (fd == AT_FDCWD || filename->name[0] == '/') {
>> + bprm->filename = filename->name;
>> + } else {
>> + /*
>> + * Build a pathname that reflects how we got to the file,
>> + * either "/dev/fd/<fd>" (for an empty filename) or
>> + * "/dev/fd/<fd>/<filename>".
>> + */
>> + pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
>> + if (!pathbuf) {
>> + retval = -ENOMEM;
>> + goto out_unmark;
>> + }
>> + bprm->filename = pathbuf;
>> + if (filename->name[0] == '\0')
>> + sprintf(pathbuf, "/dev/fd/%d", fd);
>> + else
>> + snprintf(pathbuf, PATH_MAX,
>> + "/dev/fd/%d/%s", fd, filename->name);
>> + }
>> + bprm->interp = bprm->filename;
>>
>> retval = bprm_mm_init(bprm);
>> if (retval)
>> @@ -1532,6 +1579,7 @@ out_unmark:
>>
>> out_free:
>> free_bprm(bprm);
>> + kfree(pathbuf);
>>
>> out_files:
>> if (displaced)
>> @@ -1547,7 +1595,18 @@ int do_execve(struct filename *filename,
>> {
>> struct user_arg_ptr argv = { .ptr.native = __argv };
>> struct user_arg_ptr envp = { .ptr.native = __envp };
>> - return do_execve_common(filename, argv, envp);
>> + return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
>> +}
>> +
>> +int do_execveat(int fd, struct filename *filename,
>> + const char __user *const __user *__argv,
>> + const char __user *const __user *__envp,
>> + int flags)
>> +{
>> + struct user_arg_ptr argv = { .ptr.native = __argv };
>> + struct user_arg_ptr envp = { .ptr.native = __envp };
>> +
>> + return do_execveat_common(fd, filename, argv, envp, flags);
>> }
>>
>> #ifdef CONFIG_COMPAT
>> @@ -1563,7 +1622,23 @@ static int compat_do_execve(struct filename *filename,
>> .is_compat = true,
>> .ptr.compat = __envp,
>> };
>> - return do_execve_common(filename, argv, envp);
>> + return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
>> +}
>> +
>> +static int compat_do_execveat(int fd, struct filename *filename,
>> + const compat_uptr_t __user *__argv,
>> + const compat_uptr_t __user *__envp,
>> + int flags)
>> +{
>> + struct user_arg_ptr argv = {
>> + .is_compat = true,
>> + .ptr.compat = __argv,
>> + };
>> + struct user_arg_ptr envp = {
>> + .is_compat = true,
>> + .ptr.compat = __envp,
>> + };
>> + return do_execveat_common(fd, filename, argv, envp, flags);
>> }
>> #endif
>>
>> @@ -1603,6 +1678,20 @@ SYSCALL_DEFINE3(execve,
>> {
>> return do_execve(getname(filename), argv, envp);
>> }
>> +
>> +SYSCALL_DEFINE5(execveat,
>> + int, fd, const char __user *, filename,
>> + const char __user *const __user *, argv,
>> + const char __user *const __user *, envp,
>> + int, flags)
>> +{
>> + int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
>> +
>> + return do_execveat(fd,
>> + getname_flags(filename, lookup_flags, NULL),
>> + argv, envp, flags);
>> +}
>> +
>> #ifdef CONFIG_COMPAT
>> COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
>> const compat_uptr_t __user *, argv,
>> @@ -1610,4 +1699,17 @@ COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
>> {
>> return compat_do_execve(getname(filename), argv, envp);
>> }
>> +
>> +COMPAT_SYSCALL_DEFINE5(execveat, int, fd,
>> + const char __user *, filename,
>> + const compat_uptr_t __user *, argv,
>> + const compat_uptr_t __user *, envp,
>> + int, flags)
>> +{
>> + int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
>> +
>> + return compat_do_execveat(fd,
>> + getname_flags(filename, lookup_flags, NULL),
>> + argv, envp, flags);
>> +}
>> #endif
>> diff --git a/fs/namei.c b/fs/namei.c
>> index a7b05bf82d31..553c84d3e0cc 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -130,7 +130,7 @@ void final_putname(struct filename *name)
>>
>> #define EMBEDDED_NAME_MAX (PATH_MAX - sizeof(struct filename))
>>
>> -static struct filename *
>> +struct filename *
>> getname_flags(const char __user *filename, int flags, int *empty)
>> {
>> struct filename *result, *err;
>> diff --git a/include/linux/compat.h b/include/linux/compat.h
>> index e6494261eaff..7450ca2ac1fc 100644
>> --- a/include/linux/compat.h
>> +++ b/include/linux/compat.h
>> @@ -357,6 +357,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);
>>
>> asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
>> const compat_uptr_t __user *envp);
>> +asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
>> + const compat_uptr_t __user *argv,
>> + const compat_uptr_t __user *envp, int flags);
>>
>> asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
>> compat_ulong_t __user *outp, compat_ulong_t __user *exp,
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 94187721ad41..e9818574d738 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2060,6 +2060,7 @@ extern struct file *file_open_root(struct dentry *, struct vfsmount *,
>> extern struct file * dentry_open(const struct path *, int, const struct cred *);
>> extern int filp_close(struct file *, fl_owner_t id);
>>
>> +extern struct filename *getname_flags(const char __user *, int, int *);
>> extern struct filename *getname(const char __user *);
>> extern struct filename *getname_kernel(const char *);
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index b867a4dab38a..33e056da7d33 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2430,6 +2430,10 @@ extern void do_group_exit(int);
>> extern int do_execve(struct filename *,
>> const char __user * const __user *,
>> const char __user * const __user *);
>> +extern int do_execveat(int, struct filename *,
>> + const char __user * const __user *,
>> + const char __user * const __user *,
>> + int);
>> extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
>> struct task_struct *fork_idle(int);
>> extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index 0f86d85a9ce4..df5422294deb 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -876,4 +876,8 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
>> asmlinkage long sys_getrandom(char __user *buf, size_t count,
>> unsigned int flags);
>>
>> +asmlinkage long sys_execveat(int dfd, const char __user *filename,
>> + const char __user *const __user *argv,
>> + const char __user *const __user *envp, int flags);
>> +
>> #endif
>> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
>> index 11d11bc5c78f..feef07d29663 100644
>> --- a/include/uapi/asm-generic/unistd.h
>> +++ b/include/uapi/asm-generic/unistd.h
>> @@ -705,9 +705,11 @@ __SYSCALL(__NR_seccomp, sys_seccomp)
>> __SYSCALL(__NR_getrandom, sys_getrandom)
>> #define __NR_memfd_create 279
>> __SYSCALL(__NR_memfd_create, sys_memfd_create)
>> +#define __NR_execveat 280
>> +__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
>>
>> #undef __NR_syscalls
>> -#define __NR_syscalls 280
>> +#define __NR_syscalls 281
>>
>> /*
>> * All syscalls below here should go away really,
>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>> index 391d4ddb6f4b..efb06058ad3e 100644
>> --- a/kernel/sys_ni.c
>> +++ b/kernel/sys_ni.c
>> @@ -218,3 +218,6 @@ cond_syscall(sys_kcmp);
>>
>> /* operate on Secure Computing state */
>> cond_syscall(sys_seccomp);
>> +
>> +/* execveat */
>> +cond_syscall(sys_execveat);
>> diff --git a/lib/audit.c b/lib/audit.c
>> index 1d726a22565b..b8fb5ee81e26 100644
>> --- a/lib/audit.c
>> +++ b/lib/audit.c
>> @@ -54,6 +54,9 @@ int audit_classify_syscall(int abi, unsigned syscall)
>> case __NR_socketcall:
>> return 4;
>> #endif
>> +#ifdef __NR_execveat
>> + case __NR_execveat:
>> +#endif
>> case __NR_execve:
>> return 5;
>> default:
^ permalink raw reply
* Re: [PATCH 3/3] selftests/kcmp: Always try to build the test
From: Cyrill Gorcunov @ 2014-10-23 6:09 UTC (permalink / raw)
To: Michael Ellerman
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw, linux-api-u79uwXL29TY76Z2rM5mHXA,
Andrew Morton
In-Reply-To: <1414040834-30209-3-git-send-email-mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
On Thu, Oct 23, 2014 at 04:07:14PM +1100, Michael Ellerman wrote:
> Don't prevent the test building on non-x86. Just try and build it and
> let the chips fall where they may.
>
> Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
kcmp depends on checkpoint/restore config symbol which is known
to work on x86 and (iirc) on arm, that's why x86 was only allowed.
I don't mind to such change but not sure.
^ permalink raw reply
* Re: [PATCH 2/3] selftests/kcmp: Don't include kernel headers
From: Cyrill Gorcunov @ 2014-10-23 6:05 UTC (permalink / raw)
To: Michael Ellerman
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw, linux-api-u79uwXL29TY76Z2rM5mHXA,
Andrew Morton
In-Reply-To: <1414040834-30209-2-git-send-email-mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
On Thu, Oct 23, 2014 at 04:07:13PM +1100, Michael Ellerman wrote:
> The kcmp test mucks with the include path to bring in the kernel
> headers, and x86 headers too for reasons that are not clear.
>
> Now that kcmp.h is exported none of that should be necessary.
>
> Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
The reason was to be able to run test without userspace headers generated.
Still this one is better I think.
Acked-by: Cyrill Gorcunov <gorcunov-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
^ permalink raw reply
* Re: [PATCH 1/3] kcmp: Move kcmp.h into uapi
From: Cyrill Gorcunov @ 2014-10-23 6:04 UTC (permalink / raw)
To: Michael Ellerman
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw, linux-api-u79uwXL29TY76Z2rM5mHXA,
Andrew Morton
In-Reply-To: <1414040834-30209-1-git-send-email-mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
On Thu, Oct 23, 2014 at 04:07:12PM +1100, Michael Ellerman wrote:
> kcmp.h appears to be part of the API, it's documented in kcmp(2), and
> the selftests/kcmp code uses it. So move it to uapi so it's actually
> exported.
>
> Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
Acked-by: Cyrill Gorcunov <gorcunov-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
^ permalink raw reply
* [PATCH 3/3] selftests/kcmp: Always try to build the test
From: Michael Ellerman @ 2014-10-23 5:07 UTC (permalink / raw)
To: linux-kernel; +Cc: shuahkh, linux-api, Andrew Morton, gorcunov
In-Reply-To: <1414040834-30209-1-git-send-email-mpe@ellerman.id.au>
Don't prevent the test building on non-x86. Just try and build it and
let the chips fall where they may.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
tools/testing/selftests/kcmp/Makefile | 14 --------------
1 file changed, 14 deletions(-)
diff --git a/tools/testing/selftests/kcmp/Makefile b/tools/testing/selftests/kcmp/Makefile
index 4f00c0524501..cda9cc4004c9 100644
--- a/tools/testing/selftests/kcmp/Makefile
+++ b/tools/testing/selftests/kcmp/Makefile
@@ -1,21 +1,7 @@
-uname_M := $(shell uname -m 2>/dev/null || echo not)
-ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
-ifeq ($(ARCH),i386)
- ARCH := x86
- CFLAGS := -DCONFIG_X86_32 -D__i386__
-endif
-ifeq ($(ARCH),x86_64)
- ARCH := x86
- CFLAGS := -DCONFIG_X86_64 -D__x86_64__
-endif
CFLAGS += -I../../../../usr/include/
all:
-ifeq ($(ARCH),x86)
gcc $(CFLAGS) kcmp_test.c -o kcmp_test
-else
- echo "Not an x86 target, can't build kcmp selftest"
-endif
run_tests: all
@./kcmp_test || echo "kcmp_test: [FAIL]"
--
1.9.1
^ permalink raw reply related
* [PATCH 2/3] selftests/kcmp: Don't include kernel headers
From: Michael Ellerman @ 2014-10-23 5:07 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: shuahkh-JPH+aEBZ4P+UEJcrhfAQsw, linux-api-u79uwXL29TY76Z2rM5mHXA,
Andrew Morton, gorcunov-GEFAQzZX7r8dnm+yROfE0A
In-Reply-To: <1414040834-30209-1-git-send-email-mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
The kcmp test mucks with the include path to bring in the kernel
headers, and x86 headers too for reasons that are not clear.
Now that kcmp.h is exported none of that should be necessary.
Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
---
tools/testing/selftests/kcmp/Makefile | 4 ----
1 file changed, 4 deletions(-)
diff --git a/tools/testing/selftests/kcmp/Makefile b/tools/testing/selftests/kcmp/Makefile
index 8aabd82db9e4..4f00c0524501 100644
--- a/tools/testing/selftests/kcmp/Makefile
+++ b/tools/testing/selftests/kcmp/Makefile
@@ -8,11 +8,7 @@ ifeq ($(ARCH),x86_64)
ARCH := x86
CFLAGS := -DCONFIG_X86_64 -D__x86_64__
endif
-
-CFLAGS += -I../../../../arch/x86/include/generated/
-CFLAGS += -I../../../../include/
CFLAGS += -I../../../../usr/include/
-CFLAGS += -I../../../../arch/x86/include/
all:
ifeq ($(ARCH),x86)
--
1.9.1
^ permalink raw reply related
* [PATCH 1/3] kcmp: Move kcmp.h into uapi
From: Michael Ellerman @ 2014-10-23 5:07 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: shuahkh-JPH+aEBZ4P+UEJcrhfAQsw, linux-api-u79uwXL29TY76Z2rM5mHXA,
Andrew Morton, gorcunov-GEFAQzZX7r8dnm+yROfE0A
kcmp.h appears to be part of the API, it's documented in kcmp(2), and
the selftests/kcmp code uses it. So move it to uapi so it's actually
exported.
Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
---
include/linux/kcmp.h | 13 +------------
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/kcmp.h | 17 +++++++++++++++++
3 files changed, 19 insertions(+), 12 deletions(-)
create mode 100644 include/uapi/linux/kcmp.h
diff --git a/include/linux/kcmp.h b/include/linux/kcmp.h
index 2dcd1b3aafc8..9dfb23e1771b 100644
--- a/include/linux/kcmp.h
+++ b/include/linux/kcmp.h
@@ -1,17 +1,6 @@
#ifndef _LINUX_KCMP_H
#define _LINUX_KCMP_H
-/* Comparison type */
-enum kcmp_type {
- KCMP_FILE,
- KCMP_VM,
- KCMP_FILES,
- KCMP_FS,
- KCMP_SIGHAND,
- KCMP_IO,
- KCMP_SYSVSEM,
-
- KCMP_TYPES,
-};
+#include <uapi/linux/kcmp.h>
#endif /* _LINUX_KCMP_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index b70237e8bc37..1cf50d682dbf 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -209,6 +209,7 @@ header-y += ivtvfb.h
header-y += ixjuser.h
header-y += jffs2.h
header-y += joystick.h
+header-y += kcmp.h
header-y += kd.h
header-y += kdev_t.h
header-y += kernel-page-flags.h
diff --git a/include/uapi/linux/kcmp.h b/include/uapi/linux/kcmp.h
new file mode 100644
index 000000000000..84df14b37360
--- /dev/null
+++ b/include/uapi/linux/kcmp.h
@@ -0,0 +1,17 @@
+#ifndef _UAPI_LINUX_KCMP_H
+#define _UAPI_LINUX_KCMP_H
+
+/* Comparison type */
+enum kcmp_type {
+ KCMP_FILE,
+ KCMP_VM,
+ KCMP_FILES,
+ KCMP_FS,
+ KCMP_SIGHAND,
+ KCMP_IO,
+ KCMP_SYSVSEM,
+
+ KCMP_TYPES,
+};
+
+#endif /* _UAPI_LINUX_KCMP_H */
--
1.9.1
^ permalink raw reply related
* Re: [PATCH V2] kernel, add bug_on_warn
From: Yasuaki Ishimatsu @ 2014-10-23 0:39 UTC (permalink / raw)
To: Prarit Bhargava, linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: Jonathan Corbet, Andrew Morton, Rusty Russell, H. Peter Anvin,
Andi Kleen, Masami Hiramatsu, Fabian Frederick,
vgoyal-H+wXaHxf7aLQT0dZR+AlfA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1413910077-9464-1-git-send-email-prarit-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
(2014/10/22 1:47), Prarit Bhargava wrote:
> There have been several times where I have had to rebuild a kernel to
> cause a panic when hitting a WARN() in the code in order to get a crash
> dump from a system. Sometimes this is easy to do, other times (such as
> in the case of a remote admin) it is not trivial to send new images to the
> user.
>
> A much easier method would be a switch to change the WARN() over to a
> BUG(). This makes debugging easier in that I can now test the actual
> image the WARN() was seen on and I do not have to engage in remote
> debugging.
>
> This patch adds a bug_on_warn kernel parameter, which calls BUG() in the
> warn_slowpath_common() path. The function will still print out the
> location of the warning.
>
> An example of the bug_on_warn output:
>
> The first line below is from the WARN_ON() to output the WARN_ON()'s location.
> After that the new BUG() call is displayed.
>
> WARNING: CPU: 27 PID: 3204 at
> /home/rhel7/redhat/debug/dummy-module/dummy-module.c:25 init_dummy+0x28/0x30
> [dummy_module]()
> bug_on_warn set, calling BUG()...
> ------------[ cut here ]------------
> kernel BUG at kernel/panic.c:434!
> invalid opcode: 0000 [#1] SMP
> Modules linked in: dummy_module(OE+) sg nfsv3 rpcsec_gss_krb5 nfsv4
> dns_resolver nfs fscache cfg80211 rfkill x86_pkg_temp_thermal intel_powerclamp
> coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel
> ghash_clmulni_intel igb iTCO_wdt aesni_intel iTCO_vendor_support lrw gf128mul
> sb_edac ptp edac_core glue_helper lpc_ich ioatdma pcspkr ablk_helper pps_core
> i2c_i801 mfd_core cryptd dca shpchp ipmi_si wmi ipmi_msghandler acpi_cpufreq
> nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c sr_mod cdrom sd_mod
> mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper isci ttm
> drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror
> dm_region_hash dm_log dm_mod
> CPU: 27 PID: 3204 Comm: insmod Tainted: G OE 3.17.0+ #19
> Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
> RMLSDP.86I.00.29.D696.1311111329 11/11/2013
> task: ffff880034e75160 ti: ffff8807fc5ac000 task.ti: ffff8807fc5ac000
> RIP: 0010:[<ffffffff81076b81>] [<ffffffff81076b81>] warn_slowpath_common+0xc1/0xd0
> RSP: 0018:ffff8807fc5afc68 EFLAGS: 00010246
> RAX: 0000000000000021 RBX: ffff8807fc5afcb0 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffff88081efee5f8 RDI: ffff88081efee5f8
> RBP: ffff8807fc5afc98 R08: 0000000000000096 R09: 0000000000000000
> R10: 0000000000000711 R11: ffff8807fc5af93e R12: ffffffffa0424070
> R13: 0000000000000019 R14: ffffffffa0423068 R15: 0000000000000009
> FS: 00007f2d4b034740(0000) GS:ffff88081efe0000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f2d4a99f3c0 CR3: 00000007fd88b000 CR4: 00000000001407e0
> Stack:
> ffff8807fc5afcb8 ffffffff8199f020 ffff88080e396160 0000000000000000
> ffffffffa0423040 ffffffffa0425000 ffff8807fc5afd08 ffffffff81076be5
> 0000000000000008 ffffffffa0424053 ffff880700000018 ffff8807fc5afd18
> Call Trace:
> [<ffffffffa0423040>] ? dummy_greetings+0x40/0x40 [dummy_module]
> [<ffffffff81076be5>] warn_slowpath_fmt+0x55/0x70
> [<ffffffffa0423068>] init_dummy+0x28/0x30 [dummy_module]
> [<ffffffff81002144>] do_one_initcall+0xd4/0x210
> [<ffffffff811b52c2>] ? __vunmap+0xc2/0x110
> [<ffffffff810f8889>] load_module+0x16a9/0x1b30
> [<ffffffff810f3d30>] ? store_uevent+0x70/0x70
> [<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180
> [<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0
> [<ffffffff8166ce29>] system_call_fastpath+0x12/0x17
> Code: c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 c7 c7 20 42 8a 81 31 c0 e8 fc
> 80 5e 00 eb 80 48 c7 c7 78 42 8a 81 31 c0 e8 ec 80 5e 00 <0f> 0b 66 66 66 66 2e
> 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
> RIP [<ffffffff81076b81>] warn_slowpath_common+0xc1/0xd0
> RSP <ffff8807fc5afc68>
> ---[ end trace 428218934a12088b ]---
>
> Successfully tested by me.
>
> Cc: Jonathan Corbet <corbet-T1hC0tSOHrs@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
> Cc: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
> Cc: Andi Kleen <ak-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Cc: Masami Hiramatsu <masami.hiramatsu.pt-FCd8Q96Dh0JBDgjK7y7TUQ@public.gmane.org>
> Cc: Fabian Frederick <fabf-AgBVmzD5pcezQB+pC5nmwQ@public.gmane.org>
> Cc: vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Cc: isimatu.yasuaki-+CUm20s59erQFUHtdCDX3A@public.gmane.org
> Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
> Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: Prarit Bhargava <prarit-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>
> [v2]: add /proc/sys/kernel/bug_on_warn, additional documentation, modify
> !slowpath cases
> ---
> Documentation/kdump/kdump.txt | 7 +++++++
> Documentation/kernel-parameters.txt | 3 +++
> Documentation/sysctl/kernel.txt | 12 ++++++++++++
> include/asm-generic/bug.h | 12 ++++++++++--
> include/linux/kernel.h | 1 +
> include/uapi/linux/sysctl.h | 1 +
> kernel/panic.c | 21 ++++++++++++++++++++-
> kernel/sysctl.c | 7 +++++++
> kernel/sysctl_binary.c | 1 +
> 9 files changed, 62 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> index 6c0b9f2..a04ed72 100644
> --- a/Documentation/kdump/kdump.txt
> +++ b/Documentation/kdump/kdump.txt
> @@ -471,6 +471,13 @@ format. Crash is available on Dave Anderson's site at the following URL:
>
> http://people.redhat.com/~anderson/
>
> +Trigger Kdump on WARN()
> +=======================
> +
> +The kernel parameter, bug_on_warn, calls BUG() in all WARN() paths. This
> +will cause a kdump to occur at the BUG() call. In cases where a user
> +wants to specify this during runtime, /proc/sys/kernel/bug_on_warn can be
> +set to 1 to achieve the same behaviour.
>
> Contact
> =======
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 988160a..3890a3a 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -553,6 +553,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> bttv.pll= See Documentation/video4linux/bttv/Insmod-options
> bttv.tuner=
>
> + bug_on_warn BUG() instead of WARN(). Useful to cause kdump
> + on a WARN().
> +
> bulk_remove=off [PPC] This parameter disables the use of the pSeries
> firmware feature for flushing multiple hpte entries
> at a time.
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index 57baff5..dcadcdc 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -23,6 +23,7 @@ show up in /proc/sys/kernel:
> - auto_msgmni
> - bootloader_type [ X86 only ]
> - bootloader_version [ X86 only ]
> +- bug_on_warn
> - callhome [ S390 only ]
> - cap_last_cap
> - core_pattern
> @@ -152,6 +153,17 @@ Documentation/x86/boot.txt for additional information.
>
> ==============================================================
>
> +bug_on_warn:
> +
> +Calls BUG() in the WARN() path when set to 1. This is useful to avoid
> +a kernel rebuild when attempting to kdump at the location of a WARN().
> +
> +0: only WARN(), default behaviour.
> +
> +1: call BUG() after printing out WARN() location.
> +
> +==============================================================
> +
> callhome:
>
> Controls the kernel's callhome behavior in case of a kernel panic.
> diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
> index 630dd23..4d0c763 100644
> --- a/include/asm-generic/bug.h
> +++ b/include/asm-generic/bug.h
> @@ -75,10 +75,18 @@ extern void warn_slowpath_null(const char *file, const int line);
> #define __WARN_printf_taint(taint, arg...) \
> warn_slowpath_fmt_taint(__FILE__, __LINE__, taint, arg)
> #else
> -#define __WARN() __WARN_TAINT(TAINT_WARN)
> +#define check_bug_on_warn() \
> + do { \
> + if (bug_on_warn) \
> + BUG(); \
> + } while (0)
> +
> +#define __WARN() \
> + do { __WARN_TAINT(TAINT_WARN); check_bug_on_warn(); } while (0)
> +
> #define __WARN_printf(arg...) do { printk(arg); __WARN(); } while (0)
> #define __WARN_printf_taint(taint, arg...) \
> - do { printk(arg); __WARN_TAINT(taint); } while (0)
> + do { printk(arg); __WARN_TAINT(taint); check_bug_on_warn(); } while (0)
> #endif
>
> #ifndef WARN_ON
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 40728cf..4094a60 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -422,6 +422,7 @@ extern int panic_on_oops;
> extern int panic_on_unrecovered_nmi;
> extern int panic_on_io_nmi;
> extern int sysctl_panic_on_stackoverflow;
> +extern int bug_on_warn;
> /*
> * Only to be used by arch init code. If the user over-wrote the default
> * CONFIG_PANIC_TIMEOUT, honor it.
> diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
> index 43aaba1..2ba0a58 100644
> --- a/include/uapi/linux/sysctl.h
> +++ b/include/uapi/linux/sysctl.h
> @@ -153,6 +153,7 @@ enum
> KERN_MAX_LOCK_DEPTH=74, /* int: rtmutex's maximum lock depth */
> KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
> KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
> + KERN_BUG_ON_WARN=77, /* int: call BUG() in WARN() functions */
> };
>
>
> diff --git a/kernel/panic.c b/kernel/panic.c
> index d09dc5c..a6d2e2f 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -33,6 +33,7 @@ static int pause_on_oops;
> static int pause_on_oops_flag;
> static DEFINE_SPINLOCK(pause_on_oops_lock);
> static bool crash_kexec_post_notifiers;
> +int bug_on_warn;
>
> int panic_timeout = CONFIG_PANIC_TIMEOUT;
> EXPORT_SYMBOL_GPL(panic_timeout);
> @@ -420,13 +421,24 @@ static void warn_slowpath_common(const char *file, int line, void *caller,
> {
> disable_trace_on_warning();
>
> - pr_warn("------------[ cut here ]------------\n");
> + if (!bug_on_warn)
> + pr_warn("------------[ cut here ]------------\n");
> pr_warn("WARNING: CPU: %d PID: %d at %s:%d %pS()\n",
> raw_smp_processor_id(), current->pid, file, line, caller);
>
> if (args)
> vprintk(args->fmt, args->args);
>
> + if (bug_on_warn) {
> + pr_warn("bug_on_warn set, calling BUG()...\n");
> + /*
> + * A flood of WARN()s may occur. Prevent further WARN()s
> + * from panicking the system.
> + */
> + bug_on_warn = 0;
> + BUG();
> + }
> +
> print_modules();
> dump_stack();
> print_oops_end_marker();
> @@ -501,3 +513,10 @@ static int __init oops_setup(char *s)
> return 0;
> }
> early_param("oops", oops_setup);
> +
> +static int __init bug_on_warn_setup(char *s)
> +{
> + bug_on_warn = 1;
> + return 0;
> +}
> +early_param("bug_on_warn", bug_on_warn_setup);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 4aada6d..030bb5d 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1103,6 +1103,13 @@ static struct ctl_table kern_table[] = {
> .proc_handler = proc_dointvec,
> },
> #endif
> + {
> + .procname = "bug_on_warn",
> + .data = &bug_on_warn,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
How about use:
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one,
Document says as follows but it can set other vaule.
> +0: only WARN(), default behaviour.
> +
> +1: call BUG() after printing out WARN() location.
Thanks,
Yasuaki Ishimatsu
> + },
> { }
> };
>
> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> index 9a4f750..28376bf 100644
> --- a/kernel/sysctl_binary.c
> +++ b/kernel/sysctl_binary.c
> @@ -137,6 +137,7 @@ static const struct bin_table bin_kern_table[] = {
> { CTL_INT, KERN_COMPAT_LOG, "compat-log" },
> { CTL_INT, KERN_MAX_LOCK_DEPTH, "max_lock_depth" },
> { CTL_INT, KERN_PANIC_ON_NMI, "panic_on_unrecovered_nmi" },
> + { CTL_INT, KERN_BUG_ON_WARN, "bug_on_warn" },
> {}
> };
>
>
^ permalink raw reply
* Re: [PATCHv5 man-pages 3/3] execveat.2: initial man page for execveat(2)
From: David Drysdale @ 2014-10-22 20:13 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Andy Lutomirski, Alexander Viro, Meredydd Luff,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
Kees Cook, Arnd Bergmann, X86 ML, linux-arch, Linux API
In-Reply-To: <87wq7sgfah.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
On Wed, Oct 22, 2014 at 6:55 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:
>> +.TH EXECVEAT 2 2014-04-02 "Linux" "Linux Programmer's Manual"
>> +.SH NAME
>> +execveat \- execute program relative to a directory file descriptor
>> +.SH SYNOPSIS
>> +.B #include <unistd.h>
>> +.sp
>> +.BI "int execve(int " fd ", const char *" pathname ","
> ^^^^^^ That needs to be execveat
Ah yes, so it does -- thanks for spotting.
^ permalink raw reply
* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
From: Tejun Heo @ 2014-10-22 19:42 UTC (permalink / raw)
To: Aditya Kali
Cc: Linux API, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andy Lutomirski, Ingo Molnar, Eric W. Biederman,
cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <CAGr1F2EBDCVrXZd7fOdffQ2C0c25T8co4wfxRc8P0Jb18yq2uQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
Hello,
On Wed, Oct 22, 2014 at 11:37:55AM -0700, Aditya Kali wrote:
...
> Actually, there is no right answer here. Our options are:
> * show relative path
> -- this will break userspace as /proc/<pid>/cgroup does not show
> relative paths today. This is also very ambiguous (is it relative to
> cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).
Let's go with this w/o pinning. The only necessary feature for
cgroupns is making the /proc/*/cgroups relative to its own root. It's
not like containers can avoid trusting its outside world anyway and
playing tricks with things like this tend to lead to weird surprises
down the road. If userland messes up, userland messes up.
Thanks.
--
tejun
^ permalink raw reply
* Re: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
From: Aditya Kali @ 2014-10-22 19:06 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
Ingo Molnar, Linux Containers, Eric W. Biederman
In-Reply-To: <20141017092814.GA8848-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
On Fri, Oct 17, 2014 at 2:28 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> Restrict following operations within the calling tasks:
>> * cgroup_mkdir & cgroup_rmdir
>> * cgroup_attach_task
>> * writes to cgroup files outside of task's cgroupns-root
>>
>> Also, read of /proc/<pid>/cgroup file is now restricted only
>> to tasks under same cgroupns-root. If a task tries to look
>> at cgroup of another task outside of its cgroupns-root, then
>> it won't be able to see anything for the default hierarchy.
>> This is same as if the cgroups are not mounted.
>>
>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>
> So this is a bit different from some other namespaces - if I
> have an open fd to a file, then setns into a mntns where that
> file is not addressable, I can still use the file.
>
> I guess not allowing attach to a cgroup outside our ns is a
> good failsafe as we'll otherwise risk falling off a cliff in
> some code, but I'm not sure the cgroup_file_write/mkdir/rmdir
> restrictions are needed. (And really I can fchdir to a
> directory not in my ns, so the cgroup-attach restriction is
> any more justified).
>
As discussed on another thread, most of the restrictions in this patch
are undesirable and will be removed in the next version. Even the
restriction in cgroup_attach_task() will change to something like:
- if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+ if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(current)))
return -EPERM;
i.e., we don't care the cgroup of the process being moved. We only
check if the writer has access to the dst_cgrp.
So I will just drop this patch in the next version and merge the
cgroup_attach_task() change in another patch.
> Still I'm not strictly opposed ot this, so
>
> Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>
> just wanted to point this out.
>
>> ---
>> kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
>> 1 file changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index f8099b4..2fc0dfa 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
>> struct task_struct *task;
>> int ret;
>>
>> + /* Only allow changing cgroups accessible within task's cgroup
>> + * namespace. i.e. 'dst_cgrp' should be a descendant of task's
>> + * cgroupns->root_cgrp. */
>> + if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
>> + return -EPERM;
>> +
>> /* look up all src csets */
>> down_read(&css_set_rwsem);
>> rcu_read_lock();
>> @@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
>> struct cgroup_subsys_state *css;
>> int ret;
>>
>> + /* Reject writes to cgroup files outside of task's cgroupns-root. */
>> + if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
>> + return -EINVAL;
>> +
>> if (cft->write)
>> return cft->write(of, buf, nbytes, off);
>>
>> @@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
>> parent = cgroup_kn_lock_live(parent_kn);
>> if (!parent)
>> return -ENODEV;
>> +
>> + /* Allow mkdir only within process's cgroup namespace root. */
>> + if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
>> + ret = -EPERM;
>> + goto out_unlock;
>> + }
>> +
>> root = parent->root;
>>
>> /* allocate the cgroup and its ID, 0 is reserved for the root */
>> @@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
>> if (!cgrp)
>> return 0;
>>
>> + /* Allow rmdir only within process's cgroup namespace root.
>> + * The process can't delete its own root anyways. */
>> + if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
>> + cgroup_kn_unlock(kn);
>> + return -EPERM;
>> + }
>> +
>> ret = cgroup_destroy_locked(cgrp);
>>
>> cgroup_kn_unlock(kn);
>> @@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>> if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
>> continue;
>>
>> + cgrp = task_cgroup_from_root(tsk, root);
>> +
>> + /* The cgroup path on default hierarchy is shown only if it
>> + * falls under current task's cgroupns-root.
>> + */
>> + if (root == &cgrp_dfl_root &&
>> + !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
>> + continue;
>> +
>> seq_printf(m, "%d:", root->hierarchy_id);
>> for_each_subsys(ss, ssid)
>> if (root->subsys_mask & (1 << ssid))
>> @@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>> seq_printf(m, "%sname=%s", count ? "," : "",
>> root->name);
>> seq_putc(m, ':');
>> - cgrp = task_cgroup_from_root(tsk, root);
>> path = cgroup_path(cgrp, buf, PATH_MAX);
>> if (!path) {
>> retval = -ENAMETOOLONG;
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> _______________________________________________
>> Containers mailing list
>> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
Thanks for the reiview!
--
Aditya
^ permalink raw reply
* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
From: Andy Lutomirski @ 2014-10-22 18:50 UTC (permalink / raw)
To: Aditya Kali
Cc: Linux API, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <CAGr1F2EBDCVrXZd7fOdffQ2C0c25T8co4wfxRc8P0Jb18yq2uQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Wed, Oct 22, 2014 at 11:37 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Oct 21, 2014 at 5:58 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>
>>>>>>> And with explicit permission from
>>>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>>>> suggested previously), we can make sure that unprivileged processes
>>>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>>>> semantics simple.
>>>>>>
>>>>>> I actually think it makes the semantics more complex. The less policy
>>>>>> you stick in the kernel, the easier it is to understand the impact of
>>>>>> that policy.
>>>>>>
>>>>>
>>>>> My inclination is towards keeping things simpler - both in code as
>>>>> well as in configuration. I agree that cgroupns might seem
>>>>> "less-flexible", but in its current form, it encourages consistent
>>>>> container configuration. If you have a process that needs to move
>>>>> around between cgroups belonging to different containers, then that
>>>>> process should probably not be inside any container's cgroup
>>>>> namespace. Allowing that will just make the cgroup namespace
>>>>> pretty-much meaningless.
>>>>
>>>> The problem with pinning is that preventing it causes problems
>>>> (specifically, either something potentially complex and incompatible
>>>> needs to be added or unprivileged processes will be able to pin
>>>> themselves).
>>>>
>>>> Unless I'm missing something, a normal cgroupns user doesn't actually
>>>> need kernel pinning support to effectively constrain its members'
>>>> cgroups.
>>>>
>>>
>>> So there are 2 scenarios to consider:
>>>
>>> We have 2 containers with cgroups: /container1 and /container2
>>> Assume process P is running under cgroupns-root '/container1'
>>>
>>> (1) process P wants to 'write' to cgroup.procs outside its
>>> cgroupns-root (say to /container2/cgroup.procs)
>>
>> This, at least, doesn't have the problem with unprivileged processes
>> pinning themselves.
>>
>>> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
>>> with cgroupns-root above /container1) wants to write pid of process P
>>> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>>>
>>> For (1), I think its ok to reject such a write. This is consistent
>>> with the restriction in cgroup_file_write added in 'Patch 6' of this
>>> set. I believe this should be independent of visibility of the cgroup
>>> hierarchy for P.
>>>
>>> For (2), we may allow the write to succeed if we make sure that the
>>> process doing the write is an admin process (with CAP_SYS_ADMIN in its
>>> userns AND over P's cgroupns->user_ns).
>>
>> Why is its userns relevant?
>>
>> Why not just check whether the target cgroup is in the process doing
>> the write's cgroupns? (NB: you need to check f_cred, here, not
>> current_cred(), but that's orthogonal.) Then the policy becomes: no
>> user of cgroupfs can move any process outside of the cgroupfs's user's
>> cgroupns root.
>>
> Humm .. it doesn't have to be. I think its simpler to not enforce
> artificial permission checks unless there is a security concern (and
> in this case, there doesn't seem to be any). So I will leave the
> capability check out from here.
>
>> I think I'm okay with this.
>>
>>> If this write succeeds, then:
>>> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
>>> by 'self' or any other process in P's cgrgroupns. I would really like
>>> to avoid showing relative paths or paths outside the cgroupns-root
>>
>> The empty string seems just as problematic to me.
>
> Actually, there is no right answer here. Our options are:
> * show relative path
> -- this will break userspace as /proc/<pid>/cgroup does not show
> relative paths today. This is also very ambiguous (is it relative to
> cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).
>
Confused now. If ".." in /proc/pid/group would be ambiguous, then so
would a path relative to cgroupns root, right? Or am I missing
something?
(I'm not saying that ".." is beautiful or that it won't confuse
things. I'm just not sure why it's ambiguous.)
> * show absolute path
> -- this will also wrong as the process won't be able to make sense of
> it unless it has exposure to the global cgroup hierarchy.
> -- worse case is this that the global path also exists under the
> cgroupns-root ... so now the process thinks its in completely wrong
> cgroup
> -- this exposes system
>
> * show only "/"
> -- this is arguably better, but if the process tires to verify that
> its pid is in cgroup.procs of the cgroupns-root, its in for a
> surprise!
>
> In either case, whatever we expose, the userspace won't be able to use
> this path correctly (worse yet, it associates wrong cgroup for that
> path). So I think its best to not print out the line for default
> hierarchy at all. This happens today when cgroupfs is not mounted. I
> am open to other suggestions.
I suppose that ".." is a possible security problem. If I can force
you to see lots of ..s in there, then I might be about to get you to
write outside cgroupfs.
Grr. No great solution here. I suppose that the empty string isn't
so bad. We could also write something obviously invalid like
"(unreachable)". As long as no one actually creates a cgroup called
"(unreachable)", then this could result in errors but not actual
confusion.
>>> * should we then also allow setns() without first entering the
>>> cgroupns-root? setns also checks the same conditions as in (a) plus it
>>> checks that your current cgroup is descendant of target cgroupns-root.
>>> Alternatively we can special-case setns() to own cgroupns so that it
>>> doesn't fail.
>>
>> I think setns should completely ignore the caller's cgroup and should
>> not change it. Userspace can do this.
>>
>
> All above changes more or less means that tasks cannot pin themselves
> by unsharing cgroupns. Do you agree that we don't need that "explicit
> permission from cgroupfs" anymore (via cgroup.may_unshare file or
> other mechanism)?
Yes, I agree.
>
>>> * migration for these processes will be tricky, if not impossible. But
>>> the admin trying to do this probably doesn't care about it or will
>>> provision for it.
>>
>> Migration for processes in a mntns that have a current directory
>> outside their mntns is also difficult or impossible. Same with
>> pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
>> procfs. Nothing new here.
>>
>> --Andy
>
> Thanks for the review!
No problem.
^ permalink raw reply
* [PATCH RFC v3 16/16] virtio_blk: fix types for in memory structures
From: Michael S. Tsirkin @ 2014-10-22 18:45 UTC (permalink / raw)
To: linux-kernel; +Cc: Rusty Russell, virtualization, linux-api
In-Reply-To: <1414003404-505-1-git-send-email-mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/uapi/linux/virtio_blk.h | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index 9ad67b2..247c8ba 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -28,6 +28,7 @@
#include <linux/types.h>
#include <linux/virtio_ids.h>
#include <linux/virtio_config.h>
+#include <linux/virtio_types.h>
/* Feature bits */
#define VIRTIO_BLK_F_BARRIER 0 /* Does host support barriers? */
@@ -114,18 +115,18 @@ struct virtio_blk_config {
/* This is the first element of the read scatter-gather list. */
struct virtio_blk_outhdr {
/* VIRTIO_BLK_T* */
- __u32 type;
+ __virtio32 type;
/* io priority. */
- __u32 ioprio;
+ __virtio32 ioprio;
/* Sector (ie. 512 byte offset) */
- __u64 sector;
+ __virtio64 sector;
};
struct virtio_scsi_inhdr {
- __u32 errors;
- __u32 data_len;
- __u32 sense_len;
- __u32 residual;
+ __virtio32 errors;
+ __virtio32 data_len;
+ __virtio32 sense_len;
+ __virtio32 residual;
};
/* And this is the final byte of the write scatter-gather list. */
--
MST
^ permalink raw reply related
* [PATCH RFC v3 15/16] virtio_net: fix types for in memory structures
From: Michael S. Tsirkin @ 2014-10-22 18:45 UTC (permalink / raw)
To: linux-kernel; +Cc: Rusty Russell, virtualization, linux-api
In-Reply-To: <1414003404-505-1-git-send-email-mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/uapi/linux/virtio_net.h | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 172a7f0..b5f1677 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -28,6 +28,7 @@
#include <linux/types.h>
#include <linux/virtio_ids.h>
#include <linux/virtio_config.h>
+#include <linux/virtio_types.h>
#include <linux/if_ether.h>
/* The feature bitmap for virtio net */
@@ -84,17 +85,17 @@ struct virtio_net_hdr {
#define VIRTIO_NET_HDR_GSO_TCPV6 4 // GSO frame, IPv6 TCP
#define VIRTIO_NET_HDR_GSO_ECN 0x80 // TCP has ECN set
__u8 gso_type;
- __u16 hdr_len; /* Ethernet + IP + tcp/udp hdrs */
- __u16 gso_size; /* Bytes to append to hdr_len per frame */
- __u16 csum_start; /* Position to start checksumming from */
- __u16 csum_offset; /* Offset after that to place checksum */
+ __virtio16 hdr_len; /* Ethernet + IP + tcp/udp hdrs */
+ __virtio16 gso_size; /* Bytes to append to hdr_len per frame */
+ __virtio16 csum_start; /* Position to start checksumming from */
+ __virtio16 csum_offset; /* Offset after that to place checksum */
};
/* This is the version of the header to use when the MRG_RXBUF
* feature has been negotiated. */
struct virtio_net_hdr_mrg_rxbuf {
struct virtio_net_hdr hdr;
- __u16 num_buffers; /* Number of merged rx buffers */
+ __virtio16 num_buffers; /* Number of merged rx buffers */
};
/*
@@ -149,7 +150,7 @@ typedef __u8 virtio_net_ctrl_ack;
* VIRTIO_NET_F_CTRL_MAC_ADDR feature is available.
*/
struct virtio_net_ctrl_mac {
- __u32 entries;
+ __virtio32 entries;
__u8 macs[][ETH_ALEN];
} __attribute__((packed));
@@ -193,7 +194,7 @@ struct virtio_net_ctrl_mac {
* specified.
*/
struct virtio_net_ctrl_mq {
- __u16 virtqueue_pairs;
+ __virtio16 virtqueue_pairs;
};
#define VIRTIO_NET_CTRL_MQ 4
--
MST
^ permalink raw reply related
* [PATCH RFC v3 09/16] virtio: set FEATURES_OK
From: Michael S. Tsirkin @ 2014-10-22 18:44 UTC (permalink / raw)
To: linux-kernel; +Cc: Rusty Russell, virtualization, linux-api
In-Reply-To: <1414003404-505-1-git-send-email-mst@redhat.com>
set FEATURES_OK as per virtio 1.0 spec
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/uapi/linux/virtio_config.h | 2 ++
drivers/virtio/virtio.c | 29 ++++++++++++++++++++++-------
2 files changed, 24 insertions(+), 7 deletions(-)
diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index f3fe33a..a6d0cde 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -38,6 +38,8 @@
#define VIRTIO_CONFIG_S_DRIVER 2
/* Driver has used its parts of the config, and is happy */
#define VIRTIO_CONFIG_S_DRIVER_OK 4
+/* Driver has finished configuring features */
+#define VIRTIO_CONFIG_S_FEATURES_OK 8
/* We've given up on this device. */
#define VIRTIO_CONFIG_S_FAILED 0x80
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index d213567..a3df817 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -160,6 +160,7 @@ static int virtio_dev_probe(struct device *_d)
struct virtio_device *dev = dev_to_virtio(_d);
struct virtio_driver *drv = drv_to_virtio(dev->dev.driver);
u64 device_features;
+ unsigned status;
/* We have a driver! */
add_status(dev, VIRTIO_CONFIG_S_DRIVER);
@@ -183,18 +184,32 @@ static int virtio_dev_probe(struct device *_d)
dev->config->finalize_features(dev);
+ if (virtio_has_feature(dev, VIRTIO_F_VERSION_1)) {
+ add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
+ status = dev->config->get_status(dev);
+ if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {
+ printk(KERN_ERR "virtio: device refuses features: %x\n",
+ status);
+ err = -ENODEV;
+ goto err;
+ }
+ }
+
err = drv->probe(dev);
if (err)
- add_status(dev, VIRTIO_CONFIG_S_FAILED);
- else {
- add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
- if (drv->scan)
- drv->scan(dev);
+ goto err;
- virtio_config_enable(dev);
- }
+ add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
+ if (drv->scan)
+ drv->scan(dev);
+
+ virtio_config_enable(dev);
+ return 0;
+err:
+ add_status(dev, VIRTIO_CONFIG_S_FAILED);
return err;
+
}
static int virtio_dev_remove(struct device *_d)
--
MST
^ permalink raw reply related
* [PATCH RFC v3 05/16] virtio: add virtio 1.0 feature bit
From: Michael S. Tsirkin @ 2014-10-22 18:44 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-api, virtualization
In-Reply-To: <1414003404-505-1-git-send-email-mst@redhat.com>
Based on original patches by Rusty Russell, Thomas Huth
and Cornelia Huck.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/uapi/linux/virtio_config.h | 7 +++++--
drivers/virtio/virtio_ring.c | 2 ++
2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 3ce768c..f3fe33a 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -41,11 +41,11 @@
/* We've given up on this device. */
#define VIRTIO_CONFIG_S_FAILED 0x80
-/* Some virtio feature bits (currently bits 28 through 31) are reserved for the
+/* Some virtio feature bits (currently bits 28 through 32) are reserved for the
* transport being used (eg. virtio_ring), the rest are per-device feature
* bits. */
#define VIRTIO_TRANSPORT_F_START 28
-#define VIRTIO_TRANSPORT_F_END 32
+#define VIRTIO_TRANSPORT_F_END 33
/* Do we get callbacks when the ring is completely used, even if we've
* suppressed them? */
@@ -54,4 +54,7 @@
/* Can the device handle any descriptor layout? */
#define VIRTIO_F_ANY_LAYOUT 27
+/* v1.0 compliant. */
+#define VIRTIO_F_VERSION_1 32
+
#endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index b311fa7..9f5dfe3 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -780,6 +780,8 @@ void vring_transport_features(struct virtio_device *vdev)
break;
case VIRTIO_RING_F_EVENT_IDX:
break;
+ case VIRTIO_F_VERSION_1:
+ break;
default:
/* We don't understand this bit. */
vdev->features &= ~(1ULL << i);
--
MST
^ permalink raw reply related
* Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call
From: Andy Lutomirski @ 2014-10-22 18:44 UTC (permalink / raw)
To: David Drysdale
Cc: Eric W. Biederman, Alexander Viro, Meredydd Luff,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
Kees Cook, Arnd Bergmann, X86 ML, linux-arch, Linux API
In-Reply-To: <1413978269-17274-2-git-send-email-drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
On Wed, Oct 22, 2014 at 4:44 AM, David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> Add a new system execveat(2) syscall. execveat() is to execve() as
> openat() is to open(): it takes a file descriptor that refers to a
> directory, and resolves the filename relative to that.
>
> bprm->file = file;
> - bprm->filename = bprm->interp = filename->name;
> + if (fd == AT_FDCWD || filename->name[0] == '/') {
> + bprm->filename = filename->name;
> + } else {
> + /*
> + * Build a pathname that reflects how we got to the file,
> + * either "/dev/fd/<fd>" (for an empty filename) or
> + * "/dev/fd/<fd>/<filename>".
> + */
> + pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
> + if (!pathbuf) {
> + retval = -ENOMEM;
> + goto out_unmark;
> + }
> + bprm->filename = pathbuf;
> + if (filename->name[0] == '\0')
> + sprintf(pathbuf, "/dev/fd/%d", fd);
If the fd is O_CLOEXEC, then this will result in a confused child
process. Should we fail exec attempts like that for non-static
programs? (E.g. set filename to "" or something and fix up the binfmt
drivers to handle that?)
> + else
> + snprintf(pathbuf, PATH_MAX,
> + "/dev/fd/%d/%s", fd, filename->name);
Does this need to handle the case where the result exceeds PATH_MAX?
--Andy
^ permalink raw reply
* [PATCH RFC v3 01/16] virtio: memory access APIs
From: Michael S. Tsirkin @ 2014-10-22 18:44 UTC (permalink / raw)
To: linux-kernel
Cc: Bjarke Istrup Pedersen, Anup Patel, Greg Kroah-Hartman,
virtualization, Geert Uytterhoeven, Sakari Ailus, linux-api,
David S. Miller, =?UTF-8?q?Piotr=20Kr=C3=B3l?=,
Alexei Starovoitov
In-Reply-To: <1414003404-505-1-git-send-email-mst@redhat.com>
virtio 1.0 makes all memory structures LE, so
we need APIs to conditionally do a byteswap on BE
architectures.
To make it easier to check code statically,
add virtio specific types for multi-byte integers
in memory.
Add low level wrappers that do a byteswap conditionally, these will be
useful e.g. for vhost. Add high level wrappers that will (in the
future) query device endian-ness and act accordingly.
At the moment, stub them out and assume native endian-ness everywhere.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/linux/virtio_byteorder.h | 29 ++++++++++++++++++++++++
include/linux/virtio_config.h | 16 +++++++++++++
include/uapi/linux/virtio_ring.h | 49 ++++++++++++++++++++--------------------
include/uapi/linux/Kbuild | 1 +
4 files changed, 71 insertions(+), 24 deletions(-)
create mode 100644 include/linux/virtio_byteorder.h
diff --git a/include/linux/virtio_byteorder.h b/include/linux/virtio_byteorder.h
new file mode 100644
index 0000000..7afdd8a
--- /dev/null
+++ b/include/linux/virtio_byteorder.h
@@ -0,0 +1,29 @@
+#ifndef _LINUX_VIRTIO_BYTEORDER_H
+#define _LINUX_VIRTIO_BYTEORDER_H
+#include <linux/types.h>
+#include <uapi/linux/virtio_types.h>
+
+/* Memory accessors for handling virtio in modern little endian and in
+ * compatibility big endian format. */
+
+#define __DEFINE_VIRTIO_XX_TO_CPU(bits) \
+static inline u##bits __virtio##bits##_to_cpu(bool little_endian, __virtio##bits val) \
+{ \
+ if (little_endian) \
+ return le##bits##_to_cpu((__force __le##bits)val); \
+ else \
+ return (__force u##bits)val; \
+} \
+static inline __virtio##bits __cpu_to_virtio##bits(bool little_endian, u##bits val) \
+{ \
+ if (little_endian) \
+ return (__force __virtio##bits)cpu_to_le##bits(val); \
+ else \
+ return val; \
+}
+
+__DEFINE_VIRTIO_XX_TO_CPU(16)
+__DEFINE_VIRTIO_XX_TO_CPU(32)
+__DEFINE_VIRTIO_XX_TO_CPU(64)
+
+#endif /* _LINUX_VIRTIO_BYTEORDER */
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 7f4ef66..d38d3c2 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -4,6 +4,7 @@
#include <linux/err.h>
#include <linux/bug.h>
#include <linux/virtio.h>
+#include <linux/virtio_byteorder.h>
#include <uapi/linux/virtio_config.h>
/**
@@ -152,6 +153,21 @@ int virtqueue_set_affinity(struct virtqueue *vq, int cpu)
return 0;
}
+/* Memory accessors */
+#define DEFINE_VIRTIO_XX_TO_CPU(bits) \
+static inline u##bits virtio##bits##_to_cpu(struct virtio_device *vdev, __virtio##bits val) \
+{ \
+ return __virtio##bits##_to_cpu(false, val); \
+} \
+static inline __virtio##bits cpu_to_virtio##bits(struct virtio_device *vdev, u##bits val) \
+{ \
+ return __cpu_to_virtio##bits(false, val); \
+}
+
+DEFINE_VIRTIO_XX_TO_CPU(16)
+DEFINE_VIRTIO_XX_TO_CPU(32)
+DEFINE_VIRTIO_XX_TO_CPU(64)
+
/* Config space accessors. */
#define virtio_cread(vdev, structname, member, ptr) \
do { \
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index a99f9b7..6c00632 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -32,6 +32,7 @@
*
* Copyright Rusty Russell IBM Corporation 2007. */
#include <linux/types.h>
+#include <linux/virtio_types.h>
/* This marks a buffer as continuing via the next field. */
#define VRING_DESC_F_NEXT 1
@@ -61,32 +62,32 @@
/* Virtio ring descriptors: 16 bytes. These can chain together via "next". */
struct vring_desc {
/* Address (guest-physical). */
- __u64 addr;
+ __virtio64 addr;
/* Length. */
- __u32 len;
+ __virtio32 len;
/* The flags as indicated above. */
- __u16 flags;
+ __virtio16 flags;
/* We chain unused descriptors via this, too */
- __u16 next;
+ __virtio16 next;
};
struct vring_avail {
- __u16 flags;
- __u16 idx;
- __u16 ring[];
+ __virtio16 flags;
+ __virtio16 idx;
+ __virtio16 ring[];
};
/* u32 is used here for ids for padding reasons. */
struct vring_used_elem {
/* Index of start of used descriptor chain. */
- __u32 id;
+ __virtio32 id;
/* Total length of the descriptor chain which was used (written to) */
- __u32 len;
+ __virtio32 len;
};
struct vring_used {
- __u16 flags;
- __u16 idx;
+ __virtio16 flags;
+ __virtio16 idx;
struct vring_used_elem ring[];
};
@@ -109,25 +110,25 @@ struct vring {
* struct vring_desc desc[num];
*
* // A ring of available descriptor heads with free-running index.
- * __u16 avail_flags;
- * __u16 avail_idx;
- * __u16 available[num];
- * __u16 used_event_idx;
+ * __virtio16 avail_flags;
+ * __virtio16 avail_idx;
+ * __virtio16 available[num];
+ * __virtio16 used_event_idx;
*
* // Padding to the next align boundary.
* char pad[];
*
* // A ring of used descriptor heads with free-running index.
- * __u16 used_flags;
- * __u16 used_idx;
+ * __virtio16 used_flags;
+ * __virtio16 used_idx;
* struct vring_used_elem used[num];
- * __u16 avail_event_idx;
+ * __virtio16 avail_event_idx;
* };
*/
/* We publish the used event index at the end of the available ring, and vice
* versa. They are at the end for backwards compatibility. */
#define vring_used_event(vr) ((vr)->avail->ring[(vr)->num])
-#define vring_avail_event(vr) (*(__u16 *)&(vr)->used->ring[(vr)->num])
+#define vring_avail_event(vr) (*(__virtio16 *)&(vr)->used->ring[(vr)->num])
static inline void vring_init(struct vring *vr, unsigned int num, void *p,
unsigned long align)
@@ -135,29 +136,29 @@ static inline void vring_init(struct vring *vr, unsigned int num, void *p,
vr->num = num;
vr->desc = p;
vr->avail = p + num*sizeof(struct vring_desc);
- vr->used = (void *)(((unsigned long)&vr->avail->ring[num] + sizeof(__u16)
+ vr->used = (void *)(((unsigned long)&vr->avail->ring[num] + sizeof(__virtio16)
+ align-1) & ~(align - 1));
}
static inline unsigned vring_size(unsigned int num, unsigned long align)
{
- return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (3 + num)
+ return ((sizeof(struct vring_desc) * num + sizeof(__virtio16) * (3 + num)
+ align - 1) & ~(align - 1))
- + sizeof(__u16) * 3 + sizeof(struct vring_used_elem) * num;
+ + sizeof(__virtio16) * 3 + sizeof(struct vring_used_elem) * num;
}
/* The following is used with USED_EVENT_IDX and AVAIL_EVENT_IDX */
/* Assuming a given event_idx value from the other size, if
* we have just incremented index from old to new_idx,
* should we trigger an event? */
-static inline int vring_need_event(__u16 event_idx, __u16 new_idx, __u16 old)
+static inline int vring_need_event(__virtio16 event_idx, __virtio16 new_idx, __virtio16 old)
{
/* Note: Xen has similar logic for notification hold-off
* in include/xen/interface/io/ring.h with req_event and req_prod
* corresponding to event_idx + 1 and new_idx respectively.
* Note also that req_event and req_prod in Xen start at 1,
* event indexes in virtio start at 0. */
- return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
+ return (__virtio16)(new_idx - event_idx - 1) < (__virtio16)(new_idx - old);
}
#endif /* _UAPI_LINUX_VIRTIO_RING_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 6cad974..39c161a 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -419,6 +419,7 @@ header-y += virtio_blk.h
header-y += virtio_config.h
header-y += virtio_console.h
header-y += virtio_ids.h
+header-y += virtio_types.h
header-y += virtio_net.h
header-y += virtio_pci.h
header-y += virtio_ring.h
--
MST
^ permalink raw reply related
* Re: [PATCH RFC v2 01/16] virtio: memory access APIs
From: Michael S. Tsirkin @ 2014-10-22 18:40 UTC (permalink / raw)
To: Christopher Covington
Cc: Bjarke Istrup Pedersen, Anup Patel, Greg Kroah-Hartman,
linux-kernel, David Drysdale, Lad, Prabhakar, Geert Uytterhoeven,
linux-api, Alexei Starovoitov, virtualization, David S. Miller,
Piotr Król, Mauro Carvalho Chehab
In-Reply-To: <5447ED53.2030502@codeaurora.org>
On Wed, Oct 22, 2014 at 01:45:55PM -0400, Christopher Covington wrote:
> Hi Michael,
>
> On 10/22/2014 11:50 AM, Michael S. Tsirkin wrote:
> > virtio 1.0 makes all memory structures LE, so
> > we need APIs to conditionally do a byteswap on BE
> > architectures.
> >
> > To make it easier to check code statically,
> > add virtio specific types for multi-byte integers
> > in memory.
> >
> > Add low level wrappers that do a byteswap conditionally, these will be
> > useful e.g. for vhost. Add high level wrappers that will (in the
> > future) query device endian-ness and act accordingly.
> >
> > At the moment, stub them out and assume native endian-ness everywhere.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > include/linux/virtio_config.h | 16 +++++++++++++
> > include/uapi/linux/virtio_ring.h | 49 ++++++++++++++++++++--------------------
> > include/uapi/linux/Kbuild | 1 +
> > 3 files changed, 42 insertions(+), 24 deletions(-)
> >
> > diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
> > index 7f4ef66..d38d3c2 100644
> > --- a/include/linux/virtio_config.h
> > +++ b/include/linux/virtio_config.h
> > @@ -4,6 +4,7 @@
> > #include <linux/err.h>
> > #include <linux/bug.h>
> > #include <linux/virtio.h>
> > +#include <linux/virtio_byteorder.h>
>
> What patch creates this file?
Oops I forgot to git add it.
Will resend.
> > #include <uapi/linux/virtio_config.h>
> >
> > /**
> > @@ -152,6 +153,21 @@ int virtqueue_set_affinity(struct virtqueue *vq, int cpu)
> > return 0;
> > }
> >
> > +/* Memory accessors */
> > +#define DEFINE_VIRTIO_XX_TO_CPU(bits) \
> > +static inline u##bits virtio##bits##_to_cpu(struct virtio_device *vdev, __virtio##bits val) \
> > +{ \
> > + return __virtio##bits##_to_cpu(false, val); \
> > +} \
> > +static inline __virtio##bits cpu_to_virtio##bits(struct virtio_device *vdev, u##bits val) \
> > +{ \
> > + return __cpu_to_virtio##bits(false, val); \
> > +}
> > +
> > +DEFINE_VIRTIO_XX_TO_CPU(16)
> > +DEFINE_VIRTIO_XX_TO_CPU(32)
> > +DEFINE_VIRTIO_XX_TO_CPU(64)
> > +
> > /* Config space accessors. */
> > #define virtio_cread(vdev, structname, member, ptr) \
> > do { \
> > diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
> > index a99f9b7..6c00632 100644
> > --- a/include/uapi/linux/virtio_ring.h
> > +++ b/include/uapi/linux/virtio_ring.h
> > @@ -32,6 +32,7 @@
> > *
> > * Copyright Rusty Russell IBM Corporation 2007. */
> > #include <linux/types.h>
> > +#include <linux/virtio_types.h>
> >
> > /* This marks a buffer as continuing via the next field. */
> > #define VRING_DESC_F_NEXT 1
> > @@ -61,32 +62,32 @@
> > /* Virtio ring descriptors: 16 bytes. These can chain together via "next". */
> > struct vring_desc {
> > /* Address (guest-physical). */
> > - __u64 addr;
> > + __virtio64 addr;
> > /* Length. */
> > - __u32 len;
> > + __virtio32 len;
> > /* The flags as indicated above. */
> > - __u16 flags;
> > + __virtio16 flags;
> > /* We chain unused descriptors via this, too */
> > - __u16 next;
> > + __virtio16 next;
> > };
>
> How does __virtio64 differ from __le64?
>
> Thanks,
> Chris
__le64 would require cpu_to_le, I wanted to force the use
of our byte swapping macros.
> --
> Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
^ permalink raw reply
* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
From: Aditya Kali @ 2014-10-22 18:37 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Linux API, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <CALCETrUtqozUE=Lr5d2dBKd_vaLzfVvVv8g6ZALz1MWqVzj9dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Tue, Oct 21, 2014 at 5:58 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>>>>>
>>>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>>>> implementation.
>>>>>>>
>>>>>>> Could be. I'll defer to Aditya for that one.
>>>>>>>
>>>>>>
>>>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>>>> addition to restricting the process to a cgroup-root, new processes
>>>>>> entering the container should also be implicitly contained within the
>>>>>> cgroup-root of that container.
>>>>>
>>>>> Why? Concretely, why should this be in the kernel namespace code
>>>>> instead of in userspace?
>>>>>
>>>>
>>>> Userspace can do it too. Though then there will be possibility of
>>>> having processes in the same mount namespace with different
>>>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
>>>> more complex. Thats another reason why it might not be good idea to
>>>> tie cgroups with mount namespace.
>>>>
>>>>>> Implementing pivot_cgroup_root would
>>>>>> probably involve overloading mount-namespace to now understand cgroup
>>>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>>>> earlier (not via a new syscall though), but came to the conclusion
>>>>>> that its just simpler to have a separate cgroup namespace and get
>>>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>>>
>>>>>> About pinning: I really feel that it should be OK to pin processes
>>>>>> within cgroupns-root. I think thats one of the most important feature
>>>>>> of cgroup-namespace since its most common usecase is to containerize
>>>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>>>> to remain inside their container.
>>>>>
>>>>> So don't let them out. None of the other namespaces have this kind of
>>>>> constraint:
>>>>>
>>>>> - If you're in a mntns, you can still use fds from outside.
>>>>> - If you're in a netns, you can still use sockets from outside the namespace.
>>>>> - If you're in an ipcns, you can still use ipc handles from outside.
>>>>
>>>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>>>> handles in the outside namespace. I think moving a process outside of
>>>> cgroupns-root is like allocating a resource outside of your namespace.
>>>
>>> In a pidns, you can see outside tasks if you have an outside procfs
>>> mounted, but, if you don't, then you can't. Wouldn't cgroupns be just
>>> like that? You wouldn't be able to escape your cgroup as long as you
>>> don't have an inappropriate cgroupfs mounted.
>>>
>>
>> I am not if we should only depend on restricted visibility for this
>> though. More details below.
>>
>>>
>>>>>
>>>>>> And with explicit permission from
>>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>>> suggested previously), we can make sure that unprivileged processes
>>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>>> semantics simple.
>>>>>
>>>>> I actually think it makes the semantics more complex. The less policy
>>>>> you stick in the kernel, the easier it is to understand the impact of
>>>>> that policy.
>>>>>
>>>>
>>>> My inclination is towards keeping things simpler - both in code as
>>>> well as in configuration. I agree that cgroupns might seem
>>>> "less-flexible", but in its current form, it encourages consistent
>>>> container configuration. If you have a process that needs to move
>>>> around between cgroups belonging to different containers, then that
>>>> process should probably not be inside any container's cgroup
>>>> namespace. Allowing that will just make the cgroup namespace
>>>> pretty-much meaningless.
>>>
>>> The problem with pinning is that preventing it causes problems
>>> (specifically, either something potentially complex and incompatible
>>> needs to be added or unprivileged processes will be able to pin
>>> themselves).
>>>
>>> Unless I'm missing something, a normal cgroupns user doesn't actually
>>> need kernel pinning support to effectively constrain its members'
>>> cgroups.
>>>
>>
>> So there are 2 scenarios to consider:
>>
>> We have 2 containers with cgroups: /container1 and /container2
>> Assume process P is running under cgroupns-root '/container1'
>>
>> (1) process P wants to 'write' to cgroup.procs outside its
>> cgroupns-root (say to /container2/cgroup.procs)
>
> This, at least, doesn't have the problem with unprivileged processes
> pinning themselves.
>
>> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
>> with cgroupns-root above /container1) wants to write pid of process P
>> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>>
>> For (1), I think its ok to reject such a write. This is consistent
>> with the restriction in cgroup_file_write added in 'Patch 6' of this
>> set. I believe this should be independent of visibility of the cgroup
>> hierarchy for P.
>>
>> For (2), we may allow the write to succeed if we make sure that the
>> process doing the write is an admin process (with CAP_SYS_ADMIN in its
>> userns AND over P's cgroupns->user_ns).
>
> Why is its userns relevant?
>
> Why not just check whether the target cgroup is in the process doing
> the write's cgroupns? (NB: you need to check f_cred, here, not
> current_cred(), but that's orthogonal.) Then the policy becomes: no
> user of cgroupfs can move any process outside of the cgroupfs's user's
> cgroupns root.
>
Humm .. it doesn't have to be. I think its simpler to not enforce
artificial permission checks unless there is a security concern (and
in this case, there doesn't seem to be any). So I will leave the
capability check out from here.
> I think I'm okay with this.
>
>> If this write succeeds, then:
>> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
>> by 'self' or any other process in P's cgrgroupns. I would really like
>> to avoid showing relative paths or paths outside the cgroupns-root
>
> The empty string seems just as problematic to me.
Actually, there is no right answer here. Our options are:
* show relative path
-- this will break userspace as /proc/<pid>/cgroup does not show
relative paths today. This is also very ambiguous (is it relative to
cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).
* show absolute path
-- this will also wrong as the process won't be able to make sense of
it unless it has exposure to the global cgroup hierarchy.
-- worse case is this that the global path also exists under the
cgroupns-root ... so now the process thinks its in completely wrong
cgroup
-- this exposes system
* show only "/"
-- this is arguably better, but if the process tires to verify that
its pid is in cgroup.procs of the cgroupns-root, its in for a
surprise!
In either case, whatever we expose, the userspace won't be able to use
this path correctly (worse yet, it associates wrong cgroup for that
path). So I think its best to not print out the line for default
hierarchy at all. This happens today when cgroupfs is not mounted. I
am open to other suggestions.
>
>> (b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
>> only able to mount and see cgroup hierarchy under its cgroupns-root
>> (d) if process P tries to write to any cgroup file outside of its
>> cgroupns-root (assuming that hierarchy is visible to it for whatever
>> reason), it will fail as in (1)
>
> I'm still unconvinced that this serves any purpose. If you give
> DAC/MAC permission to a task to write to something, and you give it
> access to an fd or mount pointing there, and you don't want it writing
> there, then *don't do that*. I'm not really seeing why cgroupns needs
> special treatment here.
>
There was a suggestion on the previous version of this patch-set that
we need to prevent processes inside cgroupns to not be able to modify
settings of cgroups outside of its cgroupns-root. But I agree with
your point that cgroupns should not enforce unnecessary access-control
restrictions. Its job is only to virtualize the view of
/proc/<pid>/cgroup file as much as possible (100% virtualized for a
correctly setup container). This will get rid of most of patch 6/8
"cgroup: restrict cgroup operations within task's cgroupns" of this
series. The only check we keep is in cgroup_attach_task() which
ensures that target-cgroup is descendant of current's cgroupns-root
and prevents processes from escaping their cgroupns on their own.
>>
>> i.e., in summary, you can't escape out of cgroupns-root yourself. You
>> will need help from an admin process running under some parent
>> cgroupns-root to move you out. Is that workable for your usecase? Most
>> of the things above already happen with the current patch-set, so it
>> should be easy to enable this.
>>
>> Though there are still some open issues like:
>> * what happens if you move all the processes out of /container1 and
>> then 'rmdir /container1'? As it is now, you won't be able to setns()
>> to that cgroupns anymore. But the cgroupns will still hang around
>> until the processes switch their cgroupns.
>
> Seems okay.
>
>> * should we then also allow setns() without first entering the
>> cgroupns-root? setns also checks the same conditions as in (a) plus it
>> checks that your current cgroup is descendant of target cgroupns-root.
>> Alternatively we can special-case setns() to own cgroupns so that it
>> doesn't fail.
>
> I think setns should completely ignore the caller's cgroup and should
> not change it. Userspace can do this.
>
All above changes more or less means that tasks cannot pin themselves
by unsharing cgroupns. Do you agree that we don't need that "explicit
permission from cgroupfs" anymore (via cgroup.may_unshare file or
other mechanism)?
>> * migration for these processes will be tricky, if not impossible. But
>> the admin trying to do this probably doesn't care about it or will
>> provision for it.
>
> Migration for processes in a mntns that have a current directory
> outside their mntns is also difficult or impossible. Same with
> pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
> procfs. Nothing new here.
>
> --Andy
Thanks for the review!
--
Aditya
^ permalink raw reply
* Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call
From: Eric W. Biederman @ 2014-10-22 18:07 UTC (permalink / raw)
To: David Drysdale
Cc: Andy Lutomirski, Alexander Viro, Meredydd Luff, linux-kernel,
Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
Kees Cook, Arnd Bergmann, x86, linux-arch, linux-api
In-Reply-To: <1413978269-17274-2-git-send-email-drysdale@google.com>
David Drysdale <drysdale@google.com> writes:
> Add a new system execveat(2) syscall. execveat() is to execve() as
> openat() is to open(): it takes a file descriptor that refers to a
> directory, and resolves the filename relative to that.
>
> In addition, if the filename is empty and AT_EMPTY_PATH is specified,
> execveat() executes the file to which the file descriptor refers. This
> replicates the functionality of fexecve(), which is a system call in
> other UNIXen, but in Linux glibc it depends on opening
> "/proc/self/fd/<fd>" (and so relies on /proc being mounted).
>
> The filename fed to the executed program as argv[0] (or the name of the
> script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
> (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
> reflecting how the executable was found. This does however mean that
> execution of a script in a /proc-less environment won't work.
>
> Only x86-64, i386 and x32 ABIs are supported in this patch.
>
> Based on patches by Meredydd Luff <meredydd@senatehouse.org>
>
> Signed-off-by: David Drysdale <drysdale@google.com>
> ---
> arch/x86/ia32/audit.c | 1 +
> arch/x86/ia32/ia32entry.S | 1 +
> arch/x86/kernel/audit_64.c | 1 +
> arch/x86/kernel/entry_64.S | 28 ++++++++
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 2 +
> arch/x86/um/sys_call_table_64.c | 1 +
> fs/exec.c | 130 ++++++++++++++++++++++++++++++++++----
> fs/namei.c | 2 +-
> include/linux/compat.h | 3 +
> include/linux/fs.h | 1 +
> include/linux/sched.h | 4 ++
> include/linux/syscalls.h | 4 ++
> include/uapi/asm-generic/unistd.h | 4 +-
> kernel/sys_ni.c | 3 +
> lib/audit.c | 3 +
> 16 files changed, 173 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c
> index 5d7b381da692..2eccc8932ae6 100644
> --- a/arch/x86/ia32/audit.c
> +++ b/arch/x86/ia32/audit.c
> @@ -35,6 +35,7 @@ int ia32_classify_syscall(unsigned syscall)
> case __NR_socketcall:
> return 4;
> case __NR_execve:
> + case __NR_execveat:
> return 5;
> default:
> return 1;
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 4299eb05023c..2516c09743e0 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -464,6 +464,7 @@ GLOBAL(\label)
> PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn
> PTREGSCALL stub32_sigreturn, sys32_sigreturn
> PTREGSCALL stub32_execve, compat_sys_execve
> + PTREGSCALL stub32_execveat, compat_sys_execveat
> PTREGSCALL stub32_fork, sys_fork
> PTREGSCALL stub32_vfork, sys_vfork
>
> diff --git a/arch/x86/kernel/audit_64.c b/arch/x86/kernel/audit_64.c
> index 06d3e5a14d9d..f3672508b249 100644
> --- a/arch/x86/kernel/audit_64.c
> +++ b/arch/x86/kernel/audit_64.c
> @@ -50,6 +50,7 @@ int audit_classify_syscall(int abi, unsigned syscall)
> case __NR_openat:
> return 3;
> case __NR_execve:
> + case __NR_execveat:
> return 5;
> default:
> return 0;
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 2fac1343a90b..00c4526e6ffe 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -665,6 +665,20 @@ ENTRY(stub_execve)
> CFI_ENDPROC
> END(stub_execve)
>
> +ENTRY(stub_execveat)
> + CFI_STARTPROC
> + addq $8, %rsp
> + PARTIAL_FRAME 0
> + SAVE_REST
> + FIXUP_TOP_OF_STACK %r11
> + call sys_execveat
> + RESTORE_TOP_OF_STACK %r11
> + movq %rax,RAX(%rsp)
> + RESTORE_REST
> + jmp int_ret_from_sys_call
> + CFI_ENDPROC
> +END(stub_execveat)
> +
> /*
> * sigreturn is special because it needs to restore all registers on return.
> * This cannot be done with SYSRET, so use the IRET return path instead.
> @@ -710,6 +724,20 @@ ENTRY(stub_x32_execve)
> CFI_ENDPROC
> END(stub_x32_execve)
>
> +ENTRY(stub_x32_execveat)
> + CFI_STARTPROC
> + addq $8, %rsp
> + PARTIAL_FRAME 0
> + SAVE_REST
> + FIXUP_TOP_OF_STACK %r11
> + call compat_sys_execveat
> + RESTORE_TOP_OF_STACK %r11
> + movq %rax,RAX(%rsp)
> + RESTORE_REST
> + jmp int_ret_from_sys_call
> + CFI_ENDPROC
> +END(stub_x32_execveat)
> +
> #endif
>
> /*
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index 028b78168d85..2633e3195455 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -363,3 +363,4 @@
> 354 i386 seccomp sys_seccomp
> 355 i386 getrandom sys_getrandom
> 356 i386 memfd_create sys_memfd_create
> +357 i386 execveat sys_execveat stub32_execveat
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 35dd922727b9..1af5badd159c 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -327,6 +327,7 @@
> 318 common getrandom sys_getrandom
> 319 common memfd_create sys_memfd_create
> 320 common kexec_file_load sys_kexec_file_load
> +321 64 execveat stub_execveat
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> @@ -365,3 +366,4 @@
> 542 x32 getsockopt compat_sys_getsockopt
> 543 x32 io_setup compat_sys_io_setup
> 544 x32 io_submit compat_sys_io_submit
> +545 x32 execveat stub_x32_execveat
> diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
> index f2f0723070ca..20c3649d0691 100644
> --- a/arch/x86/um/sys_call_table_64.c
> +++ b/arch/x86/um/sys_call_table_64.c
> @@ -31,6 +31,7 @@
> #define stub_fork sys_fork
> #define stub_vfork sys_vfork
> #define stub_execve sys_execve
> +#define stub_execveat sys_execveat
> #define stub_rt_sigreturn sys_rt_sigreturn
>
> #define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
> diff --git a/fs/exec.c b/fs/exec.c
> index a2b42a98c743..92a6e14f096a 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -747,7 +747,7 @@ EXPORT_SYMBOL(setup_arg_pages);
>
> #endif /* CONFIG_MMU */
>
> -static struct file *do_open_exec(struct filename *name)
> +static struct file *do_open_execat(int fd, struct filename *name, int flags)
> {
> struct file *file;
> int err;
> @@ -757,10 +757,34 @@ static struct file *do_open_exec(struct filename *name)
> .intent = LOOKUP_OPEN,
> .lookup_flags = LOOKUP_FOLLOW,
> };
> + static const struct open_flags open_exec_nofollow_flags = {
> + .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
> + .acc_mode = MAY_EXEC | MAY_OPEN,
> + .intent = LOOKUP_OPEN,
> + .lookup_flags = 0,
> + };
>
> - file = do_filp_open(AT_FDCWD, name, &open_exec_flags);
> - if (IS_ERR(file))
> - goto out;
> + if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
> + return ERR_PTR(-EINVAL);
> +
> + if (name->name[0] != '\0') {
Is it really necessary to special case AT_EMPTY_PATH here. I would
have thought the existing logic in namei.c would have been fine
assuning we passed LOOKUP_EMPTY.
> + const struct open_flags *oflags = ((flags & AT_SYMLINK_NOFOLLOW)
> + ? &open_exec_nofollow_flags
> + : &open_exec_flags);
> +
> + file = do_filp_open(fd, name, oflags);
> + if (IS_ERR(file))
> + goto out;
> + } else {
> + file = fget(fd);
> + if (!file)
> + return ERR_PTR(-EBADF);
> +
> + err = inode_permission(file->f_path.dentry->d_inode,
> + open_exec_flags.acc_mode);
> + if (err)
> + goto exit;
> + }
>
> err = -EACCES;
> if (!S_ISREG(file_inode(file)->i_mode))
> @@ -769,12 +793,13 @@ static struct file *do_open_exec(struct filename *name)
> if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
> goto exit;
>
> - fsnotify_open(file);
> -
> err = deny_write_access(file);
> if (err)
> goto exit;
>
> + if (name->name[0] != '\0')
> + fsnotify_open(file);
> +
> out:
> return file;
>
> @@ -786,7 +811,7 @@ exit:
> struct file *open_exec(const char *name)
> {
> struct filename tmp = { .name = name };
> - return do_open_exec(&tmp);
> + return do_open_execat(AT_FDCWD, &tmp, 0);
> }
> EXPORT_SYMBOL(open_exec);
>
> @@ -1422,10 +1447,12 @@ static int exec_binprm(struct linux_binprm *bprm)
> /*
> * sys_execve() executes a new program.
> */
> -static int do_execve_common(struct filename *filename,
> - struct user_arg_ptr argv,
> - struct user_arg_ptr envp)
> +static int do_execveat_common(int fd, struct filename *filename,
> + struct user_arg_ptr argv,
> + struct user_arg_ptr envp,
> + int flags)
> {
> + char *pathbuf = NULL;
> struct linux_binprm *bprm;
> struct file *file;
> struct files_struct *displaced;
> @@ -1466,7 +1493,7 @@ static int do_execve_common(struct filename *filename,
> check_unsafe_exec(bprm);
> current->in_execve = 1;
>
> - file = do_open_exec(filename);
> + file = do_open_execat(fd, filename, flags);
> retval = PTR_ERR(file);
> if (IS_ERR(file))
> goto out_unmark;
> @@ -1474,7 +1501,27 @@ static int do_execve_common(struct filename *filename,
> sched_exec();
>
> bprm->file = file;
> - bprm->filename = bprm->interp = filename->name;
> + if (fd == AT_FDCWD || filename->name[0] == '/') {
> + bprm->filename = filename->name;
> + } else {
> + /*
> + * Build a pathname that reflects how we got to the file,
> + * either "/dev/fd/<fd>" (for an empty filename) or
> + * "/dev/fd/<fd>/<filename>".
> + */
> + pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
> + if (!pathbuf) {
> + retval = -ENOMEM;
> + goto out_unmark;
> + }
> + bprm->filename = pathbuf;
> + if (filename->name[0] == '\0')
> + sprintf(pathbuf, "/dev/fd/%d", fd);
> + else
> + snprintf(pathbuf, PATH_MAX,
> + "/dev/fd/%d/%s", fd, filename->name);
> + }
> + bprm->interp = bprm->filename;
>
> retval = bprm_mm_init(bprm);
> if (retval)
> @@ -1532,6 +1579,7 @@ out_unmark:
>
> out_free:
> free_bprm(bprm);
> + kfree(pathbuf);
>
> out_files:
> if (displaced)
> @@ -1547,7 +1595,18 @@ int do_execve(struct filename *filename,
> {
> struct user_arg_ptr argv = { .ptr.native = __argv };
> struct user_arg_ptr envp = { .ptr.native = __envp };
> - return do_execve_common(filename, argv, envp);
> + return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
> +}
> +
> +int do_execveat(int fd, struct filename *filename,
> + const char __user *const __user *__argv,
> + const char __user *const __user *__envp,
> + int flags)
> +{
> + struct user_arg_ptr argv = { .ptr.native = __argv };
> + struct user_arg_ptr envp = { .ptr.native = __envp };
> +
> + return do_execveat_common(fd, filename, argv, envp, flags);
> }
>
> #ifdef CONFIG_COMPAT
> @@ -1563,7 +1622,23 @@ static int compat_do_execve(struct filename *filename,
> .is_compat = true,
> .ptr.compat = __envp,
> };
> - return do_execve_common(filename, argv, envp);
> + return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
> +}
> +
> +static int compat_do_execveat(int fd, struct filename *filename,
> + const compat_uptr_t __user *__argv,
> + const compat_uptr_t __user *__envp,
> + int flags)
> +{
> + struct user_arg_ptr argv = {
> + .is_compat = true,
> + .ptr.compat = __argv,
> + };
> + struct user_arg_ptr envp = {
> + .is_compat = true,
> + .ptr.compat = __envp,
> + };
> + return do_execveat_common(fd, filename, argv, envp, flags);
> }
> #endif
>
> @@ -1603,6 +1678,20 @@ SYSCALL_DEFINE3(execve,
> {
> return do_execve(getname(filename), argv, envp);
> }
> +
> +SYSCALL_DEFINE5(execveat,
> + int, fd, const char __user *, filename,
> + const char __user *const __user *, argv,
> + const char __user *const __user *, envp,
> + int, flags)
> +{
> + int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
> +
> + return do_execveat(fd,
> + getname_flags(filename, lookup_flags, NULL),
> + argv, envp, flags);
> +}
> +
> #ifdef CONFIG_COMPAT
> COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
> const compat_uptr_t __user *, argv,
> @@ -1610,4 +1699,17 @@ COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
> {
> return compat_do_execve(getname(filename), argv, envp);
> }
> +
> +COMPAT_SYSCALL_DEFINE5(execveat, int, fd,
> + const char __user *, filename,
> + const compat_uptr_t __user *, argv,
> + const compat_uptr_t __user *, envp,
> + int, flags)
> +{
> + int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
> +
> + return compat_do_execveat(fd,
> + getname_flags(filename, lookup_flags, NULL),
> + argv, envp, flags);
> +}
> #endif
> diff --git a/fs/namei.c b/fs/namei.c
> index a7b05bf82d31..553c84d3e0cc 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -130,7 +130,7 @@ void final_putname(struct filename *name)
>
> #define EMBEDDED_NAME_MAX (PATH_MAX - sizeof(struct filename))
>
> -static struct filename *
> +struct filename *
> getname_flags(const char __user *filename, int flags, int *empty)
> {
> struct filename *result, *err;
> diff --git a/include/linux/compat.h b/include/linux/compat.h
> index e6494261eaff..7450ca2ac1fc 100644
> --- a/include/linux/compat.h
> +++ b/include/linux/compat.h
> @@ -357,6 +357,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);
>
> asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
> const compat_uptr_t __user *envp);
> +asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
> + const compat_uptr_t __user *argv,
> + const compat_uptr_t __user *envp, int flags);
>
> asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
> compat_ulong_t __user *outp, compat_ulong_t __user *exp,
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 94187721ad41..e9818574d738 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2060,6 +2060,7 @@ extern struct file *file_open_root(struct dentry *, struct vfsmount *,
> extern struct file * dentry_open(const struct path *, int, const struct cred *);
> extern int filp_close(struct file *, fl_owner_t id);
>
> +extern struct filename *getname_flags(const char __user *, int, int *);
> extern struct filename *getname(const char __user *);
> extern struct filename *getname_kernel(const char *);
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b867a4dab38a..33e056da7d33 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2430,6 +2430,10 @@ extern void do_group_exit(int);
> extern int do_execve(struct filename *,
> const char __user * const __user *,
> const char __user * const __user *);
> +extern int do_execveat(int, struct filename *,
> + const char __user * const __user *,
> + const char __user * const __user *,
> + int);
> extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
> struct task_struct *fork_idle(int);
> extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 0f86d85a9ce4..df5422294deb 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -876,4 +876,8 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
> asmlinkage long sys_getrandom(char __user *buf, size_t count,
> unsigned int flags);
>
> +asmlinkage long sys_execveat(int dfd, const char __user *filename,
> + const char __user *const __user *argv,
> + const char __user *const __user *envp, int flags);
> +
> #endif
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 11d11bc5c78f..feef07d29663 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -705,9 +705,11 @@ __SYSCALL(__NR_seccomp, sys_seccomp)
> __SYSCALL(__NR_getrandom, sys_getrandom)
> #define __NR_memfd_create 279
> __SYSCALL(__NR_memfd_create, sys_memfd_create)
> +#define __NR_execveat 280
> +__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
>
> #undef __NR_syscalls
> -#define __NR_syscalls 280
> +#define __NR_syscalls 281
>
> /*
> * All syscalls below here should go away really,
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 391d4ddb6f4b..efb06058ad3e 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -218,3 +218,6 @@ cond_syscall(sys_kcmp);
>
> /* operate on Secure Computing state */
> cond_syscall(sys_seccomp);
> +
> +/* execveat */
> +cond_syscall(sys_execveat);
> diff --git a/lib/audit.c b/lib/audit.c
> index 1d726a22565b..b8fb5ee81e26 100644
> --- a/lib/audit.c
> +++ b/lib/audit.c
> @@ -54,6 +54,9 @@ int audit_classify_syscall(int abi, unsigned syscall)
> case __NR_socketcall:
> return 4;
> #endif
> +#ifdef __NR_execveat
> + case __NR_execveat:
> +#endif
> case __NR_execve:
> return 5;
> default:
^ permalink raw reply
* Re: [PATCHv5 man-pages 3/3] execveat.2: initial man page for execveat(2)
From: Eric W. Biederman @ 2014-10-22 17:55 UTC (permalink / raw)
To: David Drysdale
Cc: Andy Lutomirski, Alexander Viro, Meredydd Luff, linux-kernel,
Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
Kees Cook, Arnd Bergmann, x86, linux-arch, linux-api
In-Reply-To: <1413978269-17274-4-git-send-email-drysdale@google.com>
David Drysdale <drysdale@google.com> writes:
> Signed-off-by: David Drysdale <drysdale@google.com>
> ---
> man2/execveat.2 | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 144 insertions(+)
> create mode 100644 man2/execveat.2
>
> diff --git a/man2/execveat.2 b/man2/execveat.2
> new file mode 100644
> index 000000000000..d19571a3eb9d
> --- /dev/null
> +++ b/man2/execveat.2
> @@ -0,0 +1,144 @@
> +.\" Copyright (c) 2014 Google, Inc.
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein. The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH EXECVEAT 2 2014-04-02 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +execveat \- execute program relative to a directory file descriptor
> +.SH SYNOPSIS
> +.B #include <unistd.h>
> +.sp
> +.BI "int execve(int " fd ", const char *" pathname ","
^^^^^^ That needs to be execveat
> +.br
> +.BI " char *const " argv "[], char *const " envp "[],"
> +.br
> +.BI " int " flags);
> +.SH DESCRIPTION
> +The
> +.BR execveat ()
> +system call executes the program pointed to by the combination of \fIfd\fP and \fIpathname\fP.
> +The
> +.BR execveat ()
> +system call operates in exactly the same way as
> +.BR execve (2),
> +except for the differences described in this manual page.
> +
> +If the pathname given in
> +.I pathname
> +is relative, then it is interpreted relative to the directory
> +referred to by the file descriptor
> +.I fd
> +(rather than relative to the current working directory of
> +the calling process, as is done by
> +.BR execve (2)
> +for a relative pathname).
> +
> +If
> +.I pathname
> +is relative and
> +.I fd
> +is the special value
> +.BR AT_FDCWD ,
> +then
> +.I pathname
> +is interpreted relative to the current working
> +directory of the calling process (like
> +.BR execve (2)).
> +
> +If
> +.I pathname
> +is absolute, then
> +.I fd
> +is ignored.
> +
> +If
> +.I pathname
> +is an empty string and the
> +.BR AT_EMPTY_PATH
> +flag is specified, then the file descriptor
> +.I fd
> +specifies the file to be executed.
> +
> +.I flags
> +can either be 0, or include the following flags:
> +.TP
> +.BR AT_EMPTY_PATH
> +If
> +.I pathname
> +is an empty string, operate on the file referred to by
> +.IR fd
> +(which may have been obtained using the
> +.BR open (2)
> +.B O_PATH
> +flag).
> +.TP
> +.B AT_SYMLINK_NOFOLLOW
> +If the file identified by
> +.I fd
> +and a non-NULL
> +.I pathname
> +is a symbolic link, then the call fails with the error
> +.BR EINVAL .
> +.SH "RETURN VALUE"
> +On success,
> +.BR execveat ()
> +does not return. On error \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +The same errors that occur for
> +.BR execve (2)
> +can also occur for
> +.BR execveat ().
> +The following additional errors can occur for
> +.BR execveat ():
> +.TP
> +.B EBADF
> +.I fd
> +is not a valid file descriptor.
> +.TP
> +.B EINVAL
> +Invalid flag specified in
> +.IR flags .
> +.TP
> +.B ENOTDIR
> +.I pathname
> +is relative and
> +.I fd
> +is a file descriptor referring to a file other than a directory.
> +.SH VERSIONS
> +.BR execveat ()
> +was added to Linux in kernel 3.???.
> +.SH NOTES
> +In addition to the reasons explained in
> +.BR openat (2),
> +the
> +.BR execveat ()
> +system call is also needed to allow
> +.BR fexecve (3)
> +to be implemented on systems that do not have the
> +.I /proc
> +filesystem mounted.
> +.SH SEE ALSO
> +.BR execve (2),
> +.BR fexecve (3)
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox