Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v8] kernel, add panic_on_warn
From: Prarit Bhargava @ 2014-11-07 11:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andi Kleen, Jonathan Corbet, jbaron-JqFfY2XvxFXQT0dZR+AlfA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Rusty Russell,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabian Frederick,
	isimatu.yasuaki-+CUm20s59erQFUHtdCDX3A, H. Peter Anvin,
	Masami Hiramatsu, Andrew Morton, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Vivek Goyal
In-Reply-To: <alpine.DEB.2.10.1411070214220.32405-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>



On 11/07/2014 05:15 AM, David Rientjes wrote:
> On Thu, 6 Nov 2014, Vivek Goyal wrote:
> 
>> On Thu, Nov 06, 2014 at 01:57:36PM -0800, David Rientjes wrote:
>>
>> [..]
>>> You see that doing
>>>
>>> 	if (panic_on_warn) {
>>> 		panic_on_warn = 0;
>>> 		panic(...);
>>> 	}
>>>
>>> is racy, I hope.  If two threads WARN() at the same time, then there's 
>>> nothing preventing a double panic() because WARN() itself is not 
>>> serialized against anything.  So both the current comment and your 
>>> suggested revision comment are bogus.
>>
>> panic() is serialized on panic_lock. So I guess it is fine to hit WARN()
>> on multiple cpus. Do you see an issue there?
>>
> 
> No issue at all, what's completely mysterious is the panic_on_warn = 0 and 
> the completely bogus comment that says "prevent further WARN()s from 
> panicking the system" when that's racy.  There's no need to clear 
> panic_on_warn at all.

There very much is.  Consider a thread that hits a WARN() and then panics.  Then
somewhere in the panic code the thread hits another WARN() ... and then panics
again.  Previously this would have caused the system to "finish" panick'ing.
Now it makes the system hang.

P.
> 

^ permalink raw reply

* Re: [PATCH v8] kernel, add panic_on_warn
From: David Rientjes @ 2014-11-07 10:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Prarit Bhargava, linux-kernel, Jonathan Corbet, Andrew Morton,
	Rusty Russell, H. Peter Anvin, Andi Kleen, Masami Hiramatsu,
	Fabian Frederick, isimatu.yasuaki, jbaron, linux-doc, kexec,
	linux-api
In-Reply-To: <20141106220731.GA13590@redhat.com>

On Thu, 6 Nov 2014, Vivek Goyal wrote:

> On Thu, Nov 06, 2014 at 01:57:36PM -0800, David Rientjes wrote:
> 
> [..]
> > You see that doing
> > 
> > 	if (panic_on_warn) {
> > 		panic_on_warn = 0;
> > 		panic(...);
> > 	}
> > 
> > is racy, I hope.  If two threads WARN() at the same time, then there's 
> > nothing preventing a double panic() because WARN() itself is not 
> > serialized against anything.  So both the current comment and your 
> > suggested revision comment are bogus.
> 
> panic() is serialized on panic_lock. So I guess it is fine to hit WARN()
> on multiple cpus. Do you see an issue there?
> 

No issue at all, what's completely mysterious is the panic_on_warn = 0 and 
the completely bogus comment that says "prevent further WARN()s from 
panicking the system" when that's racy.  There's no need to clear 
panic_on_warn at all.

^ permalink raw reply

* Re: [fuse-devel] [PATCH v5 7/7] add a flag for per-operation O_DSYNC semantics
From: Anton Altaparmakov @ 2014-11-07  6:43 UTC (permalink / raw)
  To: Anand Avati
  Cc: Jeff Moyer, linux-arch, linux-aio, linux-nfs, Volker Lendecke,
	Theodore Ts'o, linux-mm, fuse-devel@lists.sourceforge.net,
	linux-api, Linux Kernel Mailing List, Al Viro, Christoph Hellwig,
	Tejun Heo, Milosz Tanski, linux-fsdevel, Michael Kerrisk,
	ceph-devel, Christoph Hellwig, ocfs2-devel, Mel Gorman
In-Reply-To: <CAFboF2y2skt=H4crv54shfnXOmz23W-shYWtHWekK8ZUDkfP=A@mail.gmail.com>

Hi,

> On 7 Nov 2014, at 07:52, Anand Avati <avati@gluster.org> wrote:
> On Thu, Nov 6, 2014 at 8:22 PM, Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > On 7 Nov 2014, at 01:46, Jeff Moyer <jmoyer@redhat.com> wrote:
> > Minor nit, but I'd rather read something that looks like this:
> >
> >       if (type == READ && (flags & RWF_NONBLOCK))
> >               return -EAGAIN;
> >       else if (type == WRITE && (flags & RWF_DSYNC))
> >               return -EINVAL;
> 
> But your version is less logically efficient for the case where "type == READ" is true and "flags & RWF_NONBLOCK" is false because your version then has to do the "if (type == WRITE" check before discovering it does not need to take that branch either, whilst the original version does not have to do such a test at all.
> 
> Seriously?

Of course seriously.

> Just focus on the code readability/maintainability which makes the code most easily understood/obvious to a new pair of eyes, and leave such micro-optimizations to the compiler..

The original version is more readable (IMO) and this is not a micro-optimization.  It is people like you who are responsible for the fact that we need faster and faster computers to cope with the inefficient/poor code being written more and more...

And I really wouldn't hedge my bets on gcc optimizing something like that.  The amount of crap assembly produced from gcc that I have seen over the years suggests that it is quite likely it will make a hash of it instead...

Best regards,

	Anton

> Thanks

-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
University of Cambridge Information Services, Roger Needham Building
7 JJ Thomson Avenue, Cambridge, CB3 0RB, UK


^ permalink raw reply

* Re: [fuse-devel] [PATCH v5 7/7] add a flag for per-operation O_DSYNC semantics
From: Anand Avati @ 2014-11-07  5:52 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Jeff Moyer, linux-arch, linux-aio, linux-nfs, Volker Lendecke,
	Theodore Ts'o, linux-mm, fuse-devel@lists.sourceforge.net,
	linux-api, Linux Kernel Mailing List, Al Viro, Christoph Hellwig,
	Tejun Heo, Milosz Tanski, linux-fsdevel, Michael Kerrisk,
	ceph-devel, Christoph Hellwig, ocfs2-devel, Mel Gorman
In-Reply-To: <BF30FAEC-D4D3-4079-9ECD-2743747279BD@cam.ac.uk>

[-- Attachment #1: Type: text/plain, Size: 928 bytes --]

On Thu, Nov 6, 2014 at 8:22 PM, Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> > On 7 Nov 2014, at 01:46, Jeff Moyer <jmoyer@redhat.com> wrote:
> > Minor nit, but I'd rather read something that looks like this:
> >
> >       if (type == READ && (flags & RWF_NONBLOCK))
> >               return -EAGAIN;
> >       else if (type == WRITE && (flags & RWF_DSYNC))
> >               return -EINVAL;
>
> But your version is less logically efficient for the case where "type ==
> READ" is true and "flags & RWF_NONBLOCK" is false because your version then
> has to do the "if (type == WRITE" check before discovering it does not need
> to take that branch either, whilst the original version does not have to do
> such a test at all.
>

Seriously? Just focus on the code readability/maintainability which makes
the code most easily understood/obvious to a new pair of eyes, and leave
such micro-optimizations to the compiler..

Thanks

[-- Attachment #2: Type: text/html, Size: 1425 bytes --]

^ permalink raw reply

* Re: [PATCH v5 7/7] add a flag for per-operation O_DSYNC semantics
From: Anton Altaparmakov @ 2014-11-07  4:22 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Milosz Tanski, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Christoph Hellwig, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, Mel Gorman, Volker Lendecke,
	Tejun Heo, Theodore Ts'o, Al Viro,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	linux-arch-u79uwXL29TY76Z2rM5mHXA,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA,
	fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	ocfs2-devel-N0ozoZBvEnrZJqsBc5GL+g,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg
In-Reply-To: <x49r3xf28qn.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>

Hi Jeff,

> On 7 Nov 2014, at 01:46, Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> Milosz Tanski <milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org> writes:
> 
>> -		if (type == READ && (flags & RWF_NONBLOCK))
>> -			return -EAGAIN;
>> +		if (type == READ) {
>> +			if (flags & RWF_NONBLOCK)
>> +				return -EAGAIN;
>> +		} else {
>> +			if (flags & RWF_DSYNC)
>> +				return -EINVAL;
>> +		}
> 
> Minor nit, but I'd rather read something that looks like this:
> 
> 	if (type == READ && (flags & RWF_NONBLOCK))
> 		return -EAGAIN;
> 	else if (type == WRITE && (flags & RWF_DSYNC))
> 		return -EINVAL;

But your version is less logically efficient for the case where "type == READ" is true and "flags & RWF_NONBLOCK" is false because your version then has to do the "if (type == WRITE" check before discovering it does not need to take that branch either, whilst the original version does not have to do such a test at all.

Best regards,

	Anton

> I won't lose sleep over it, though.
> 
> Reviewed-by: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
University of Cambridge Information Services, Roger Needham Building
7 JJ Thomson Avenue, Cambridge, CB3 0RB, UK

^ permalink raw reply

* Re: [PATCH v5 7/7] fs: add a flag for per-operation O_DSYNC semantics
From: Jeff Moyer @ 2014-11-06 23:46 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, ceph-devel, fuse-devel, linux-nfs, ocfs2-devel,
	linux-mm
In-Reply-To: <c188b04ede700ce5f986b19de12fa617d158540f.1415220890.git.milosz@adfin.com>

Milosz Tanski <milosz@adfin.com> writes:

> -		if (type == READ && (flags & RWF_NONBLOCK))
> -			return -EAGAIN;
> +		if (type == READ) {
> +			if (flags & RWF_NONBLOCK)
> +				return -EAGAIN;
> +		} else {
> +			if (flags & RWF_DSYNC)
> +				return -EINVAL;
> +		}

Minor nit, but I'd rather read something that looks like this:

	if (type == READ && (flags & RWF_NONBLOCK))
		return -EAGAIN;
	else if (type == WRITE && (flags & RWF_DSYNC))
		return -EINVAL;

I won't lose sleep over it, though.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v5 2/7] vfs: Define new syscalls preadv2,pwritev2
From: Jeff Moyer @ 2014-11-06 23:25 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, linux-mm
In-Reply-To: <dcc7d998033bbd999bbd92ef9c2041bce0255a3e.1415220890.git.milosz@adfin.com>

Milosz Tanski <milosz@adfin.com> writes:

> New syscalls that take an flag argument. This change does not add any specific
> flags.
>
> Signed-off-by: Milosz Tanski <milosz@adfin.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/read_write.c                   | 176 ++++++++++++++++++++++++++++++--------
>  include/linux/compat.h            |   6 ++
>  include/linux/syscalls.h          |   6 ++
>  include/uapi/asm-generic/unistd.h |   6 +-
>  mm/filemap.c                      |   5 +-
>  5 files changed, 158 insertions(+), 41 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 94b2d34..907735c 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -866,6 +866,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
>  		return -EBADF;
>  	if (!(file->f_mode & FMODE_CAN_READ))
>  		return -EINVAL;
> +	if (flags & ~0)
> +		return -EINVAL;
>  
>  	return do_readv_writev(READ, file, vec, vlen, pos, flags);
>  }
> @@ -879,21 +881,23 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
>  		return -EBADF;
>  	if (!(file->f_mode & FMODE_CAN_WRITE))
>  		return -EINVAL;
> +	if (flags & ~0)
> +		return -EINVAL;
>  
>  	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
>  }

Hi, Milosz,

You've checked for invalid flags for the normal system calls, but not
for the compat variants.  Can you add that in, please?

Thanks!
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v8] kernel, add panic_on_warn
From: Prarit Bhargava @ 2014-11-06 22:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andi Kleen, Jonathan Corbet, jbaron-JqFfY2XvxFXQT0dZR+AlfA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Rusty Russell,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabian Frederick,
	isimatu.yasuaki-+CUm20s59erQFUHtdCDX3A, H. Peter Anvin,
	Masami Hiramatsu, David Rientjes, Andrew Morton,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20141106220731.GA13590-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>



On 11/06/2014 05:07 PM, Vivek Goyal wrote:
> On Thu, Nov 06, 2014 at 01:57:36PM -0800, David Rientjes wrote:
> 
> [..]
>> You see that doing
>>
>> 	if (panic_on_warn) {
>> 		panic_on_warn = 0;
>> 		panic(...);
>> 	}
>>
>> is racy, I hope.  If two threads WARN() at the same time, then there's 
>> nothing preventing a double panic() because WARN() itself is not 
>> serialized against anything.  So both the current comment and your 
>> suggested revision comment are bogus.
> 
> panic() is serialized on panic_lock. So I guess it is fine to hit WARN()
> on multiple cpus. Do you see an issue there?

Oops -- didn't see this until just now.

Once again, Vivek beat me to the punch :) and much more succinctly than I will
ever be able to do. :)

P.

> 
> Thanks
> Vivek
> 

^ permalink raw reply

* Re: [PATCH v8] kernel, add panic_on_warn
From: Prarit Bhargava @ 2014-11-06 22:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andi Kleen, Jonathan Corbet, jbaron-JqFfY2XvxFXQT0dZR+AlfA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Rusty Russell,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabian Frederick,
	isimatu.yasuaki-+CUm20s59erQFUHtdCDX3A, H. Peter Anvin,
	Masami Hiramatsu, Andrew Morton, linux-api-u79uwXL29TY76Z2rM5mHXA,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
In-Reply-To: <alpine.DEB.2.10.1411061352440.1526-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>



On 11/06/2014 04:57 PM, David Rientjes wrote:
> On Thu, 6 Nov 2014, Prarit Bhargava wrote:
> 
>>>> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
>>>> index 6c0b9f2..bc4bd5a 100644
>>>> --- a/Documentation/kdump/kdump.txt
>>>> +++ b/Documentation/kdump/kdump.txt
>>>> @@ -471,6 +471,13 @@ format. Crash is available on Dave Anderson's site at the following URL:
>>>>  
>>>>     http://people.redhat.com/~anderson/
>>>>  
>>>> +Trigger Kdump on WARN()
>>>> +=======================
>>>> +
>>>> +The kernel parameter, panic_on_warn, calls panic() in all WARN() paths.  This
>>>> +will cause a kdump to occur at the panic() call.  In cases where a user wants
>>>> +to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 1
>>>> +to achieve the same behaviour.
>>>>  
>>>>  Contact
>>>>  =======
>>>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
>>>> index 4c81a86..ea5d57c 100644
>>>> --- a/Documentation/kernel-parameters.txt
>>>> +++ b/Documentation/kernel-parameters.txt
>>>> @@ -2509,6 +2509,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>>>>  			timeout < 0: reboot immediately
>>>>  			Format: <timeout>
>>>>  
>>>> +	panic_on_warn	panic() instead of WARN().  Useful to cause kdump
>>>> +			on a WARN().
>>>> +
>>>>  	crash_kexec_post_notifiers
>>>>  			Run kdump after running panic-notifiers and dumping
>>>>  			kmsg. This only for the users who doubt kdump always
>>>> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
>>>> index 57baff5..b5d0c85 100644
>>>> --- a/Documentation/sysctl/kernel.txt
>>>> +++ b/Documentation/sysctl/kernel.txt
>>>> @@ -54,8 +54,9 @@ show up in /proc/sys/kernel:
>>>>  - overflowuid
>>>>  - panic
>>>>  - panic_on_oops
>>>> -- panic_on_unrecovered_nmi
>>>>  - panic_on_stackoverflow
>>>> +- panic_on_unrecovered_nmi
>>>> +- panic_on_warn
>>>>  - pid_max
>>>>  - powersave-nap               [ PPC only ]
>>>>  - printk
>>>> @@ -527,19 +528,6 @@ the recommended setting is 60.
>>>>  
>>>>  ==============================================================
>>>>  
>>>> -panic_on_unrecovered_nmi:
>>>> -
>>>> -The default Linux behaviour on an NMI of either memory or unknown is
>>>> -to continue operation. For many environments such as scientific
>>>> -computing it is preferable that the box is taken out and the error
>>>> -dealt with than an uncorrected parity/ECC error get propagated.
>>>> -
>>>> -A small number of systems do generate NMI's for bizarre random reasons
>>>> -such as power management so the default is off. That sysctl works like
>>>> -the existing panic controls already in that directory.
>>>> -
>>>> -==============================================================
>>>> -
>>>>  panic_on_oops:
>>>>  
>>>>  Controls the kernel's behaviour when an oops or BUG is encountered.
>>>> @@ -563,6 +551,30 @@ This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled.
>>>>  
>>>>  ==============================================================
>>>>  
>>>> +panic_on_unrecovered_nmi:
>>>> +
>>>> +The default Linux behaviour on an NMI of either memory or unknown is
>>>> +to continue operation. For many environments such as scientific
>>>> +computing it is preferable that the box is taken out and the error
>>>> +dealt with than an uncorrected parity/ECC error get propagated.
>>>> +
>>>> +A small number of systems do generate NMI's for bizarre random reasons
>>>> +such as power management so the default is off. That sysctl works like
>>>> +the existing panic controls already in that directory.
>>>> +
>>>> +==============================================================
>>>> +
>>>> +panic_on_warn:
>>>> +
>>>> +Calls panic() in the WARN() path when set to 1.  This is useful to avoid
>>>> +a kernel rebuild when attempting to kdump at the location of a WARN().
>>>> +
>>>> +0: only WARN(), default behaviour.
>>>> +
>>>> +1: call panic() after printing out WARN() location.
>>>> +
>>>> +==============================================================
>>>> +
>>>>  perf_cpu_time_max_percent:
>>>>  
>>>>  Hints to the kernel how much CPU time it should be allowed to
>>>> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
>>>> index 3d770f55..d60d31d 100644
>>>> --- a/include/linux/kernel.h
>>>> +++ b/include/linux/kernel.h
>>>> @@ -422,6 +422,7 @@ extern int panic_timeout;
>>>>  extern int panic_on_oops;
>>>>  extern int panic_on_unrecovered_nmi;
>>>>  extern int panic_on_io_nmi;
>>>> +extern int panic_on_warn;
>>>>  extern int sysctl_panic_on_stackoverflow;
>>>>  /*
>>>>   * Only to be used by arch init code. If the user over-wrote the default
>>>> diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
>>>> index 43aaba1..0956373 100644
>>>> --- a/include/uapi/linux/sysctl.h
>>>> +++ b/include/uapi/linux/sysctl.h
>>>> @@ -153,6 +153,7 @@ enum
>>>>  	KERN_MAX_LOCK_DEPTH=74, /* int: rtmutex's maximum lock depth */
>>>>  	KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
>>>>  	KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
>>>> +	KERN_PANIC_ON_WARN=77, /* int: call panic() in WARN() functions */
>>>>  };
>>>>  
>>>>  
>>>> diff --git a/kernel/panic.c b/kernel/panic.c
>>>> index d09dc5c..c6a7723 100644
>>>> --- a/kernel/panic.c
>>>> +++ b/kernel/panic.c
>>>> @@ -33,6 +33,7 @@ static int pause_on_oops;
>>>>  static int pause_on_oops_flag;
>>>>  static DEFINE_SPINLOCK(pause_on_oops_lock);
>>>>  static bool crash_kexec_post_notifiers;
>>>> +int panic_on_warn __read_mostly;
>>>>  
>>>>  int panic_timeout = CONFIG_PANIC_TIMEOUT;
>>>>  EXPORT_SYMBOL_GPL(panic_timeout);
>>>> @@ -420,13 +421,23 @@ static void warn_slowpath_common(const char *file, int line, void *caller,
>>>>  {
>>>>  	disable_trace_on_warning();
>>>>  
>>>> -	pr_warn("------------[ cut here ]------------\n");
>>>> +	if (!panic_on_warn)
>>>> +		pr_warn("------------[ cut here ]------------\n");
>>>
>>> Is this really necessary?
>>
>> Yes IMO.  The WARN() prints out the line and it looks "weird" when we're doing a
>> panic because the finishing "end" doesn't print out.  I'm specifically
>> targetting this kernel option at end users and I think the way it looks matters.
>>
> 
> I disagree, I think it gives bug reporters guidance on what needs to be 
> reported and what doesn't need to be reported.  The trailing "end" isn't 
> needed if the system is going to panic.

True ... I'm not really here-or-there on it.  I can put it back.

> 
>>>
>>>>  	pr_warn("WARNING: CPU: %d PID: %d at %s:%d %pS()\n",
>>>>  		raw_smp_processor_id(), current->pid, file, line, caller);
>>>>  
>>>>  	if (args)
>>>>  		vprintk(args->fmt, args->args);
>>>>  
>>>> +	if (panic_on_warn) {
>>>> +		/*
>>>> +		 * A flood of WARN()s may occur.  Prevent further WARN()s
>>>> +		 * from panicking the system.
>>>> +		 */
>>>
>>> What synchronization is preventing this race and further WARN()s panicking 
>>> the system?
>>
>> Now that I re-read it, the flood comment is definitely misleading.
>> It should read "The panic path may lead to additional WARN()s.  Prevent
>> additional WARN()s from panicking the system."  I'll change that in the next
>> version.
>>
>> Your question spurred me to write a simple module that did this on a 160-core
>> system:
>>
>> static void warn_this_cpu(void *arg)
>> {
>>         WARN(1, "cpu = %d\n", smp_processor_id());
>> }
>>
>> static int init_dummy(void)
>> {
>>         on_each_cpu(warn_this_cpu, NULL, 1);
>>         return 0;
>> }
>>
>> to see if I could hit any races in this code.  While the WARN()s output overlap
>> each other I always see a single:
>>
> 
> You see that doing
> 
> 	if (panic_on_warn) {
> 		panic_on_warn = 0;
> 		panic(...);
> 	}
> 
> is racy, I hope.  

Yes and no.  Yes, I agree that panic_no_warn setting & panic() could race
leading to multiple threads hitting the function panic(),  but (see below)

>If two threads WARN() at the same time, then there's
> nothing preventing a double panic() because WARN() itself is not 
> serialized against anything.  So both the current comment and your 
> suggested revision comment are bogus.

panic(), after disabling local interrupts does spin_trylock(&panic_lock) so
your multiple thread panic cannot occur.  The stack trace on the other threads
will still be intact (this is no different from other situations where
multiple threads have panicked).  So no, I don't see the situation you describe
where multiple threads hitting that panic() can cause a problem here.

> 
>> Another issue: Are multiple WARN()s supposed to overlap output like that?  Do we
>> want them to?  AFAICT there is no way to distinguish the output from one WARN()
>> from another ...
>>
> 
> Usually one thread is encountering a path that spurs many WARN()s to 
> trigger so they don't interleave in the kernel log, your test case is 
> causing warnings on different cpus, some simultaneously.  To prevent the 
> interleaving of the stacks, it would need the same serialization that 
> doing
> 
> 	if (panic_on_warn) {
> 		panic_on_warn = 0;
> 		panic(...);
> 	}
> 

spin_lock(&warn_lock) or something around the whole thing.  In any case it is a
really contrived situation so I'm not worried about it [unless someone has seen
it in a real world scenario].

P.

> needs.
> 

^ permalink raw reply

* Re: [PATCH v8] kernel, add panic_on_warn
From: Vivek Goyal @ 2014-11-06 22:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: Prarit Bhargava, Andi Kleen, Jonathan Corbet,
	jbaron-JqFfY2XvxFXQT0dZR+AlfA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Rusty Russell,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabian Frederick,
	isimatu.yasuaki-+CUm20s59erQFUHtdCDX3A, H. Peter Anvin,
	Masami Hiramatsu, Andrew Morton, linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.DEB.2.10.1411061352440.1526-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>

On Thu, Nov 06, 2014 at 01:57:36PM -0800, David Rientjes wrote:

[..]
> You see that doing
> 
> 	if (panic_on_warn) {
> 		panic_on_warn = 0;
> 		panic(...);
> 	}
> 
> is racy, I hope.  If two threads WARN() at the same time, then there's 
> nothing preventing a double panic() because WARN() itself is not 
> serialized against anything.  So both the current comment and your 
> suggested revision comment are bogus.

panic() is serialized on panic_lock. So I guess it is fine to hit WARN()
on multiple cpus. Do you see an issue there?

Thanks
Vivek

^ permalink raw reply

* Re: [PATCH v8] kernel, add panic_on_warn
From: David Rientjes @ 2014-11-06 21:57 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jonathan Corbet,
	Andrew Morton, Rusty Russell, H. Peter Anvin, Andi Kleen,
	Masami Hiramatsu, Fabian Frederick, vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	isimatu.yasuaki-+CUm20s59erQFUHtdCDX3A,
	jbaron-JqFfY2XvxFXQT0dZR+AlfA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <545B733C.6080903-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Thu, 6 Nov 2014, Prarit Bhargava wrote:

> >> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> >> index 6c0b9f2..bc4bd5a 100644
> >> --- a/Documentation/kdump/kdump.txt
> >> +++ b/Documentation/kdump/kdump.txt
> >> @@ -471,6 +471,13 @@ format. Crash is available on Dave Anderson's site at the following URL:
> >>  
> >>     http://people.redhat.com/~anderson/
> >>  
> >> +Trigger Kdump on WARN()
> >> +=======================
> >> +
> >> +The kernel parameter, panic_on_warn, calls panic() in all WARN() paths.  This
> >> +will cause a kdump to occur at the panic() call.  In cases where a user wants
> >> +to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 1
> >> +to achieve the same behaviour.
> >>  
> >>  Contact
> >>  =======
> >> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> >> index 4c81a86..ea5d57c 100644
> >> --- a/Documentation/kernel-parameters.txt
> >> +++ b/Documentation/kernel-parameters.txt
> >> @@ -2509,6 +2509,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> >>  			timeout < 0: reboot immediately
> >>  			Format: <timeout>
> >>  
> >> +	panic_on_warn	panic() instead of WARN().  Useful to cause kdump
> >> +			on a WARN().
> >> +
> >>  	crash_kexec_post_notifiers
> >>  			Run kdump after running panic-notifiers and dumping
> >>  			kmsg. This only for the users who doubt kdump always
> >> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> >> index 57baff5..b5d0c85 100644
> >> --- a/Documentation/sysctl/kernel.txt
> >> +++ b/Documentation/sysctl/kernel.txt
> >> @@ -54,8 +54,9 @@ show up in /proc/sys/kernel:
> >>  - overflowuid
> >>  - panic
> >>  - panic_on_oops
> >> -- panic_on_unrecovered_nmi
> >>  - panic_on_stackoverflow
> >> +- panic_on_unrecovered_nmi
> >> +- panic_on_warn
> >>  - pid_max
> >>  - powersave-nap               [ PPC only ]
> >>  - printk
> >> @@ -527,19 +528,6 @@ the recommended setting is 60.
> >>  
> >>  ==============================================================
> >>  
> >> -panic_on_unrecovered_nmi:
> >> -
> >> -The default Linux behaviour on an NMI of either memory or unknown is
> >> -to continue operation. For many environments such as scientific
> >> -computing it is preferable that the box is taken out and the error
> >> -dealt with than an uncorrected parity/ECC error get propagated.
> >> -
> >> -A small number of systems do generate NMI's for bizarre random reasons
> >> -such as power management so the default is off. That sysctl works like
> >> -the existing panic controls already in that directory.
> >> -
> >> -==============================================================
> >> -
> >>  panic_on_oops:
> >>  
> >>  Controls the kernel's behaviour when an oops or BUG is encountered.
> >> @@ -563,6 +551,30 @@ This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled.
> >>  
> >>  ==============================================================
> >>  
> >> +panic_on_unrecovered_nmi:
> >> +
> >> +The default Linux behaviour on an NMI of either memory or unknown is
> >> +to continue operation. For many environments such as scientific
> >> +computing it is preferable that the box is taken out and the error
> >> +dealt with than an uncorrected parity/ECC error get propagated.
> >> +
> >> +A small number of systems do generate NMI's for bizarre random reasons
> >> +such as power management so the default is off. That sysctl works like
> >> +the existing panic controls already in that directory.
> >> +
> >> +==============================================================
> >> +
> >> +panic_on_warn:
> >> +
> >> +Calls panic() in the WARN() path when set to 1.  This is useful to avoid
> >> +a kernel rebuild when attempting to kdump at the location of a WARN().
> >> +
> >> +0: only WARN(), default behaviour.
> >> +
> >> +1: call panic() after printing out WARN() location.
> >> +
> >> +==============================================================
> >> +
> >>  perf_cpu_time_max_percent:
> >>  
> >>  Hints to the kernel how much CPU time it should be allowed to
> >> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> >> index 3d770f55..d60d31d 100644
> >> --- a/include/linux/kernel.h
> >> +++ b/include/linux/kernel.h
> >> @@ -422,6 +422,7 @@ extern int panic_timeout;
> >>  extern int panic_on_oops;
> >>  extern int panic_on_unrecovered_nmi;
> >>  extern int panic_on_io_nmi;
> >> +extern int panic_on_warn;
> >>  extern int sysctl_panic_on_stackoverflow;
> >>  /*
> >>   * Only to be used by arch init code. If the user over-wrote the default
> >> diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
> >> index 43aaba1..0956373 100644
> >> --- a/include/uapi/linux/sysctl.h
> >> +++ b/include/uapi/linux/sysctl.h
> >> @@ -153,6 +153,7 @@ enum
> >>  	KERN_MAX_LOCK_DEPTH=74, /* int: rtmutex's maximum lock depth */
> >>  	KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
> >>  	KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
> >> +	KERN_PANIC_ON_WARN=77, /* int: call panic() in WARN() functions */
> >>  };
> >>  
> >>  
> >> diff --git a/kernel/panic.c b/kernel/panic.c
> >> index d09dc5c..c6a7723 100644
> >> --- a/kernel/panic.c
> >> +++ b/kernel/panic.c
> >> @@ -33,6 +33,7 @@ static int pause_on_oops;
> >>  static int pause_on_oops_flag;
> >>  static DEFINE_SPINLOCK(pause_on_oops_lock);
> >>  static bool crash_kexec_post_notifiers;
> >> +int panic_on_warn __read_mostly;
> >>  
> >>  int panic_timeout = CONFIG_PANIC_TIMEOUT;
> >>  EXPORT_SYMBOL_GPL(panic_timeout);
> >> @@ -420,13 +421,23 @@ static void warn_slowpath_common(const char *file, int line, void *caller,
> >>  {
> >>  	disable_trace_on_warning();
> >>  
> >> -	pr_warn("------------[ cut here ]------------\n");
> >> +	if (!panic_on_warn)
> >> +		pr_warn("------------[ cut here ]------------\n");
> > 
> > Is this really necessary?
> 
> Yes IMO.  The WARN() prints out the line and it looks "weird" when we're doing a
> panic because the finishing "end" doesn't print out.  I'm specifically
> targetting this kernel option at end users and I think the way it looks matters.
> 

I disagree, I think it gives bug reporters guidance on what needs to be 
reported and what doesn't need to be reported.  The trailing "end" isn't 
needed if the system is going to panic.

> > 
> >>  	pr_warn("WARNING: CPU: %d PID: %d at %s:%d %pS()\n",
> >>  		raw_smp_processor_id(), current->pid, file, line, caller);
> >>  
> >>  	if (args)
> >>  		vprintk(args->fmt, args->args);
> >>  
> >> +	if (panic_on_warn) {
> >> +		/*
> >> +		 * A flood of WARN()s may occur.  Prevent further WARN()s
> >> +		 * from panicking the system.
> >> +		 */
> > 
> > What synchronization is preventing this race and further WARN()s panicking 
> > the system?
> 
> Now that I re-read it, the flood comment is definitely misleading.
> It should read "The panic path may lead to additional WARN()s.  Prevent
> additional WARN()s from panicking the system."  I'll change that in the next
> version.
> 
> Your question spurred me to write a simple module that did this on a 160-core
> system:
> 
> static void warn_this_cpu(void *arg)
> {
>         WARN(1, "cpu = %d\n", smp_processor_id());
> }
> 
> static int init_dummy(void)
> {
>         on_each_cpu(warn_this_cpu, NULL, 1);
>         return 0;
> }
> 
> to see if I could hit any races in this code.  While the WARN()s output overlap
> each other I always see a single:
> 

You see that doing

	if (panic_on_warn) {
		panic_on_warn = 0;
		panic(...);
	}

is racy, I hope.  If two threads WARN() at the same time, then there's 
nothing preventing a double panic() because WARN() itself is not 
serialized against anything.  So both the current comment and your 
suggested revision comment are bogus.

> Another issue: Are multiple WARN()s supposed to overlap output like that?  Do we
> want them to?  AFAICT there is no way to distinguish the output from one WARN()
> from another ...
> 

Usually one thread is encountering a path that spurs many WARN()s to 
trigger so they don't interleave in the kernel log, your test case is 
causing warnings on different cpus, some simultaneously.  To prevent the 
interleaving of the stacks, it would need the same serialization that 
doing

	if (panic_on_warn) {
		panic_on_warn = 0;
		panic(...);
	}

needs.

^ permalink raw reply

* Re: [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
From: Konstantin Khlebnikov @ 2014-11-06 20:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel-qX2TKyscuCcdnm+yROfE0A, kvm-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Andrew Morton, Sasha Levin,
	Hugh Dickins, Peter Feiner, \Dr. David Alan Gilbert\,
	Christopher Covington, Johannes Weiner, Android Kernel Team,
	Robert Love, Dmitry Adamushko, Neil Brown, Mike
In-Reply-To: <1412356087-16115-8-git-send-email-aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Fri, Oct 3, 2014 at 9:07 PM, Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> We run out of 32bits in vm_flags, noop change for 64bit archs.

What? Again?
As I see there are some free bits: 0x200, 0x1000, 0x80000

I prefer to reserve 0x02000000 for VM_ARCH_2

>
> Signed-off-by: Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/proc/task_mmu.c       | 4 ++--
>  include/linux/huge_mm.h  | 4 ++--
>  include/linux/ksm.h      | 4 ++--
>  include/linux/mm_types.h | 2 +-
>  mm/huge_memory.c         | 2 +-
>  mm/ksm.c                 | 2 +-
>  mm/madvise.c             | 2 +-
>  mm/mremap.c              | 2 +-
>  8 files changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index c341568..ee1c3a2 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -532,11 +532,11 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>         /*
>          * Don't forget to update Documentation/ on changes.
>          */
> -       static const char mnemonics[BITS_PER_LONG][2] = {
> +       static const char mnemonics[BITS_PER_LONG+1][2] = {
>                 /*
>                  * In case if we meet a flag we don't know about.
>                  */
> -               [0 ... (BITS_PER_LONG-1)] = "??",
> +               [0 ... (BITS_PER_LONG)] = "??",
>
>                 [ilog2(VM_READ)]        = "rd",
>                 [ilog2(VM_WRITE)]       = "wr",
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 63579cb..3aa10e0 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -121,7 +121,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
>  #error "hugepages can't be allocated by the buddy allocator"
>  #endif
>  extern int hugepage_madvise(struct vm_area_struct *vma,
> -                           unsigned long *vm_flags, int advice);
> +                           vm_flags_t *vm_flags, int advice);
>  extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
>                                     unsigned long start,
>                                     unsigned long end,
> @@ -183,7 +183,7 @@ static inline int split_huge_page(struct page *page)
>  #define split_huge_page_pmd_mm(__mm, __address, __pmd) \
>         do { } while (0)
>  static inline int hugepage_madvise(struct vm_area_struct *vma,
> -                                  unsigned long *vm_flags, int advice)
> +                                  vm_flags_t *vm_flags, int advice)
>  {
>         BUG();
>         return 0;
> diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> index 3be6bb1..8b35253 100644
> --- a/include/linux/ksm.h
> +++ b/include/linux/ksm.h
> @@ -18,7 +18,7 @@ struct mem_cgroup;
>
>  #ifdef CONFIG_KSM
>  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> -               unsigned long end, int advice, unsigned long *vm_flags);
> +               unsigned long end, int advice, vm_flags_t *vm_flags);
>  int __ksm_enter(struct mm_struct *mm);
>  void __ksm_exit(struct mm_struct *mm);
>
> @@ -94,7 +94,7 @@ static inline int PageKsm(struct page *page)
>
>  #ifdef CONFIG_MMU
>  static inline int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> -               unsigned long end, int advice, unsigned long *vm_flags)
> +               unsigned long end, int advice, vm_flags_t *vm_flags)
>  {
>         return 0;
>  }
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6e0b286..2c876d1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -217,7 +217,7 @@ struct page_frag {
>  #endif
>  };
>
> -typedef unsigned long __nocast vm_flags_t;
> +typedef unsigned long long __nocast vm_flags_t;
>
>  /*
>   * A region containing a mapping of a non-memory backed file under NOMMU
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d9a21d06..e913a19 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1942,7 +1942,7 @@ out:
>  #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
>
>  int hugepage_madvise(struct vm_area_struct *vma,
> -                    unsigned long *vm_flags, int advice)
> +                    vm_flags_t *vm_flags, int advice)
>  {
>         switch (advice) {
>         case MADV_HUGEPAGE:
> diff --git a/mm/ksm.c b/mm/ksm.c
> index fb75902..faf319e 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1736,7 +1736,7 @@ static int ksm_scan_thread(void *nothing)
>  }
>
>  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> -               unsigned long end, int advice, unsigned long *vm_flags)
> +               unsigned long end, int advice, vm_flags_t *vm_flags)
>  {
>         struct mm_struct *mm = vma->vm_mm;
>         int err;
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 0938b30..d5aee71 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -49,7 +49,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
>         struct mm_struct *mm = vma->vm_mm;
>         int error = 0;
>         pgoff_t pgoff;
> -       unsigned long new_flags = vma->vm_flags;
> +       vm_flags_t new_flags = vma->vm_flags;
>
>         switch (behavior) {
>         case MADV_NORMAL:
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 05f1180..fa7db87 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -239,7 +239,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  {
>         struct mm_struct *mm = vma->vm_mm;
>         struct vm_area_struct *new_vma;
> -       unsigned long vm_flags = vma->vm_flags;
> +       vm_flags_t vm_flags = vma->vm_flags;
>         unsigned long new_pgoff;
>         unsigned long moved_len;
>         unsigned long excess = 0;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>

^ permalink raw reply

* Re: [PATCHv6 2/3] syscalls,x86: add selftest for execveat(2)
From: Kees Cook @ 2014-11-06 18:07 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro, Meredydd Luff,
	LKML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Arnd Bergmann, Rich Felker, Christoph Hellwig, x86@kernel.org,
	linux-arch, Linux API
In-Reply-To: <1415290033-15771-3-git-send-email-drysdale@google.com>

On Thu, Nov 6, 2014 at 8:07 AM, David Drysdale <drysdale@google.com> wrote:
> Signed-off-by: David Drysdale <drysdale@google.com>
> ---
>  tools/testing/selftests/Makefile        |   1 +
>  tools/testing/selftests/exec/.gitignore |   7 +
>  tools/testing/selftests/exec/Makefile   |  25 +++
>  tools/testing/selftests/exec/execveat.c | 321 ++++++++++++++++++++++++++++++++
>  4 files changed, 354 insertions(+)
>  create mode 100644 tools/testing/selftests/exec/.gitignore
>  create mode 100644 tools/testing/selftests/exec/Makefile
>  create mode 100644 tools/testing/selftests/exec/execveat.c
>
> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index 36ff2e4c7b6f..210cf68b3511 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -14,6 +14,7 @@ TARGETS += powerpc
>  TARGETS += user
>  TARGETS += sysctl
>  TARGETS += firmware
> +TARGETS += exec
>
>  TARGETS_HOTPLUG = cpu-hotplug
>  TARGETS_HOTPLUG += memory-hotplug
> diff --git a/tools/testing/selftests/exec/.gitignore b/tools/testing/selftests/exec/.gitignore
> new file mode 100644
> index 000000000000..778147d01af9
> --- /dev/null
> +++ b/tools/testing/selftests/exec/.gitignore
> @@ -0,0 +1,7 @@
> +subdir*
> +script*
> +execveat
> +execveat.symlink
> +execveat.moved
> +execveat.ephemeral
> +execveat.denatured
> \ No newline at end of file
> diff --git a/tools/testing/selftests/exec/Makefile b/tools/testing/selftests/exec/Makefile
> new file mode 100644
> index 000000000000..c97e0aaea02d
> --- /dev/null
> +++ b/tools/testing/selftests/exec/Makefile
> @@ -0,0 +1,25 @@
> +CC = $(CROSS_COMPILE)gcc
> +CFLAGS = -Wall
> +BINARIES = execveat
> +DEPS = execveat.symlink execveat.denatured script subdir
> +all: $(BINARIES) $(DEPS)
> +
> +subdir:
> +       mkdir -p $@
> +script:
> +       echo '#!/bin/sh' > $@
> +       echo 'exit $$*' >> $@
> +       chmod +x $@
> +execveat.symlink: execveat
> +       ln -s -f $< $@
> +execveat.denatured: execveat
> +       cp $< $@
> +       chmod -x $@
> +%: %.c
> +       $(CC) $(CFLAGS) -o $@ $^
> +
> +run_tests: all
> +       ./execveat
> +
> +clean:
> +       rm -rf $(BINARIES) $(DEPS) subdir.moved execveat.moved
> diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c
> new file mode 100644
> index 000000000000..f6ea48176393
> --- /dev/null
> +++ b/tools/testing/selftests/exec/execveat.c
> @@ -0,0 +1,321 @@
> +/*
> + * Copyright (c) 2014 Google, Inc.
> + *
> + * Licensed under the terms of the GNU GPL License version 2
> + *
> + * Selftests for execveat(2).
> + */
> +
> +#define _GNU_SOURCE  /* to get O_PATH, AT_EMPTY_PATH */
> +#include <sys/sendfile.h>
> +#include <sys/stat.h>
> +#include <sys/syscall.h>
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <limits.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +
> +static char *envp[] = { "IN_TEST=yes", NULL, NULL };
> +static char *argv[] = { "execveat", "99", NULL };
> +
> +static int execveat_(int fd, const char *path, char **argv, char **envp,
> +                    int flags)
> +{
> +#ifdef __NR_execveat
> +       return syscall(__NR_execveat, fd, path, argv, envp, flags);
> +#else
> +       errno = -ENOSYS;
> +       return -1;
> +#endif
> +}
> +
> +#define check_execveat_fail(fd, path, flags, errno)    \
> +       _check_execveat_fail(fd, path, flags, errno, #errno)
> +static int _check_execveat_fail(int fd, const char *path, int flags,
> +                               int expected_errno, const char *errno_str)
> +{
> +       int rc;
> +
> +       errno = 0;
> +       printf("Check failure of execveat(%d, '%s', %d) with %s... ",
> +               fd, path?:"(null)", flags, errno_str);
> +       rc = execveat_(fd, path, argv, envp, flags);
> +
> +       if (rc > 0) {
> +               printf("[FAIL] (unexpected success from execveat(2))\n");
> +               return 1;
> +       }
> +       if (errno != expected_errno) {
> +               printf("[FAIL] (expected errno %d (%s) not %d (%s)\n",
> +                       expected_errno, strerror(expected_errno),
> +                       errno, strerror(errno));
> +               return 1;
> +       }
> +       printf("[OK]\n");
> +       return 0;
> +}
> +
> +static int check_execveat_invoked_rc(int fd, const char *path, int flags,
> +                                    int expected_rc)
> +{
> +       int status;
> +       int rc;
> +       pid_t child;
> +
> +       printf("Check success of execveat(%d, '%s', %d)... ",
> +               fd, path?:"(null)", flags);
> +       child = fork();
> +       if (child < 0) {
> +               printf("[FAIL] (fork() failed)\n");
> +               return 1;
> +       }
> +       if (child == 0) {
> +               /* Child: do execveat(). */
> +               rc = execveat_(fd, path, argv, envp, flags);
> +               printf("[FAIL]: execveat() failed, rc=%d errno=%d (%s)\n",
> +                       rc, errno, strerror(errno));
> +               exit(1);  /* should not reach here */
> +       }
> +       /* Parent: wait for & check child's exit status. */
> +       rc = waitpid(child, &status, 0);
> +       if (rc != child) {
> +               printf("[FAIL] (waitpid(%d,...) returned %d)\n", child, rc);
> +               return 1;
> +       }
> +       if (!WIFEXITED(status)) {
> +               printf("[FAIL] (child %d did not exit cleanly, status=%08x)\n",
> +                       child, status);
> +               return 1;
> +       }
> +       if (WEXITSTATUS(status) != expected_rc) {
> +               printf("[FAIL] (child %d exited with %d not %d)\n",
> +                       child, WEXITSTATUS(status), expected_rc);
> +               return 1;
> +       }
> +       printf("[OK]\n");
> +       return 0;
> +}
> +
> +static int check_execveat(int fd, const char *path, int flags)
> +{
> +       return check_execveat_invoked_rc(fd, path, flags, 99);
> +}
> +
> +static char *concat(const char *left, const char *right)
> +{
> +       char *result = malloc(strlen(left) + strlen(right) + 1);
> +
> +       strcpy(result, left);
> +       strcat(result, right);
> +       return result;
> +}
> +
> +static int open_or_die(const char *filename, int flags)
> +{
> +       int fd = open(filename, flags);
> +
> +       if (fd < 0) {
> +               printf("Failed to open '%s'; "
> +                       "check prerequisites are available\n", filename);
> +               exit(1);
> +       }
> +       return fd;
> +}
> +
> +static int run_tests(void)
> +{
> +       int fail = 0;
> +       char *fullname = realpath("execveat", NULL);
> +       char *fullname_script = realpath("script", NULL);
> +       char *fullname_symlink = concat(fullname, ".symlink");
> +       int subdir_dfd = open_or_die("subdir", O_DIRECTORY|O_RDONLY);
> +       int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral",
> +                                              O_DIRECTORY|O_RDONLY);
> +       int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY);
> +       int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH);
> +       int dot_dfd_cloexec = open_or_die(".", O_DIRECTORY|O_RDONLY|O_CLOEXEC);
> +       int fd = open_or_die("execveat", O_RDONLY);
> +       int fd_path = open_or_die("execveat", O_RDONLY|O_PATH);
> +       int fd_symlink = open_or_die("execveat.symlink", O_RDONLY);
> +       int fd_denatured = open_or_die("execveat.denatured", O_RDONLY);
> +       int fd_script = open_or_die("script", O_RDONLY);
> +       int fd_ephemeral = open_or_die("execveat.ephemeral", O_RDONLY);
> +       int fd_script_ephemeral = open_or_die("script.ephemeral", O_RDONLY);
> +       int fd_cloexec = open_or_die("execveat", O_RDONLY|O_CLOEXEC);
> +       int fd_script_cloexec = open_or_die("script", O_RDONLY|O_CLOEXEC);
> +
> +       /* Change file position to confirm it doesn't affect anything */
> +       lseek(fd, 10, SEEK_SET);
> +
> +       /* Normal executable file: */
> +       /*   dfd + path */
> +       fail += check_execveat(subdir_dfd, "../execveat", 0);
> +       fail += check_execveat(dot_dfd, "execveat", 0);
> +       fail += check_execveat(dot_dfd_path, "execveat", 0);
> +       /*   absolute path */
> +       fail += check_execveat(AT_FDCWD, fullname, 0);
> +       /*   absolute path with nonsense dfd */
> +       fail += check_execveat(99, fullname, 0);
> +       /*   fd + no path */
> +       fail += check_execveat(fd, "", AT_EMPTY_PATH);
> +       /*   O_CLOEXEC fd + no path */
> +       fail += check_execveat(fd_cloexec, "", AT_EMPTY_PATH);
> +
> +       /* Mess with executable file that's already open: */
> +       /*   fd + no path to a file that's been renamed */
> +       rename("execveat.ephemeral", "execveat.moved");
> +       fail += check_execveat(fd_ephemeral, "", AT_EMPTY_PATH);
> +       /*   fd + no path to a file that's been deleted */
> +       unlink("execveat.moved"); /* remove the file now fd open */
> +       fail += check_execveat(fd_ephemeral, "", AT_EMPTY_PATH);
> +
> +       /* Invalid argument failures */
> +       fail += check_execveat_fail(fd, "", 0, ENOENT);
> +       fail += check_execveat_fail(fd, NULL, AT_EMPTY_PATH, EFAULT);
> +
> +       /* Symlink to executable file: */
> +       /*   dfd + path */
> +       fail += check_execveat(dot_dfd, "execveat.symlink", 0);
> +       fail += check_execveat(dot_dfd_path, "execveat.symlink", 0);
> +       /*   absolute path */
> +       fail += check_execveat(AT_FDCWD, fullname_symlink, 0);
> +       /*   fd + no path, even with AT_SYMLINK_NOFOLLOW (already followed) */
> +       fail += check_execveat(fd_symlink, "", AT_EMPTY_PATH);
> +       fail += check_execveat(fd_symlink, "",
> +                              AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW);
> +
> +       /* Symlink fails when AT_SYMLINK_NOFOLLOW set: */
> +       /*   dfd + path */
> +       fail += check_execveat_fail(dot_dfd, "execveat.symlink",
> +                                   AT_SYMLINK_NOFOLLOW, ELOOP);
> +       fail += check_execveat_fail(dot_dfd_path, "execveat.symlink",
> +                                   AT_SYMLINK_NOFOLLOW, ELOOP);
> +       /*   absolute path */
> +       fail += check_execveat_fail(AT_FDCWD, fullname_symlink,
> +                                   AT_SYMLINK_NOFOLLOW, ELOOP);
> +
> +       /* Shell script wrapping executable file: */
> +       /*   dfd + path */
> +       fail += check_execveat(subdir_dfd, "../script", 0);
> +       fail += check_execveat(dot_dfd, "script", 0);
> +       fail += check_execveat(dot_dfd_path, "script", 0);
> +       /*   absolute path */
> +       fail += check_execveat(AT_FDCWD, fullname_script, 0);
> +       /*   fd + no path */
> +       fail += check_execveat(fd_script, "", AT_EMPTY_PATH);
> +       fail += check_execveat(fd_script, "",
> +                              AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW);
> +       /*   O_CLOEXEC fd fails for a script (as script file inaccessible) */
> +       fail += check_execveat_fail(fd_script_cloexec, "", AT_EMPTY_PATH,
> +                                   ENOENT);
> +       fail += check_execveat_fail(dot_dfd_cloexec, "script", 0, ENOENT);
> +
> +       /* Mess with script file that's already open: */
> +       /*   fd + no path to a file that's been renamed */
> +       rename("script.ephemeral", "script.moved");
> +       fail += check_execveat(fd_script_ephemeral, "", AT_EMPTY_PATH);
> +       /*   fd + no path to a file that's been deleted */
> +       unlink("script.moved"); /* remove the file while fd open */
> +       fail += check_execveat(fd_script_ephemeral, "", AT_EMPTY_PATH);
> +
> +       /* Rename a subdirectory in the path: */
> +       rename("subdir.ephemeral", "subdir.moved");
> +       fail += check_execveat(subdir_dfd_ephemeral, "../script", 0);
> +       fail += check_execveat(subdir_dfd_ephemeral, "script", 0);
> +       /* Remove the subdir and its contents */
> +       unlink("subdir.moved/script");
> +       unlink("subdir.moved");
> +       /* Shell loads via deleted subdir OK because name starts with .. */
> +       fail += check_execveat(subdir_dfd_ephemeral, "../script", 0);
> +       fail += check_execveat_fail(subdir_dfd_ephemeral, "script", 0, ENOENT);
> +
> +       /* Flag values other than AT_SYMLINK_NOFOLLOW => EINVAL */
> +       fail += check_execveat_fail(dot_dfd, "execveat", 0xFFFF, EINVAL);
> +       /* Invalid path => ENOENT */
> +       fail += check_execveat_fail(dot_dfd, "no-such-file", 0, ENOENT);
> +       fail += check_execveat_fail(dot_dfd_path, "no-such-file", 0, ENOENT);
> +       fail += check_execveat_fail(AT_FDCWD, "no-such-file", 0, ENOENT);
> +       /* Attempt to execute directory => EACCES */
> +       fail += check_execveat_fail(dot_dfd, "", AT_EMPTY_PATH, EACCES);
> +       /* Attempt to execute non-executable => EACCES */
> +       fail += check_execveat_fail(dot_dfd, "Makefile", 0, EACCES);
> +       fail += check_execveat_fail(fd_denatured, "", AT_EMPTY_PATH, EACCES);
> +       /* Attempt to execute file opened with O_PATH => EBADF */
> +       fail += check_execveat_fail(fd_path, "", AT_EMPTY_PATH, EBADF);
> +       /* Attempt to execute nonsense FD => EBADF */
> +       fail += check_execveat_fail(99, "", AT_EMPTY_PATH, EBADF);
> +       fail += check_execveat_fail(99, "execveat", 0, EBADF);
> +       /* Attempt to execute relative to non-directory => ENOTDIR */
> +       fail += check_execveat_fail(fd, "execveat", 0, ENOTDIR);
> +

I'd add some tests that check PATH_MAX with the /dev/fd/n/filename
off-by-one I noticed. That could catch any regressions there.

-Kees

> +       return fail;
> +}
> +
> +static void exe_cp(const char *src, const char *dest)
> +{
> +       int in_fd = open_or_die(src, O_RDONLY);
> +       int out_fd = open(dest, O_RDWR|O_CREAT|O_TRUNC, 0755);
> +       struct stat info;
> +
> +       fstat(in_fd, &info);
> +       sendfile(out_fd, in_fd, NULL, info.st_size);
> +       close(in_fd);
> +       close(out_fd);
> +}
> +
> +static void prerequisites(void)
> +{
> +       int fd;
> +       const char *script = "#!/bin/sh\nexit $*\n";
> +
> +       /* Create ephemeral copies of files */
> +       exe_cp("execveat", "execveat.ephemeral");
> +       exe_cp("script", "script.ephemeral");
> +       mkdir("subdir.ephemeral", 0755);
> +
> +       fd = open("subdir.ephemeral/script", O_RDWR|O_CREAT|O_TRUNC, 0755);
> +       write(fd, script, strlen(script));
> +       close(fd);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +       int ii;
> +       int rc;
> +       const char *verbose = getenv("VERBOSE");
> +
> +       if (argc >= 2) {
> +               /* If we are invoked with an argument, don't run tests. */
> +               const char *in_test = getenv("IN_TEST");
> +
> +               if (verbose) {
> +                       printf("  invoked with:");
> +                       for (ii = 0; ii < argc; ii++)
> +                               printf(" [%d]='%s'", ii, argv[ii]);
> +                       printf("\n");
> +               }
> +
> +               /* Check expected environment transferred. */
> +               if (!in_test || strcmp(in_test, "yes") != 0) {
> +                       printf("[FAIL] (no IN_TEST=yes in env)\n");
> +                       return 1;
> +               }
> +
> +               /* Use the final argument as an exit code. */
> +               rc = atoi(argv[argc - 1]);
> +               fflush(stdout);
> +       } else {
> +               prerequisites();
> +               if (verbose)
> +                       envp[1] = "VERBOSE=1";
> +               rc = run_tests();
> +               if (rc > 0)
> +                       printf("%d tests failed\n", rc);
> +       }
> +       return rc;
> +}
> --
> 2.1.0.rc2.206.gedb03e5
>



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply

* Re: [PATCHv6 1/3] syscalls,x86: implement execveat() system call
From: Kees Cook @ 2014-11-06 18:06 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro, Meredydd Luff,
	LKML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Arnd Bergmann, Rich Felker, Christoph Hellwig,
	x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, linux-arch,
	Linux API
In-Reply-To: <1415290033-15771-2-git-send-email-drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

On Thu, Nov 6, 2014 at 8:07 AM, David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> Add a new execveat(2) system call. execveat() is to execve() as
> openat() is to open(): it takes a file descriptor that refers to a
> directory, and resolves the filename relative to that.
>
> In addition, if the filename is empty and AT_EMPTY_PATH is specified,
> execveat() executes the file to which the file descriptor refers. This
> replicates the functionality of fexecve(), which is a system call in
> other UNIXen, but in Linux glibc it depends on opening
> "/proc/self/fd/<fd>" (and so relies on /proc being mounted).
>
> The filename fed to the executed program as argv[0] (or the name of the
> script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
> (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
> reflecting how the executable was found.  This does however mean that
> execution of a script in a /proc-less environment won't work; also,
> script execution via an O_CLOEXEC file descriptor fails (as the file
> will not be accessible after exec).
>
> Only x86-64, i386 and x32 ABIs are supported in this patch.
>
> Based on patches by Meredydd Luff <meredydd-zPN50pYk8eUaUu29zAJCuw@public.gmane.org>

This'll be quite nice for doing launches into a tight sandbox.

>
> Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  arch/x86/ia32/audit.c             |   1 +
>  arch/x86/ia32/ia32entry.S         |   1 +
>  arch/x86/kernel/audit_64.c        |   1 +
>  arch/x86/kernel/entry_64.S        |  28 ++++++++++
>  arch/x86/syscalls/syscall_32.tbl  |   1 +
>  arch/x86/syscalls/syscall_64.tbl  |   2 +
>  arch/x86/um/sys_call_table_64.c   |   1 +
>  fs/binfmt_em86.c                  |   4 ++
>  fs/binfmt_misc.c                  |   4 ++
>  fs/binfmt_script.c                |  10 ++++
>  fs/exec.c                         | 115 +++++++++++++++++++++++++++++++++-----
>  fs/namei.c                        |   8 ++-
>  include/linux/binfmts.h           |   4 ++
>  include/linux/compat.h            |   3 +
>  include/linux/fs.h                |   1 +
>  include/linux/namei.h             |   1 +
>  include/linux/sched.h             |   4 ++
>  include/linux/syscalls.h          |   4 ++
>  include/uapi/asm-generic/unistd.h |   4 +-
>  kernel/sys_ni.c                   |   3 +
>  lib/audit.c                       |   3 +
>  21 files changed, 188 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c
> index 5d7b381da692..2eccc8932ae6 100644
> --- a/arch/x86/ia32/audit.c
> +++ b/arch/x86/ia32/audit.c
> @@ -35,6 +35,7 @@ int ia32_classify_syscall(unsigned syscall)
>         case __NR_socketcall:
>                 return 4;
>         case __NR_execve:
> +       case __NR_execveat:
>                 return 5;
>         default:
>                 return 1;
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 4299eb05023c..2516c09743e0 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -464,6 +464,7 @@ GLOBAL(\label)
>         PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn
>         PTREGSCALL stub32_sigreturn, sys32_sigreturn
>         PTREGSCALL stub32_execve, compat_sys_execve
> +       PTREGSCALL stub32_execveat, compat_sys_execveat
>         PTREGSCALL stub32_fork, sys_fork
>         PTREGSCALL stub32_vfork, sys_vfork
>
> diff --git a/arch/x86/kernel/audit_64.c b/arch/x86/kernel/audit_64.c
> index 06d3e5a14d9d..f3672508b249 100644
> --- a/arch/x86/kernel/audit_64.c
> +++ b/arch/x86/kernel/audit_64.c
> @@ -50,6 +50,7 @@ int audit_classify_syscall(int abi, unsigned syscall)
>         case __NR_openat:
>                 return 3;
>         case __NR_execve:
> +       case __NR_execveat:
>                 return 5;
>         default:
>                 return 0;
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 2fac1343a90b..00c4526e6ffe 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -665,6 +665,20 @@ ENTRY(stub_execve)
>         CFI_ENDPROC
>  END(stub_execve)
>
> +ENTRY(stub_execveat)
> +       CFI_STARTPROC
> +       addq $8, %rsp
> +       PARTIAL_FRAME 0
> +       SAVE_REST
> +       FIXUP_TOP_OF_STACK %r11
> +       call sys_execveat
> +       RESTORE_TOP_OF_STACK %r11
> +       movq %rax,RAX(%rsp)
> +       RESTORE_REST
> +       jmp int_ret_from_sys_call
> +       CFI_ENDPROC
> +END(stub_execveat)
> +
>  /*
>   * sigreturn is special because it needs to restore all registers on return.
>   * This cannot be done with SYSRET, so use the IRET return path instead.
> @@ -710,6 +724,20 @@ ENTRY(stub_x32_execve)
>         CFI_ENDPROC
>  END(stub_x32_execve)
>
> +ENTRY(stub_x32_execveat)
> +       CFI_STARTPROC
> +       addq $8, %rsp
> +       PARTIAL_FRAME 0
> +       SAVE_REST
> +       FIXUP_TOP_OF_STACK %r11
> +       call compat_sys_execveat
> +       RESTORE_TOP_OF_STACK %r11
> +       movq %rax,RAX(%rsp)
> +       RESTORE_REST
> +       jmp int_ret_from_sys_call
> +       CFI_ENDPROC
> +END(stub_x32_execveat)
> +
>  #endif
>
>  /*
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index 028b78168d85..2633e3195455 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -363,3 +363,4 @@
>  354    i386    seccomp                 sys_seccomp
>  355    i386    getrandom               sys_getrandom
>  356    i386    memfd_create            sys_memfd_create
> +357    i386    execveat                sys_execveat                    stub32_execveat
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 35dd922727b9..1af5badd159c 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -327,6 +327,7 @@
>  318    common  getrandom               sys_getrandom
>  319    common  memfd_create            sys_memfd_create
>  320    common  kexec_file_load         sys_kexec_file_load
> +321    64      execveat                stub_execveat
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> @@ -365,3 +366,4 @@
>  542    x32     getsockopt              compat_sys_getsockopt
>  543    x32     io_setup                compat_sys_io_setup
>  544    x32     io_submit               compat_sys_io_submit
> +545    x32     execveat                stub_x32_execveat
> diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
> index f2f0723070ca..20c3649d0691 100644
> --- a/arch/x86/um/sys_call_table_64.c
> +++ b/arch/x86/um/sys_call_table_64.c
> @@ -31,6 +31,7 @@
>  #define stub_fork sys_fork
>  #define stub_vfork sys_vfork
>  #define stub_execve sys_execve
> +#define stub_execveat sys_execveat
>  #define stub_rt_sigreturn sys_rt_sigreturn
>
>  #define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
> diff --git a/fs/binfmt_em86.c b/fs/binfmt_em86.c
> index f37b08cea1f7..490538536cb4 100644
> --- a/fs/binfmt_em86.c
> +++ b/fs/binfmt_em86.c
> @@ -42,6 +42,10 @@ static int load_em86(struct linux_binprm *bprm)
>                         return -ENOEXEC;
>         }
>
> +       /* Need to be able to load the file after exec */
> +       if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE)
> +               return -ENOENT;
> +
>         allow_write_access(bprm->file);
>         fput(bprm->file);
>         bprm->file = NULL;
> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
> index b60500300dd7..e659f5562356 100644
> --- a/fs/binfmt_misc.c
> +++ b/fs/binfmt_misc.c
> @@ -127,6 +127,10 @@ static int load_misc_binary(struct linux_binprm *bprm)
>         if (!fmt)
>                 goto _ret;
>
> +       /* Need to be able to load the file after exec */
> +       if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE)
> +               return -ENOENT;
> +
>         if (!(fmt->flags & MISC_FMT_PRESERVE_ARGV0)) {
>                 retval = remove_arg_zero(bprm);
>                 if (retval)
> diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c
> index 5027a3e14922..afdf4e3cafc2 100644
> --- a/fs/binfmt_script.c
> +++ b/fs/binfmt_script.c
> @@ -24,6 +24,16 @@ static int load_script(struct linux_binprm *bprm)
>
>         if ((bprm->buf[0] != '#') || (bprm->buf[1] != '!'))
>                 return -ENOEXEC;
> +
> +       /*
> +        * If the script filename will be inaccessible after exec, typically
> +        * because it is a "/dev/fd/<fd>/.." path against an O_CLOEXEC fd, give
> +        * up now (on the assumption that the interpreter will want to load
> +        * this file).
> +        */
> +       if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE)
> +               return -ENOENT;
> +
>         /*
>          * This section does the #! interpretation.
>          * Sorta complicated, but hopefully it will work.  -TYT
> diff --git a/fs/exec.c b/fs/exec.c
> index a2b42a98c743..800d232c17bb 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -747,18 +747,26 @@ EXPORT_SYMBOL(setup_arg_pages);
>
>  #endif /* CONFIG_MMU */
>
> -static struct file *do_open_exec(struct filename *name)
> +static struct file *do_open_execat(int fd, struct filename *name, int flags)
>  {
>         struct file *file;
>         int err;
> -       static const struct open_flags open_exec_flags = {
> +       struct open_flags open_exec_flags = {
>                 .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
>                 .acc_mode = MAY_EXEC | MAY_OPEN,
>                 .intent = LOOKUP_OPEN,
>                 .lookup_flags = LOOKUP_FOLLOW,
>         };
>
> -       file = do_filp_open(AT_FDCWD, name, &open_exec_flags);
> +       if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
> +               return ERR_PTR(-EINVAL);
> +       if (flags & AT_SYMLINK_NOFOLLOW)
> +               open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW;
> +       if (flags & AT_EMPTY_PATH)
> +               open_exec_flags.lookup_flags |= (LOOKUP_EMPTY |
> +                                                LOOKUP_EMPTY_NOPATH);
> +
> +       file = do_filp_open(fd, name, &open_exec_flags);
>         if (IS_ERR(file))
>                 goto out;
>
> @@ -769,12 +777,13 @@ static struct file *do_open_exec(struct filename *name)
>         if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
>                 goto exit;
>
> -       fsnotify_open(file);
> -
>         err = deny_write_access(file);
>         if (err)
>                 goto exit;
>
> +       if (name->name[0] != '\0')
> +               fsnotify_open(file);
> +
>  out:
>         return file;
>
> @@ -786,7 +795,7 @@ exit:
>  struct file *open_exec(const char *name)
>  {
>         struct filename tmp = { .name = name };
> -       return do_open_exec(&tmp);
> +       return do_open_execat(AT_FDCWD, &tmp, 0);
>  }
>  EXPORT_SYMBOL(open_exec);
>
> @@ -1422,10 +1431,12 @@ static int exec_binprm(struct linux_binprm *bprm)
>  /*
>   * sys_execve() executes a new program.
>   */
> -static int do_execve_common(struct filename *filename,
> -                               struct user_arg_ptr argv,
> -                               struct user_arg_ptr envp)
> +static int do_execveat_common(int fd, struct filename *filename,
> +                             struct user_arg_ptr argv,
> +                             struct user_arg_ptr envp,
> +                             int flags)
>  {
> +       char *pathbuf = NULL;
>         struct linux_binprm *bprm;
>         struct file *file;
>         struct files_struct *displaced;
> @@ -1466,7 +1477,7 @@ static int do_execve_common(struct filename *filename,
>         check_unsafe_exec(bprm);
>         current->in_execve = 1;
>
> -       file = do_open_exec(filename);
> +       file = do_open_execat(fd, filename, flags);
>         retval = PTR_ERR(file);
>         if (IS_ERR(file))
>                 goto out_unmark;
> @@ -1474,7 +1485,30 @@ static int do_execve_common(struct filename *filename,
>         sched_exec();
>
>         bprm->file = file;
> -       bprm->filename = bprm->interp = filename->name;
> +       if (fd == AT_FDCWD || filename->name[0] == '/') {
> +               bprm->filename = filename->name;
> +       } else {
> +               /* "/dev/fd/2147483647/" + filename->name */
> +               int maxlen = 19 + strlen(filename->name);

This should be 20 + strlen (to include the trailing NULL). However, I
think this whole bit of code could be replaced with kasprintf...

> +
> +               pathbuf = kmalloc(maxlen, GFP_TEMPORARY);
> +               if (!pathbuf) {
> +                       retval = -ENOMEM;
> +                       goto out_unmark;
> +               }
> +               if (filename->name[0] == '\0')
> +                       sprintf(pathbuf, "/dev/fd/%d", fd);
> +               else
> +                       snprintf(pathbuf, maxlen,
> +                                "/dev/fd/%d/%s", fd, filename->name);

Maybe something like this? A bit messy, so maybe your original if/else
would be more readable.

    } else {
        pathbuf = kasprintf("/dev/fd/%d%s%s", fd,
                                       filename->name[0] == '\0' ? "" : "/",
                                       filename->name[0] == '\0' ? ""
: filename->name);
        if (!pathbuf) {
            retval = -ENOMEM;
            goto out_unmark;
        }
    }

> +               /* Record that a name derived from an O_CLOEXEC fd will be
> +                * inaccessible after exec. Relies on having exclusive access to
> +                * current->files (due to unshare_files above). */
> +               if (close_on_exec(fd, current->files->fdt))
> +                       bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
> +               bprm->filename = pathbuf;
> +       }
> +       bprm->interp = bprm->filename;
>
>         retval = bprm_mm_init(bprm);
>         if (retval)
> @@ -1532,6 +1566,7 @@ out_unmark:
>
>  out_free:
>         free_bprm(bprm);
> +       kfree(pathbuf);
>
>  out_files:
>         if (displaced)
> @@ -1547,7 +1582,18 @@ int do_execve(struct filename *filename,
>  {
>         struct user_arg_ptr argv = { .ptr.native = __argv };
>         struct user_arg_ptr envp = { .ptr.native = __envp };
> -       return do_execve_common(filename, argv, envp);
> +       return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
> +}
> +
> +int do_execveat(int fd, struct filename *filename,
> +               const char __user *const __user *__argv,
> +               const char __user *const __user *__envp,
> +               int flags)
> +{
> +       struct user_arg_ptr argv = { .ptr.native = __argv };
> +       struct user_arg_ptr envp = { .ptr.native = __envp };
> +
> +       return do_execveat_common(fd, filename, argv, envp, flags);
>  }
>
>  #ifdef CONFIG_COMPAT
> @@ -1563,7 +1609,23 @@ static int compat_do_execve(struct filename *filename,
>                 .is_compat = true,
>                 .ptr.compat = __envp,
>         };
> -       return do_execve_common(filename, argv, envp);
> +       return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
> +}
> +
> +static int compat_do_execveat(int fd, struct filename *filename,
> +                             const compat_uptr_t __user *__argv,
> +                             const compat_uptr_t __user *__envp,
> +                             int flags)
> +{
> +       struct user_arg_ptr argv = {
> +               .is_compat = true,
> +               .ptr.compat = __argv,
> +       };
> +       struct user_arg_ptr envp = {
> +               .is_compat = true,
> +               .ptr.compat = __envp,
> +       };
> +       return do_execveat_common(fd, filename, argv, envp, flags);
>  }
>  #endif
>
> @@ -1603,6 +1665,20 @@ SYSCALL_DEFINE3(execve,
>  {
>         return do_execve(getname(filename), argv, envp);
>  }
> +
> +SYSCALL_DEFINE5(execveat,
> +               int, fd, const char __user *, filename,
> +               const char __user *const __user *, argv,
> +               const char __user *const __user *, envp,
> +               int, flags)
> +{
> +       int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
> +
> +       return do_execveat(fd,
> +                          getname_flags(filename, lookup_flags, NULL),
> +                          argv, envp, flags);
> +}
> +
>  #ifdef CONFIG_COMPAT
>  COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
>         const compat_uptr_t __user *, argv,
> @@ -1610,4 +1686,17 @@ COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
>  {
>         return compat_do_execve(getname(filename), argv, envp);
>  }
> +
> +COMPAT_SYSCALL_DEFINE5(execveat, int, fd,
> +                      const char __user *, filename,
> +                      const compat_uptr_t __user *, argv,
> +                      const compat_uptr_t __user *, envp,
> +                      int,  flags)
> +{
> +       int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
> +
> +       return compat_do_execveat(fd,
> +                                 getname_flags(filename, lookup_flags, NULL),
> +                                 argv, envp, flags);
> +}
>  #endif
> diff --git a/fs/namei.c b/fs/namei.c
> index a7b05bf82d31..757df6777ae5 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -130,7 +130,7 @@ void final_putname(struct filename *name)
>
>  #define EMBEDDED_NAME_MAX      (PATH_MAX - sizeof(struct filename))
>
> -static struct filename *
> +struct filename *
>  getname_flags(const char __user *filename, int flags, int *empty)
>  {
>         struct filename *result, *err;
> @@ -1891,6 +1891,12 @@ static int path_init(int dfd, const char *name, unsigned int flags,
>                                 fdput(f);
>                                 return -ENOTDIR;
>                         }
> +               } else if (flags & LOOKUP_EMPTY_NOPATH) {
> +                       /* When using the fd alone, disallow O_PATH files */
> +                       if (f.file->f_mode & FMODE_PATH) {
> +                               fdput(f);
> +                               return -EBADF;
> +                       }
>                 }
>
>                 nd->path = f.file->f_path;
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index 61f29e5ea840..576e4639ca60 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -53,6 +53,10 @@ struct linux_binprm {
>  #define BINPRM_FLAGS_EXECFD_BIT 1
>  #define BINPRM_FLAGS_EXECFD (1 << BINPRM_FLAGS_EXECFD_BIT)
>
> +/* filename of the binary will be inaccessible after exec */
> +#define BINPRM_FLAGS_PATH_INACCESSIBLE_BIT 2
> +#define BINPRM_FLAGS_PATH_INACCESSIBLE (1 << BINPRM_FLAGS_PATH_INACCESSIBLE_BIT)
> +
>  /* Function parameter for binfmt->coredump */
>  struct coredump_params {
>         const siginfo_t *siginfo;
> diff --git a/include/linux/compat.h b/include/linux/compat.h
> index e6494261eaff..7450ca2ac1fc 100644
> --- a/include/linux/compat.h
> +++ b/include/linux/compat.h
> @@ -357,6 +357,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);
>
>  asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
>                      const compat_uptr_t __user *envp);
> +asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
> +                    const compat_uptr_t __user *argv,
> +                    const compat_uptr_t __user *envp, int flags);
>
>  asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
>                 compat_ulong_t __user *outp, compat_ulong_t __user *exp,
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 94187721ad41..e9818574d738 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2060,6 +2060,7 @@ extern struct file *file_open_root(struct dentry *, struct vfsmount *,
>  extern struct file * dentry_open(const struct path *, int, const struct cred *);
>  extern int filp_close(struct file *, fl_owner_t id);
>
> +extern struct filename *getname_flags(const char __user *, int, int *);
>  extern struct filename *getname(const char __user *);
>  extern struct filename *getname_kernel(const char *);
>
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 492de72560fa..eaa25cc72213 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -55,6 +55,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
>  #define LOOKUP_JUMPED          0x1000
>  #define LOOKUP_ROOT            0x2000
>  #define LOOKUP_EMPTY           0x4000
> +#define LOOKUP_EMPTY_NOPATH    0x8000
>
>  extern int user_path_at(int, const char __user *, unsigned, struct path *);
>  extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b867a4dab38a..33e056da7d33 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2430,6 +2430,10 @@ extern void do_group_exit(int);
>  extern int do_execve(struct filename *,
>                      const char __user * const __user *,
>                      const char __user * const __user *);
> +extern int do_execveat(int, struct filename *,
> +                      const char __user * const __user *,
> +                      const char __user * const __user *,
> +                      int);
>  extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
>  struct task_struct *fork_idle(int);
>  extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 0f86d85a9ce4..df5422294deb 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -876,4 +876,8 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
>  asmlinkage long sys_getrandom(char __user *buf, size_t count,
>                               unsigned int flags);
>
> +asmlinkage long sys_execveat(int dfd, const char __user *filename,
> +                       const char __user *const __user *argv,
> +                       const char __user *const __user *envp, int flags);
> +
>  #endif
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 11d11bc5c78f..feef07d29663 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -705,9 +705,11 @@ __SYSCALL(__NR_seccomp, sys_seccomp)
>  __SYSCALL(__NR_getrandom, sys_getrandom)
>  #define __NR_memfd_create 279
>  __SYSCALL(__NR_memfd_create, sys_memfd_create)
> +#define __NR_execveat 280
> +__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
>
>  #undef __NR_syscalls
> -#define __NR_syscalls 280
> +#define __NR_syscalls 281
>
>  /*
>   * All syscalls below here should go away really,
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 391d4ddb6f4b..efb06058ad3e 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -218,3 +218,6 @@ cond_syscall(sys_kcmp);
>
>  /* operate on Secure Computing state */
>  cond_syscall(sys_seccomp);
> +
> +/* execveat */
> +cond_syscall(sys_execveat);
> diff --git a/lib/audit.c b/lib/audit.c
> index 1d726a22565b..b8fb5ee81e26 100644
> --- a/lib/audit.c
> +++ b/lib/audit.c
> @@ -54,6 +54,9 @@ int audit_classify_syscall(int abi, unsigned syscall)
>         case __NR_socketcall:
>                 return 4;
>  #endif
> +#ifdef __NR_execveat
> +       case __NR_execveat:
> +#endif
>         case __NR_execve:
>                 return 5;
>         default:
> --
> 2.1.0.rc2.206.gedb03e5
>

Thanks for working on this!

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply

* Re: [PATCH net-next 1/7] bpf: add 'flags' attribute to BPF_MAP_UPDATE_ELEM command
From: Alexei Starovoitov @ 2014-11-06 17:39 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Ingo Molnar, Andy Lutomirski,
	Hannes Frederic Sowa, Eric Dumazet, Linux API,
	Network Development, LKML
In-Reply-To: <545A3ACC.3080101@redhat.com>

On Wed, Nov 5, 2014 at 6:57 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> On 11/05/2014 12:04 AM, Alexei Starovoitov wrote:
>>
>> On Tue, Nov 4, 2014 at 1:25 AM, Daniel Borkmann <dborkman@redhat.com>
>> wrote:
>>>
>>> On 11/04/2014 03:54 AM, Alexei Starovoitov wrote:
>>>>
>>>>
>>>> the current meaning of BPF_MAP_UPDATE_ELEM syscall command is:
>>>> either update existing map element or create a new one.
>>>> Initially the plan was to add a new command to handle the case of
>>>> 'create new element if it didn't exist', but 'flags' style looks
>>>> cleaner and overall diff is much smaller (more code reused), so add
>>>> 'flags'
>>>> attribute to BPF_MAP_UPDATE_ELEM command with the following meaning:
>>>> enum {
>>>>     BPF_MAP_UPDATE_OR_CREATE = 0, /* add new element or update existing
>>>> */
>>>>     BPF_MAP_CREATE_ONLY,          /* add new element if it didn't exist
>>>> */
>>>>     BPF_MAP_UPDATE_ONLY           /* update existing element */
>>>> };
>>>
>>>
>>>  From you commit message/code I currently don't see an explanation why
>>> it cannot be done in typical ``flags style'' as various syscalls do,
>>> i.e. BPF_MAP_UPDATE_OR_CREATE rather represented as ...
>>>
>>>    BPF_MAP_CREATE | BPF_MAP_UPDATE
>>>
>>> Do you expect more than 64 different flags to be passed from user space
>>> for BPF_MAP?
>>
>>
>> several reasons:
>> - preserve flags==0 as default behavior
>> - avoid holes and extra checks for invalid combinations, so
>>    if (flags > BPF_MAP_UPDATE_ONLY) goto err, is enough.
>> - it looks much neater when user space uses
>>    BPF_MAP_UPDATE_OR_CREATE instead of ORing bits.
>>
>> Note this choice doesn't prevent adding bit-like flags
>> in the future. Today I cannot think of any new flags
>> for the update() command, but if somebody comes up with
>> a new selector that can apply to all three combinations,
>> we can add it as 3rd bit that can be ORed.
>
>
> Hm, mixing enums together with bitfield-like flags seems
> kind of hacky ... :/ Or, do you mean to say you're adding
> a 2nd flag field, i.e. splitting the 64bits into a 32bit
> ``cmd enum'' and 32bit ``flag section''?

something like this.
or splitting 64-bit into 2 and 62. We'll see.
First two encode this 'type' of update and the rest -
whatever else.

> Hm, my concern is that we start to add many *_OR_* enum
> elements once we find that a flag might be a useful in
> combination with many other flags ... even though if we
> only can think of 3 flags /right now/.

Agree. Adding many *_OR_* would look bad, that's
why I said that future additions can be bits. Like in
paragraph above.

Also, we don't have 3 flags now. In this patch I'm
showing 3 types and you're suggesting to treat
them as 2 flags. To me that's incorrect, since 'no flags'
becomes invalid combination, which logically incorrect.
Therefore I cannot see them as 'flags'. This is a 'type'
or 'style' of update() command.

I think it actually matches how open() defines things
in similar situation:
#define O_RDONLY        00000000
#define O_WRONLY        00000001
#define O_RDWR          00000002
We used to think of them as flags, but they're not
bit flags, though the rest of open() flags are bit-like.
If we apply your argument to open() then open()
should have defined O_RD as 1 and OR_WR as 2
and force everyone to mix and match them, but
then zero would be invalid. So I still think that
what I have is a cleaner API :)

^ permalink raw reply

* Re: [PATCHv2 0/7] CGroup Namespaces
From: Aditya Kali @ 2014-11-06 17:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andy Lutomirski, Eric W. Biederman, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar
In-Reply-To: <20141104131030.GA2937-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Tue, Nov 4, 2014 at 5:10 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
> [..]
>>  fs/kernfs/dir.c                  | 194 ++++++++++++++++++++++++++++++++++-----
>>  fs/kernfs/mount.c                |  48 ++++++++++
>>  fs/proc/namespaces.c             |   1 +
>>  include/linux/cgroup.h           |  41 ++++++++-
>>  include/linux/cgroup_namespace.h |  36 ++++++++
>>  include/linux/kernfs.h           |   5 +
>>  include/linux/nsproxy.h          |   2 +
>>  include/linux/proc_ns.h          |   4 +
>>  include/uapi/linux/sched.h       |   3 +-
>>  kernel/Makefile                  |   2 +-
>>  kernel/cgroup.c                  | 108 +++++++++++++++++-----
>>  kernel/cgroup_namespace.c        | 148 +++++++++++++++++++++++++++++
>>  kernel/fork.c                    |   2 +-
>>  kernel/nsproxy.c                 |  19 +++-
>
> Hi Aditya,
>
> Can we provide a documentation file for cgroup namespace behavior. Say,
> Documentation/namespaces/cgroup-namespace.txt.
>
Yes, definitely. I will add it as soon as we have a consensus on the
overall series.

> Namespaces are complicated and it might be a good idea to keep one .txt
> file for each namespace.
>
> Thanks
> Vivek


Thanks,
-- 
Aditya

^ permalink raw reply

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Aditya Kali @ 2014-11-06 17:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andy Lutomirski, Li Zefan, Serge Hallyn, Eric W. Biederman,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Ingo Molnar, Linux Containers, Rohit Jnagal
In-Reply-To: <20141104135726.GB14014-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>

On Tue, Nov 4, 2014 at 5:57 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Aditya.
>
> On Mon, Nov 03, 2014 at 03:12:28PM -0800, Aditya Kali wrote:
>> I think the sane-behavior flag is only temporary and will be removed
>> anyways, right? So I didn't bother asking user to supply it. But I can
>> make the change as you suggested. We just have to make sure that tasks
>> inside cgroupns cannot mount non-default hierarchies as it would be a
>> regression.
>
> I'm not sure whether supporting mounting from inside a ns is even
> necessary but, if it is, can't you just test against cgrp_dfl_root?
> There's no reason to do anything differnetly for ns mounting.
>

I am not sure I fully understand what you mean. But we don't have a
way to test against cgrp_dfl_root while parsing mount-options. They
only way we know that user is trying to mount a default hierarchy is
via the sane_behavior flag. So I need to test against this flag it if
we want to restrict processes inside cgroupns to mounting the default
hierarchy only.
Or are you suggesting that its OK for nsown_capable(CAP_SYS_ADMIN)
processes to mount any cgroup hierarchy (irrespective of their
cgroupns)? I assumed that this will be a undesirable.

> Thanks.
>
> --
> tejun

Thanks,
-- 
Aditya

^ permalink raw reply

* Re: [RFC PATCH 0/1] arm64: Fix /proc/cpuinfo
From: Catalin Marinas @ 2014-11-06 17:05 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, linux-arm-kernel@lists.infradead.org,
	ghackmann@google.com, ijc@hellion.org.uk, Serban Constantinescu,
	cross-distro@lists.linaro.org, linux-api@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20141106165430.GH31605@arm.com>

On Thu, Nov 06, 2014 at 04:54:31PM +0000, Will Deacon wrote:
> On Thu, Nov 06, 2014 at 04:43:12PM +0000, Catalin Marinas wrote:
> > On Fri, Oct 24, 2014 at 02:56:39PM +0100, Mark Rutland wrote:
> > > [d] Print different hwcaps dependent on the personality.
> > > 
> > >     This would allow for 32-bit and 64-bit applications to function
> > >     correctly, but for some 32-bit applications the personality would
> > >     need to be set explicitly by the user.
> > 
> > Which makes this option actually in line with the uname -m behaviour. My
> > vote goes for [d] with option [b] as a close alternative.
> > 
> > > [1] arm, v3.17, Versatile Express A15x2 A7x3 coretile
> > > Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm 
> > [...]
> > > [2] arm64, v3.17, Juno platform
> > > Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 
> > 
> > As an exercise, I'm trying to see what option [b] would look like when
> > CONFIG_COMPAT is enabled:
> > 
> > Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
> > 
> > The duplicate strings would only be listed once (evtstrm, aes, pmull,
> > sha1, sha2, crc32). New AArch64 features that we may expect to be
> > optional on AArch32 could be prefixed with "a64". If they are missing
> > entirely from AArch32, (like asimd), no need for the prefix.
> > 
> > The advantage is that we don't need to check the personality but we have
> > to assume that scripts would not search for substrings (sane people
> > shouldn't do this anyway as the Features string can always be extended).
> 
> And a big disadvantage is that I can imagine AArch64 applications checking
> for "neon" instead of "asimd", which will break if they're run under kernels
> without COMPAT support enabled.

That's because they do stupid things and I they probably deserve to
break ;). But I agree, there is a risk.

> So I'm inclined to stick with Mark's patch as it is.

If we don't hear otherwise, I propose sometime next week we queue Mark's
patch for -next.

-- 
Catalin

^ permalink raw reply

* Re: [PATCHv6 RFC 0/3] syscalls,x86: Add execveat() system call
From: Andy Lutomirski @ 2014-11-06 16:55 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Alexander Viro, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Kees Cook, Arnd Bergmann, Rich Felker, Christoph Hellwig, X86 ML,
	linux-arch, Linux API
In-Reply-To: <1415290033-15771-1-git-send-email-drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

On Thu, Nov 6, 2014 at 8:07 AM, David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> Here's another pass at this. Some things to discuss in particular:
>
> 1) The current approach for interpreted execs (i.e. mostly "#!" scripts)
>    gives them an argv[1] filename like "/dev/fd/<fd>/<path>".  This
>    means that script execution in a /proc-less system isn't going to
>    work, at least until interpreters get smart enough to spot and
>    special-case the leading "/dev/fd/<fd>", or until there's something
>    to use in place of /dev/fd -> /proc/self/fd (e.g. Al's dupfs
>    suggestion, https://lkml.org/lkml/2014/10/19/141).
>
>    So is an execveat(2) that (currently) only works for non-interpreted
>    programs still useful?

I think it is.  I would make sure to return a distinguishable error
code in the event that the failure happens because of one of the
unsupported cases.

>
> 2) I don't like having to add a new LOOKUP_EMPTY_NOPATH flag
>    just to prevent O_PATH fds from being fexecve()ed -- alternative
>    suggestions welcomed.  (More generally, I don't have a great
>    feel for what O_PATH is for; how bad would it be to just allow
>    them to be fexecve()ed?)

If you fexecve an O_PATH fd, does it at least check that you have
execute permission on the inode?  If so, it seems okay to allow it.

--Andy

>
> .........
>
> This patch set adds execveat(2) for x86, and is derived from Meredydd
> Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
>
> The primary aim of adding an execveat syscall is to allow an
> implementation of fexecve(3) that does not rely on the /proc
> filesystem, at least for executables (rather than scripts).  The
> current glibc version of fexecve(3) is implemented via /proc, which
> causes problems in sandboxed or otherwise restricted environments.
>
> Given the desire for a /proc-free fexecve() implementation, HPA
> suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2)
> syscall would be an appropriate generalization.
>
> Also, having a new syscall means that it can take a flags argument
> without back-compatibility concerns.  The current implementation just
> defines the AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other
> flags could be added in future -- for example, flags for new namespaces
> (as suggested at https://lkml.org/lkml/2006/7/11/474).
>
> Related history:
>  - https://lkml.org/lkml/2006/12/27/123 is an example of someone
>    realizing that fexecve() is likely to fail in a chroot environment.
>  - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
>    documenting the /proc requirement of fexecve(3) in its manpage, to
>    "prevent other people from wasting their time".
>  - https://bugzilla.kernel.org/show_bug.cgi?id=74481 documented that
>    it's not possible to fexecve() a file descriptor for a script with
>    close-on-exec set (which is possible with the implementation here).

Confused.  How does it work for a close-on-exec script?  I understand
how it works for a close-on-exec ELF binary.

--Andy

>  - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
>    problem where a process that did setuid() could not fexecve()
>    because it no longer had access to /proc/self/fd; this has since
>    been fixed.
>
>
> Changes since v5:
>  - Set new flag in bprm->interp_flags for O_CLOEXEC fds, so that binfmts
>    that invoke an interpreter fail the exec (as they will not be able
>    to access the invoked file). [Andy Lutomirski]
>  - Don't truncate long paths. [Andy Lutomirski]
>  - Commonize code to open the executed file. [Eric W. Biederman]
>  - Mark O_PATH file descriptors so they cannot be fexecve()ed.
>  - Make self-test more helpful, and add additional cases:
>      - file offset non-zero
>      - binary file without execute bit
>      - O_CLOEXEC fds
>
> Changes since v4, suggested by Eric W. Biederman:
>  - Use empty filename with AT_EMPTY_PATH flag rather than NULL
>    pathname to request fexecve-like behaviour.
>  - Build pathname as "/dev/fd/<fd>/<filename>" (or "/dev/fd/<fd>")
>    rather than using d_path().
>  - Patch against v3.17 (bfe01a5ba249)
>
> Changes since Meredydd's v3 patch:
>  - Added a selftest.
>  - Added a man page.
>  - Left open_exec() signature untouched to reduce patch impact
>    elsewhere (as suggested by Al Viro).
>  - Filled in bprm->filename with d_path() into a buffer, to avoid use
>    of potentially-ephemeral dentry->d_name.
>  - Patch against v3.14 (455c6fdbd21916).
>
>
> David Drysdale (2):
>   syscalls,x86: implement execveat() system call
>   syscalls,x86: add selftest for execveat(2)
>
>  arch/x86/ia32/audit.c                   |   1 +
>  arch/x86/ia32/ia32entry.S               |   1 +
>  arch/x86/kernel/audit_64.c              |   1 +
>  arch/x86/kernel/entry_64.S              |  28 +++
>  arch/x86/syscalls/syscall_32.tbl        |   1 +
>  arch/x86/syscalls/syscall_64.tbl        |   2 +
>  arch/x86/um/sys_call_table_64.c         |   1 +
>  fs/binfmt_em86.c                        |   4 +
>  fs/binfmt_misc.c                        |   4 +
>  fs/binfmt_script.c                      |  10 +
>  fs/exec.c                               | 115 ++++++++++--
>  fs/namei.c                              |   8 +-
>  include/linux/binfmts.h                 |   4 +
>  include/linux/compat.h                  |   3 +
>  include/linux/fs.h                      |   1 +
>  include/linux/namei.h                   |   1 +
>  include/linux/sched.h                   |   4 +
>  include/linux/syscalls.h                |   4 +
>  include/uapi/asm-generic/unistd.h       |   4 +-
>  kernel/sys_ni.c                         |   3 +
>  lib/audit.c                             |   3 +
>  tools/testing/selftests/Makefile        |   1 +
>  tools/testing/selftests/exec/.gitignore |   7 +
>  tools/testing/selftests/exec/Makefile   |  25 +++
>  tools/testing/selftests/exec/execveat.c | 321 ++++++++++++++++++++++++++++++++
>  25 files changed, 542 insertions(+), 15 deletions(-)
>  create mode 100644 tools/testing/selftests/exec/.gitignore
>  create mode 100644 tools/testing/selftests/exec/Makefile
>  create mode 100644 tools/testing/selftests/exec/execveat.c
>
> --
> 2.1.0.rc2.206.gedb03e5



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [RFC PATCH 0/1] arm64: Fix /proc/cpuinfo
From: Will Deacon @ 2014-11-06 16:54 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Mark Rutland,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	ghackmann-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	ijc-KcIKpvwj1kUDXYZnReoRVg@public.gmane.org,
	Serban Constantinescu,
	cross-distro-cunTk1MwBs8s++Sfvej+rw@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20141106164311.GD19702-M2fw3Uu6cmfZROr8t4l/smS4ubULX0JqMm0uRHvK7Nw@public.gmane.org>

On Thu, Nov 06, 2014 at 04:43:12PM +0000, Catalin Marinas wrote:
> On Fri, Oct 24, 2014 at 02:56:39PM +0100, Mark Rutland wrote:
> > [d] Print different hwcaps dependent on the personality.
> > 
> >     This would allow for 32-bit and 64-bit applications to function
> >     correctly, but for some 32-bit applications the personality would
> >     need to be set explicitly by the user.
> 
> Which makes this option actually in line with the uname -m behaviour. My
> vote goes for [d] with option [b] as a close alternative.
> 
> > [1] arm, v3.17, Versatile Express A15x2 A7x3 coretile
> > Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm 
> [...]
> > [2] arm64, v3.17, Juno platform
> > Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 
> 
> As an exercise, I'm trying to see what option [b] would look like when
> CONFIG_COMPAT is enabled:
> 
> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
> 
> The duplicate strings would only be listed once (evtstrm, aes, pmull,
> sha1, sha2, crc32). New AArch64 features that we may expect to be
> optional on AArch32 could be prefixed with "a64". If they are missing
> entirely from AArch32, (like asimd), no need for the prefix.
> 
> The advantage is that we don't need to check the personality but we have
> to assume that scripts would not search for substrings (sane people
> shouldn't do this anyway as the Features string can always be extended).

And a big disadvantage is that I can imagine AArch64 applications checking
for "neon" instead of "asimd", which will break if they're run under kernels
without COMPAT support enabled.

So I'm inclined to stick with Mark's patch as it is.

Will

^ permalink raw reply

* [PATCH v4 3/3] perf: Sample additional clock value
From: Pawel Moll @ 2014-11-06 16:51 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel, linux-api, Pawel Moll
In-Reply-To: <1415292718-19785-1-git-send-email-pawel.moll@arm.com>

This patch adds an option to sample value of an additional clock with
any perf event, with the the aim of allowing time correlation between
data coming from perf and other sources like hardware trace which is
timestamped with an external clock.

The idea is to generate periodic perf record containing timestamps from
two different sources, sampled as close to each other as possible. This
allows performing simple linear approximation:

                      other event
       -----O--------------+-------------O------> t_other
            :              |             :
            :              V             :
       -----O----------------------------O------> t_perf
        perf event

User can request such samples for any standard perf event, with the most
obvious examples of cpu-clock (hrtimer) at selected frequency or other
periodic events like sched:sched_switch.

In order to do this, PERF_SAMPLE_CLOCK has to be set in struct
perf_event_attr.sample_type and a type of the clock to be sampled
selected by setting perf_event_attr.clock to a value corresponding to
a POSIX clock clk_id (see "man 2 clock_gettime")

Currently three clocks are implemented: CLOCK_REALITME = 0,
CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
5 bits wide to allow for future extension to custom, non-POSIX clock
sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
ARM CoreSight (hardware trace) timestamp generator.

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---

Changes since v3:

- none

 include/linux/perf_event.h      |  2 ++
 include/uapi/linux/perf_event.h | 16 ++++++++++++++--
 kernel/events/core.c            | 30 ++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 680767d..b690a0d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -607,6 +607,8 @@ struct perf_sample_data {
 	 * Transaction flags for abort events:
 	 */
 	u64				txn;
+	/* Clock value (additional timestamp for time correlation) */
+	u64				clock;
 };
 
 /* default value for data source */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9a64eb1..53a7a72 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -137,8 +137,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_DATA_SRC			= 1U << 15,
 	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
+	PERF_SAMPLE_CLOCK			= 1U << 18,
 
-	PERF_SAMPLE_MAX = 1U << 18,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
 };
 
 /*
@@ -304,7 +305,16 @@ struct perf_event_attr {
 				mmap2          :  1, /* include mmap with inode data     */
 				comm_exec      :  1, /* flag comm events that are due to an exec */
 				uevents        :  1, /* allow uevents into the buffer */
-				__reserved_1   : 38;
+
+				/*
+				 * clock: one of the POSIX clock IDs:
+				 *
+				 * 0 - CLOCK_REALTIME
+				 * 1 - CLOCK_MONOTONIC
+				 * 4 - CLOCK_MONOTONIC_RAW
+				 */
+				clock          :  5, /* clock type */
+				__reserved_1   : 33;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -544,6 +554,7 @@ enum perf_event_type {
 	 * 	{ u64			id;       } && PERF_SAMPLE_ID
 	 * 	{ u64			stream_id;} && PERF_SAMPLE_STREAM_ID
 	 * 	{ u32			cpu, res; } && PERF_SAMPLE_CPU
+	 *	{ u64			clock;    } && PERF_SAMPLE_CLOCK
 	 *	{ u64			id;	  } && PERF_SAMPLE_IDENTIFIER
 	 * } && perf_event_attr::sample_id_all
 	 *
@@ -687,6 +698,7 @@ enum perf_event_type {
 	 *	{ u64			weight;   } && PERF_SAMPLE_WEIGHT
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
+	 *	{ u64			clock;    } && PERF_SAMPLE_CLOCK
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9a2d082..816baa6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1264,6 +1264,9 @@ static void perf_event__header_size(struct perf_event *event)
 	if (sample_type & PERF_SAMPLE_TRANSACTION)
 		size += sizeof(data->txn);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		size += sizeof(data->clock);
+
 	event->header_size = size;
 }
 
@@ -1291,6 +1294,9 @@ static void perf_event__id_header_size(struct perf_event *event)
 	if (sample_type & PERF_SAMPLE_CPU)
 		size += sizeof(data->cpu_entry);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		size += sizeof(data->clock);
+
 	event->id_header_size = size;
 }
 
@@ -4631,6 +4637,24 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
 		data->cpu_entry.cpu	 = raw_smp_processor_id();
 		data->cpu_entry.reserved = 0;
 	}
+
+	if (sample_type & PERF_SAMPLE_CLOCK) {
+		switch (event->attr.clock) {
+		case CLOCK_REALTIME:
+			data->clock = ktime_get_real_ns();
+			break;
+		case CLOCK_MONOTONIC:
+			data->clock = ktime_get_mono_fast_ns();
+			break;
+		case CLOCK_MONOTONIC_RAW:
+			data->clock = ktime_get_raw_ns();
+			break;
+		default:
+			data->clock = 0;
+			break;
+		}
+	}
+
 }
 
 void perf_event_header__init_id(struct perf_event_header *header,
@@ -4661,6 +4685,9 @@ static void __perf_event__output_id_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_CPU)
 		perf_output_put(handle, data->cpu_entry);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		perf_output_put(handle, data->clock);
+
 	if (sample_type & PERF_SAMPLE_IDENTIFIER)
 		perf_output_put(handle, data->id);
 }
@@ -4889,6 +4916,9 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_TRANSACTION)
 		perf_output_put(handle, data->txn);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		perf_output_put(handle, data->clock);
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
-- 
2.1.0

^ permalink raw reply related

* [PATCH v4 2/3] perf: Userspace event
From: Pawel Moll @ 2014-11-06 16:51 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Pawel Moll
In-Reply-To: <1415292718-19785-1-git-send-email-pawel.moll-5wv7dgnIgG8@public.gmane.org>

This patch adds a PR_TASK_PERF_UEVENT prctl call which can be used by
any process to inject custom data into perf data stream as a new
PERF_RECORD_UEVENT record, if such process is being observed or if it
is running on a CPU being observed by the perf framework.

The prctl call takes the following arguments:

        prctl(PR_TASK_PERF_UEVENT, type, size, data, flags);

- type: a number meaning to describe content of the following data.
  Kernel does not pay attention to it and merely passes it further in
  the perf data, therefore its use must be agreed between the events
  producer (the process being observed) and the consumer (performance
  analysis tool). The perf userspace tool will contain a repository of
  "well known" types and reference implementation of their decoders.
- size: Length in bytes of the data.
- data: Pointer to the data.
- flags: Reserved for future use. Always pass zero.

Perf context that are supposed to receive events generated with the
prctl above must be opened with perf_event_attr.uevent set to 1. The
PERF_RECORD_UEVENT records consist of a standard perf event header,
32-bit type value, 32-bit data size and the data itself, followed by
padding to align the overall record size to 8 bytes and optional,
standard sample_id field.

Example use cases:

- "perf_printf" like mechanism to add logging messages to perf data;
  in the simplest case it can be just

        prctl(PR_TASK_PERF_UEVENT, 0, 8, "Message", 0);

- synchronisation of performance data generated in user space with the
  perf stream coming from the kernel. For example, the marker can be
  inserted by a JIT engine after it generated portion of the code, but
  before the code is executed for the first time, allowing the
  post-processor to pick the correct debugging information.

Signed-off-by: Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org>
---

Changes since v3:

- none

 include/linux/perf_event.h      |  4 +++
 include/uapi/linux/perf_event.h | 23 ++++++++++++-
 include/uapi/linux/prctl.h      | 10 ++++++
 kernel/events/core.c            | 71 +++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c                    |  5 +++
 5 files changed, 112 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 893a0d0..680767d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -721,6 +721,8 @@ extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks
 extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
 extern void perf_event_fork(struct task_struct *tsk);
+extern int perf_uevent(struct task_struct *tsk, u32 type, u32 size,
+		       const char __user *data);
 
 /* Callchains */
 DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
@@ -829,6 +831,8 @@ static inline void perf_event_mmap(struct vm_area_struct *vma)		{ }
 static inline void perf_event_exec(void)				{ }
 static inline void perf_event_comm(struct task_struct *tsk, bool exec)	{ }
 static inline void perf_event_fork(struct task_struct *tsk)		{ }
+static inline int perf_uevent(struct task_struct *tsk, u32 type, u32 size,
+		              const char __user *data)			{ return -1; };
 static inline void perf_event_init(void)				{ }
 static inline int  perf_swevent_get_recursion_context(void)		{ return -1; }
 static inline void perf_swevent_put_recursion_context(int rctx)		{ }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9d84540..9a64eb1 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -303,7 +303,8 @@ struct perf_event_attr {
 				exclude_callchain_user   : 1, /* exclude user callchains */
 				mmap2          :  1, /* include mmap with inode data     */
 				comm_exec      :  1, /* flag comm events that are due to an exec */
-				__reserved_1   : 39;
+				uevents        :  1, /* allow uevents into the buffer */
+				__reserved_1   : 38;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -712,6 +713,26 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_MMAP2			= 10,
 
+	/*
+	 * Data in userspace event record is transparent for the kernel
+	 *
+	 * Userspace perf tool code maintains a list of known types with
+	 * reference implementations of parsers for the data field.
+	 *
+	 * Overall size of the record (including type and size fields)
+	 * is always aligned to 8 bytes by adding padding after the data.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u32				type;
+	 *	u32				size;
+	 *	char				data[size];
+	 *	char				__padding[-size & 7];
+	 * 	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_UEVENT			= 11,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 513df75..2a6852f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -179,4 +179,14 @@ struct prctl_mm_map {
 #define PR_SET_THP_DISABLE	41
 #define PR_GET_THP_DISABLE	42
 
+/*
+ * Perf userspace event generation
+ *
+ * second argument: event type
+ * third argument:  data size
+ * fourth argument: pointer to data
+ * fifth argument:  flags (currently unused, pass 0)
+ */
+#define PR_TASK_PERF_UEVENT	43
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5d0aa03..9a2d082 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5597,6 +5597,77 @@ static void perf_log_throttle(struct perf_event *event, int enable)
 }
 
 /*
+ * Userspace-generated event
+ */
+
+struct perf_uevent {
+	struct perf_event_header	header;
+	u32				type;
+	u32				size;
+	u8				data[0];
+};
+
+static void perf_uevent_output(struct perf_event *event, void *data)
+{
+	struct perf_uevent *uevent = data;
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	int size = uevent->header.size;
+
+	if (!event->attr.uevents)
+		return;
+
+	perf_event_header__init_id(&uevent->header, &sample, event);
+
+	if (perf_output_begin(&handle, event, uevent->header.size) != 0)
+		goto out;
+	perf_output_put(&handle, uevent->header);
+	perf_output_put(&handle, uevent->type);
+	perf_output_put(&handle, uevent->size);
+	__output_copy(&handle, uevent->data, uevent->size);
+
+	/* Padding to align overall data size to 8 bytes */
+	perf_output_skip(&handle, -uevent->size & (sizeof(u64) - 1));
+
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+out:
+	uevent->header.size = size;
+}
+
+int perf_uevent(struct task_struct *tsk, u32 type, u32 size,
+		const char __user *data)
+{
+	struct perf_uevent *uevent;
+
+	/* Need some reasonable limit */
+	if (size > PAGE_SIZE)
+		return -E2BIG;
+
+	uevent = kmalloc(sizeof(*uevent) + size, GFP_KERNEL);
+	if (!uevent)
+		return -ENOMEM;
+
+	uevent->header.type = PERF_RECORD_UEVENT;
+	uevent->header.size = sizeof(*uevent) + ALIGN(size, sizeof(u64));
+
+	uevent->type = type;
+	uevent->size = size;
+	if (copy_from_user(uevent->data, data, size)) {
+		kfree(uevent);
+		return -EFAULT;
+	}
+
+	perf_event_aux(perf_uevent_output, uevent, NULL);
+
+	kfree(uevent);
+
+	return 0;
+}
+
+
+/*
  * Generic event overflow handling, sampling.
  */
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 1eaa2f0..1c83677 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2121,6 +2121,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_TASK_PERF_EVENTS_ENABLE:
 		error = perf_event_task_enable();
 		break;
+	case PR_TASK_PERF_UEVENT:
+		if (arg5 != 0)
+			return -EINVAL;
+		error = perf_uevent(me, arg2, arg3, (char __user *)arg4);
+		break;
 	case PR_GET_TIMERSLACK:
 		error = current->timer_slack_ns;
 		break;
-- 
2.1.0

^ permalink raw reply related

* [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
From: Pawel Moll @ 2014-11-06 16:51 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel, linux-api, Pawel Moll
In-Reply-To: <1415292718-19785-1-git-send-email-pawel.moll@arm.com>

Until now, perf framework never defined the meaning of the timestamps
captured as PERF_SAMPLE_TIME sample type. The values were obtaining
from local (sched) clock, which is unavailable in userspace. This made
it impossible to correlate perf data with any other events. Other
tracing solutions have the source configurable (ftrace) or just share
a common time domain between kernel and userspace (LTTng).

Follow the trend by using monotonic clock, which is readily available
as POSIX CLOCK_MONOTONIC.

Also add a sysctl "perf_sample_time_clk_id" attribute which can be used
by the user to obtain the clk_id to be used with POSIX clock API (eg.
clock_gettime()) to obtain a time value comparable with perf samples.

Old behaviour can be restored by using "perf_use_local_clock" kernel
parameter.

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---
Ingo, I remember your comments about this approach in the past, but
during discussions at LPC Thomas was convinced that it's the right
thing to do - see cover letter for the series...

Changes since v3:

- Added "perf_use_lock_clock" parameter...

- ... and creating the sysctl value only when it's not defined (turned
  out that negative clk_ids are not invalid - they are used to describe
  dynamic clocks)

- Had to keep sysctl_perf_sample_time_clk_id non-const, because struct
  ctl_table.data is non-const

 Documentation/kernel-parameters.txt |  9 +++++++++
 kernel/events/core.c                | 37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 4c81a86..8ead8d8 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -91,6 +91,7 @@ parameter is applicable:
 	NUMA	NUMA support is enabled.
 	NFS	Appropriate NFS support is enabled.
 	OSS	OSS sound support is enabled.
+	PERF	Performance events and counters support is enabled.
 	PV_OPS	A paravirtualized kernel is enabled.
 	PARIDE	The ParIDE (parallel port IDE) subsystem is enabled.
 	PARISC	The PA-RISC architecture is enabled.
@@ -2763,6 +2764,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			allocator.  This parameter is primarily	for debugging
 			and performance comparison.
 
+	perf_use_local_clock
+			[PERF]
+			Use local_clock() as a source for perf timestamps
+			generation. This was be the default behaviour and
+			this parameter can be used to maintain backward
+			compatibility or on older hardware with expensive
+			monotonic clock source.
+
 	pf.		[PARIDE]
 			See Documentation/blockdev/paride.txt.
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2b02c9f..5d0aa03 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -42,6 +42,7 @@
 #include <linux/module.h>
 #include <linux/mman.h>
 #include <linux/compat.h>
+#include <linux/sysctl.h>
 
 #include "internal.h"
 
@@ -322,8 +323,41 @@ extern __weak const char *perf_pmu_name(void)
 	return "pmu";
 }
 
+static bool perf_use_local_clock;
+static int __init perf_use_local_clock_setup(char *__unused)
+{
+	perf_use_local_clock = true;
+	return 1;
+}
+__setup("perf_use_local_clock", perf_use_local_clock_setup);
+
+static int sysctl_perf_sample_time_clk_id = CLOCK_MONOTONIC;
+
+static struct ctl_table perf_sample_time_kern_table[] = {
+	{
+		.procname       = "perf_sample_time_clk_id",
+		.data           = &sysctl_perf_sample_time_clk_id,
+		.maxlen         = sizeof(int),
+		.mode           = 0444,
+		.proc_handler   = proc_dointvec,
+	},
+	{}
+};
+
+static struct ctl_table perf_sample_time_root_table[] = {
+	{
+		.procname	= "kernel",
+		.mode		= 0555,
+		.child		= perf_sample_time_kern_table,
+	},
+	{}
+};
+
 static inline u64 perf_clock(void)
 {
+	if (likely(!perf_use_local_clock))
+		return ktime_get_mono_fast_ns();
+
 	return local_clock();
 }
 
@@ -8230,6 +8264,9 @@ void __init perf_event_init(void)
 	 */
 	BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
 		     != 1024);
+
+	if (!perf_use_local_clock)
+		register_sysctl_table(perf_sample_time_root_table);
 }
 
 static int __init perf_event_sysfs_init(void)
-- 
2.1.0

^ permalink raw reply related

* [PATCH v4 0/3] perf: User/kernel time correlation and event generation
From: Pawel Moll @ 2014-11-06 16:51 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel, linux-api, Pawel Moll

This is a second version of the post-LPC series. The changes are limited
to the first patch, adding a backward-compatibility kernel parameter.

The v3 cover letter:

I've organised a session on the subject during the tracing minisummit
at LPC 2014 in Dusseldorf. Notes taken from the discussion taken by
Steven Rostedt (thanks Steve!)

http://www.linuxplumbersconf.org/2014/wp-content/uploads/2014/10/LPC2014_Tracing.txt

The following three patches address three main topics. They are pretty
much orthogonal and (subject to small and obvious modifications) could
be applied independently from each other.

An executive summary of both discussion and the patches:

1. User/kernel perf timestamps correlation

Thomas suggested that, instead of jumping through loops of correlation,
perf should simply generate monotonic clock timestamps, instead of
using sched clock. Peter and I pointed out that Ingo didn't like this
idea as monotonic can be slow, but apparently the cases when it is are
irrelevant. Thomas offered to fly to Budapest to physically convince
Ingo - I hope it won't be necessary and he will be able to achieve
this here, on mailing lists :-)

Setting aside potential performance problems, it would be a really
great solution, unifying all trace systems (perf, ftrace and LTTng)
in this respect. I'm more than happy to work on potential improvements
in the are of monotonic clock if it was to help the cause.

If it is a definite no-go, we still have the third patch, allowing post-
capture correlation based on synchronisation events.

2. User event generation

Everyone present agreed that it would be a very-nice-to-have feature.
There was some discussion about implementation details, so I welcome
feedback and comments regarding my take on the matter.

3. Correlation with external timestamps

This is a new issue, which surfaced recently while I was working on
hardware trace infrastructure. It can timestamp trace packets, but is
using yet another, completely different time source to do this.

Thomas suggested solution which gets down to my original proposal for
sched/monotonic clock correlation - an additional sample type so events
can be "double stamped" using different clock sources providing
synchronisation points for later time approximation. I've just extended
the implementation with configuration value to select the clock source.
If the first patch (making perf timestamps monotonic) gets accepted,
there will be no immediate need for this one, but I'd like to gain some
feedback anyway.

That's all for this series. Previous versions:

- RFC: http://www.spinics.net/lists/kernel/msg1824419.html
- v1: http://thread.gmane.org/gmane.linux.kernel/1790231
- v2: http://thread.gmane.org/gmane.linux.kernel/1793272
- v3: http://thread.gmane.org/gmane.linux.kernel.api/5681

Pawel Moll (3):
  perf: Use monotonic clock as a source for timestamps
  perf: Userspace event
  perf: Sample additional clock value

 Documentation/kernel-parameters.txt |   9 +++
 include/linux/perf_event.h          |   6 ++
 include/uapi/linux/perf_event.h     |  37 +++++++++-
 include/uapi/linux/prctl.h          |  10 +++
 kernel/events/core.c                | 138 ++++++++++++++++++++++++++++++++++++
 kernel/sys.c                        |   5 ++
 6 files changed, 203 insertions(+), 2 deletions(-)

-- 
2.1.0

^ permalink raw reply

* Re: [RFC PATCH 0/1] arm64: Fix /proc/cpuinfo
From: Catalin Marinas @ 2014-11-06 16:43 UTC (permalink / raw)
  To: Mark Rutland
  Cc: cross-distro-cunTk1MwBs8s++Sfvej+rw@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Serban Constantinescu, Will Deacon,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ghackmann-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
In-Reply-To: <1414159000-27059-1-git-send-email-mark.rutland-5wv7dgnIgG8@public.gmane.org>

On Fri, Oct 24, 2014 at 02:56:39PM +0100, Mark Rutland wrote:
> Currently, the arm64 /proc/cpuinfo format differs from that of arm, in a
> manner which prevents some otherwise portable applications from
> functioning as expected. Specifically, the "Features" line describes the
> 64-bit hwcaps exclusive of the 32-bit hwcaps, which causes issues for
> certain applications which attempt to parse /proc/cpuinfo to detect
> features rather than directly using the hwcaps exposed via auxval.
> 
> Additionally, the arm64 /proc/cpuinfo format only provides identifying
> information for a single CPU (unlike 32-bit), which is problematic for
> systems with heterogeneous CPUs (i.e. big.LITTLE).
> 
> This patch attempts to solve both issues.

I'm perfectly fine with the heterogeneous CPUs part.

> I believe the contentious part
> is what to do with the Features line, and for that there are a number of
> possibilities:
> 
> [a] Only print the 64-bit hwcaps
> 
>     This would match our current behaviour. However certain 32-bit
>     applications will not detect CPU features correctly, and could fail
>     to launch. The appropriate hwcaps are available in auxval, but this
>     will not be of help to existing binaries.
> 
> [b] Append the 64-bit and 32-bit hwcaps
> 
>     This would allow for a consistent format. However, some
>     human-readable hwcap names have been reused for analogous
>     instruction set features (e.g. "aes") despite 32-bit and 64-bit
>     instruction set support being largely unrelated per the
>     architecture. This could lead to applications mis-detecting
>     instruction set support on some CPUs in future, and may be
>     misleading to a casual reader.

The only overlap between 32 and 64-bit is aes, pmull, sha1, sha2, crc32.
An ARMv8 CPU implementing both AArch32 and AArch64 will likely support
these extensions in both modes. However, "likely" is not enough and we
need to get some confirmation from the architecture people. If that's
the case, point [b] is not too bad.

> [c] Print different hwcaps for compat tasks
> 
>     This would allow for 32-bit and 64-bit applications to function
>     correctly. Having the format differ depending on the instruction set
>     of the application reading /proc/cpuinfo may be misleading in some
>     cases (e.g. a human using a 32-bit cat to read /proc/cpuinfo on a
>     64-bit system).

Long time ago we decided that "uname -m" would report "aarch64"
independent of whether it is compat or not. That's the case for x86 as
well, you need PER_LINUX32 to report the actual compat UTS name.

> [d] Print different hwcaps dependent on the personality.
> 
>     This would allow for 32-bit and 64-bit applications to function
>     correctly, but for some 32-bit applications the personality would
>     need to be set explicitly by the user.

Which makes this option actually in line with the uname -m behaviour. My
vote goes for [d] with option [b] as a close alternative.

> [1] arm, v3.17, Versatile Express A15x2 A7x3 coretile
> Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm 
[...]
> [2] arm64, v3.17, Juno platform
> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 

As an exercise, I'm trying to see what option [b] would look like when
CONFIG_COMPAT is enabled:

Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae

The duplicate strings would only be listed once (evtstrm, aes, pmull,
sha1, sha2, crc32). New AArch64 features that we may expect to be
optional on AArch32 could be prefixed with "a64". If they are missing
entirely from AArch32, (like asimd), no need for the prefix.

The advantage is that we don't need to check the personality but we have
to assume that scripts would not search for substrings (sane people
shouldn't do this anyway as the Features string can always be extended).

-- 
Catalin

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox