Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
From: Khalid Aziz @ 2014-12-19 21:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, corbet, mingo, hpa, riel, akpm, rientjes, ak,
	mgorman, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api
In-Reply-To: <alpine.DEB.2.11.1412190045590.17382@nanos>

On 12/18/2014 05:27 PM, Thomas Gleixner wrote:
> On Thu, 18 Dec 2014, Khalid Aziz wrote:
>> On 12/18/2014 04:02 PM, Thomas Gleixner wrote:
>>> If we can solve it with a proper designed and well thought out
>>> functionality in the kernel based on a futex like mechanism, why cant
>>> java and databases not switch over to that and simply use it?
>>>
>>> You need to modify user space anyway, so it does not matter whether
>>> you modify it in a sane or in a hacky way.
>>
>> Actually userspace does not need to be modified. The code to use this
>> functionality is already present in database code since this same
>> functionality exists on other OSs (the API is a little different but those
>> details can be handled with a simple header file in userspace). Userspace code
>> has already been tested and debugged thoroughly on the OSs that support this
>> functionality and that has significant impact on testing effort. So for
>> userspace it is simply a matter of turning that code on on Linux as well and
>> recompiling. This would be a multi-platform solution for database/java as
>> opposed to a Linux specific solution.
>
> Bullshit. If you turn that option on, it's a modification from the QA
> point of view and you need to run a full validation no matter
> what. Anything else is just QA by crystal ball.
>
> Of course you carefully avoided (again) to answer the real question:
>
>> But its simpler to hack crap into the scheduler than coming up with a
>> proper solution to the problem, right?
>
> I can answer it for you: Yes, it is simpler.
>
> But as you might have figured out it's not really popular and therefor
> not simpler to be accepted by the people who actually care about sane
> designs. I can whip you up special purpose hacks for that which will
> give you way more guarantees with way less lines of horrible code, but
> that does not mean that such hacks are an acceptable solution. You can
> carry those hacks in your private tree and ship it to your customers,
> but do not expect that any sane maintainer will care about it.
>
> Now the very same maintainers asked you several times to answer the
> question why this can't be done with proper futex like spin
> mechanisms, which would solve a bunch of related problems as well.
>
>   You never even tried to answer that question simply because you never
>   tried to think about it for real. Your only answer is that you want A
>   because A is already used on other OSs and therefor solution B is not
>   an option.
>
>   But if solution B would gain 4% performance, then according to your
>   previous argumentation it would become suddenly very interesting,
>   right?
>
> So unless you even show any sign of thinking about different
> approaches and technically arguing why they cannot deliver the same
> value you wont get anywhere with this and I can tell you why.
>
> You create a new user space ABI
>
>   That forces the kernel to support it forever, which in consequence
>   imposes restrictions on the kernel scheduler forever.
>
>   We have enough restrictions by misdesigned ABIs (e.g. sched_yield())
>   already, so we really do not need more of that.
>
> You ignore any request to prove why a proper designed spin futex
> interface would not be a sensible solution for the problem.
>
>   Of course you are free to ignore that (as you are free to ignore
>   important review comments), but you don't have to be suprised when
>   the responsible maintainers ignore any further attempt from you to
>   get this merged.
>
> Aside of that, you still fail to provide a proper test case which is
> publically usable for the people involved in this to reproduce your 3%
> gain and analyze the problem at hand properly. The provided:
>
>        enable_hack();
>        while (/*some condition */) {
>        	    /* bla */
> 	    /* blub */
> 	    /* blurb */
> 	    /* yay! */
>        }
>        disable_hack();
>
> is beyond useless.
>
> Thanks,
>
> 	tglx
>

Fair enough. Implications of a new userspace ABI can be significant and 
I can accept not introducing a new one in the kernel.

The queuing problem caused by a task taking a contended lock just before 
its current timeslice is up which userspace app wouldn't know about, is 
a real problem nevertheless. My patch attempts to avoid the contention 
in the first place. futex with adaptive spinning is a post-contention 
solution that tries to minimize the cost of contention but does nothing 
to avoid the contention. Solving this problem using futex can help only 
if the userspace lock uses futex.

I have looked at solving this problem in userspace using priority 
inheritance semaphore but ran into many problems. I will go back and 
take another look at it.

I appreciate your feedback.

Thanks,
Khalid

^ permalink raw reply

* [RFC PATCH v3 1/3] vfio: platform: add device properties skeleton and user API
From: Antonios Motakis @ 2014-12-19 21:20 UTC (permalink / raw)
  To: kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	alex.williamson-H+wXaHxf7aLQT0dZR+AlfA
  Cc: open list:VFIO DRIVER, eric.auger-QSEj5FYQhm4dnm+yROfE0A,
	open list:ABI/API, open list, Antonios Motakis,
	tech-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J,
	christoffer.dall-QSEj5FYQhm4dnm+yROfE0A
In-Reply-To: <1419024032-1269-1-git-send-email-a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>

This patch introduced the API to return device properties about
a PLATFORM device (if described by a device tree or ACPI) and the
skeleton of the implementation for VFIO_PLATFORM. Information about any
device node bound by VFIO_PLATFORM should be queried via the introduced
ioctl VFIO_DEVICE_GET_DEV_PROPERTY.

The user needs to know the name and the data type of the property he is
accessing.

Signed-off-by: Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
---
 drivers/vfio/platform/Makefile                |  3 +-
 drivers/vfio/platform/properties.c            | 61 +++++++++++++++++++++++++++
 drivers/vfio/platform/vfio_platform_common.c  | 35 +++++++++++++++
 drivers/vfio/platform/vfio_platform_private.h |  7 +++
 include/uapi/linux/vfio.h                     | 26 ++++++++++++
 5 files changed, 131 insertions(+), 1 deletion(-)
 create mode 100644 drivers/vfio/platform/properties.c

diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile
index 81de144..99f3ba1 100644
--- a/drivers/vfio/platform/Makefile
+++ b/drivers/vfio/platform/Makefile
@@ -1,5 +1,6 @@
 
-vfio-platform-y := vfio_platform.o vfio_platform_common.o vfio_platform_irq.o
+vfio-platform-y := vfio_platform.o vfio_platform_common.o vfio_platform_irq.o \
+		   devtree.o
 
 obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o
 
diff --git a/drivers/vfio/platform/properties.c b/drivers/vfio/platform/properties.c
new file mode 100644
index 0000000..8b90465
--- /dev/null
+++ b/drivers/vfio/platform/properties.c
@@ -0,0 +1,61 @@
+#include <linux/slab.h>
+#include <linux/vfio.h>
+#include <linux/property.h>
+#include "vfio_platform_private.h"
+
+static int dev_property_get_strings(struct device *dev,
+				    char *name, unsigned *lenp,
+				    void __user *datap, unsigned long datasz)
+{
+	return -EINVAL;
+}
+
+static int dev_property_get_uint(struct device *dev, char *name,
+				 uint32_t type, unsigned *lenp,
+				 void __user *datap, unsigned long datasz)
+{
+	return -EINVAL;
+}
+
+int vfio_platform_dev_properties(struct device *dev,
+				 uint32_t type, unsigned *lenp,
+				 void __user *datap, unsigned long datasz)
+{
+	char *name;
+	long namesz;
+	int ret;
+
+	namesz = strnlen_user(datap, datasz);
+	if (!namesz)
+		return -EFAULT;
+	if (namesz > datasz)
+		return -EINVAL;
+
+	name = kzalloc(namesz, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+	if (strncpy_from_user(name, datap, namesz) <= 0) {
+		kfree(name);
+		return -EFAULT;
+	}
+
+	switch (type) {
+	case VFIO_DEV_PROPERTY_TYPE_STRINGS:
+		ret = dev_property_get_strings(dev, name, lenp, datap, datasz);
+		break;
+
+	case VFIO_DEV_PROPERTY_TYPE_U64:
+	case VFIO_DEV_PROPERTY_TYPE_U32:
+	case VFIO_DEV_PROPERTY_TYPE_U16:
+	case VFIO_DEV_PROPERTY_TYPE_U8:
+		ret = dev_property_get_uint(dev, name, type, lenp,
+					    datap, datasz);
+		break;
+
+	default:
+		ret = -EINVAL;
+	}
+
+	kfree(name);
+	return ret;
+}
diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c
index a532a25..83ad4b74 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -19,6 +19,7 @@
 #include <linux/slab.h>
 #include <linux/types.h>
 #include <linux/vfio.h>
+#include <linux/property.h>
 
 #include "vfio_platform_private.h"
 
@@ -251,6 +252,34 @@ static long vfio_platform_ioctl(void *device_data,
 
 		return ret;
 
+	} else if (cmd == VFIO_DEVICE_GET_DEV_PROPERTY) {
+		struct vfio_dev_property info;
+		void __user *datap;
+		unsigned long datasz;
+		int ret;
+
+		if (!vdev->dev)
+			return -EINVAL;
+
+		minsz = offsetofend(struct vfio_dev_property, length);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		datap = (void __user *) arg + minsz;
+		datasz = info.argsz - minsz;
+
+		ret = vfio_platform_dev_properties(vdev->dev, info.type,
+						   &info.length, datap, datasz);
+
+		if (copy_to_user((void __user *)arg, &info, minsz))
+			ret = -EFAULT;
+
+		return ret;
+
 	} else if (cmd == VFIO_DEVICE_RESET)
 		return -EINVAL;
 
@@ -501,6 +530,12 @@ int vfio_platform_probe_common(struct vfio_platform_device *vdev,
 		return ret;
 	}
 
+	/* add device properties flag */
+	if (device_property_present(dev, "name")) {
+		vdev->dev = dev;
+		vdev->flags |= VFIO_DEVICE_FLAGS_DEV_PROPERTIES;
+	}
+
 	mutex_init(&vdev->igate);
 
 	return 0;
diff --git a/drivers/vfio/platform/vfio_platform_private.h b/drivers/vfio/platform/vfio_platform_private.h
index c3d5b4b..fc6b1fb 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -53,6 +53,7 @@ struct vfio_platform_device {
 	u32				num_irqs;
 	int				refcnt;
 	struct mutex			igate;
+	struct device			*dev;
 
 	/*
 	 * These fields should be filled by the bus specific binder
@@ -79,4 +80,10 @@ extern int vfio_platform_set_irqs_ioctl(struct vfio_platform_device *vdev,
 					unsigned start, unsigned count,
 					void *data);
 
+/* device properties support in devtree.c */
+extern int vfio_platform_dev_properties(struct device *dev,
+					uint32_t type, unsigned *lenp,
+					void __user *datap,
+					unsigned long datasz);
+
 #endif /* VFIO_PLATFORM_PRIVATE_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 544d3d8..ac50ab3f 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -161,12 +161,38 @@ struct vfio_device_info {
 #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
 #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
 #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
+#define VFIO_DEVICE_FLAGS_DEV_PROPERTIES (1 << 4) /* Device properties */
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 };
 #define VFIO_DEVICE_GET_INFO		_IO(VFIO_TYPE, VFIO_BASE + 7)
 
 /**
+ * VFIO_DEVICE_GET_DEV_PROPERTY - _IOR(VFIO_TYPE, VFIO_BASE + 16,
+ *						struct vfio_devtree_info)
+ *
+ * Retrive a device property, e.g. from a device tree if available.
+ * Caller will initialize data[] with a single string with the requested
+ * devicetree property name, and type depending on whether a array of strings
+ * or an array of u32 values is expected. On success, data[] will be extended
+ * with the requested information, either as an array of u32, or with a list
+ * of strings sepparated by the NULL terminating character.
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_dev_property {
+	__u32	argsz;
+	__u32	type;
+#define VFIO_DEV_PROPERTY_TYPE_STRINGS	0
+#define VFIO_DEV_PROPERTY_TYPE_U8	1
+#define VFIO_DEV_PROPERTY_TYPE_U16	2
+#define VFIO_DEV_PROPERTY_TYPE_U32	3
+#define VFIO_DEV_PROPERTY_TYPE_U64	4
+	__u32	length;
+	__u8	data[];
+};
+#define VFIO_DEVICE_GET_DEV_PROPERTY	_IO(VFIO_TYPE, VFIO_BASE + 17)
+
+/**
  * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
  *				       struct vfio_region_info)
  *
-- 
2.1.3

^ permalink raw reply related

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
From: Thomas Gleixner @ 2014-12-19  0:27 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Peter Zijlstra, corbet, mingo, hpa, riel, akpm, rientjes, ak,
	mgorman, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api
In-Reply-To: <54936562.5070502@oracle.com>

On Thu, 18 Dec 2014, Khalid Aziz wrote:
> On 12/18/2014 04:02 PM, Thomas Gleixner wrote:
> > If we can solve it with a proper designed and well thought out
> > functionality in the kernel based on a futex like mechanism, why cant
> > java and databases not switch over to that and simply use it?
> > 
> > You need to modify user space anyway, so it does not matter whether
> > you modify it in a sane or in a hacky way.
> 
> Actually userspace does not need to be modified. The code to use this
> functionality is already present in database code since this same
> functionality exists on other OSs (the API is a little different but those
> details can be handled with a simple header file in userspace). Userspace code
> has already been tested and debugged thoroughly on the OSs that support this
> functionality and that has significant impact on testing effort. So for
> userspace it is simply a matter of turning that code on on Linux as well and
> recompiling. This would be a multi-platform solution for database/java as
> opposed to a Linux specific solution.

Bullshit. If you turn that option on, it's a modification from the QA
point of view and you need to run a full validation no matter
what. Anything else is just QA by crystal ball.

Of course you carefully avoided (again) to answer the real question:

> But its simpler to hack crap into the scheduler than coming up with a
> proper solution to the problem, right?

I can answer it for you: Yes, it is simpler.

But as you might have figured out it's not really popular and therefor
not simpler to be accepted by the people who actually care about sane
designs. I can whip you up special purpose hacks for that which will
give you way more guarantees with way less lines of horrible code, but
that does not mean that such hacks are an acceptable solution. You can
carry those hacks in your private tree and ship it to your customers,
but do not expect that any sane maintainer will care about it.

Now the very same maintainers asked you several times to answer the
question why this can't be done with proper futex like spin
mechanisms, which would solve a bunch of related problems as well.

 You never even tried to answer that question simply because you never
 tried to think about it for real. Your only answer is that you want A
 because A is already used on other OSs and therefor solution B is not
 an option.

 But if solution B would gain 4% performance, then according to your
 previous argumentation it would become suddenly very interesting,
 right?

So unless you even show any sign of thinking about different
approaches and technically arguing why they cannot deliver the same
value you wont get anywhere with this and I can tell you why.

You create a new user space ABI

 That forces the kernel to support it forever, which in consequence
 imposes restrictions on the kernel scheduler forever.

 We have enough restrictions by misdesigned ABIs (e.g. sched_yield())
 already, so we really do not need more of that.

You ignore any request to prove why a proper designed spin futex
interface would not be a sensible solution for the problem.

 Of course you are free to ignore that (as you are free to ignore
 important review comments), but you don't have to be suprised when
 the responsible maintainers ignore any further attempt from you to
 get this merged.

Aside of that, you still fail to provide a proper test case which is
publically usable for the people involved in this to reproduce your 3%
gain and analyze the problem at hand properly. The provided:

      enable_hack();
      while (/*some condition */) {
      	    /* bla */
	    /* blub */
	    /* blurb */
	    /* yay! */
      }
      disable_hack();

is beyond useless.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
From: Khalid Aziz @ 2014-12-18 23:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, corbet, mingo, hpa, riel, akpm, rientjes, ak,
	mgorman, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api
In-Reply-To: <alpine.DEB.2.11.1412182355020.17382@nanos>

On 12/18/2014 04:02 PM, Thomas Gleixner wrote:
> On Thu, 18 Dec 2014, Khalid Aziz wrote:
>> On 12/18/2014 03:28 PM, Peter Zijlstra wrote:
>>> On Thu, Dec 18, 2014 at 11:44:19AM -0700, Khalid Aziz wrote:
>>>> sched/fair: Add advisory flag for borrowing a timeslice
>>>
>>> yuck hatred and much of that.
>>>
>>> Also, you fail to explain why a kernel side spin futex-lock is not an
>>> option.
>>>
>>
>> I had explained that when this question came up in the last round of
>> discussions. The queuing problem I am trying to solve stems from userspace
>> locks. Java and databases implement their own userspace locks that do not use
>> futex. Solving this with futex will not help the primary users of this
>> solution.
>
> If we can solve it with a proper designed and well thought out
> functionality in the kernel based on a futex like mechanism, why cant
> java and databases not switch over to that and simply use it?
>
> You need to modify user space anyway, so it does not matter whether
> you modify it in a sane or in a hacky way.

Actually userspace does not need to be modified. The code to use this 
functionality is already present in database code since this same 
functionality exists on other OSs (the API is a little different but 
those details can be handled with a simple header file in userspace). 
Userspace code has already been tested and debugged thoroughly on the 
OSs that support this functionality and that has significant impact on 
testing effort. So for userspace it is simply a matter of turning that 
code on on Linux as well and recompiling. This would be a multi-platform 
solution for database/java as opposed to a Linux specific solution.

--
Khalid

^ permalink raw reply

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
From: Thomas Gleixner @ 2014-12-18 23:02 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Peter Zijlstra, corbet, mingo, hpa, riel, akpm, rientjes, ak,
	mgorman, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api
In-Reply-To: <54935842.5020507@oracle.com>

On Thu, 18 Dec 2014, Khalid Aziz wrote:
> On 12/18/2014 03:28 PM, Peter Zijlstra wrote:
> > On Thu, Dec 18, 2014 at 11:44:19AM -0700, Khalid Aziz wrote:
> > > sched/fair: Add advisory flag for borrowing a timeslice
> > 
> > yuck hatred and much of that.
> > 
> > Also, you fail to explain why a kernel side spin futex-lock is not an
> > option.
> > 
> 
> I had explained that when this question came up in the last round of
> discussions. The queuing problem I am trying to solve stems from userspace
> locks. Java and databases implement their own userspace locks that do not use
> futex. Solving this with futex will not help the primary users of this
> solution.

If we can solve it with a proper designed and well thought out
functionality in the kernel based on a futex like mechanism, why cant
java and databases not switch over to that and simply use it?

You need to modify user space anyway, so it does not matter whether
you modify it in a sane or in a hacky way.

But its simpler to hack crap into the scheduler than coming up with a
proper solution to the problem, right?

> Hope this helps.

Not at all.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
From: Khalid Aziz @ 2014-12-18 22:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, corbet, mingo, hpa, riel, akpm, rientjes, ak, mgorman,
	raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api
In-Reply-To: <20141218222846.GH30905@twins.programming.kicks-ass.net>

On 12/18/2014 03:28 PM, Peter Zijlstra wrote:
> On Thu, Dec 18, 2014 at 11:44:19AM -0700, Khalid Aziz wrote:
>> sched/fair: Add advisory flag for borrowing a timeslice
>
> yuck hatred and much of that.
>
> Also, you fail to explain why a kernel side spin futex-lock is not an
> option.
>

I had explained that when this question came up in the last round of 
discussions. The queuing problem I am trying to solve stems from 
userspace locks. Java and databases implement their own userspace locks 
that do not use futex. Solving this with futex will not help the primary 
users of this solution.

Hope this helps.

Thanks,
Khalid

^ permalink raw reply

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
From: Peter Zijlstra @ 2014-12-18 22:28 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, corbet, mingo, hpa, riel, akpm, rientjes, ak, mgorman,
	raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api
In-Reply-To: <1418928259-6311-1-git-send-email-khalid.aziz@oracle.com>

On Thu, Dec 18, 2014 at 11:44:19AM -0700, Khalid Aziz wrote:
> sched/fair: Add advisory flag for borrowing a timeslice

yuck hatred and much of that.

Also, you fail to explain why a kernel side spin futex-lock is not an
option.

^ permalink raw reply

* [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
From: Khalid Aziz @ 2014-12-18 18:44 UTC (permalink / raw)
  To: tglx, corbet, mingo, hpa, peterz, riel, akpm, rientjes, ak,
	mgorman, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck
  Cc: Khalid Aziz, linux-kernel, linux-doc, linux-api

sched/fair: Add advisory flag for borrowing a timeslice

This patch adds a way for a task to request to borrow one timeslice
from future if it is about to be preempted, so it could delay
preemption and complete any critical task it is in the middle of.

This feature improves performance for apps that use userspace locking
across large number of threads, for example large databases and Java,
and similar solutions have been used for many years on other OSs.
This feature helps in situation where a task acquires a lock before
performing a critical operation on shared data and happens to have
acquired the lock just before its timeslice is up which means it gets
preempted before it completes its task. This lock being held causes
all other tasks that also acquire the same lock to perform their
critical operation on shared data, to start queueing up and causing
large number of context switches. This queueing problem can be avoided
if the task that acquires lock first could request scheduler to let it
borrow one timeslice once it enters its critical section and hence
allow it to complete its critical section without causing queueing
problem. If critical section completes before the task is due for
preemption, the task can desassert its request which causes scheduler
to proceed with normal preemption. A task sends the scheduler this
request by setting a flag in a memory location it has shared with the
kernel. Kernel uses bytes in the same memory location to let the task
know when its request for amnesty from preemption has been granted.

These rules apply to the use of this feature:

- Request to borrow timeslice is not guranteed to be honored.
- If the task is allowed to borrow, kernel will inform the task
  of this. When this happens, task must yield the processor as soon
  as it completes its critical section.
- If the task fails to yield processor after being allowed to
  borrow, it is penalized by not honoring its next request for
  extra timeslice.
- Task is charged additional time for the borrowed timeslice as
  accumulated run time. This pushes it further down in consideration
  for the next task to run.

This feature was tested with a TPC-C workload. TPC-C workload shows
a 3% improvement in tpcc throughput when using this feature, which
is a significant improvement.

A new sysctl tunable kernel.preempt_delay_available enables this
feature at run time. The kernel boots up with this feature disabled
by default.

Documentation file included in this patch contains details on how to
use this feature, and conditions associated with its use. This patch
also adds a new field in scheduler statistics which keeps track of
how many times a task was granted amnesty from preemption.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---

With this version of this patch, the kernel will not enable
preemption delay by default. This feature must be turned on by using
sysctl tunable kernel.preempt_delay_available. With this change, there
are now two ways to eliminate the impact of this feature on systems
that do not intend to use it and are sensitive to scheduling delays
that may be caused by the use of this feature. This feature can be
configured out for custom built kernels. For pre-compiled kernels where
this feature may have been configured in, it will stay off until enabled
through sysctl tunable.

Changelog:
v4:
	- Added a shared data structure to define the memory location
	  used for requesting preemption delay.
	- Fixed a hole in the code that allowed preemption delay to
	  continue to happen if preemption delay feature was disabled
	  with sysctl after a task had already started using this.
	- Removed the restriction on setting location of shared data
	  structure only when one is not set currently.
	- Moved almost all conditionally compiled code into header files
	- Cleaned up config dependency for CONFIG_SCHED_PREEMPT_DELAY
	- Changed permission on preempt_delay_available sysctl to allow
	  users to read it.
	- Updated documentation file to match the code changes
v3:
	- Use prctl() syscall to give kernel the location for shared flag 
	  instead of using a proc file.
	- Disabled this feature by default on a newly booted kernel and
	  added a sysctl tunable to enable/disable it at runtime.
v2:
	- Replaced mmap operation with a more memory efficient futex
	  like communication between userspace and kernel
	- Added a flag to let userspace know if it was granted amnesty
	- Added a penalty for tasks failing to yield CPU when they
	  are granted amnesty from pre-emption

v1:
	- Initial RFC patch with mmap for communication between userspace
	  and kernel

 Documentation/scheduler/sched-preempt-delay.txt | 112 ++++++++++++++++++++
 arch/x86/Kconfig                                |  11 ++
 include/linux/sched.h                           |  38 +++++++
 include/linux/sched/sysctl.h                    |   4 +
 include/uapi/linux/prctl.h                      |   3 +
 include/uapi/linux/sched.h                      |   9 ++
 kernel/fork.c                                   |   2 +
 kernel/sched/core.c                             |   1 +
 kernel/sched/debug.c                            |   1 +
 kernel/sched/fair.c                             | 129 +++++++++++++++++++++++-
 kernel/sys.c                                    |   6 ++
 kernel/sysctl.c                                 |   9 ++
 12 files changed, 322 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/scheduler/sched-preempt-delay.txt

diff --git a/Documentation/scheduler/sched-preempt-delay.txt b/Documentation/scheduler/sched-preempt-delay.txt
new file mode 100644
index 0000000..4c9e111
--- /dev/null
+++ b/Documentation/scheduler/sched-preempt-delay.txt
@@ -0,0 +1,112 @@
+=================================
+What is preemption delay feature?
+=================================
+
+There are times when a userspace task is executing a critical section
+which gates a number of other tasks that want access to the same
+critical section. If the task holding the lock that guards this critical
+section happens to grab the lock just before its timeslice is up and is
+preempted by the scheduler, scheduler ends up scheduling other
+tasks which immediately try to grab the lock to enter the critical
+section. This only results in lots of context switches as tasks wake up
+and go to sleep immediately. If on the other hand, the original task
+were allowed to run for an extra timeslice, it could have completed
+executing its critical section allowing other tasks to make progress
+when they get scheduled. Preemption delay feature allows a task to
+request scheduler to let it borrow one extra timeslice, if possible.
+
+
+==================================
+Using the preemption delay feature
+==================================
+
+This feature is compiled in the kernel by setting
+CONFIG_SCHED_PREEMPT_DELAY in kernel configuration. By default, the
+kernel boots up with this feature disabled. Enable it using sysctl
+tunable kernel.preempt_delay_available. Once this feature is
+enabled, the userspace process communicates with the kernel using a
+4-byte memory location in its address space. This location must be
+aligned to 4-byte boundary. It first gives the kernel address for this
+memory location by making a prctl() system call with PR_SET_PREEMPT_DELAY
+option. This memory location is interpreted as the following data
+structure (defined in linux/sched.h):
+
+struct sched_delay_req {
+	unsigned char nopreempt;	/* flag to request preemption delay */
+	unsigned char yield;		/* flag from kernel indicating  */
+					/* preemption delay was granted */
+	unsigned char rsvd[2];		/* reserved */
+};
+
+Task requests a preemption delay by writing a non-zero value to the
+first byte - nopreempt. Scheduler checks this value before preempting
+the task. Scheduler can choose to grant one and only an additional
+time slice to the task for each delay request but this delay is not
+guaranteed. If scheduler does grant an additional timeslice, it will
+set the flag in second byte. Upon completion of the section of code
+where the task wants preemption delay, task should check the second byte.
+If the flag in second byte is set, it should clear this flag and call
+sched_yield() so as to not hog the processor. If a thread was granted
+additional timeslice and it fails to call sched_yield(), scheduler
+will penalize it by denying its next request for additional timeslice.
+Following sample code illustrates how to use this feature:
+
+#include <linux/sched.h>
+
+int main()
+{
+	unsigned char buf[256];
+	struct sched_delay_req delay;
+
+	bzero(&delay, sizeof(delay));
+
+	/* Tell kernel where the flag lives */
+	prctl(PR_SET_PREEMPT_DELAY, &delay);
+
+	while (/* some condition is true */) {
+		/* do some work and get ready to enter critical section */
+		delay.nopreempt = 1;
+		/*
+		 * Obtain lock for critical section
+		 */
+		/*
+		 * critical section
+		 */
+		/*
+		 * Release lock for critical section
+		 */
+		delay.nopreempt = 0;
+		/* Give the CPU up if required */
+		if (delay.yield) {
+			delay.yield = 0;
+			sched_yield();
+		}
+		/* do some more work */
+	}
+	/*
+	 * Tell kernel we are done asking for preemption delay
+	 */
+	prctl(PR_SET_PREEMPT_DELAY, NULL);
+}
+
+
+====================
+Scheduler statistics
+====================
+
+Preemption delay features adds a new field to scheduler statictics -
+nr_preempt_delayed. This is a per thread statistic that tracks the
+number of times a thread was granted amnesty from preemption when it
+requested for one. "cat /proc/<pid>/task/<tid>/sched" will list this
+number along with other scheduler statistics.
+
+
+=====
+Notes
+=====
+
+1. If the location of shared flag is not aligned to 4-byte boundary,
+   prctl() will terminate with EFAULT.
+2. Userspace app should zero out the sched_delay_req structure before
+   giving kernel the address of this structure. Stale data in this
+   structure could cause unintended requests for preemption delay.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 41a503c..6c24167 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -852,6 +852,17 @@ config SCHED_MC
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config SCHED_PREEMPT_DELAY
+	def_bool n
+	prompt "Scheduler preemption delay support"
+	---help---
+	  Say Y here if you want to be able to delay scheduler preemption
+	  when possible by setting a flag in a memory location after
+	  sharing the address of this location with kernel using
+	  PR_SET_PREEMPT_DELAY prctl() call. See
+	  Documentation/scheduler/sched-preempt-delay.txt for details.
+	  If in doubt, say "N".
+
 source "kernel/Kconfig.preempt"
 
 config X86_UP_APIC
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..0b2f911 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1116,6 +1116,7 @@ struct sched_statistics {
 	u64			nr_wakeups_affine_attempts;
 	u64			nr_wakeups_passive;
 	u64			nr_wakeups_idle;
+	u64			nr_preempt_delayed;
 };
 #endif
 
@@ -1232,6 +1233,14 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+struct preempt_delay {
+	struct sched_delay_req *delay_req;	/* delay request flag pointer */
+	unsigned char delay_granted;		/* currently in delay */
+	unsigned char yield_penalty;		/* failure to yield penalty */
+};
+#endif
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1324,6 +1333,9 @@ struct task_struct {
 	/* Revert to default priority/policy when forking */
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	struct preempt_delay sched_preempt_delay;
+#endif
 
 	unsigned long atomic_flags; /* Flags needing atomic access. */
 
@@ -3031,4 +3043,30 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+static inline void task_init_preempt_delay(struct task_struct *p)
+{
+	memset(&p->sched_preempt_delay, 0, sizeof(struct preempt_delay));
+}
+static inline void task_clear_preempt_yield(struct task_struct *p)
+{
+	p->sched_preempt_delay.yield_penalty = 0;
+}
+extern int preempt_delay_write(struct task_struct *task,
+					unsigned long preempt_delay_addr);
+#define SCHED_SET_PREEMPT_DELAY(a)	preempt_delay_write(current, a)
+#define SCHED_GET_PREEMPT_DELAY(a)	\
+		put_user((unsigned long)current->sched_preempt_delay.delay_req,\
+				(unsigned long __user *)a)
+#else
+static inline void task_init_preempt_delay(struct task_struct *p)
+{
+}
+static inline void task_clear_preempt_yield(struct task_struct *p)
+{
+}
+#define SCHED_SET_PREEMPT_DELAY(a)	(-EINVAL)
+#define SCHED_GET_PREEMPT_DELAY(a)	(-EINVAL)
+#endif /* CONFIG_SCHED_PREEMPT_DELAY */
+
 #endif
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 596a0e0..516f74e 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -107,4 +107,8 @@ extern int sysctl_numa_balancing(struct ctl_table *table, int write,
 				 void __user *buffer, size_t *lenp,
 				 loff_t *ppos);
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+extern int sysctl_preempt_delay_available;
+#endif
+
 #endif /* _SCHED_SYSCTL_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 513df75..ecfd2cd 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -179,4 +179,7 @@ struct prctl_mm_map {
 #define PR_SET_THP_DISABLE	41
 #define PR_GET_THP_DISABLE	42
 
+#define PR_SET_PREEMPT_DELAY	43
+#define PR_GET_PREEMPT_DELAY	44
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b932be9..66a2f67 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -49,4 +49,13 @@
  */
 #define SCHED_FLAG_RESET_ON_FORK	0x01
 
+/*
+ * struct for requesting preemption delay from scheduler
+ */
+struct sched_delay_req {
+	unsigned char nopreempt;
+	unsigned char yield;
+	unsigned char rsvd[2];
+};
+
 #endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..aea655b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1672,6 +1672,8 @@ long do_fork(unsigned long clone_flags,
 			get_task_struct(p);
 		}
 
+		task_init_preempt_delay(p);
+
 		wake_up_new_task(p);
 
 		/* forking complete and child started to run, tell ptracer */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 24beb9b..a926eea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4201,6 +4201,7 @@ SYSCALL_DEFINE0(sched_yield)
 {
 	struct rq *rq = this_rq_lock();
 
+	task_clear_preempt_yield(current);
 	schedstat_inc(rq, yld_count);
 	current->sched_class->yield_task(rq);
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index ce33780..618d2ac 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -597,6 +597,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.statistics.nr_wakeups_affine_attempts);
 	P(se.statistics.nr_wakeups_passive);
 	P(se.statistics.nr_wakeups_idle);
+	P(se.statistics.nr_preempt_delayed);
 
 	{
 		u64 avg_atom, avg_per_cpu;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ef2b104..a880c6f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -428,6 +428,129 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+int sysctl_preempt_delay_available;
+
+int
+preempt_delay_write(struct task_struct *task, unsigned long preempt_delay_addr)
+{
+	/*
+	 * Do not allow write if preemption delay feature is disabled
+	 */
+	if (!sysctl_preempt_delay_available)
+		return -EPERM;
+
+	if ((void *)preempt_delay_addr == NULL) {
+		task->sched_preempt_delay.delay_req = NULL;
+		return 0;
+	}
+
+	/*
+	 * Validate the pointer. It should be naturally aligned
+	 */
+	if (unlikely((preempt_delay_addr % sizeof(u32)) != 0))
+		return -EFAULT;
+	if (unlikely(!access_ok(rw, preempt_delay_addr,
+					sizeof(struct sched_delay_req))))
+		return -EFAULT;
+
+	task->sched_preempt_delay.delay_req =
+				(struct sched_delay_req *) preempt_delay_addr;
+	return 0;
+}
+
+/*
+ * delay_resched_rq(): Check if the task about to be preempted has
+ *	requested an additional time slice. If it has, grant it additional
+ *	timeslice once.
+ */
+static void
+delay_resched_rq(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_entity *se;
+	struct sched_delay_req *delay_req, delay_flag;
+	int ret;
+
+	if (!sysctl_preempt_delay_available)
+		goto resched_now;
+
+	/*
+	 * Check if task is using pre-emption delay feature. If address
+	 * for preemption delay request flag is not set, this task is
+	 * not using preemption delay feature, we can reschedule without
+	 * any delay
+	 */
+	delay_req = curr->sched_preempt_delay.delay_req;
+	if (delay_req == NULL)
+		goto resched_now;
+
+	/*
+	 * Pre-emption delay will  be granted only once. If this task
+	 * has already been granted delay, rechedule now
+	 */
+	if (curr->sched_preempt_delay.delay_granted) {
+		curr->sched_preempt_delay.delay_granted = 0;
+		goto resched_now;
+	}
+
+	/*
+	 * Get the value of preemption delay request flag from userspace.
+	 * Task had already passed us the address where the flag is stored
+	 * in userspace earlier. If there is a page fault accessing this
+	 * flag in userspace, that means userspace has not touched this
+	 * flag recently and we can assume no preemption delay is needed.
+	 *
+	 * If task is not requesting additional timeslice, resched now
+	 */
+	pagefault_disable();
+	ret = __copy_from_user_inatomic(&delay_flag, delay_req,
+			sizeof(u32));
+	pagefault_enable();
+	if (ret || !delay_flag.nopreempt)
+		goto resched_now;
+
+	/*
+	 * Current thread has requested preemption delay and has not
+	 * been granted an extension yet. If this thread failed to yield
+	 * processor after being granted amnesty last time, penalize it
+	 * by not granting this delay request, otherwise give it an extra
+	 * timeslice.
+	 */
+	if (curr->sched_preempt_delay.yield_penalty) {
+		curr->sched_preempt_delay.yield_penalty = 0;
+		goto resched_now;
+	}
+
+	se = &curr->se;
+	curr->sched_preempt_delay.delay_granted = 1;
+
+	/*
+	 * Set the penalty flag for failing to yield the processor after
+	 * being granted immunity. This flag will be cleared in
+	 * sched_yield() if the thread indeed calls sched_yield
+	 */
+	curr->sched_preempt_delay.yield_penalty = 1;
+
+	/*
+	 * Let the thread know it got amnesty and it should call
+	 * sched_yield() when it is done to avoid penalty next time
+	 * it wants amnesty.
+	 */
+	delay_flag.nopreempt = 0;
+	delay_flag.yield = 1;
+	schedstat_inc(curr, se.statistics.nr_preempt_delayed);
+	__copy_to_user_inatomic(delay_req, &delay_flag, sizeof(u32));
+
+	return;
+
+resched_now:
+	resched_curr(rq);
+}
+#else
+#define delay_resched_rq(rq) resched_curr(rq)
+#endif /* CONFIG_SCHED_PREEMPT_DELAY */
+
 static __always_inline
 void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
 
@@ -2951,7 +3074,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	ideal_runtime = sched_slice(cfs_rq, curr);
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
 	if (delta_exec > ideal_runtime) {
-		resched_curr(rq_of(cfs_rq));
+		delay_resched_rq(rq_of(cfs_rq));
 		/*
 		 * The current task ran long enough, ensure it doesn't get
 		 * re-elected due to buddy favours.
@@ -2975,7 +3098,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 		return;
 
 	if (delta > ideal_runtime)
-		resched_curr(rq_of(cfs_rq));
+		delay_resched_rq(rq_of(cfs_rq));
 }
 
 static void
@@ -4792,7 +4915,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	return;
 
 preempt:
-	resched_curr(rq);
+	delay_resched_rq(rq);
 	/*
 	 * Only set the backward buddy when the current task is still
 	 * on the rq. This can happen when a wakeup gets interleaved
diff --git a/kernel/sys.c b/kernel/sys.c
index 1eaa2f0..a8b1eff 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2203,6 +2203,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			me->mm->def_flags &= ~VM_NOHUGEPAGE;
 		up_write(&me->mm->mmap_sem);
 		break;
+	case PR_SET_PREEMPT_DELAY:
+		error = SCHED_SET_PREEMPT_DELAY(arg2);
+		break;
+	case PR_GET_PREEMPT_DELAY:
+		error = SCHED_GET_PREEMPT_DELAY(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 15f2511..c1cd344 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1104,6 +1104,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	{
+		.procname	= "preempt_delay_available",
+		.data		= &sysctl_preempt_delay_available,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
 	{ }
 };
 
-- 
1.9.1

^ permalink raw reply related

* Re: selftests: standardized results output?
From: Michael Ellerman @ 2014-12-18  3:28 UTC (permalink / raw)
  To: Young, David
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Shuah Khan (shuahkhan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org),
	grant.likely-s3s/WqlpOiPyB63q8FvJNQ@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <7F8EB1E5A1A71643ACAD0928E03FFDE90E3DABAC24-/aPJBeqlRkAyyULh10xnlCahX2Nl78RV0E9HWUfgJXw@public.gmane.org>

On Mon, 2014-12-15 at 11:42 -0500, Young, David wrote:
> Hi,  I'm looking into what sorts of tools can consume the selftest output, and
> found this on the wikipage:
> 
> https://kselftest.wiki.kernel.org/standardize_the_test_output
> 
> The current suggestion (as of the last-modified date on that wiki page for
> October) is to use the Test Anything Protocol [TAP] for standard output.  I
> notice that there is at least one test that conforms to TAP output, but
> the majority of the tests are not using it.
> 
> There's also the kselftest.h file that suggests exit codes for the individual
> test applications.
> 
> I'm interested in knowing what is the intention of this standardization, so
> that I can put some work into a "glue" layer for a tool like buildbot,
> autotest, or Jenkins for executing and consuming the results of these
> selftests.

I'm not the selftests "maintainer", but I have written a bunch of code under
selftests, so the following is my 2c.

The barrier for entry to selftests should be as low as we can make it. Our
preference should always be for more tests, even if some of them start out a
bit crufty.

As such, requiring tests to produce perfectly formatted output or error codes
is the wrong approach in my opinion.

So ..

> 1) Is this output standard a "nice to have" that won't be much enforced?

Yes I think so. I think at best we would hope for tests to converge on the
standard over time, but there will always be new tests arriving that are not so
well behaved.

> 2) Will the exit codes be utilized outside of the current makefile-based
>    approach for executing the tests?  The current make target just runs all the
>    tests without really concerning itself with the exit values of the
>    individual tests.  It's simple, which isn't a bad thing, but it lacks a
>    summarized result.  Is the intention to use a different harness to consume
>    and report results?

One of the advantages of producing TAP or some other standard format would be
to avoid needing our own harness just to produce a summary.

> 3) Are the TAP results intended to be the exclusive (std)output of the tests,
>    or will the tests report in a hybrid fashion?
>    Such an example would be a test that produces some verbose stdout to the
>    console, while simultaneously creating a TAP-compliant results.tap file as
>    well... or vice versa with the stdout being TAP and a more verbose but
>    free-form log.txt 

We probably want to be able to do both of those.

FWIW, (most of) the tests under tools/testing/selftests/powerpc produce subunit
v1 output (https://pypi.python.org/pypi/python-subunit).

So you can do:

 $ cd tools/testing/selftests/powerpc
 $ make run_tests 2>&1 | subunit-1to2 | subunit-stats --no-passthrough
 Total tests:      32
 Passed tests:     32
 Failed tests:      0
 Skipped tests:     0
 Seen tags: git_version:next-20141209-0-g5348e33

subunit has the advantage over TAP IMO that it uses "success", "failure" and
"error" rather than "ok", "not ok".

There is now a subunit v2, which is a binary protocol. I don't think we want
a binary protocol, but it does have the advantage of being much more robust
than text based output.

cheers

^ permalink raw reply

* Re: [PATCH v4 0/5] Add Spreadtrum Sharkl64 Platform support
From: Lyra Zhang @ 2014-12-17  6:16 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Chunyan Zhang, gregkh@linuxfoundation.org, arnd@arndb.de,
	gnomes@lxorguk.ukuu.org.uk, broonie@kernel.org,
	robh+dt@kernel.org, Pawel Moll, ijc+devicetree@hellion.org.uk,
	galak@codeaurora.org, Will Deacon, Catalin Marinas,
	jslaby@suse.cz, jason@lakedaemon.net, heiko@sntech.de,
	florian.vaussard@epfl.ch, andrew@lunn.ch, rrichter@cavium.com,
	hytszk@gmail.com, grant.likely@linaro.org
In-Reply-To: <20141215133817.GD8264@leverpostej>

Hi, Mark

On Mon, Dec 15, 2014 at 9:38 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> On Fri, Dec 05, 2014 at 12:27:23PM +0000, Lyra Zhang wrote:
>> On Fri, Dec 5, 2014 at 6:40 PM, Mark Rutland <mark.rutland@arm.com> wrote:
>> > Hi,
>> >
>> > On Thu, Dec 04, 2014 at 11:34:15AM +0000, Chunyan Zhang wrote:
>> >> Spreadtrum is a rapid growing chip vendor providing smart phone total solutions.
>> >>
>> >> Sharkl64 Platform is nominated as a SoC infrastructure that supports 4G/3G/2G
>> >> standards based on ARMv8 multiple core architecture.Now we have only one
>> >> SoC(SC9836) based on this Platform in developing.
>> >>
>> >> This patchset adds Sharkl64 support in arm64 device tree and the serial driver
>> >> of SC9836-UART.
>> >>
>> >> This patchset also has patches which address "sprd" prefix and DT compatible
>> >> strings for nodes which appear un-documented.
>> >>
>> >> This version code was tesed both on Fast Mode and sc9836-fpga board.
>> >> We use the latest boot-wrapper-aarch64 as the bootloader.
>> >>
>> >> Changes from v3:
>> >> * Addressed review comments:
>> >>       - Added the description of clock property for sc9836-uart
>> >>       - Revised the size of GICC to be 8KiB
>> >>       - Added another compatible string for psci-0.1
>> >
>> > I had open questions on v3 regarding your PSCI imlpementation. You
>> > mentioned that you are using the aarch64 bootwrapper, but your DT
>> > describes PSCI 0.2, and the (upstream) bootwrapper does not implement
>> > PSCI 0.2. Adding the old PSCI compatible string is _not_ sufficient if
>> > you do not have a full PSCI 0.2 implementation.
>> >
>> > Given that PSCI 0.2 requires more functionality to be implemented, I'd
>> > like to know that your implementation is spec-compliant (implementing
>> > the mandatory functions, nters the kernel in the correct state, etc),
>> > and that it has been tested.
>> >
>> > Would you be able to look at my comments from the last posting please?
>> >
>> > Thanks,
>> > Mark.
>> > --
>>
>> Hi, Mark
>>
>> Ok, I'll check it again with our related engineers.
>>
>> Actually, I had read all of your comments carefully before sending
>> each version of patches, and I replied you a few days early, I guess
>> you may miss it :)
>>
>> If we just implemented psci-0.1 until now, can we submit this path
>> without "compatible = "arm,psci-0.2"", but only with " compatible =
>> "arm,psci" ".
>
> Elsewhere I've recommended such things are placed in the DTB by the
> bootloader (at least for those cases where the bootlaoder is tied to the
> firmware), as it can fill in the appropriate enable-method (and add any
> other nodes as required).
>
> Having compatible = "arm,psci", and a couple of other properties for now
> is certainly preferable to describing a not-yet-implemented PSCI 0.2.
>
> Thanks,
> Mark.
> --

Thank you for the replay.
I'll still ping my colleagues to implement PSCI 0.2, and test it.
But it perhaps would not be finished soon, because our internal
developing code right now is based on a little old version Linux
kernel which doesn't support  PSCI-0.2

Thanks again,
Chunyan

^ permalink raw reply

* Re: Re: [tpmdd-devel] [PATCH v10 0/8] TPM 2.0 support
From: Jarkko Sakkinen @ 2014-12-16 21:34 UTC (permalink / raw)
  To: Peter Huewe
  Cc: christophe.ricard-Re5JQEeQqe8AvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Ashley Lai,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, josh-iaAMLnmF4UmaiuxdJuQwMA,
	tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	jason.gunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	trousers-tech-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
In-Reply-To: <trinity-c704c9fe-b92f-48bd-80cc-a452bd18cc28-1418765537847@3capp-gmx-bs15>

On Tue, Dec 16, 2014 at 10:32:18PM +0100, Peter Huewe wrote:
> Hi,
> if I fix this (already adressed issue) 
>  #define TPM2_STARTUP_IN_SIZE \
>         (sizeof(struct tpm_input_header) + \
> -        sizeof(struct tpm2_pcr_read_in))
> +        sizeof(struct tpm2_startup_in))
>  
> it gets detected correctly.
>  
> I'll fix it, no updated patch needed.

Oops, I was too fast :)

> Not sure however if I think to reuse tpm2_startup_in also for shutdown, simply
> because it has  the same size, but ok.

They are equivalent messages except for the command code.

> Peter

/Jarkko

^ permalink raw reply

* Re: [tpmdd-devel] [PATCH v10 0/8] TPM 2.0 support
From: Peter Huewe @ 2014-12-16 21:32 UTC (permalink / raw)
  To: Peter Huewe
  Cc: christophe.ricard, linux-api, Ashley Lai, linux-kernel, josh,
	tpmdd-devel, jason.gunthorpe, trousers-tech
In-Reply-To: <trinity-c582edfe-4951-4a31-911d-db0c6f723157-1418764766771@3capp-gmx-bs15>

[-- Attachment #1: Type: text/html, Size: 812 bytes --]

[-- Attachment #2: Type: text/plain, Size: 440 bytes --]

------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net

[-- Attachment #3: Type: text/plain, Size: 170 bytes --]

_______________________________________________
TrouSerS-tech mailing list
TrouSerS-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/trousers-tech

^ permalink raw reply

* Re: Re: [PATCH v10 0/8] TPM 2.0 support
From: Jarkko Sakkinen @ 2014-12-16 21:31 UTC (permalink / raw)
  To: Peter Huewe
  Cc: Ashley Lai, Marcel Selhorst,
	tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, josh-iaAMLnmF4UmaiuxdJuQwMA,
	christophe.ricard-Re5JQEeQqe8AvxtiuMwx3w,
	jason.gunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	stefanb-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	trousers-tech-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
In-Reply-To: <trinity-c582edfe-4951-4a31-911d-db0c6f723157-1418764766771@3capp-gmx-bs15>

[-- Attachment #1: Type: text/plain, Size: 2442 bytes --]


On Tue, Dec 16, 2014 at 10:19:26PM +0100, Peter Huewe wrote:
> > So, what do you think? Are we now settled with this?
> 
> Meh...
> It does not recognize my tpm20 with force=1 :( :( :(
> Probably since I don't have a machine with bios integration -> no tpm startup,
> and your startup is returned with RC_SIZE :(

I don't know why it worked for me. My dTPM2 does not export ACPI device
but I guess firmware does execute startup. 

I root caused this quickly and attached patch here. Does that resolve
your issue?


>  rmmod tpm; modprobe tpm_tis force=1 ; dmesg 
> modprobe: ERROR: could not insert 'tpm_tis': No such device
> [   77.744050] tpm_tis tpm_tis: 2.0 TPM (device-id 0x1A, rev-id 16)
> [   77.816049] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
> [   77.888043] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
> [   77.960043] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
> [   77.960057] genirq: Flags mismatch irq 6. 00000080 (tpm0) vs. 00000000 (floppy)
> [   77.960067] tpm_tis tpm_tis: Unable to request irq: 6 for probe
> [   77.960073] genirq: Flags mismatch irq 7. 00000080 (tpm0) vs. 00000000 (parport0)
> [   77.960077] tpm_tis tpm_tis: Unable to request irq: 7 for probe
> [   77.960082] genirq: Flags mismatch irq 8. 00000080 (tpm0) vs. 00000000 (rtc0)
> [   77.960087] tpm_tis tpm_tis: Unable to request irq: 8 for probe
> [   78.032041] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
> [   78.104043] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
> [   78.176041] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
> [   78.248041] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
> [   78.320041] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
> [   78.392040] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
> [   78.464041] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
> [   78.524041] tpm_tis tpm_tis: A TPM error (256) occurred continue selftest
> [   78.524048] tpm_tis tpm_tis: Firmware has not started TPM
> [   78.596041] tpm_tis tpm_tis: A TPM error (149) occurred attempting to start the TPM
> [   78.596048] tpm_tis tpm_tis: TPM self test failed
> 
> 
> ;(
> Peter

/Jarkko

[-- Attachment #2: 0001-tpm-invalid-size-for-struct-tpm2_startup_header.patch --]
[-- Type: text/x-diff, Size: 1062 bytes --]

>From e8337658b6f29fc1259b1ddaf81e51ede91219e8 Mon Sep 17 00:00:00 2001
From: Jarkko Sakkinen <jarkko.sakkinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Date: Tue, 16 Dec 2014 23:28:51 +0200
Subject: [PATCH] tpm: invalid size for struct tpm2_startup_header

sizeof(tpm2_pcr_read_in) was used in place of sizeof(tpm2_startup_in).
This patch fixes the issue.

Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 drivers/char/tpm/tpm2-cmd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/char/tpm/tpm2-cmd.c b/drivers/char/tpm/tpm2-cmd.c
index 610dc87..d1b99df 100644
--- a/drivers/char/tpm/tpm2-cmd.c
+++ b/drivers/char/tpm/tpm2-cmd.c
@@ -411,7 +411,7 @@ ssize_t tpm2_get_tpm_pt(struct tpm_chip *chip, u32 property_id,  u32 *value,
 
 #define TPM2_STARTUP_IN_SIZE \
 	(sizeof(struct tpm_input_header) + \
-	 sizeof(struct tpm2_pcr_read_in))
+	 sizeof(struct tpm2_startup_in))
 
 static const struct tpm_input_header tpm2_startup_header = {
 	.tag = cpu_to_be16(TPM2_ST_NO_SESSIONS),
-- 
2.1.0


^ permalink raw reply related

* Aw: Re: [PATCH v10 0/8] TPM 2.0 support
From: Peter Huewe @ 2014-12-16 21:19 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Ashley Lai, Marcel Selhorst, tpmdd-devel, linux-kernel, josh,
	christophe.ricard, jason.gunthorpe, stefanb, linux-api,
	trousers-tech
In-Reply-To: <20141215201757.GA27923@intel.com>

> So, what do you think? Are we now settled with this?

Meh...
It does not recognize my tpm20 with force=1 :( :( :(
Probably since I don't have a machine with bios integration -> no tpm startup,
and your startup is returned with RC_SIZE :(

 rmmod tpm; modprobe tpm_tis force=1 ; dmesg 
modprobe: ERROR: could not insert 'tpm_tis': No such device
[   77.744050] tpm_tis tpm_tis: 2.0 TPM (device-id 0x1A, rev-id 16)
[   77.816049] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
[   77.888043] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
[   77.960043] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
[   77.960057] genirq: Flags mismatch irq 6. 00000080 (tpm0) vs. 00000000 (floppy)
[   77.960067] tpm_tis tpm_tis: Unable to request irq: 6 for probe
[   77.960073] genirq: Flags mismatch irq 7. 00000080 (tpm0) vs. 00000000 (parport0)
[   77.960077] tpm_tis tpm_tis: Unable to request irq: 7 for probe
[   77.960082] genirq: Flags mismatch irq 8. 00000080 (tpm0) vs. 00000000 (rtc0)
[   77.960087] tpm_tis tpm_tis: Unable to request irq: 8 for probe
[   78.032041] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
[   78.104043] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
[   78.176041] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
[   78.248041] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
[   78.320041] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
[   78.392040] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
[   78.464041] tpm_tis tpm_tis: A TPM error (256) occurred attempting to generate an interrupt
[   78.524041] tpm_tis tpm_tis: A TPM error (256) occurred continue selftest
[   78.524048] tpm_tis tpm_tis: Firmware has not started TPM
[   78.596041] tpm_tis tpm_tis: A TPM error (149) occurred attempting to start the TPM
[   78.596048] tpm_tis tpm_tis: TPM self test failed

;(
Peter

^ permalink raw reply

* Re: [GIT PULL RESEND] kselftest-3.19-rc1
From: Linus Torvalds @ 2014-12-16 21:17 UTC (permalink / raw)
  To: Shuah Khan; +Cc: linux-kernel@vger.kernel.org, Linux API, Shuah Khan
In-Reply-To: <548F57D8.4080809@osg.samsung.com>

On Mon, Dec 15, 2014 at 1:51 PM, Shuah Khan <shuahkh@osg.samsung.com> wrote:
>
> Please pull the following Kselftest updates for 3.19-rc1. Hope
> I have this right this time.

Looks good, thanks,

                  Linus

^ permalink raw reply

* [GIT PULL] User namespace related fixes
From: Eric W. Biederman @ 2014-12-16 18:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux API, Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages,
	Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA


Linus,

Please pull the for-linus branch from the git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus

   HEAD: db86da7cb76f797a1a8b445166a15cb922c6ff85 userns: Unbreak the unprivileged remount tests


This set of changes addresses all of the bugs with user namespaces that
I am aware of except for an unfortunate interaction between unprivileged
MNT_DETACH and MNT_LOCKED which can allow you to see under mount points
if you are clever with your use of MNT_DETACH.

As these are bug fixes almost all of thes changes are marked for
backporting to stable.

The first change (implicitly adding MNT_NODEV on remount) addresses a
regression that was created when security issues with unprivileged 
remount were closed.  I go on to update the remount test to make it
easy to detect if this issue reoccurs.

Then there are a handful of mount and umount related fixes.

Then half of the changes deal with the a recently discovered design bug
in the permission checks of gid_map.  Unix since the beginning has
allowed setting group permissions on files to less than the user and
other permissions (aka ---rwx---rwx).  As the unix permission checks
stop as soon as a group matches, and setgroups allows setting groups
that can not later be dropped, results in a situtation where it is
possible to legitimately use a group to assign fewer privileges to a
process.   Which means dropping a group can increase a processes
privileges.

The fix I have adopted is that gid_map is now no longer writable
without privilege unless the new file /proc/self/setgroups has
been set to permanently disable setgroups.

The bulk of user namespace using applications even the applications
using applications using user namespaces without privilege remain
unaffected by this change.  Unfortunately this ix breaks a couple user
space applications, that were relying on the problematic behavior (one
of which was tools/selftests/mount/unprivileged-remount-test.c).

To hopefully prevent needing a regression fix on top of my security fix
I rounded folks who work with the container implementations mostly like
to be affected and encouraged them to test the changes.

> So far nothing broke on my libvirt-lxc test bed. :-)
> Tested with openSUSE 13.2 and libvirt 1.2.9.
> Tested-by: Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org>

> Tested on Fedora20 with libvirt 1.2.11, works fine.
> Tested-by: Chen Hanxiao <chenhanxiao-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>

> Ok, thanks - yes, unprivileged lxc is working fine with your kernels.
> Just to be sure I was testing the right thing I also tested using
> my unprivileged nsexec testcases, and they failed on setgroup/setgid
> as now expected, and succeeded there without your patches.
> Tested-by: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

> I tested this with Sandstorm.  It breaks as is and it works if I add
> the setgroups thing.
> Tested-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> # breaks things as designed :(

Eric

Eric W. Biederman (18):
      mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount
      mnt: Update unprivileged remount test
      umount: Disallow unprivileged mount force
      umount: Do not allow unmounting rootfs.
      mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers.
      mnt: Carefully set CL_UNPRIVILEGED in clone_mnt
      mnt: Clear mnt_expire during pivot_root
      groups: Consolidate the setgroups permission checks
      userns: Document what the invariant required for safe unprivileged mappings.
      userns: Don't allow setgroups until a gid mapping has been setablished
      userns: Don't allow unprivileged creation of gid mappings
      userns: Check euid no fsuid when establishing an unprivileged uid mapping
      userns: Only allow the creator of the userns unprivileged mappings
      userns: Rename id_map_mutex to userns_state_mutex
      userns: Add a knob to disable setgroups on a per user namespace basis
      userns: Allow setting gid_maps without privilege when setgroups is disabled
      userns; Correct the comment in map_write
      userns: Unbreak the unprivileged remount tests

 arch/s390/kernel/compat_linux.c                    |   2 +-
 fs/namespace.c                                     |  18 +-
 fs/pnode.c                                         |   1 +
 fs/proc/base.c                                     |  53 ++++++
 include/linux/cred.h                               |   1 +
 include/linux/user_namespace.h                     |  12 ++
 kernel/groups.c                                    |  11 +-
 kernel/uid16.c                                     |   2 +-
 kernel/user.c                                      |   1 +
 kernel/user_namespace.c                            | 124 +++++++++++--
 .../selftests/mount/unprivileged-remount-test.c    | 204 +++++++++++++++++----
 11 files changed, 374 insertions(+), 55 deletions(-)


diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
index ca38139423ae..437e61159279 100644
--- a/arch/s390/kernel/compat_linux.c
+++ b/arch/s390/kernel/compat_linux.c
@@ -249,7 +249,7 @@ COMPAT_SYSCALL_DEFINE2(s390_setgroups16, int, gidsetsize, u16 __user *, grouplis
 	struct group_info *group_info;
 	int retval;
 
-	if (!capable(CAP_SETGID))
+	if (!may_setgroups())
 		return -EPERM;
 	if ((unsigned)gidsetsize > NGROUPS_MAX)
 		return -EINVAL;
diff --git a/fs/namespace.c b/fs/namespace.c
index 5b66b2b3624d..fe1c77145a78 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -963,7 +963,8 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	}
 
 	/* Don't allow unprivileged users to reveal what is under a mount */
-	if ((flag & CL_UNPRIVILEGED) && list_empty(&old->mnt_expire))
+	if ((flag & CL_UNPRIVILEGED) &&
+	    (!(flag & CL_EXPIRE) || list_empty(&old->mnt_expire)))
 		mnt->mnt.mnt_flags |= MNT_LOCKED;
 
 	atomic_inc(&sb->s_active);
@@ -1544,6 +1545,9 @@ SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
 		goto dput_and_out;
 	if (mnt->mnt.mnt_flags & MNT_LOCKED)
 		goto dput_and_out;
+	retval = -EPERM;
+	if (flags & MNT_FORCE && !capable(CAP_SYS_ADMIN))
+		goto dput_and_out;
 
 	retval = do_umount(mnt, flags);
 dput_and_out:
@@ -1610,7 +1614,6 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
 	if (IS_ERR(q))
 		return q;
 
-	q->mnt.mnt_flags &= ~MNT_LOCKED;
 	q->mnt_mountpoint = mnt->mnt_mountpoint;
 
 	p = mnt;
@@ -2098,7 +2101,13 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
 	}
 	if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
 	    !(mnt_flags & MNT_NODEV)) {
-		return -EPERM;
+		/* Was the nodev implicitly added in mount? */
+		if ((mnt->mnt_ns->user_ns != &init_user_ns) &&
+		    !(sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT)) {
+			mnt_flags |= MNT_NODEV;
+		} else {
+			return -EPERM;
+		}
 	}
 	if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
 	    !(mnt_flags & MNT_NOSUID)) {
@@ -2958,6 +2967,8 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
 	/* mount new_root on / */
 	attach_mnt(new_mnt, real_mount(root_parent.mnt), root_mp);
 	touch_mnt_namespace(current->nsproxy->mnt_ns);
+	/* A moved mount should not expire automatically */
+	list_del_init(&new_mnt->mnt_expire);
 	unlock_mount_hash();
 	chroot_fs_refs(&root, &new);
 	put_mountpoint(root_mp);
@@ -3002,6 +3013,7 @@ static void __init init_mount_tree(void)
 
 	root.mnt = mnt;
 	root.dentry = mnt->mnt_root;
+	mnt->mnt_flags |= MNT_LOCKED;
 
 	set_fs_pwd(current->fs, &root);
 	set_fs_root(current->fs, &root);
diff --git a/fs/pnode.c b/fs/pnode.c
index aae331a5d03b..260ac8f898a4 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -242,6 +242,7 @@ static int propagate_one(struct mount *m)
 	child = copy_tree(last_source, last_source->mnt.mnt_root, type);
 	if (IS_ERR(child))
 		return PTR_ERR(child);
+	child->mnt.mnt_flags &= ~MNT_LOCKED;
 	mnt_set_mountpoint(m, mp, child);
 	last_dest = m;
 	last_source = child;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 772efa45a452..7dc3ea89ef1a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2464,6 +2464,57 @@ static const struct file_operations proc_projid_map_operations = {
 	.llseek		= seq_lseek,
 	.release	= proc_id_map_release,
 };
+
+static int proc_setgroups_open(struct inode *inode, struct file *file)
+{
+	struct user_namespace *ns = NULL;
+	struct task_struct *task;
+	int ret;
+
+	ret = -ESRCH;
+	task = get_proc_task(inode);
+	if (task) {
+		rcu_read_lock();
+		ns = get_user_ns(task_cred_xxx(task, user_ns));
+		rcu_read_unlock();
+		put_task_struct(task);
+	}
+	if (!ns)
+		goto err;
+
+	if (file->f_mode & FMODE_WRITE) {
+		ret = -EACCES;
+		if (!ns_capable(ns, CAP_SYS_ADMIN))
+			goto err_put_ns;
+	}
+
+	ret = single_open(file, &proc_setgroups_show, ns);
+	if (ret)
+		goto err_put_ns;
+
+	return 0;
+err_put_ns:
+	put_user_ns(ns);
+err:
+	return ret;
+}
+
+static int proc_setgroups_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *seq = file->private_data;
+	struct user_namespace *ns = seq->private;
+	int ret = single_release(inode, file);
+	put_user_ns(ns);
+	return ret;
+}
+
+static const struct file_operations proc_setgroups_operations = {
+	.open		= proc_setgroups_open,
+	.write		= proc_setgroups_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= proc_setgroups_release,
+};
 #endif /* CONFIG_USER_NS */
 
 static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
@@ -2572,6 +2623,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("uid_map",    S_IRUGO|S_IWUSR, proc_uid_map_operations),
 	REG("gid_map",    S_IRUGO|S_IWUSR, proc_gid_map_operations),
 	REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations),
+	REG("setgroups",  S_IRUGO|S_IWUSR, proc_setgroups_operations),
 #endif
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	REG("timers",	  S_IRUGO, proc_timers_operations),
@@ -2913,6 +2965,7 @@ static const struct pid_entry tid_base_stuff[] = {
 	REG("uid_map",    S_IRUGO|S_IWUSR, proc_uid_map_operations),
 	REG("gid_map",    S_IRUGO|S_IWUSR, proc_gid_map_operations),
 	REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations),
+	REG("setgroups",  S_IRUGO|S_IWUSR, proc_setgroups_operations),
 #endif
 };
 
diff --git a/include/linux/cred.h b/include/linux/cred.h
index b2d0820837c4..2fb2ca2127ed 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -68,6 +68,7 @@ extern void groups_free(struct group_info *);
 extern int set_current_groups(struct group_info *);
 extern void set_groups(struct cred *, struct group_info *);
 extern int groups_search(const struct group_info *, kgid_t);
+extern bool may_setgroups(void);
 
 /* access the groups "array" with this macro */
 #define GROUP_AT(gi, i) \
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index e95372654f09..9f3579ff543d 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -17,6 +17,10 @@ struct uid_gid_map {	/* 64 bytes -- 1 cache line */
 	} extent[UID_GID_MAP_MAX_EXTENTS];
 };
 
+#define USERNS_SETGROUPS_ALLOWED 1UL
+
+#define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED
+
 struct user_namespace {
 	struct uid_gid_map	uid_map;
 	struct uid_gid_map	gid_map;
@@ -27,6 +31,7 @@ struct user_namespace {
 	kuid_t			owner;
 	kgid_t			group;
 	unsigned int		proc_inum;
+	unsigned long		flags;
 
 	/* Register of per-UID persistent keyrings for this namespace */
 #ifdef CONFIG_PERSISTENT_KEYRINGS
@@ -63,6 +68,9 @@ extern const struct seq_operations proc_projid_seq_operations;
 extern ssize_t proc_uid_map_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t proc_gid_map_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t proc_projid_map_write(struct file *, const char __user *, size_t, loff_t *);
+extern ssize_t proc_setgroups_write(struct file *, const char __user *, size_t, loff_t *);
+extern int proc_setgroups_show(struct seq_file *m, void *v);
+extern bool userns_may_setgroups(const struct user_namespace *ns);
 #else
 
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -87,6 +95,10 @@ static inline void put_user_ns(struct user_namespace *ns)
 {
 }
 
+static inline bool userns_may_setgroups(const struct user_namespace *ns)
+{
+	return true;
+}
 #endif
 
 #endif /* _LINUX_USER_H */
diff --git a/kernel/groups.c b/kernel/groups.c
index 451698f86cfa..664411f171b5 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -6,6 +6,7 @@
 #include <linux/slab.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/user_namespace.h>
 #include <asm/uaccess.h>
 
 /* init to 2 - one for init_task, one to ensure it is never freed */
@@ -213,6 +214,14 @@ out:
 	return i;
 }
 
+bool may_setgroups(void)
+{
+	struct user_namespace *user_ns = current_user_ns();
+
+	return ns_capable(user_ns, CAP_SETGID) &&
+		userns_may_setgroups(user_ns);
+}
+
 /*
  *	SMP: Our groups are copy-on-write. We can set them safely
  *	without another task interfering.
@@ -223,7 +232,7 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
 	struct group_info *group_info;
 	int retval;
 
-	if (!ns_capable(current_user_ns(), CAP_SETGID))
+	if (!may_setgroups())
 		return -EPERM;
 	if ((unsigned)gidsetsize > NGROUPS_MAX)
 		return -EINVAL;
diff --git a/kernel/uid16.c b/kernel/uid16.c
index 602e5bbbceff..d58cc4d8f0d1 100644
--- a/kernel/uid16.c
+++ b/kernel/uid16.c
@@ -176,7 +176,7 @@ SYSCALL_DEFINE2(setgroups16, int, gidsetsize, old_gid_t __user *, grouplist)
 	struct group_info *group_info;
 	int retval;
 
-	if (!ns_capable(current_user_ns(), CAP_SETGID))
+	if (!may_setgroups())
 		return -EPERM;
 	if ((unsigned)gidsetsize > NGROUPS_MAX)
 		return -EINVAL;
diff --git a/kernel/user.c b/kernel/user.c
index 4efa39350e44..2d09940c9632 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -51,6 +51,7 @@ struct user_namespace init_user_ns = {
 	.owner = GLOBAL_ROOT_UID,
 	.group = GLOBAL_ROOT_GID,
 	.proc_inum = PROC_USER_INIT_INO,
+	.flags = USERNS_INIT_FLAGS,
 #ifdef CONFIG_PERSISTENT_KEYRINGS
 	.persistent_keyring_register_sem =
 	__RWSEM_INITIALIZER(init_user_ns.persistent_keyring_register_sem),
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index aa312b0dc3ec..ad419b04c146 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -24,6 +24,7 @@
 #include <linux/fs_struct.h>
 
 static struct kmem_cache *user_ns_cachep __read_mostly;
+static DEFINE_MUTEX(userns_state_mutex);
 
 static bool new_idmap_permitted(const struct file *file,
 				struct user_namespace *ns, int cap_setid,
@@ -99,6 +100,11 @@ int create_user_ns(struct cred *new)
 	ns->owner = owner;
 	ns->group = group;
 
+	/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
+	mutex_lock(&userns_state_mutex);
+	ns->flags = parent_ns->flags;
+	mutex_unlock(&userns_state_mutex);
+
 	set_cred_user_ns(new, ns);
 
 #ifdef CONFIG_PERSISTENT_KEYRINGS
@@ -583,9 +589,6 @@ static bool mappings_overlap(struct uid_gid_map *new_map,
 	return false;
 }
 
-
-static DEFINE_MUTEX(id_map_mutex);
-
 static ssize_t map_write(struct file *file, const char __user *buf,
 			 size_t count, loff_t *ppos,
 			 int cap_setid,
@@ -602,7 +605,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	ssize_t ret = -EINVAL;
 
 	/*
-	 * The id_map_mutex serializes all writes to any given map.
+	 * The userns_state_mutex serializes all writes to any given map.
 	 *
 	 * Any map is only ever written once.
 	 *
@@ -620,7 +623,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	 * order and smp_rmb() is guaranteed that we don't have crazy
 	 * architectures returning stale data.
 	 */
-	mutex_lock(&id_map_mutex);
+	mutex_lock(&userns_state_mutex);
 
 	ret = -EPERM;
 	/* Only allow one successful write to the map */
@@ -640,7 +643,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	if (!page)
 		goto out;
 
-	/* Only allow <= page size writes at the beginning of the file */
+	/* Only allow < page size writes at the beginning of the file */
 	ret = -EINVAL;
 	if ((*ppos != 0) || (count >= PAGE_SIZE))
 		goto out;
@@ -750,7 +753,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	*ppos = count;
 	ret = count;
 out:
-	mutex_unlock(&id_map_mutex);
+	mutex_unlock(&userns_state_mutex);
 	if (page)
 		free_page(page);
 	return ret;
@@ -812,16 +815,21 @@ static bool new_idmap_permitted(const struct file *file,
 				struct user_namespace *ns, int cap_setid,
 				struct uid_gid_map *new_map)
 {
-	/* Allow mapping to your own filesystem ids */
-	if ((new_map->nr_extents == 1) && (new_map->extent[0].count == 1)) {
+	const struct cred *cred = file->f_cred;
+	/* Don't allow mappings that would allow anything that wouldn't
+	 * be allowed without the establishment of unprivileged mappings.
+	 */
+	if ((new_map->nr_extents == 1) && (new_map->extent[0].count == 1) &&
+	    uid_eq(ns->owner, cred->euid)) {
 		u32 id = new_map->extent[0].lower_first;
 		if (cap_setid == CAP_SETUID) {
 			kuid_t uid = make_kuid(ns->parent, id);
-			if (uid_eq(uid, file->f_cred->fsuid))
+			if (uid_eq(uid, cred->euid))
 				return true;
 		} else if (cap_setid == CAP_SETGID) {
 			kgid_t gid = make_kgid(ns->parent, id);
-			if (gid_eq(gid, file->f_cred->fsgid))
+			if (!(ns->flags & USERNS_SETGROUPS_ALLOWED) &&
+			    gid_eq(gid, cred->egid))
 				return true;
 		}
 	}
@@ -841,6 +849,100 @@ static bool new_idmap_permitted(const struct file *file,
 	return false;
 }
 
+int proc_setgroups_show(struct seq_file *seq, void *v)
+{
+	struct user_namespace *ns = seq->private;
+	unsigned long userns_flags = ACCESS_ONCE(ns->flags);
+
+	seq_printf(seq, "%s\n",
+		   (userns_flags & USERNS_SETGROUPS_ALLOWED) ?
+		   "allow" : "deny");
+	return 0;
+}
+
+ssize_t proc_setgroups_write(struct file *file, const char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	struct seq_file *seq = file->private_data;
+	struct user_namespace *ns = seq->private;
+	char kbuf[8], *pos;
+	bool setgroups_allowed;
+	ssize_t ret;
+
+	/* Only allow a very narrow range of strings to be written */
+	ret = -EINVAL;
+	if ((*ppos != 0) || (count >= sizeof(kbuf)))
+		goto out;
+
+	/* What was written? */
+	ret = -EFAULT;
+	if (copy_from_user(kbuf, buf, count))
+		goto out;
+	kbuf[count] = '\0';
+	pos = kbuf;
+
+	/* What is being requested? */
+	ret = -EINVAL;
+	if (strncmp(pos, "allow", 5) == 0) {
+		pos += 5;
+		setgroups_allowed = true;
+	}
+	else if (strncmp(pos, "deny", 4) == 0) {
+		pos += 4;
+		setgroups_allowed = false;
+	}
+	else
+		goto out;
+
+	/* Verify there is not trailing junk on the line */
+	pos = skip_spaces(pos);
+	if (*pos != '\0')
+		goto out;
+
+	ret = -EPERM;
+	mutex_lock(&userns_state_mutex);
+	if (setgroups_allowed) {
+		/* Enabling setgroups after setgroups has been disabled
+		 * is not allowed.
+		 */
+		if (!(ns->flags & USERNS_SETGROUPS_ALLOWED))
+			goto out_unlock;
+	} else {
+		/* Permanently disabling setgroups after setgroups has
+		 * been enabled by writing the gid_map is not allowed.
+		 */
+		if (ns->gid_map.nr_extents != 0)
+			goto out_unlock;
+		ns->flags &= ~USERNS_SETGROUPS_ALLOWED;
+	}
+	mutex_unlock(&userns_state_mutex);
+
+	/* Report a successful write */
+	*ppos = count;
+	ret = count;
+out:
+	return ret;
+out_unlock:
+	mutex_unlock(&userns_state_mutex);
+	goto out;
+}
+
+bool userns_may_setgroups(const struct user_namespace *ns)
+{
+	bool allowed;
+
+	mutex_lock(&userns_state_mutex);
+	/* It is not safe to use setgroups until a gid mapping in
+	 * the user namespace has been established.
+	 */
+	allowed = ns->gid_map.nr_extents != 0;
+	/* Is setgroups allowed? */
+	allowed = allowed && (ns->flags & USERNS_SETGROUPS_ALLOWED);
+	mutex_unlock(&userns_state_mutex);
+
+	return allowed;
+}
+
 static void *userns_get(struct task_struct *task)
 {
 	struct user_namespace *user_ns;
diff --git a/tools/testing/selftests/mount/unprivileged-remount-test.c b/tools/testing/selftests/mount/unprivileged-remount-test.c
index 1b3ff2fda4d0..517785052f1c 100644
--- a/tools/testing/selftests/mount/unprivileged-remount-test.c
+++ b/tools/testing/selftests/mount/unprivileged-remount-test.c
@@ -6,6 +6,8 @@
 #include <sys/types.h>
 #include <sys/mount.h>
 #include <sys/wait.h>
+#include <sys/vfs.h>
+#include <sys/statvfs.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <fcntl.h>
@@ -32,11 +34,14 @@
 # define CLONE_NEWPID 0x20000000
 #endif
 
+#ifndef MS_REC
+# define MS_REC 16384
+#endif
 #ifndef MS_RELATIME
-#define MS_RELATIME (1 << 21)
+# define MS_RELATIME (1 << 21)
 #endif
 #ifndef MS_STRICTATIME
-#define MS_STRICTATIME (1 << 24)
+# define MS_STRICTATIME (1 << 24)
 #endif
 
 static void die(char *fmt, ...)
@@ -48,17 +53,14 @@ static void die(char *fmt, ...)
 	exit(EXIT_FAILURE);
 }
 
-static void write_file(char *filename, char *fmt, ...)
+static void vmaybe_write_file(bool enoent_ok, char *filename, char *fmt, va_list ap)
 {
 	char buf[4096];
 	int fd;
 	ssize_t written;
 	int buf_len;
-	va_list ap;
 
-	va_start(ap, fmt);
 	buf_len = vsnprintf(buf, sizeof(buf), fmt, ap);
-	va_end(ap);
 	if (buf_len < 0) {
 		die("vsnprintf failed: %s\n",
 		    strerror(errno));
@@ -69,6 +71,8 @@ static void write_file(char *filename, char *fmt, ...)
 
 	fd = open(filename, O_WRONLY);
 	if (fd < 0) {
+		if ((errno == ENOENT) && enoent_ok)
+			return;
 		die("open of %s failed: %s\n",
 		    filename, strerror(errno));
 	}
@@ -87,6 +91,65 @@ static void write_file(char *filename, char *fmt, ...)
 	}
 }
 
+static void maybe_write_file(char *filename, char *fmt, ...)
+{
+	va_list ap;
+
+	va_start(ap, fmt);
+	vmaybe_write_file(true, filename, fmt, ap);
+	va_end(ap);
+
+}
+
+static void write_file(char *filename, char *fmt, ...)
+{
+	va_list ap;
+
+	va_start(ap, fmt);
+	vmaybe_write_file(false, filename, fmt, ap);
+	va_end(ap);
+
+}
+
+static int read_mnt_flags(const char *path)
+{
+	int ret;
+	struct statvfs stat;
+	int mnt_flags;
+
+	ret = statvfs(path, &stat);
+	if (ret != 0) {
+		die("statvfs of %s failed: %s\n",
+			path, strerror(errno));
+	}
+	if (stat.f_flag & ~(ST_RDONLY | ST_NOSUID | ST_NODEV | \
+			ST_NOEXEC | ST_NOATIME | ST_NODIRATIME | ST_RELATIME | \
+			ST_SYNCHRONOUS | ST_MANDLOCK)) {
+		die("Unrecognized mount flags\n");
+	}
+	mnt_flags = 0;
+	if (stat.f_flag & ST_RDONLY)
+		mnt_flags |= MS_RDONLY;
+	if (stat.f_flag & ST_NOSUID)
+		mnt_flags |= MS_NOSUID;
+	if (stat.f_flag & ST_NODEV)
+		mnt_flags |= MS_NODEV;
+	if (stat.f_flag & ST_NOEXEC)
+		mnt_flags |= MS_NOEXEC;
+	if (stat.f_flag & ST_NOATIME)
+		mnt_flags |= MS_NOATIME;
+	if (stat.f_flag & ST_NODIRATIME)
+		mnt_flags |= MS_NODIRATIME;
+	if (stat.f_flag & ST_RELATIME)
+		mnt_flags |= MS_RELATIME;
+	if (stat.f_flag & ST_SYNCHRONOUS)
+		mnt_flags |= MS_SYNCHRONOUS;
+	if (stat.f_flag & ST_MANDLOCK)
+		mnt_flags |= ST_MANDLOCK;
+
+	return mnt_flags;
+}
+
 static void create_and_enter_userns(void)
 {
 	uid_t uid;
@@ -100,13 +163,10 @@ static void create_and_enter_userns(void)
 			strerror(errno));
 	}
 
+	maybe_write_file("/proc/self/setgroups", "deny");
 	write_file("/proc/self/uid_map", "0 %d 1", uid);
 	write_file("/proc/self/gid_map", "0 %d 1", gid);
 
-	if (setgroups(0, NULL) != 0) {
-		die("setgroups failed: %s\n",
-			strerror(errno));
-	}
 	if (setgid(0) != 0) {
 		die ("setgid(0) failed %s\n",
 			strerror(errno));
@@ -118,7 +178,8 @@ static void create_and_enter_userns(void)
 }
 
 static
-bool test_unpriv_remount(int mount_flags, int remount_flags, int invalid_flags)
+bool test_unpriv_remount(const char *fstype, const char *mount_options,
+			 int mount_flags, int remount_flags, int invalid_flags)
 {
 	pid_t child;
 
@@ -151,9 +212,11 @@ bool test_unpriv_remount(int mount_flags, int remount_flags, int invalid_flags)
 			strerror(errno));
 	}
 
-	if (mount("testing", "/tmp", "ramfs", mount_flags, NULL) != 0) {
-		die("mount of /tmp failed: %s\n",
-			strerror(errno));
+	if (mount("testing", "/tmp", fstype, mount_flags, mount_options) != 0) {
+		die("mount of %s with options '%s' on /tmp failed: %s\n",
+		    fstype,
+		    mount_options? mount_options : "",
+		    strerror(errno));
 	}
 
 	create_and_enter_userns();
@@ -181,62 +244,127 @@ bool test_unpriv_remount(int mount_flags, int remount_flags, int invalid_flags)
 
 static bool test_unpriv_remount_simple(int mount_flags)
 {
-	return test_unpriv_remount(mount_flags, mount_flags, 0);
+	return test_unpriv_remount("ramfs", NULL, mount_flags, mount_flags, 0);
 }
 
 static bool test_unpriv_remount_atime(int mount_flags, int invalid_flags)
 {
-	return test_unpriv_remount(mount_flags, mount_flags, invalid_flags);
+	return test_unpriv_remount("ramfs", NULL, mount_flags, mount_flags,
+				   invalid_flags);
+}
+
+static bool test_priv_mount_unpriv_remount(void)
+{
+	pid_t child;
+	int ret;
+	const char *orig_path = "/dev";
+	const char *dest_path = "/tmp";
+	int orig_mnt_flags, remount_mnt_flags;
+
+	child = fork();
+	if (child == -1) {
+		die("fork failed: %s\n",
+			strerror(errno));
+	}
+	if (child != 0) { /* parent */
+		pid_t pid;
+		int status;
+		pid = waitpid(child, &status, 0);
+		if (pid == -1) {
+			die("waitpid failed: %s\n",
+				strerror(errno));
+		}
+		if (pid != child) {
+			die("waited for %d got %d\n",
+				child, pid);
+		}
+		if (!WIFEXITED(status)) {
+			die("child did not terminate cleanly\n");
+		}
+		return WEXITSTATUS(status) == EXIT_SUCCESS ? true : false;
+	}
+
+	orig_mnt_flags = read_mnt_flags(orig_path);
+
+	create_and_enter_userns();
+	ret = unshare(CLONE_NEWNS);
+	if (ret != 0) {
+		die("unshare(CLONE_NEWNS) failed: %s\n",
+			strerror(errno));
+	}
+
+	ret = mount(orig_path, dest_path, "bind", MS_BIND | MS_REC, NULL);
+	if (ret != 0) {
+		die("recursive bind mount of %s onto %s failed: %s\n",
+			orig_path, dest_path, strerror(errno));
+	}
+
+	ret = mount(dest_path, dest_path, "none",
+		    MS_REMOUNT | MS_BIND | orig_mnt_flags , NULL);
+	if (ret != 0) {
+		/* system("cat /proc/self/mounts"); */
+		die("remount of /tmp failed: %s\n",
+		    strerror(errno));
+	}
+
+	remount_mnt_flags = read_mnt_flags(dest_path);
+	if (orig_mnt_flags != remount_mnt_flags) {
+		die("Mount flags unexpectedly changed during remount of %s originally mounted on %s\n",
+			dest_path, orig_path);
+	}
+	exit(EXIT_SUCCESS);
 }
 
 int main(int argc, char **argv)
 {
-	if (!test_unpriv_remount_simple(MS_RDONLY|MS_NODEV)) {
+	if (!test_unpriv_remount_simple(MS_RDONLY)) {
 		die("MS_RDONLY malfunctions\n");
 	}
-	if (!test_unpriv_remount_simple(MS_NODEV)) {
+	if (!test_unpriv_remount("devpts", "newinstance", MS_NODEV, MS_NODEV, 0)) {
 		die("MS_NODEV malfunctions\n");
 	}
-	if (!test_unpriv_remount_simple(MS_NOSUID|MS_NODEV)) {
+	if (!test_unpriv_remount_simple(MS_NOSUID)) {
 		die("MS_NOSUID malfunctions\n");
 	}
-	if (!test_unpriv_remount_simple(MS_NOEXEC|MS_NODEV)) {
+	if (!test_unpriv_remount_simple(MS_NOEXEC)) {
 		die("MS_NOEXEC malfunctions\n");
 	}
-	if (!test_unpriv_remount_atime(MS_RELATIME|MS_NODEV,
-				       MS_NOATIME|MS_NODEV))
+	if (!test_unpriv_remount_atime(MS_RELATIME,
+				       MS_NOATIME))
 	{
 		die("MS_RELATIME malfunctions\n");
 	}
-	if (!test_unpriv_remount_atime(MS_STRICTATIME|MS_NODEV,
-				       MS_NOATIME|MS_NODEV))
+	if (!test_unpriv_remount_atime(MS_STRICTATIME,
+				       MS_NOATIME))
 	{
 		die("MS_STRICTATIME malfunctions\n");
 	}
-	if (!test_unpriv_remount_atime(MS_NOATIME|MS_NODEV,
-				       MS_STRICTATIME|MS_NODEV))
+	if (!test_unpriv_remount_atime(MS_NOATIME,
+				       MS_STRICTATIME))
 	{
-		die("MS_RELATIME malfunctions\n");
+		die("MS_NOATIME malfunctions\n");
 	}
-	if (!test_unpriv_remount_atime(MS_RELATIME|MS_NODIRATIME|MS_NODEV,
-				       MS_NOATIME|MS_NODEV))
+	if (!test_unpriv_remount_atime(MS_RELATIME|MS_NODIRATIME,
+				       MS_NOATIME))
 	{
-		die("MS_RELATIME malfunctions\n");
+		die("MS_RELATIME|MS_NODIRATIME malfunctions\n");
 	}
-	if (!test_unpriv_remount_atime(MS_STRICTATIME|MS_NODIRATIME|MS_NODEV,
-				       MS_NOATIME|MS_NODEV))
+	if (!test_unpriv_remount_atime(MS_STRICTATIME|MS_NODIRATIME,
+				       MS_NOATIME))
 	{
-		die("MS_RELATIME malfunctions\n");
+		die("MS_STRICTATIME|MS_NODIRATIME malfunctions\n");
 	}
-	if (!test_unpriv_remount_atime(MS_NOATIME|MS_NODIRATIME|MS_NODEV,
-				       MS_STRICTATIME|MS_NODEV))
+	if (!test_unpriv_remount_atime(MS_NOATIME|MS_NODIRATIME,
+				       MS_STRICTATIME))
 	{
-		die("MS_RELATIME malfunctions\n");
+		die("MS_NOATIME|MS_DIRATIME malfunctions\n");
 	}
-	if (!test_unpriv_remount(MS_STRICTATIME|MS_NODEV, MS_NODEV,
-				 MS_NOATIME|MS_NODEV))
+	if (!test_unpriv_remount("ramfs", NULL, MS_STRICTATIME, 0, MS_NOATIME))
 	{
 		die("Default atime malfunctions\n");
 	}
+	if (!test_priv_mount_unpriv_remount()) {
+		die("Mount flags unexpectedly changed after remount\n");
+	}
 	return EXIT_SUCCESS;
 }

^ permalink raw reply related

* [PATCH 5/5] if_tun: drop broken IFF_VNET_LE
From: Michael S. Tsirkin @ 2014-12-16 13:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: David Miller, netdev, Dan Carpenter, linux-api, Jason Wang
In-Reply-To: <1418732988-3535-1-git-send-email-mst@redhat.com>

Everyone should use TUNSETVNETLE/TUNGETVNETLE instead.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/if_tun.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 274630c..50ae243 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -59,7 +59,6 @@
 #define IFF_ONE_QUEUE	0x2000
 #define IFF_VNET_HDR	0x4000
 #define IFF_TUN_EXCL	0x8000
-#define IFF_VNET_LE	0x10000
 #define IFF_MULTI_QUEUE 0x0100
 #define IFF_ATTACH_QUEUE 0x0200
 #define IFF_DETACH_QUEUE 0x0400
-- 
MST

^ permalink raw reply related

* [PATCH 2/5] if_tun: add TUNSETVNETLE/TUNGETVNETLE
From: Michael S. Tsirkin @ 2014-12-16 13:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: David Miller, netdev, Dan Carpenter, linux-api, Jason Wang
In-Reply-To: <1418732988-3535-1-git-send-email-mst@redhat.com>

ifreq flags field is only 16 bit wide, so setting IFF_VNET_LE there has
no effect:
doesn't fit in two bytes.

The tests passed apparently because they have an even number of bugs,
all cancelling out.

Luckily we didn't release a kernel with this flag, so it's
not too late to fix this.

Add TUNSETVNETLE/TUNGETVNETLE to really achieve the purpose
of IFF_VNET_LE.

This has an added benefit that if we ever want a BE flag,
we won't have to deal with weird configurations like
setting both LE and BE at the same time.

IFF_VNET_LE will be dropped in a follow-up patch.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/if_tun.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 18b2403..274630c 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -48,6 +48,8 @@
 #define TUNSETQUEUE  _IOW('T', 217, int)
 #define TUNSETIFINDEX	_IOW('T', 218, unsigned int)
 #define TUNGETFILTER _IOR('T', 219, struct sock_fprog)
+#define TUNSETVNETLE _IOW('T', 220, int)
+#define TUNGETVNETLE _IOR('T', 221, int)

 /* TUNSETIFF ifr flags */
 #define IFF_TUN		0x0001
-- 
MST

^ permalink raw reply related

* Re: [CFT] Can I get some Tested-By's on this series?
From: Richard Weinberger @ 2014-12-16  9:23 UTC (permalink / raw)
  To: Andy Lutomirski, Eric W. Biederman
  Cc: Serge E. Hallyn, Linux Containers, Josh Triplett, Andrew Morton,
	Kees Cook, Michael Kerrisk-manpages, Linux API, linux-man,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, LSM,
	Casey Schaufler, Kenton Varda, stable
In-Reply-To: <CALCETrWXp3eq2O068NZcd+KPCg+N2y0T57Q0JzHcZLjzq+mXLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Am 16.12.2014 um 03:05 schrieb Andy Lutomirski:
> On Wed, Dec 10, 2014 at 8:39 AM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>> Will people please test these patches with their container project?
>>
>> These changes break container userspace (hopefully in a minimal way) if
>> I could have that confirmed by testing I would really appreciate it.  I
>> really don't want to send out a bug fix that accidentally breaks
>> userspace again.
>>
>> The only issue sort of under discussion is if there is a better name for
>> /proc/<pid>/setgroups, and the name of the file will not affect the
>> functionality of the patchset.
>>
>> With the code reviewed and written in simple obviously correct, easily
>> reviewable ways I am hoping/planning to send this to Linus ASAP.
> 
> 
> I tested this with Sandstorm.  It breaks as is and it works if I add
> the setgroups thing.
> 
> Tested-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> # breaks things as designed :(
> 
> I still don't like the name "setgroups".

I agree that the name is not optimal.
But I don't have a counterproposal as my naming skills are miserable.

Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [CFT] Can I get some Tested-By's on this series?
From: Andy Lutomirski @ 2014-12-16  2:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-man, Kees Cook, Richard Weinberger, Linux Containers,
	Josh Triplett, stable,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kenton Varda, LSM, Michael Kerrisk-manpages, Linux API,
	Casey Schaufler, Andrew Morton
In-Reply-To: <87mw6vh31e.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Wed, Dec 10, 2014 at 8:39 AM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> Will people please test these patches with their container project?
>
> These changes break container userspace (hopefully in a minimal way) if
> I could have that confirmed by testing I would really appreciate it.  I
> really don't want to send out a bug fix that accidentally breaks
> userspace again.
>
> The only issue sort of under discussion is if there is a better name for
> /proc/<pid>/setgroups, and the name of the file will not affect the
> functionality of the patchset.
>
> With the code reviewed and written in simple obviously correct, easily
> reviewable ways I am hoping/planning to send this to Linus ASAP.


I tested this with Sandstorm.  It breaks as is and it works if I add
the setgroups thing.

Tested-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> # breaks things as designed :(

I still don't like the name "setgroups".

--Andy

>
> Eric



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* [GIT PULL RESEND] kselftest-3.19-rc1
From: Shuah Khan @ 2014-12-15 21:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Shuah Khan, Shuah Khan
In-Reply-To: <548EEA88.2060206-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>

Hi Linus,

Please pull the following Kselftest updates for 3.19-rc1. Hope
I have this right this time. Sorry again for the messed up
first request.

thanks,
-- Shuah

The following changes since commit 0df1f2487d2f0d04703f142813d53615d62a1da4:

  Linux 3.18-rc3 (2014-11-02 15:01:51 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
tags/linux-kselftest-3.19-rc1

for you to fetch changes up to 3ce51050fadd63737c03627293ca2dc4be238891:

  selftest: size: Add size test for Linux kernel (2014-12-03 19:27:47 -0700)

----------------------------------------------------------------
kselftest updates for 3.19-rc1

----------------------------------------------------------------
Michael Ellerman (3):
      kcmp: Move kcmp.h into uapi
      selftests/kcmp: Don't include kernel headers
      selftests/kcmp: Always try to build the test

Shuah Khan (7):
      selftests/net: move test out of Makefile into a shell script
      selftests/user: move test out of Makefile into a shell script
      selftests: add kselftest framework for uniform test reporting
      selftests/breakpoints: change test to use ksft framework
      selftests/ipc: change test to use ksft framework
      selftests/kcmp: change test to use ksft framework
      selftests/timers: change test to use ksft framework

Tim Bird (1):
      selftest: size: Add size test for Linux kernel

 include/uapi/linux/Kbuild                          |   1 +
 include/{ => uapi}/linux/kcmp.h                    |   6 +-
 tools/testing/selftests/Makefile                   |   1 +
 .../selftests/breakpoints/breakpoint_test.c        |  10 ++-
 tools/testing/selftests/ipc/msgque.c               |  26 +++---
 tools/testing/selftests/kcmp/Makefile              |  22 +----
 tools/testing/selftests/kcmp/kcmp_test.c           |  27 ++++--
 tools/testing/selftests/kselftest.h                |  62 +++++++++++++
 tools/testing/selftests/net/Makefile               |   8 +-
 tools/testing/selftests/net/test_bpf.sh            |  10 +++
 tools/testing/selftests/size/.gitignore            |   1 +
 tools/testing/selftests/size/Makefile              |  12 +++
 tools/testing/selftests/size/get_size.c            | 100
+++++++++++++++++++++
 tools/testing/selftests/timers/posix_timers.c      |  14 +--
 tools/testing/selftests/user/Makefile              |   8 +-
 tools/testing/selftests/user/test_user_copy.sh     |  10 +++
 16 files changed, 252 insertions(+), 66 deletions(-)
 rename include/{ => uapi}/linux/kcmp.h (62%)
 create mode 100644 tools/testing/selftests/kselftest.h
tools/testing/selftests/net/Makefile               |   8 +-
 tools/testing/selftests/net/test_bpf.sh            |  10 +++
 tools/testing/selftests/size/.gitignore            |   1 +
 tools/testing/selftests/size/Makefile              |  12 +++
 tools/testing/selftests/size/get_size.c            | 100
+++++++++++++++++++++
 tools/testing/selftests/timers/posix_timers.c      |  14 +--
 tools/testing/selftests/user/Makefile              |   8 +-
 tools/testing/selftests/user/test_user_copy.sh     |  10 +++
 16 files changed, 252 insertions(+), 66 deletions(-)
 rename include/{ => uapi}/linux/kcmp.h (62%)
 create mode 100644 tools/testing/selftests/kselftest.h
 create mode 100755 tools/testing/selftests/net/test_bpf.sh
 create mode 100644 tools/testing/selftests/size/.gitignore
 create mode 100644 tools/testing/selftests/size/Makefile
 create mode 100644 tools/testing/selftests/size/get_size.c
 create mode 100755 tools/testing/selftests/user/test_user_copy.sh


-- 
Shuah Khan
Sr. Linux Kernel Developer
Samsung Open Source Group
Samsung Research America (Silicon Valley)
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org | (970) 217-8978

^ permalink raw reply

* [PATCH RFC 5/5] virtio_pci: macros for PCI layout offsets.
From: Michael S. Tsirkin @ 2014-12-15 21:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, virtualization, linux-sh, cornelia.huck, linux-api
In-Reply-To: <1418679168-1011-1-git-send-email-mst@redhat.com>

QEMU wants it, so why not?  Trust, but verify.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 include/uapi/linux/virtio_pci.h    | 30 ++++++++++++++++++
 drivers/virtio/virtio_pci_modern.c | 63 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 28c2ce0..5d546c6 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -159,6 +159,36 @@ struct virtio_pci_common_cfg {
 	__le32 queue_used_hi;		/* read-write */
 };
 
+/* Macro versions of offsets for the Old Timers! */
+#define VIRTIO_PCI_CAP_VNDR		0
+#define VIRTIO_PCI_CAP_NEXT		1
+#define VIRTIO_PCI_CAP_LEN		2
+#define VIRTIO_PCI_CAP_TYPE_AND_BAR	3
+#define VIRTIO_PCI_CAP_OFFSET		4
+#define VIRTIO_PCI_CAP_LENGTH		8
+
+#define VIRTIO_PCI_NOTIFY_CAP_MULT	12
+
+#define VIRTIO_PCI_COMMON_DFSELECT	0
+#define VIRTIO_PCI_COMMON_DF		4
+#define VIRTIO_PCI_COMMON_GFSELECT	8
+#define VIRTIO_PCI_COMMON_GF		12
+#define VIRTIO_PCI_COMMON_MSIX		16
+#define VIRTIO_PCI_COMMON_NUMQ		18
+#define VIRTIO_PCI_COMMON_STATUS	20
+#define VIRTIO_PCI_COMMON_CFGGENERATION	21
+#define VIRTIO_PCI_COMMON_Q_SELECT	22
+#define VIRTIO_PCI_COMMON_Q_SIZE	24
+#define VIRTIO_PCI_COMMON_Q_MSIX	26
+#define VIRTIO_PCI_COMMON_Q_ENABLE	28
+#define VIRTIO_PCI_COMMON_Q_NOFF	30
+#define VIRTIO_PCI_COMMON_Q_DESCLO	32
+#define VIRTIO_PCI_COMMON_Q_DESCHI	36
+#define VIRTIO_PCI_COMMON_Q_AVAILLO	40
+#define VIRTIO_PCI_COMMON_Q_AVAILHI	44
+#define VIRTIO_PCI_COMMON_Q_USEDLO	48
+#define VIRTIO_PCI_COMMON_Q_USEDHI	52
+
 #endif /* VIRTIO_PCI_NO_MODERN */
 
 #endif
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 636666c..5b0c012 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -446,6 +446,67 @@ static inline int virtio_pci_find_capability(struct pci_dev *dev, u8 cfg_type,
 	return 0;
 }
 
+/* This is part of the ABI.  Don't screw with it. */
+static inline void check_offsets(void)
+{
+	/* Note: disk space was harmed in compilation of this function. */
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_VNDR !=
+		     offsetof(struct virtio_pci_cap, cap_vndr));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_NEXT !=
+		     offsetof(struct virtio_pci_cap, cap_next));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_LEN !=
+		     offsetof(struct virtio_pci_cap, cap_len));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_TYPE_AND_BAR !=
+		     offsetof(struct virtio_pci_cap, type_and_bar));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_OFFSET !=
+		     offsetof(struct virtio_pci_cap, offset));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_LENGTH !=
+		     offsetof(struct virtio_pci_cap, length));
+	BUILD_BUG_ON(VIRTIO_PCI_NOTIFY_CAP_MULT !=
+		     offsetof(struct virtio_pci_notify_cap,
+			      notify_off_multiplier));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_DFSELECT !=
+		     offsetof(struct virtio_pci_common_cfg,
+			      device_feature_select));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_DF !=
+		     offsetof(struct virtio_pci_common_cfg, device_feature));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_GFSELECT !=
+		     offsetof(struct virtio_pci_common_cfg,
+			      guest_feature_select));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_GF !=
+		     offsetof(struct virtio_pci_common_cfg, guest_feature));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_MSIX !=
+		     offsetof(struct virtio_pci_common_cfg, msix_config));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_NUMQ !=
+		     offsetof(struct virtio_pci_common_cfg, num_queues));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_STATUS !=
+		     offsetof(struct virtio_pci_common_cfg, device_status));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_CFGGENERATION !=
+		     offsetof(struct virtio_pci_common_cfg, config_generation));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_SELECT !=
+		     offsetof(struct virtio_pci_common_cfg, queue_select));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_SIZE !=
+		     offsetof(struct virtio_pci_common_cfg, queue_size));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_MSIX !=
+		     offsetof(struct virtio_pci_common_cfg, queue_msix_vector));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_ENABLE !=
+		     offsetof(struct virtio_pci_common_cfg, queue_enable));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_NOFF !=
+		     offsetof(struct virtio_pci_common_cfg, queue_notify_off));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_DESCLO !=
+		     offsetof(struct virtio_pci_common_cfg, queue_desc_lo));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_DESCHI !=
+		     offsetof(struct virtio_pci_common_cfg, queue_desc_hi));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_AVAILLO !=
+		     offsetof(struct virtio_pci_common_cfg, queue_avail_lo));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_AVAILHI !=
+		     offsetof(struct virtio_pci_common_cfg, queue_avail_hi));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_USEDLO !=
+		     offsetof(struct virtio_pci_common_cfg, queue_used_lo));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_USEDHI !=
+		     offsetof(struct virtio_pci_common_cfg, queue_used_hi));
+}
+
 /* the PCI probing function */
 int virtio_pci_modern_probe(struct pci_dev *pci_dev,
 			    const struct pci_device_id *id)
@@ -455,6 +516,8 @@ int virtio_pci_modern_probe(struct pci_dev *pci_dev,
 	struct virtio_device_id virtio_id;
 	u32 notify_length;
 
+	check_offsets();
+
 	/* We only own devices >= 0x1000 and <= 0x107f: leave the rest. */
 	if (pci_dev->device < 0x1000 || pci_dev->device > 0x107f)
 		return -ENODEV;
-- 
MST

^ permalink raw reply related

* [PATCH RFC 3/5] virtio-pci: define layout for virtio 1.0
From: Michael S. Tsirkin @ 2014-12-15 21:33 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-sh, linux-api, virtualization
In-Reply-To: <1418679168-1011-1-git-send-email-mst@redhat.com>

From: Rusty Russell <rusty@rustcorp.com.au>

Based on patches by Michael S. Tsirkin <mst@redhat.com>, but I found it
hard to follow so changed to use structures which are more
self-documenting.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/virtio_pci.h | 62 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 35b552c..28c2ce0 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -99,4 +99,66 @@
 /* Vector value used to disable MSI for queue */
 #define VIRTIO_MSI_NO_VECTOR            0xffff
 
+#ifndef VIRTIO_PCI_NO_MODERN
+
+/* IDs for different capabilities.  Must all exist. */
+
+/* Common configuration */
+#define VIRTIO_PCI_CAP_COMMON_CFG	1
+/* Notifications */
+#define VIRTIO_PCI_CAP_NOTIFY_CFG	2
+/* ISR access */
+#define VIRTIO_PCI_CAP_ISR_CFG		3
+/* Device specific confiuration */
+#define VIRTIO_PCI_CAP_DEVICE_CFG	4
+
+/* This is the PCI capability header: */
+struct virtio_pci_cap {
+	__u8 cap_vndr;		/* Generic PCI field: PCI_CAP_ID_VNDR */
+	__u8 cap_next;		/* Generic PCI field: next ptr. */
+	__u8 cap_len;		/* Generic PCI field: capability length */
+	__u8 type_and_bar;	/* Upper 3 bits: bar.
+				 * Lower 3 is VIRTIO_PCI_CAP_*_CFG. */
+	__le32 offset;		/* Offset within bar. */
+	__le32 length;		/* Length. */
+};
+
+#define VIRTIO_PCI_CAP_BAR_SHIFT	5
+#define VIRTIO_PCI_CAP_BAR_MASK		0x7
+#define VIRTIO_PCI_CAP_TYPE_SHIFT	0
+#define VIRTIO_PCI_CAP_TYPE_MASK	0x7
+
+struct virtio_pci_notify_cap {
+	struct virtio_pci_cap cap;
+	__le32 notify_off_multiplier;	/* Multiplier for queue_notify_off. */
+};
+
+/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
+struct virtio_pci_common_cfg {
+	/* About the whole device. */
+	__le32 device_feature_select;	/* read-write */
+	__le32 device_feature;		/* read-only */
+	__le32 guest_feature_select;	/* read-write */
+	__le32 guest_feature;		/* read-write */
+	__le16 msix_config;		/* read-write */
+	__le16 num_queues;		/* read-only */
+	__u8 device_status;		/* read-write */
+	__u8 config_generation;		/* read-only */
+
+	/* About a specific virtqueue. */
+	__le16 queue_select;		/* read-write */
+	__le16 queue_size;		/* read-write, power of 2. */
+	__le16 queue_msix_vector;	/* read-write */
+	__le16 queue_enable;		/* read-write */
+	__le16 queue_notify_off;	/* read-only */
+	__le32 queue_desc_lo;		/* read-write */
+	__le32 queue_desc_hi;		/* read-write */
+	__le32 queue_avail_lo;		/* read-write */
+	__le32 queue_avail_hi;		/* read-write */
+	__le32 queue_used_lo;		/* read-write */
+	__le32 queue_used_hi;		/* read-write */
+};
+
+#endif /* VIRTIO_PCI_NO_MODERN */
+
 #endif
-- 
MST

^ permalink raw reply related

* [PATCH 5/6] virtio_pci: add VIRTIO_PCI_NO_LEGACY
From: Michael S. Tsirkin @ 2014-12-15 21:23 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Rusty Russell,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1418678575-31755-1-git-send-email-mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Add macro to disable all legacy register defines.
Helpful to make sure legacy macros don't leak
through into modern code.

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/uapi/linux/virtio_pci.h | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index e5ec1ca..35b552c 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -41,6 +41,8 @@
 
 #include <linux/virtio_config.h>
 
+#ifndef VIRTIO_PCI_NO_LEGACY
+
 /* A 32-bit r/o bitmask of the features supported by the host */
 #define VIRTIO_PCI_HOST_FEATURES	0
 
@@ -67,16 +69,11 @@
  * a read-and-acknowledge. */
 #define VIRTIO_PCI_ISR			19
 
-/* The bit of the ISR which indicates a device configuration change. */
-#define VIRTIO_PCI_ISR_CONFIG		0x2
-
 /* MSI-X registers: only enabled if MSI-X is enabled. */
 /* A 16-bit vector for configuration changes. */
 #define VIRTIO_MSI_CONFIG_VECTOR        20
 /* A 16-bit vector for selected queue notifications. */
 #define VIRTIO_MSI_QUEUE_VECTOR         22
-/* Vector value used to disable MSI for queue */
-#define VIRTIO_MSI_NO_VECTOR            0xffff
 
 /* The remaining space is defined by each driver as the per-driver
  * configuration space */
@@ -94,4 +91,12 @@
 /* The alignment to use between consumer and producer parts of vring.
  * x86 pagesize again. */
 #define VIRTIO_PCI_VRING_ALIGN		4096
+
+#endif /* VIRTIO_PCI_NO_LEGACY */
+
+/* The bit of the ISR which indicates a device configuration change. */
+#define VIRTIO_PCI_ISR_CONFIG		0x2
+/* Vector value used to disable MSI for queue */
+#define VIRTIO_MSI_NO_VECTOR            0xffff
+
 #endif
-- 
MST

^ permalink raw reply related

* Re: [CFT] Can I get some Tested-By's on this series?
From: Serge Hallyn @ 2014-12-15 20:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-man, Kees Cook, Richard Weinberger, Linux Containers,
	Josh Triplett, stable, Andy Lutomirski, Kenton Varda, LSM,
	Michael Kerrisk-manpages, Linux API, Casey Schaufler,
	Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <8761dcwu40.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> >> Stéphane Graber <stgraber-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:
> >> 
> >> > On Fri, Dec 12, 2014 at 03:38:18PM -0600, Eric W. Biederman wrote:
> >> >> Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:
> >> >> 
> >> >> > Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> >> >> >> 
> >> >> >> Will people please test these patches with their container project?
> >> >> >> 
> >> >> >> These changes break container userspace (hopefully in a minimal way) if
> >> >> >> I could have that confirmed by testing I would really appreciate it.  I
> >> >> >> really don't want to send out a bug fix that accidentally breaks
> >> >> >> userspace again.
> >> >> >> 
> >> >> >> The only issue sort of under discussion is if there is a better name for
> >> >> >> /proc/<pid>/setgroups, and the name of the file will not affect the
> >> >> >> functionality of the patchset.
> >> >> >> 
> >> >> >> With the code reviewed and written in simple obviously correct, easily
> >> >> >> reviewable ways I am hoping/planning to send this to Linus ASAP.
> >> >> >> 
> >> >> >> Eric
> >> >> >
> >> >> > Is there a git tree we can clone?
> >> >> 
> >> >> Have either of you been able to check to see if any of my changes
> >> >> affects lxc?
> >> >> 
> >> >> I am trying to gauge how hard and how fast I should push to Linus.  lxc
> >> >> being the largest adopter of unprivileged user namespaces for general
> >> >> purpose containers.
> >> >> 
> >> >> I expect you just call newuidmap and newgidmap and don't actually care
> >> >> about not being able to set gid_map without privilege.  But I really
> >> >> want to avoid pushing a security fix and then being surprised that
> >> >> things like lxc break.
> >> >> 
> >> >> Eric
> >> >
> >> > Hi Eric,
> >> >
> >> > I've unfortunately been pretty busy this week as I was (well, still am)
> >> > travelling to South Africa for a meeting. I don't have a full kernel
> >> > tree around here and a full git clone isn't really doable over the kind
> >> > of Internet I've got here :)
> >> >
> >> > Hopefully Serge can give it a quick try, otherwise I should be able to
> >> > do some tests on Tuesday when I'm back home.
> >> 
> >> I thought Serge was going to but I haven't heard yet so I am prodding ;-)
> >
> > Ok, thanks - yes, unprivileged lxc is working fine with your kernels.
> > Just to be sure I was testing the right thing I also tested using
> > my unprivileged nsexec testcases, and they failed on setgroup/setgid
> > as now expected, and succeeded there without your patches.
> 
> Thanks.
> 
> Serge unless you object will add your Tested-By to my pull message to Linus.

Sounds good.

> Minor question do you runprivileged nsexec test cases test to see if the
> write to gid_map succeeds?  I would have expected the gid_map write to
> fail before the setgroups setgid system calls came into play.

Yes, I did that by hand, and it failed (with your kernel).

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox