Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
From: Peter Zijlstra @ 2015-01-05 13:00 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern, Thomas Gleixner,
	Tomeu Vizoso, linux-kernel, linux-api
In-Reply-To: <1415292718-19785-2-git-send-email-pawel.moll@arm.com>

On Thu, Nov 06, 2014 at 04:51:56PM +0000, Pawel Moll wrote:
>  Documentation/kernel-parameters.txt |  9 +++++++++
>  kernel/events/core.c                | 37 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 46 insertions(+)

> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 4c81a86..8ead8d8 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt

> @@ -2763,6 +2764,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  			allocator.  This parameter is primarily	for debugging
>  			and performance comparison.
>  
> +	perf_use_local_clock
> +			[PERF]
> +			Use local_clock() as a source for perf timestamps
> +			generation. This was be the default behaviour and
> +			this parameter can be used to maintain backward
> +			compatibility or on older hardware with expensive
> +			monotonic clock source.
> +
>  	pf.		[PARIDE]
>  			See Documentation/blockdev/paride.txt.

So I'm always terminally confused on the naming of kernel parameters,
sometimes things are modules (even when they're not actually =m capable)
and get a module::foo naming or so and sometimes they're not.

So we want to use the module naming thing or not?

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 2b02c9f..5d0aa03 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c

> @@ -322,8 +323,41 @@ extern __weak const char *perf_pmu_name(void)
>  	return "pmu";
>  }
>  
> +static bool perf_use_local_clock;
> +static int __init perf_use_local_clock_setup(char *__unused)
> +{
> +	perf_use_local_clock = true;
> +	return 1;
> +}
> +__setup("perf_use_local_clock", perf_use_local_clock_setup);

>  static inline u64 perf_clock(void)
>  {
> +	if (likely(!perf_use_local_clock))
> +		return ktime_get_mono_fast_ns();
> +
>  	return local_clock();
>  }

Since this all is boot time, should we not use things like static_key
and avoid the 'pointless' conditional at runtime?

^ permalink raw reply

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
From: Peter Zijlstra @ 2015-01-05 13:01 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Ingo Molnar, Richard Cochran, Steven Rostedt, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern, Thomas Gleixner,
	Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <1418305153.4037.1.camel-5wv7dgnIgG8@public.gmane.org>

On Thu, Dec 11, 2014 at 01:39:13PM +0000, Pawel Moll wrote:
> On Thu, 2014-11-27 at 15:05 +0000, Pawel Moll wrote:
> > It's been 3 weeks without any negative feedback (no feedback at all, but
> > I take the optimistic view :-)...
> > 
> > How about queuing at least this patch alone for the incoming merge
> > window? Or at least getting it into -next, with the view at 3.20?
> 
> It's been another two weeks... How about getting it queued for v3.20
> then?

Yes sorry about that, I had me a kid and took a wee bit of time away
from computers, then xmas and new years happened.

In any case, best wishes for the new year to all.

^ permalink raw reply

* Re: [PATCH v4 2/3] perf: Userspace event
From: Peter Zijlstra @ 2015-01-05 13:12 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern, Thomas Gleixner,
	Tomeu Vizoso, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1415292718-19785-3-git-send-email-pawel.moll-5wv7dgnIgG8@public.gmane.org>

On Thu, Nov 06, 2014 at 04:51:57PM +0000, Pawel Moll wrote:
> This patch adds a PR_TASK_PERF_UEVENT prctl call which can be used by
> any process to inject custom data into perf data stream as a new
> PERF_RECORD_UEVENT record, if such process is being observed or if it
> is running on a CPU being observed by the perf framework.
> 
> The prctl call takes the following arguments:
> 
>         prctl(PR_TASK_PERF_UEVENT, type, size, data, flags);
> 
> - type: a number meaning to describe content of the following data.
>   Kernel does not pay attention to it and merely passes it further in
>   the perf data, therefore its use must be agreed between the events
>   producer (the process being observed) and the consumer (performance
>   analysis tool). The perf userspace tool will contain a repository of
>   "well known" types and reference implementation of their decoders.
> - size: Length in bytes of the data.
> - data: Pointer to the data.
> - flags: Reserved for future use. Always pass zero.
> 
> Perf context that are supposed to receive events generated with the
> prctl above must be opened with perf_event_attr.uevent set to 1. The
> PERF_RECORD_UEVENT records consist of a standard perf event header,
> 32-bit type value, 32-bit data size and the data itself, followed by
> padding to align the overall record size to 8 bytes and optional,
> standard sample_id field.
> 
> Example use cases:
> 
> - "perf_printf" like mechanism to add logging messages to perf data;
>   in the simplest case it can be just
> 
>         prctl(PR_TASK_PERF_UEVENT, 0, 8, "Message", 0);
> 
> - synchronisation of performance data generated in user space with the
>   perf stream coming from the kernel. For example, the marker can be
>   inserted by a JIT engine after it generated portion of the code, but
>   before the code is executed for the first time, allowing the
>   post-processor to pick the correct debugging information.
> 

The think I remember being raised was a unified means of these msgs
across perf/ftrace/lttng. I am not seeing that mentioned.

Also, I would like a stronger rationale for the @type argument, if it
has no actual meaning why is it separate from the binary msg data?

^ permalink raw reply

* Re: [PATCH v4 3/3] perf: Sample additional clock value
From: Peter Zijlstra @ 2015-01-05 13:45 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern, Thomas Gleixner,
	Tomeu Vizoso, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1415292718-19785-4-git-send-email-pawel.moll-5wv7dgnIgG8@public.gmane.org>

On Thu, Nov 06, 2014 at 04:51:58PM +0000, Pawel Moll wrote:
> Currently three clocks are implemented: CLOCK_REALITME = 0,
> CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
> 5 bits wide to allow for future extension to custom, non-POSIX clock
> sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
> ARM CoreSight (hardware trace) timestamp generator.

> @@ -304,7 +305,16 @@ struct perf_event_attr {
>  				mmap2          :  1, /* include mmap with inode data     */
>  				comm_exec      :  1, /* flag comm events that are due to an exec */
>  				uevents        :  1, /* allow uevents into the buffer */
> -				__reserved_1   : 38;
> +
> +				/*
> +				 * clock: one of the POSIX clock IDs:
> +				 *
> +				 * 0 - CLOCK_REALTIME
> +				 * 1 - CLOCK_MONOTONIC
> +				 * 4 - CLOCK_MONOTONIC_RAW
> +				 */
> +				clock          :  5, /* clock type */
> +				__reserved_1   : 33;
>  
>  	union {
>  		__u32		wakeup_events;	  /* wakeup every n events */

This would put a constraint on actually changing MAX_CLOCKS, are the
time people OK with that? Thomas, John?

I'm also not quite sure of using the >MAX_CLOCKS space for 'special'
clocks, preferably those would register themselves with the POSIX clock
interface.

> @@ -4631,6 +4637,24 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
>  		data->cpu_entry.cpu	 = raw_smp_processor_id();
>  		data->cpu_entry.reserved = 0;
>  	}
> +
> +	if (sample_type & PERF_SAMPLE_CLOCK) {
> +		switch (event->attr.clock) {
> +		case CLOCK_REALTIME:
> +			data->clock = ktime_get_real_ns();
> +			break;
> +		case CLOCK_MONOTONIC:
> +			data->clock = ktime_get_mono_fast_ns();
> +			break;
> +		case CLOCK_MONOTONIC_RAW:
> +			data->clock = ktime_get_raw_ns();
> +			break;
> +		default:
> +			data->clock = 0;
> +			break;
> +		}
> +	}
> +
>  }

This is broken. There is nothing stopping this from being called from
NMI context and only the MONO one is NMI safe.

Also, one would expect something like:

	default: {
		struct k_clock *kc = clockid_to_kclock(event->attr.clock);
		struct timespec ts;
		if (kc) {
			kc->clock_get(event->attr.clock, &ts);
			data->clock = ktime_to_ns(timespec_to_ktime(ts));
		} else {
			data->clock = 0;
		}
	}

Albeit preferably slightly less horrible -- of course, one would first
need to deal with the NMI issue.

^ permalink raw reply

* Re: [PATCH v3 05/20] selftests/ftrace: add install target to enable test install
From: Shuah Khan @ 2015-01-05 18:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: mmarek, gregkh, akpm, mingo, davem, keescook, tranmanphong, mpe,
	cov, dh.herrmann, hughd, bobby.prani, serge.hallyn, ebiederm,
	tim.bird, josh, koct9i, linux-kbuild, linux-kernel, linux-api,
	netdev, Shuah Khan
In-Reply-To: <20150102104526.29df5641@gandalf.local.home>

On 01/02/2015 08:45 AM, Steven Rostedt wrote:
> On Wed, 24 Dec 2014 09:27:41 -0700
> Shuah Khan <shuahkh@osg.samsung.com> wrote:
> 
>> Add a new make target to enable installing test. This target
>> installs test in the kselftest install location and add to the
>> kselftest script to run the test. Install target can be run
>> only from top level kernel source directory.
>>
>> Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
>> ---
>>  tools/testing/selftests/ftrace/Makefile | 11 ++++++++++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/testing/selftests/ftrace/Makefile b/tools/testing/selftests/ftrace/Makefile
>> index 76cc9f1..7c7cf42 100644
>> --- a/tools/testing/selftests/ftrace/Makefile
>> +++ b/tools/testing/selftests/ftrace/Makefile
>> @@ -1,7 +1,16 @@
>> +TEST_STR = /bin/sh ./ftracetest || echo ftrace selftests: [FAIL]
> 
> Is it ok that this removes the quotes around the echo string? I don't
> see anything wrong about it, but I don't know if there's a shell out
> there that will fail due to it.

Right. both sh and bash are fine without the quotes. In this case there
are no variables to interpret, so quotes don't do anything. I might as
well play it safe. I will have to fix a few other tests to address this
comment. Will generate v4s for a few tests in this series.

Thanks,
-- Shuah

> 
> Other than than,
> 
> Acked-by: Steven Rostedt <rostedt@goodmis.org>
> 
> -- Steve
> 
> 
>> +
>>  all:
>>  
>> +install:
>> +ifdef INSTALL_KSFT_PATH
>> +	install ./ftracetest $(INSTALL_KSFT_PATH)
>> +	@cp -r test.d $(INSTALL_KSFT_PATH)
>> +	echo "$(TEST_STR)" >> $(KSELFTEST)
>> +endif
>> +
>>  run_tests:
>> -	@/bin/sh ./ftracetest || echo "ftrace selftests: [FAIL]"
>> +	@$(TEST_STR)
>>  
>>  clean:
>>  	rm -rf logs/*
> 


-- 
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh@osg.samsung.com | (970) 217-8978

^ permalink raw reply

* [PATCH RESEND v3] gpio: lib-sysfs: Add 'wakeup' attribute
From: Soren Brinkmann @ 2015-01-05 18:16 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Alexandre Courbot, linux-api, linux-kernel, linux-gpio, linux-doc,
	Soren Brinkmann

Add an attribute 'wakeup' to the GPIO sysfs interface which allows
marking/unmarking a GPIO as wake IRQ.
The file 'wakeup' is created in each exported GPIOs directory, if an IRQ
is associated with that GPIO and the irqchip implements set_wake().
Writing 'enabled' to that file will enable wake for that GPIO, while
writing 'disabled' will disable wake.
Reading that file will return either 'disabled' or 'enabled' depening on
the currently set flag for the GPIO's IRQ.

Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com>
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
---
v3:
 - add documentation
v2:
 - fix error path to unlock mutex before return
---
 Documentation/ABI/testing/sysfs-gpio |  1 +
 Documentation/gpio/sysfs.txt         |  8 ++++
 drivers/gpio/gpiolib-sysfs.c         | 77 +++++++++++++++++++++++++++++++++---
 3 files changed, 80 insertions(+), 6 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-gpio b/Documentation/ABI/testing/sysfs-gpio
index 80f4c94c7bef..4cc7f4b3f724 100644
--- a/Documentation/ABI/testing/sysfs-gpio
+++ b/Documentation/ABI/testing/sysfs-gpio
@@ -20,6 +20,7 @@ Description:
 	    /value ... always readable, writes fail for input GPIOs
 	    /direction ... r/w as: in, out (default low); write: high, low
 	    /edge ... r/w as: none, falling, rising, both
+	    /wakeup ... r/w as: enabled, disabled
 	/gpiochipN ... for each gpiochip; #N is its first GPIO
 	    /base ... (r/o) same as N
 	    /label ... (r/o) descriptive, not necessarily unique
diff --git a/Documentation/gpio/sysfs.txt b/Documentation/gpio/sysfs.txt
index c2c3a97f8ff7..f703377d528f 100644
--- a/Documentation/gpio/sysfs.txt
+++ b/Documentation/gpio/sysfs.txt
@@ -97,6 +97,14 @@ and have the following read/write attributes:
 		for "rising" and "falling" edges will follow this
 		setting.
 
+	"wakeup" ... reads as either "enabled" or "disabled". Write these
+		strings to set/clear the 'wakeup' flag of the IRQ associated
+		with this GPIO. If the IRQ has the 'wakeup' flag set, it can
+		wake the system from sleep states.
+
+		This file exists only if the pin can generate interrupts and
+		the driver implements the required infrastructure.
+
 GPIO controllers have paths like /sys/class/gpio/gpiochip42/ (for the
 controller implementing GPIOs starting at #42) and have the following
 read-only attributes:
diff --git a/drivers/gpio/gpiolib-sysfs.c b/drivers/gpio/gpiolib-sysfs.c
index 2ac1800b58bb..05a5fc9503bd 100644
--- a/drivers/gpio/gpiolib-sysfs.c
+++ b/drivers/gpio/gpiolib-sysfs.c
@@ -286,6 +286,58 @@ found:
 
 static DEVICE_ATTR(edge, 0644, gpio_edge_show, gpio_edge_store);
 
+static ssize_t gpio_wakeup_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	ssize_t			status;
+	const struct gpio_desc	*desc = dev_get_drvdata(dev);
+	int			irq = gpiod_to_irq(desc);
+	struct irq_desc		*irq_desc = irq_to_desc(irq);
+
+	mutex_lock(&sysfs_lock);
+
+	if (irqd_is_wakeup_set(&irq_desc->irq_data))
+		status = sprintf(buf, "enabled\n");
+	else
+		status = sprintf(buf, "disabled\n");
+
+	mutex_unlock(&sysfs_lock);
+
+	return status;
+}
+
+static ssize_t gpio_wakeup_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t size)
+{
+	int			ret;
+	unsigned int		on;
+	struct gpio_desc	*desc = dev_get_drvdata(dev);
+	int			irq = gpiod_to_irq(desc);
+
+	mutex_lock(&sysfs_lock);
+
+	if (sysfs_streq("enabled", buf)) {
+		on = true;
+	} else if (sysfs_streq("disabled", buf)) {
+		on = false;
+	} else {
+		mutex_unlock(&sysfs_lock);
+		return -EINVAL;
+	}
+
+	ret = irq_set_irq_wake(irq, on);
+
+	mutex_unlock(&sysfs_lock);
+
+	if (ret)
+		pr_warn("%s: failed to %s wake\n", __func__,
+				on ? "enable" : "disable");
+
+	return size;
+}
+
+static DEVICE_ATTR(wakeup, 0644, gpio_wakeup_show, gpio_wakeup_store);
+
 static int sysfs_set_active_low(struct gpio_desc *desc, struct device *dev,
 				int value)
 {
@@ -526,7 +578,7 @@ static struct class gpio_class = {
 int gpiod_export(struct gpio_desc *desc, bool direction_may_change)
 {
 	unsigned long		flags;
-	int			status;
+	int			status, irq;
 	const char		*ioname = NULL;
 	struct device		*dev;
 	int			offset;
@@ -582,11 +634,24 @@ int gpiod_export(struct gpio_desc *desc, bool direction_may_change)
 			goto fail_unregister_device;
 	}
 
-	if (gpiod_to_irq(desc) >= 0 && (direction_may_change ||
-				       !test_bit(FLAG_IS_OUT, &desc->flags))) {
-		status = device_create_file(dev, &dev_attr_edge);
-		if (status)
-			goto fail_unregister_device;
+	irq = gpiod_to_irq(desc);
+	if (irq >= 0) {
+		struct irq_desc *irq_desc = irq_to_desc(irq);
+		struct irq_chip	*irqchip = irq_desc_get_chip(irq_desc);
+
+		if (direction_may_change ||
+				!test_bit(FLAG_IS_OUT, &desc->flags)) {
+			status = device_create_file(dev, &dev_attr_edge);
+			if (status)
+				goto fail_unregister_device;
+		}
+
+		if (irqchip->flags & IRQCHIP_SKIP_SET_WAKE ||
+				irqchip->irq_set_wake) {
+			status = device_create_file(dev, &dev_attr_wakeup);
+			if (status)
+				goto fail_unregister_device;
+		}
 	}
 
 	set_bit(FLAG_EXPORT, &desc->flags);
-- 
2.2.1.1.gb42cc81


^ permalink raw reply related

* Re: [PATCH net-next v1 0/7] net: extend ethtool link mode bitmaps to 48 bits
From: David Decotigny @ 2015-01-05 18:16 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Maciej Żenczykowski, Amir Vadai, Florian Fainelli,
	Linux NetDev, Linux Kernel Mailing List, linux-api,
	David S. Miller, Jason Wang, Michael S. Tsirkin, Herbert Xu,
	Al Viro, Masatake YAMATO, Xi Wang, Neil Horman, WANG Cong,
	Flavio Leitner, Tom Gundersen, Jiri Pirko, Vlad Yasevich,
	Eric W. Biederman, Saeed Mahameed, Venkata Duvvuru
In-Reply-To: <1420425003.5686.155.camel@decadent.org.uk>

Thanks Ben, I will send an updated version.

About rejecting high bits in drivers that don't support them: a basic
fix (in a separate patch series) could be something like follows in
ethtool_set_settings callbacks:  if (ecmd->advertising_hi) return
-EINVAL; with comments. But I don't find it very nice. Or: allocate a
new net_device::priv_flags bit and ask net/core/ethtool.c to accept
high advertising bits only when this flag is set? Any preference/other
options?

Related: lately, each new class of link modes declared == 4 new bits
allocated. At current pace these 16 new bits buy us only 4 new
classes, ie. a little more than 5 years if I extrapolate from the
recent past. Is the longer term plan to create a new ethtool ioctl
command specialized in link modes with variable length masks? Or to
switch to a brand new netlink interface altogether and take advantage
of that to revisit the link mode reporting/configuration with variable
length masks?

On Sun, Jan 4, 2015 at 6:30 PM, Ben Hutchings <ben@decadent.org.uk> wrote:
> On Mon, 2015-01-05 at 01:34 +0100, Maciej Żenczykowski wrote:
>> >> I can send updates to other drivers, even though it's rather pointless
>> >> to update 1G drivers at this point for example. Please let me know,
>> >> but I'd prefer to do this in follow-up patches outside this first
>> >> patch series.
>> > [...]
>> >
>> > They should be changed to ensure they reject setting any of the high
>> > advertising flags, but it's not urgent.
>>
>> if old drivers advertised a get/set_bits function while new drivers
>> advertised a get/set_new_bits function,
>> you could not updated any old drivers, and simply take care of
>> rejecting invalid bits in core, by calling set_new_bits if provided,
>> if not, rejecting bad bits and calling set_bits if no bad bits were
>> set.
>
> We've never checked that the reserved fields are zero before, and I
> think there are still drivers that don't fully validate the existing 32
> bits.  So while I think drivers should fully validate the advertising
> flags, userland generally can't assume they do.  And therefore I don't
> think it's worth adding complexity to the ethtool core that only partly
> fixes this.
>
> Ben.
>
> --
> Ben Hutchings
> This sentence contradicts itself - no actually it doesn't.

^ permalink raw reply

* Re: [PATCH] selftests/exec: allow shell return code of 126
From: Shuah Khan @ 2015-01-05 19:03 UTC (permalink / raw)
  To: David Drysdale, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Geert Uytterhoeven, Andreas Schwab
  Cc: Andrew Morton, linux-m68k-cunTk1MwBs8S/qaLPR03pWD2FQJk+8+b,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Shuah Khan
In-Reply-To: <1420456786-5989-1-git-send-email-drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

On 01/05/2015 04:19 AM, David Drysdale wrote:
> When the shell fails to invoke a script because its path name
> is too long (ENAMETOOLONG), most shells return 127 to indicate
> command not found.  However, some systems report 126 (which POSIX
> suggests should indicate a non-executable file) for this case,
> so allow that too.
> 
> Reported-by: Geert Uytterhoeven <geert-Td1EMuHUCqxL1ZNQvxDV9g@public.gmane.org>
> Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  tools/testing/selftests/exec/execveat.c | 28 ++++++++++++++++++++++------
>  1 file changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c
> index d273624c93a6..0d940c6e26bd 100644
> --- a/tools/testing/selftests/exec/execveat.c
> +++ b/tools/testing/selftests/exec/execveat.c
> @@ -62,7 +62,7 @@ static int _check_execveat_fail(int fd, const char *path, int flags,
>  }
>  
>  static int check_execveat_invoked_rc(int fd, const char *path, int flags,
> -				     int expected_rc)
> +				     int expected_rc, int expected_rc2)

This logic doesn't scale well if there other expected return
values to account for. Please think about a re-write to handle
multiple expected return codes.


>  {
>  	int status;
>  	int rc;
> @@ -99,9 +99,19 @@ static int check_execveat_invoked_rc(int fd, const char *path, int flags,
>  		return 1;
>  	}
>  	if (WEXITSTATUS(status) != expected_rc) {
> -		printf("[FAIL] (child %d exited with %d not %d)\n",
> -			child, WEXITSTATUS(status), expected_rc);
> -		return 1;
> +		if (expected_rc != expected_rc2) {

Please collapse expected_rc and expected_rc2 checks. You can rephrase
the error message to work for both cases.

thanks,
-- Shuah

-- 
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org | (970) 217-8978

^ permalink raw reply

* Re: [PATCH v4 3/3] perf: Sample additional clock value
From: John Stultz @ 2015-01-05 19:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pawel Moll, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern, Thomas Gleixner,
	Tomeu Vizoso, lkml, Linux API
In-Reply-To: <20150105134514.GS30905@twins.programming.kicks-ass.net>

On Mon, Jan 5, 2015 at 5:45 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Nov 06, 2014 at 04:51:58PM +0000, Pawel Moll wrote:
>> Currently three clocks are implemented: CLOCK_REALITME = 0,
>> CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
>> 5 bits wide to allow for future extension to custom, non-POSIX clock
>> sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
>> ARM CoreSight (hardware trace) timestamp generator.
>
>> @@ -304,7 +305,16 @@ struct perf_event_attr {
>>                               mmap2          :  1, /* include mmap with inode data     */
>>                               comm_exec      :  1, /* flag comm events that are due to an exec */
>>                               uevents        :  1, /* allow uevents into the buffer */
>> -                             __reserved_1   : 38;
>> +
>> +                             /*
>> +                              * clock: one of the POSIX clock IDs:
>> +                              *
>> +                              * 0 - CLOCK_REALTIME
>> +                              * 1 - CLOCK_MONOTONIC
>> +                              * 4 - CLOCK_MONOTONIC_RAW
>> +                              */
>> +                             clock          :  5, /* clock type */
>> +                             __reserved_1   : 33;
>>
>>       union {
>>               __u32           wakeup_events;    /* wakeup every n events */
>
> This would put a constraint on actually changing MAX_CLOCKS, are the
> time people OK with that? Thomas, John?

Yea, I'd not want to enshrine MAX_CLOCKS if we can avoid it. We're
getting a bit close to it these days.

thanks
-john

^ permalink raw reply

* Re: Edited seccomp.2 man page for review [v2]
From: Kees Cook @ 2015-01-05 20:25 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Daniel Borkmann, Linux API,
	linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, lkml,
	Will Drewry
In-Reply-To: <54A300D0.7090802-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Tue, Dec 30, 2014 at 11:45 AM, Michael Kerrisk (man-pages)
<mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> The program counter will be as though the system call happened
>>> (i.e., it will not point to the system call instruction).
>>> The return value register will contain an architecture\-dependent value;
>>> if resuming execution, set it to something sensible.
>>> .\" FIXME Regarding the preceding line, can you give an example(s)
>>> .\"       of "something sensible"? (Depending on the answer, maybe it
>>> .\"       might be useful to add some text on this point.)
>>
>> This means sensible in the context of the syscall made, or the desired
>> behavior. For example, setting the return value to ELOOP for something
>> like a "bind" syscall isn't very sensible.
>
> Okay -- I did s/sensible/appropriate for the system call/

Yes, perfect. That captures it nicely.

>>> .\"
>>> .\" FIXME Please check:
>>> .\"     In an attempt to make the text clearer, I changed
>>> .\"     "replacing it with" to "setting the return value register to"
>>> .\"     Okay?
>>> (The architecture dependency is because setting the return value register to
>>> .BR ENOSYS
>>> could overwrite some useful information.)
>>
>> Well, the arch dependency is really because _how_ to change the
>> register, and the register itself, is different between architectures.
>> (i.e. which ptrace call is needed, and which register is being
>> changed.) The overwriting of useful information is certainly true too,
>> though.
>
> So, revert to the previous wording? Or do you have a suggested
> better wording?

I think the previous wording is better. I'm struggling to produce
language that makes more sense here.

> Thanks. We're getting close now.

Excellent! :)

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: WIP alternative - was Re: [PATCH v3 14/20] selftests/size: add install target to enable test install
From: Shuah Khan @ 2015-01-05 21:28 UTC (permalink / raw)
  To: Tim Bird, mmarek@suse.cz, gregkh@linuxfoundation.org,
	akpm@linux-foundation.org, rostedt@goodmis.org, mingo@redhat.com,
	davem@davemloft.net, keescook@chromium.org,
	tranmanphong@gmail.com, mpe@ellerman.id.au, cov@codeaurora.org,
	dh.herrmann@gmail.com, hughd@google.com, bobby.prani@gmail.com,
	serge.hallyn@ubuntu.com, ebiederm@xmission.com,
	josh@joshtriplett.org, koct9i@gmail.com
  Cc: linux-kbuild@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-api@vger.kernel.org, netdev@vger.kernel.org, Shuah Khan
In-Reply-To: <54A4B174.2010501@sonymobile.com>

On 12/31/2014 07:31 PM, Tim Bird wrote:
> On 12/24/2014 08:27 AM, Shuah Khan wrote:
>> Add a new make target to enable installing test. This target
>> installs test in the kselftest install location and add to the
>> kselftest script to run the test. Install target can be run
>> only from top level kernel source directory.
>>
>> Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
>> ---
>>  tools/testing/selftests/size/Makefile | 12 +++++++++++-
>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/testing/selftests/size/Makefile b/tools/testing/selftests/size/Makefile
>> index 04dc25e..bb7113b 100644
>> --- a/tools/testing/selftests/size/Makefile
>> +++ b/tools/testing/selftests/size/Makefile
>> @@ -1,12 +1,22 @@
>>  CC = $(CROSS_COMPILE)gcc
>>  
>> +TEST_STR = ./get_size || echo get_size selftests: [FAIL]
>> +
>>  all: get_size
>>  
>>  get_size: get_size.c
>>  	$(CC) -static -ffreestanding -nostartfiles -s $< -o $@
>>  
>> +install:
>> +ifdef INSTALL_KSFT_PATH
>> +	install ./get_size $(INSTALL_KSFT_PATH)
>> +	@echo "$(TEST_STR)" >> $(KSELFTEST)
>> +else
>> +	@echo Run make kselftest_install in top level source directory
>> +endif
>> +
>>  run_tests: all
>> -	./get_size
>> +	@$(TEST_STR)
>>  
>>  clean:
>>  	$(RM) get_size
>>
> 
> The install phase is desperately needed for usage of kselftest in
> cross-target situations (applicable to almost all embedded).  So this
> is great stuff.

Thanks.

> 
> I worked a bit on isolating the install stuff to a makefile include file.
> This allows simplifying some of the sub-level Makefiles a bit, and allowing
> control of some of the install and run logic in less places.
> 
> This is still a work in progress, but before I got too far along, I wanted
> to post it for people to provide feedback.  A couple of problems cropped
> up that are worth discussing, IMHO.
> 
> 1) I think it should be a requirement that each test has a single
> "main" program to execute to run the tests.  If multiple tests are supported
> or more flexibility is desired for additional arguments, or that sort of
> thing, then that's fine, but the automated script builder should be able
> to run just a single program or script to have things work.  This also
> makes things more consistent.  In the case of the firmware test, I created
> a single fw_both.sh script to do this, instead of having two separate
> blocks in the kselftest.sh script.

It is a good goal for individual tests to use a main program to run
tests, even though, I would not make it a requirement. I would like to
leave that decision up to the individual test writer.

> 
> 2) I've added a CROSS_INSTALL variable, which can call an arbitrary program
> to place files on the target system (rather than just calling 'install').
> In my case, I'd use my own 'ttc cp' command, which I can extend as necessary
> to put things on a remote machine.  This works for a single directory,
> but things get dicier with sub-directory trees full of files (like
> the ftrace test uses.)
> 
> If additional items need to be installed to the target, then maybe a setup
> program should be used, rather than just copying files.
> 
> 3) Some of the scripts were using /bin/bash to execute them, rather
> than rely on the interpreter line in the script itself (and having
> the script have executable privileges).  Is there a reason for this?
> I modified a few scripts to be executable, and got rid of the
> explicit execution with /bin/bash.

Probably no reason other than the choice made by the test writer.
It could be cleaned up and made consistent, however, I would see
this as a separate enhancement type work that could be done on its
own and not include it in the install work.

> 
> The following is just a start...  Let me know if this direction looks
> OK, and I'll finish this up.  The main item to look at is
> kselftest.include file.  Note that these patches are based on Shuah's
> series - but if you want to use these ideas I can rebase them onto
> mainline, and break them out per test sub-level like Shuah did.

One of the reasons I picked install target approach is to enable the
feature by extending the existing run_tests support. This way we will
have the feature available quickly. Once that is supported, we can work
on evolving to a generic approach to use the include file approach, as
the changes you are proposing are based on the the series I sent out,
and makes improvements to it.

kselftest.include file approach could work for several tests and tests
that can't use the generic could add their own install support.

I propose evolving to a generic kselftest.include as the second step in
evolving the install feature. Can I count on you do the work and update
the tests to use kselftest.include, CROSS_INSTALL variable support?

thanks,
-- Shuah


Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh@osg.samsung.com | (970) 217-8978

^ permalink raw reply

* Re: WIP alternative - was Re: [PATCH v3 14/20] selftests/size: add install target to enable test install
From: Tim Bird @ 2015-01-05 21:56 UTC (permalink / raw)
  To: Shuah Khan, mmarek-AlSwsSmVLrQ@public.gmane.org,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org,
	mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org,
	keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org,
	tranmanphong-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org,
	cov-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org,
	dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	bobby.prani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org,
	josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org,
	koct9i-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
  Cc: linux-kbuild-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <54AB0215.5020409-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>

On 01/05/2015 01:28 PM, Shuah Khan wrote:
> On 12/31/2014 07:31 PM, Tim Bird wrote:
...
>> The install phase is desperately needed for usage of kselftest in
>> cross-target situations (applicable to almost all embedded).  So this
>> is great stuff.
> 
> Thanks.
> 
>>
>> I worked a bit on isolating the install stuff to a makefile include file.
>> This allows simplifying some of the sub-level Makefiles a bit, and allowing
>> control of some of the install and run logic in less places.
>>
>> This is still a work in progress, but before I got too far along, I wanted
>> to post it for people to provide feedback.  A couple of problems cropped
>> up that are worth discussing, IMHO.
>>
>> 1) I think it should be a requirement that each test has a single
>> "main" program to execute to run the tests.  If multiple tests are supported
>> or more flexibility is desired for additional arguments, or that sort of
>> thing, then that's fine, but the automated script builder should be able
>> to run just a single program or script to have things work.  This also
>> makes things more consistent.  In the case of the firmware test, I created
>> a single fw_both.sh script to do this, instead of having two separate
>> blocks in the kselftest.sh script.
> 
> It is a good goal for individual tests to use a main program to run
> tests, even though, I would not make it a requirement. I would like to
> leave that decision up to the individual test writer.
>
OK.  It helps to have a single line when trying to isolate
RUN_TEST creation into the include file, but there may be other
ways to accomplish this.

>>
>> 2) I've added a CROSS_INSTALL variable, which can call an arbitrary program
>> to place files on the target system (rather than just calling 'install').
>> In my case, I'd use my own 'ttc cp' command, which I can extend as necessary
>> to put things on a remote machine.  This works for a single directory,
>> but things get dicier with sub-directory trees full of files (like
>> the ftrace test uses.)
>>
>> If additional items need to be installed to the target, then maybe a setup
>> program should be used, rather than just copying files.
>>
>> 3) Some of the scripts were using /bin/bash to execute them, rather
>> than rely on the interpreter line in the script itself (and having
>> the script have executable privileges).  Is there a reason for this?
>> I modified a few scripts to be executable, and got rid of the
>> explicit execution with /bin/bash.
> 
> Probably no reason other than the choice made by the test writer.
> It could be cleaned up and made consistent, however, I would see
> this as a separate enhancement type work that could be done on its
> own and not include it in the install work.

OK - this was also something that simplified the creation
of the RUN_TEST variable in the isolated include file.
Also, having the interpreter explicit in the invocation line
in the Makefile as well as in the script itself is a bit redundant.
>>
>> The following is just a start...  Let me know if this direction looks
>> OK, and I'll finish this up.  The main item to look at is
>> kselftest.include file.  Note that these patches are based on Shuah's
>> series - but if you want to use these ideas I can rebase them onto
>> mainline, and break them out per test sub-level like Shuah did.
> 
> One of the reasons I picked install target approach is to enable the
> feature by extending the existing run_tests support. This way we will
> have the feature available quickly. Once that is supported, we can work
> on evolving to a generic approach to use the include file approach, as
> the changes you are proposing are based on the the series I sent out,
> and makes improvements to it.
> 
> kselftest.include file approach could work for several tests and tests
> that can't use the generic could add their own install support.
> 
> I propose evolving to a generic kselftest.include as the second step in
> evolving the install feature. Can I count on you do the work and update
> the tests to use kselftest.include, CROSS_INSTALL variable support?

Yes.  I'd be happy to evolve it through phases to support the include
file and cross-target install feature.

Is there anything I can help with in the mean time?  Some of the tests
require a directory tree of files rather than just a few top-level files
(e.g. ftrace).

I was thinking about doing some work to tar-ify the needed directories of
data files during build, and untar it in the execution area during the
install step.  Do you want me to propose something there?
 -- Tim

^ permalink raw reply

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Aditya Kali @ 2015-01-05 22:48 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andy Lutomirski, Eric W. Biederman, Tejun Heo,
	cgroups mailinglist, Ingo Molnar
In-Reply-To: <548E17CE.8010704-/L3Ra7n9ekc@public.gmane.org>

On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
> Aditya,
>
> I gave your patch set a try but it does not work for me.
> Maybe you can bring some light into the issues I'm facing.
> Sadly I still had no time to dig into your code.
>
> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> ---
>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>  1 file changed, 147 insertions(+)
>>  create mode 100644 Documentation/cgroups/namespace.txt
>>
>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>> new file mode 100644
>> index 0000000..6480379
>> --- /dev/null
>> +++ b/Documentation/cgroups/namespace.txt
>> @@ -0,0 +1,147 @@
>> +                     CGroup Namespaces
>> +
>> +CGroup Namespace provides a mechanism to virtualize the view of the
>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>> +clone() and unshare() syscalls to create a new cgroup namespace.
>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>> +at the time of creation of the cgroup namespace.
>> +
>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>> +may leak potential system level information to the isolated processes.
>> +
>> +For Example:
>> +  $ cat /proc/self/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>> +
>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>> +and its desirable to not expose it to the isolated process.
>> +
>> +CGroup Namespaces can be used to restrict visibility of this path.
>> +For Example:
>> +  # Before creating cgroup namespace
>> +  $ ls -l /proc/self/ns/cgroup
>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>> +  $ cat /proc/self/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>> +
>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>> +  $ ~/unshare -c
>> +  [ns]$ ls -l /proc/self/ns/cgroup
>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>> +  # From within new cgroupns, process sees that its in the root cgroup
>> +  [ns]$ cat /proc/self/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>> +
>> +  # From global cgroupns:
>> +  $ cat /proc/<pid>/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>> +
>> +  # Unshare cgroupns along with userns and mountns
>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>> +  # sets up uid/gid map and execs /bin/bash
>> +  $ ~/unshare -c -u -m
>
> This command does not issue CLONE_NEWUSER, -U does.
>
I was using a custom unshare binary. But I will update the command
line to be similar to the one in util-linux.

>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>> +  # hierarchy.
>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>> +  [ns]$ ls -l /tmp/cgroup
>> +  total 0
>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>
> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
> And /proc/self/cgroup still shows the cgroup from outside.
>
> ---cut---
> container:/ # ls /sys/fs/cgroup/
> container:/ # mount -t cgroup none /sys/fs/cgroup/

You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
container, only unified hierarchy can be mounted. So, for now, that
flag is needed. I will fix the documentation too.

> mount: wrong fs type, bad option, bad superblock on none,
>        missing codepage or helper program, or other error
>
>        In some cases useful info is found in syslog - try
>        dmesg | tail or so.
> container:/ # cat /proc/self/cgroup
> 8:memory:/machine/test00.libvirt-lxc
> 7:devices:/machine/test00.libvirt-lxc
> 6:hugetlb:/
> 5:cpuset:/machine/test00.libvirt-lxc
> 4:blkio:/machine/test00.libvirt-lxc
> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
> 2:freezer:/machine/test00.libvirt-lxc
> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
> container:/ # ls -la /proc/self/ns
> total 0
> dr-x--x--x 2 root root 0 Dec 14 23:02 .
> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
> container:/ #
>
> #host side
> lxc-os132:~ # ls -la /proc/self/ns
> total 0
> dr-x--x--x 2 root root 0 Dec 14 23:56 .
> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
> ---cut---
>
> Any ideas?
>

Please try with "-o __DEVEL_sane_behavior" flag to the mount command.

> Thanks,
> //richard


Thanks,
-- 
Aditya

^ permalink raw reply

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Richard Weinberger @ 2015-01-05 22:52 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andy Lutomirski, Eric W. Biederman, Tejun Heo,
	cgroups mailinglist, Ingo Molnar
In-Reply-To: <CAGr1F2HA6mzFwgp5ngX8P7=198-5CmCjLmuCJ8j3eQ08J2d9Qw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Am 05.01.2015 um 23:48 schrieb Aditya Kali:
> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
>> Aditya,
>>
>> I gave your patch set a try but it does not work for me.
>> Maybe you can bring some light into the issues I'm facing.
>> Sadly I still had no time to dig into your code.
>>
>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 147 insertions(+)
>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>
>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>> new file mode 100644
>>> index 0000000..6480379
>>> --- /dev/null
>>> +++ b/Documentation/cgroups/namespace.txt
>>> @@ -0,0 +1,147 @@
>>> +                     CGroup Namespaces
>>> +
>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>> +at the time of creation of the cgroup namespace.
>>> +
>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>> +may leak potential system level information to the isolated processes.
>>> +
>>> +For Example:
>>> +  $ cat /proc/self/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>> +
>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>> +and its desirable to not expose it to the isolated process.
>>> +
>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>> +For Example:
>>> +  # Before creating cgroup namespace
>>> +  $ ls -l /proc/self/ns/cgroup
>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>> +  $ cat /proc/self/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>> +
>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>> +  $ ~/unshare -c
>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>> +  [ns]$ cat /proc/self/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>> +
>>> +  # From global cgroupns:
>>> +  $ cat /proc/<pid>/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>> +
>>> +  # Unshare cgroupns along with userns and mountns
>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>> +  # sets up uid/gid map and execs /bin/bash
>>> +  $ ~/unshare -c -u -m
>>
>> This command does not issue CLONE_NEWUSER, -U does.
>>
> I was using a custom unshare binary. But I will update the command
> line to be similar to the one in util-linux.
> 
>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>> +  # hierarchy.
>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>> +  [ns]$ ls -l /tmp/cgroup
>>> +  total 0
>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>> And /proc/self/cgroup still shows the cgroup from outside.
>>
>> ---cut---
>> container:/ # ls /sys/fs/cgroup/
>> container:/ # mount -t cgroup none /sys/fs/cgroup/
> 
> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
> container, only unified hierarchy can be mounted. So, for now, that
> flag is needed. I will fix the documentation too.
> 
>> mount: wrong fs type, bad option, bad superblock on none,
>>        missing codepage or helper program, or other error
>>
>>        In some cases useful info is found in syslog - try
>>        dmesg | tail or so.
>> container:/ # cat /proc/self/cgroup
>> 8:memory:/machine/test00.libvirt-lxc
>> 7:devices:/machine/test00.libvirt-lxc
>> 6:hugetlb:/
>> 5:cpuset:/machine/test00.libvirt-lxc
>> 4:blkio:/machine/test00.libvirt-lxc
>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>> 2:freezer:/machine/test00.libvirt-lxc
>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>> container:/ # ls -la /proc/self/ns
>> total 0
>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>> container:/ #
>>
>> #host side
>> lxc-os132:~ # ls -la /proc/self/ns
>> total 0
>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>> ---cut---
>>
>> Any ideas?
>>
> 
> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.

Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.

Thanks,
//richard

^ permalink raw reply

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Aditya Kali @ 2015-01-05 22:54 UTC (permalink / raw)
  To: Zefan Li
  Cc: Richard Weinberger, Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andy Lutomirski, Ingo Molnar, Eric W. Biederman, Tejun Heo,
	cgroups mailinglist
In-Reply-To: <548AAD42.5010002-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

Thanks for the review. I have made the suggested fixes. Regarding
relative path, please see inline.

On Fri, Dec 12, 2014 at 12:54 AM, Zefan Li <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
>> +In its current form, the cgroup namespaces patcheset provides following
>> +behavior:
>> +
>> +(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
>> +    the process calling unshare is running.
>> +    For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
>> +    cgroup /batchjobs/container_id1 becomes the cgroupns-root.
>> +    For the init_cgroup_ns, this is the real root ('/') cgroup
>> +    (identified in code as cgrp_dfl_root.cgrp).
>> +
>> +(2) The cgroupns-root cgroup does not change even if the namespace
>> +    creator process later moves to a different cgroup.
>> +    $ ~/unshare -c # unshare cgroupns in some cgroup
>> +    [ns]$ cat /proc/self/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>> +    [ns]$ mkdir sub_cgrp_1
>> +    [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>> +    [ns]$ cat /proc/self/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>> +
>> +(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
>> +(a) Processes running inside the cgroup namespace will be able to see
>> +    cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>> +    [ns]$ sleep 100000 &  # From within unshared cgroupns
>> +    [1] 7353
>> +    [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>> +    [ns]$ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>> +
>> +(b) From global cgroupns, the real cgroup path will be visible:
>> +    $ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
>> +
>> +(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
>> +    path relative to its own cgroupns-root will be shown:
>> +    # ns2's cgroupns-root is at '/batchjobs/container_id2'
>> +    [ns2]$ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1
>
> Should be ../container_id1/sub_cgrp_1 ?
>

Starting with '/' was deliberate.

>> +
>> +    Note that the relative path always starts with '/' to indicate that its
>> +    relative to the cgroupns-root of the caller.
>
> If a path doesn't start with '/', then it's a relative path, so why make it start with '/'?
>

This is so as not to surprise the apps parsing /proc/<pid>/cgroup
files and using the path in it as absolute path. The existing paths
there always start with '/' right now. Retaining the '/' means path
generated by userspace continuous to work. Does this makes sense?

>> +
>> +(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
>> +    (if they have proper access to external cgroups).
>> +    # From inside cgroupns (with cgroupns-root at /batchjobs/container_id1), and
>> +    # assuming that the global hierarchy is still accessible inside cgroupns:
>> +    $ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>> +    $ echo 7353 > batchjobs/container_id2/cgroup.procs
>> +    $ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
>> +
>> +    Note that this kind of setup is not encouraged. A task inside cgroupns
>> +    should only be exposed to its own cgroupns hierarchy. Otherwise it makes
>> +    the virtualization of /proc/<pid>/cgroup less useful.
>> +
>> +(5) Setns to another cgroup namespace is allowed when:
>> +    (a) the process has CAP_SYS_ADMIN in its current userns
>> +    (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
>> +    No implicit cgroup changes happen with attaching to another cgroupns. It
>> +    is expected that the somone moves the attaching process under the target
>> +    cgroupns-root.
>> +
>
> s/the somone/someone
>
fixed.

>> +(6) When some thread from a multi-threaded process unshares its
>> +    cgroup-namespace, the new cgroupns gets applied to the entire
>> +    process (all the threads). This should be OK since
>> +    unified-hierarchy only allows process-level containerization. So
>> +    all the threads in the process will have the same cgroup.
>> +
>> +(7) The cgroup namespace is alive as long as there is atleast 1
>
> s/atelast/at least
>
fixed.

>> +    process inside it. When the last process exits, the cgroup
>> +    namespace is destroyed. The cgroupns-root and the actual cgroups
>> +    remain though.
>> +
>> +(8) Namespace specific cgroup hierarchy can be mounted by a process running
>> +    inside cgroupns:
>> +    $ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
>> +
>> +    This will mount the unified cgroup hierarchy with cgroupns-root as the
>> +    filesystem root. The process needs CAP_SYS_ADMIN in its userns and mntns.
>> +
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thanks!
-- 
Aditya

^ permalink raw reply

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Eric W. Biederman @ 2015-01-05 23:53 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Aditya Kali, Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	cgroups mailinglist, linux-kernel@vger.kernel.org, Linux API,
	Ingo Molnar, Linux Containers, Rohit Jnagal, Vivek Goyal
In-Reply-To: <54AB15BD.8020007-/L3Ra7n9ekc@public.gmane.org>

Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:

> Am 05.01.2015 um 23:48 schrieb Aditya Kali:
>> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
>>> Aditya,
>>>
>>> I gave your patch set a try but it does not work for me.
>>> Maybe you can bring some light into the issues I'm facing.
>>> Sadly I still had no time to dig into your code.
>>>
>>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>> ---
>>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 147 insertions(+)
>>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>>
>>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>>> new file mode 100644
>>>> index 0000000..6480379
>>>> --- /dev/null
>>>> +++ b/Documentation/cgroups/namespace.txt
>>>> @@ -0,0 +1,147 @@
>>>> +                     CGroup Namespaces
>>>> +
>>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>>> +at the time of creation of the cgroup namespace.
>>>> +
>>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>>> +may leak potential system level information to the isolated processes.
>>>> +
>>>> +For Example:
>>>> +  $ cat /proc/self/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>> +
>>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>>> +and its desirable to not expose it to the isolated process.
>>>> +
>>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>>> +For Example:
>>>> +  # Before creating cgroup namespace
>>>> +  $ ls -l /proc/self/ns/cgroup
>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>>> +  $ cat /proc/self/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>> +
>>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>>> +  $ ~/unshare -c
>>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>>> +  [ns]$ cat /proc/self/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>>> +
>>>> +  # From global cgroupns:
>>>> +  $ cat /proc/<pid>/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>> +
>>>> +  # Unshare cgroupns along with userns and mountns
>>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>>> +  # sets up uid/gid map and execs /bin/bash
>>>> +  $ ~/unshare -c -u -m
>>>
>>> This command does not issue CLONE_NEWUSER, -U does.
>>>
>> I was using a custom unshare binary. But I will update the command
>> line to be similar to the one in util-linux.
>> 
>>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>>> +  # hierarchy.
>>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>>> +  [ns]$ ls -l /tmp/cgroup
>>>> +  total 0
>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>>
>>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>>> And /proc/self/cgroup still shows the cgroup from outside.
>>>
>>> ---cut---
>>> container:/ # ls /sys/fs/cgroup/
>>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>> 
>> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
>> container, only unified hierarchy can be mounted. So, for now, that
>> flag is needed. I will fix the documentation too.
>> 
>>> mount: wrong fs type, bad option, bad superblock on none,
>>>        missing codepage or helper program, or other error
>>>
>>>        In some cases useful info is found in syslog - try
>>>        dmesg | tail or so.
>>> container:/ # cat /proc/self/cgroup
>>> 8:memory:/machine/test00.libvirt-lxc
>>> 7:devices:/machine/test00.libvirt-lxc
>>> 6:hugetlb:/
>>> 5:cpuset:/machine/test00.libvirt-lxc
>>> 4:blkio:/machine/test00.libvirt-lxc
>>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>>> 2:freezer:/machine/test00.libvirt-lxc
>>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>>> container:/ # ls -la /proc/self/ns
>>> total 0
>>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>>> container:/ #
>>>
>>> #host side
>>> lxc-os132:~ # ls -la /proc/self/ns
>>> total 0
>>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>>> ---cut---
>>>
>>> Any ideas?
>>>
>> 
>> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.
>
> Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
> I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.

Ugh.  It sounds like there is a real mess here.  At the very least there
is misunderstanding.

I have a memory that systemd should have been able to use a unified
hierarchy.  As you could still mount the different controllers
independently (they just use the same directory structure on each
mount).

That said from a practical standpoint I am not certain that a cgroup
namespace is viable if it can not support the behavior of cgroupsfs
that everyone is using.

Eric

^ permalink raw reply

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Richard Weinberger @ 2015-01-06  0:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andy Lutomirski, Tejun Heo, cgroups mailinglist, Ingo Molnar
In-Reply-To: <87lhlgpyxk.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

Am 06.01.2015 um 00:53 schrieb Eric W. Biederman:
> Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:
> 
>> Am 05.01.2015 um 23:48 schrieb Aditya Kali:
>>> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
>>>> Aditya,
>>>>
>>>> I gave your patch set a try but it does not work for me.
>>>> Maybe you can bring some light into the issues I'm facing.
>>>> Sadly I still had no time to dig into your code.
>>>>
>>>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>>> ---
>>>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 147 insertions(+)
>>>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>>>
>>>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>>>> new file mode 100644
>>>>> index 0000000..6480379
>>>>> --- /dev/null
>>>>> +++ b/Documentation/cgroups/namespace.txt
>>>>> @@ -0,0 +1,147 @@
>>>>> +                     CGroup Namespaces
>>>>> +
>>>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>>>> +at the time of creation of the cgroup namespace.
>>>>> +
>>>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>>>> +may leak potential system level information to the isolated processes.
>>>>> +
>>>>> +For Example:
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>>>> +and its desirable to not expose it to the isolated process.
>>>>> +
>>>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>>>> +For Example:
>>>>> +  # Before creating cgroup namespace
>>>>> +  $ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>>>> +  $ ~/unshare -c
>>>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>>>> +  [ns]$ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>>>> +
>>>>> +  # From global cgroupns:
>>>>> +  $ cat /proc/<pid>/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # Unshare cgroupns along with userns and mountns
>>>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>>>> +  # sets up uid/gid map and execs /bin/bash
>>>>> +  $ ~/unshare -c -u -m
>>>>
>>>> This command does not issue CLONE_NEWUSER, -U does.
>>>>
>>> I was using a custom unshare binary. But I will update the command
>>> line to be similar to the one in util-linux.
>>>
>>>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>>>> +  # hierarchy.
>>>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>>>> +  [ns]$ ls -l /tmp/cgroup
>>>>> +  total 0
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>>>
>>>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>>>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>>>> And /proc/self/cgroup still shows the cgroup from outside.
>>>>
>>>> ---cut---
>>>> container:/ # ls /sys/fs/cgroup/
>>>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>>>
>>> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
>>> container, only unified hierarchy can be mounted. So, for now, that
>>> flag is needed. I will fix the documentation too.
>>>
>>>> mount: wrong fs type, bad option, bad superblock on none,
>>>>        missing codepage or helper program, or other error
>>>>
>>>>        In some cases useful info is found in syslog - try
>>>>        dmesg | tail or so.
>>>> container:/ # cat /proc/self/cgroup
>>>> 8:memory:/machine/test00.libvirt-lxc
>>>> 7:devices:/machine/test00.libvirt-lxc
>>>> 6:hugetlb:/
>>>> 5:cpuset:/machine/test00.libvirt-lxc
>>>> 4:blkio:/machine/test00.libvirt-lxc
>>>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>>>> 2:freezer:/machine/test00.libvirt-lxc
>>>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>>>> container:/ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>>>> container:/ #
>>>>
>>>> #host side
>>>> lxc-os132:~ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>>>> ---cut---
>>>>
>>>> Any ideas?
>>>>
>>>
>>> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.
>>
>> Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
>> I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.
> 
> Ugh.  It sounds like there is a real mess here.  At the very least there
> is misunderstanding.
> 
> I have a memory that systemd should have been able to use a unified
> hierarchy.  As you could still mount the different controllers
> independently (they just use the same directory structure on each
> mount).

Luckily systemd folks want to move to the unified but as of now it does not work.
Please see this mail from Lennart:
https://www.redhat.com/archives/libvir-list/2014-November/msg01090.html

Maybe the porting is easy. Dunno.
I had no time yet to look into that.

> That said from a practical standpoint I am not certain that a cgroup
> namespace is viable if it can not support the behavior of cgroupsfs
> that everyone is using.

Yep.

systemd *really* wants to own cgroupfs, so it has to mount it within the container.
Currently libvirt does nasty hacks using bind mounts which are also problematic.
My hope was that with cgroup namespaces I can simply cheat systemd and give it
a cgroupfs to mess with.

Thanks,
//richard

^ permalink raw reply

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Aditya Kali @ 2015-01-06  0:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Tejun Heo, Li Zefan, Serge Hallyn,
	Andy Lutomirski, cgroups mailinglist,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Ingo Molnar, Linux Containers, Rohit Jnagal, Vivek Goyal
In-Reply-To: <87lhlgpyxk.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Mon, Jan 5, 2015 at 3:53 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:
>
>> Am 05.01.2015 um 23:48 schrieb Aditya Kali:
>>> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
>>>> Aditya,
>>>>
>>>> I gave your patch set a try but it does not work for me.
>>>> Maybe you can bring some light into the issues I'm facing.
>>>> Sadly I still had no time to dig into your code.
>>>>
>>>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>>> ---
>>>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 147 insertions(+)
>>>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>>>
>>>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>>>> new file mode 100644
>>>>> index 0000000..6480379
>>>>> --- /dev/null
>>>>> +++ b/Documentation/cgroups/namespace.txt
>>>>> @@ -0,0 +1,147 @@
>>>>> +                     CGroup Namespaces
>>>>> +
>>>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>>>> +at the time of creation of the cgroup namespace.
>>>>> +
>>>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>>>> +may leak potential system level information to the isolated processes.
>>>>> +
>>>>> +For Example:
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>>>> +and its desirable to not expose it to the isolated process.
>>>>> +
>>>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>>>> +For Example:
>>>>> +  # Before creating cgroup namespace
>>>>> +  $ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>>>> +  $ ~/unshare -c
>>>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>>>> +  [ns]$ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>>>> +
>>>>> +  # From global cgroupns:
>>>>> +  $ cat /proc/<pid>/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # Unshare cgroupns along with userns and mountns
>>>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>>>> +  # sets up uid/gid map and execs /bin/bash
>>>>> +  $ ~/unshare -c -u -m
>>>>
>>>> This command does not issue CLONE_NEWUSER, -U does.
>>>>
>>> I was using a custom unshare binary. But I will update the command
>>> line to be similar to the one in util-linux.
>>>
>>>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>>>> +  # hierarchy.
>>>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>>>> +  [ns]$ ls -l /tmp/cgroup
>>>>> +  total 0
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>>>
>>>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>>>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>>>> And /proc/self/cgroup still shows the cgroup from outside.
>>>>
>>>> ---cut---
>>>> container:/ # ls /sys/fs/cgroup/
>>>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>>>
>>> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
>>> container, only unified hierarchy can be mounted. So, for now, that
>>> flag is needed. I will fix the documentation too.
>>>
>>>> mount: wrong fs type, bad option, bad superblock on none,
>>>>        missing codepage or helper program, or other error
>>>>
>>>>        In some cases useful info is found in syslog - try
>>>>        dmesg | tail or so.
>>>> container:/ # cat /proc/self/cgroup
>>>> 8:memory:/machine/test00.libvirt-lxc
>>>> 7:devices:/machine/test00.libvirt-lxc
>>>> 6:hugetlb:/
>>>> 5:cpuset:/machine/test00.libvirt-lxc
>>>> 4:blkio:/machine/test00.libvirt-lxc
>>>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>>>> 2:freezer:/machine/test00.libvirt-lxc
>>>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>>>> container:/ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>>>> container:/ #
>>>>
>>>> #host side
>>>> lxc-os132:~ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>>>> ---cut---
>>>>
>>>> Any ideas?
>>>>
>>>
>>> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.
>>
>> Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
>> I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.
>
> Ugh.  It sounds like there is a real mess here.  At the very least there
> is misunderstanding.
>
> I have a memory that systemd should have been able to use a unified
> hierarchy.  As you could still mount the different controllers
> independently (they just use the same directory structure on each
> mount).
>
In theory, if you boot kernel with
"cgroup__DEVEL__legacy_files_on_dfl" command-line parameter, and mount
cgroups with sane-behavior flag, then it should be more-or-less
similar to mounting all hierarchies together at the same mount-point
(mount -t cgroup -o __DEVEL_sane_behavior none $mntpt). I haven't
tried this, but systemd should be able to work with it and you can
enable cgroup-namespace too.

> That said from a practical standpoint I am not certain that a cgroup
> namespace is viable if it can not support the behavior of cgroupsfs
> that everyone is using.
>

Since the old/default behavior is on its way out, I didn't invest time
in fixing that. Also, some of the properties that make
cgroup-namespace simpler are only provided by unified hierarchy (for
example: a single root-cgroup per container).


> Eric

-- 
Aditya

^ permalink raw reply

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Richard Weinberger @ 2015-01-06  0:17 UTC (permalink / raw)
  To: Aditya Kali, Eric W. Biederman
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	cgroups mailinglist,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Ingo Molnar, Linux Containers, Rohit Jnagal, Vivek Goyal
In-Reply-To: <CAGr1F2HSi_D07r2c5CKOsjSR1+58k9G2MrtACsd+HV6XKvJ7cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Am 06.01.2015 um 01:10 schrieb Aditya Kali:
> Since the old/default behavior is on its way out, I didn't invest time
> in fixing that. Also, some of the properties that make
> cgroup-namespace simpler are only provided by unified hierarchy (for
> example: a single root-cgroup per container).

Does the new sane cgroupfs behavior even have a single real world user?
I always thought it isn't stable yet.

Linux distros currently use systemd v210. They don't dare to use a newer one.
Even *if* systemd would support the sane sane cgroupfs behavior in the most recent
version it will take 1-2 years until it would hit a recent distro.

So please support also the old and nasty behavior such that one day we can run current
systemd distros in Linux containers.

Thanks,
//richard

^ permalink raw reply

* [PATCH net-next v2 0/8] net: extend ethtool link mode bitmaps to 48 bits
From: David Decotigny @ 2015-01-06  2:54 UTC (permalink / raw)
  To: Amir Vadai, Florian Fainelli, netdev, linux-kernel, linux-api
  Cc: David Decotigny, David S. Miller, Jason Wang, Michael S. Tsirkin,
	Herbert Xu, Al Viro, Ben Hutchings, Masatake YAMATO, Xi Wang,
	Neil Horman, WANG Cong, Flavio Leitner, Tom Gundersen, Jiri Pirko,
	Vlad Yasevich, Eric W. Biederman, Saeed Mahameed, Venkata Duvvuru,
	Govindarajulu Varadarajan

From: David Decotigny <decot@googlers.com>

History:
  v2:
    - added ethtool_ops::get_compat_flags() to be able to check user
      parameters in generic ethtool layer when a driver does not
      support a new feature.
    - in generic ethtool code, EINVAL when userspace sets a link mode
      with bits 32..47 set, unless driver explicitly indicates that it
      supports it (get_compat_flags). also print a warning once (per
      lifetime, not per-interface) if driver sets bits 32..47 of link
      mode, unless it explicitly indicated that it supports 48bit link
      modes.
    - don't update phydev->advertising too early (ie. keep original behavior)
  v1:
    initial draft

This patch series increases the width of the supported/advertising
ethtool masks from 32 bits to 48. This should allow to breathe for a
couple more years (or... months?).

It should not cause any backward compatibility issues for
now. However, if user-space passed non-0 to previously-reserved fields
to ethtool_set_settings or ethtool_set_eee, they will be rejected by
non-updated (ie. no get_compat_flags) drivers. Updated drivers will
handle the new expanded link mode masks appropriately. It is
recommended we gradually adopt the new get/set API, and provide the
associated get_compat_flags(), in order to correctly support future
link modes. See ethtool.h for details.

I updated a couple drivers (mlx4, veth, tun), and some shared code in
drivers/net (phy, mii, mdio). It might be overkill for phy/mii/mdio,
and I might have missed other shared code in drivers/net. Please let
me know.

I used the compiler on my private copy to 1/ track where the
ethtool_cmd/ethtool_eee link mode fields were used 2/ track where the
link mode bitmaps are used and updated. So I believe that there is
some sort of transitive closure for the code I updated. Maybe tools
like coccinelle or clang could allow to automate these processes (1/
would probably be easy-ish, but 2/ seems a bit more complex)? I
reverted these internal "tracking" tricks for this version to minimize
impact on uapi/ethtool.h. The main resulting visible artifact of those
tricks is the new ethtool_link_mode_mask_t typedef (aka. u64). I kept
this trivial typedef here to help future refactoring, but I'd be happy
to rename it as plain "u64" if you prefer.

I can send updates to other drivers, even though it's rather pointless
to update 1G drivers at this point for example. Please let me know,
but I'd prefer to do this in follow-up patches outside this first
patch series.

############################################
# Patch Set Summary:

David Decotigny (8):
  net: ethtool: internal compatibility flags to reject non-zero reserved
    fields
  net: ethtool: extend link mode support to 48 bits
  net: phy: extend link mode support to 48 bits
  net: mii: extend link mode support to 48 bits
  net: mdio: extend link mode support to 48 bits
  net: veth: extend link mode support to 48 bits
  net: tun: extend link mode support to 48 bits
  net: mlx4_en: extend link mode support to 48 bits

 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  73 ++++++++-----
 drivers/net/mdio.c                              |  59 +++++-----
 drivers/net/mii.c                               |  52 +++++----
 drivers/net/phy/phy.c                           |  30 +++---
 drivers/net/phy/phy_device.c                    |   4 +-
 drivers/net/tun.c                               |   4 +-
 drivers/net/veth.c                              |   4 +-
 include/linux/ethtool.h                         |  25 ++++-
 include/linux/mdio.h                            |  15 +--
 include/linux/mii.h                             |  31 +++---
 include/linux/phy.h                             |  10 +-
 include/uapi/linux/ethtool.h                    | 137 ++++++++++++++++++++----
 net/core/ethtool.c                              |  44 ++++++++
 13 files changed, 343 insertions(+), 145 deletions(-)

-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply

* [PATCH net-next v2 1/8] net: ethtool: internal compatibility flags to reject non-zero reserved fields
From: David Decotigny @ 2015-01-06  2:54 UTC (permalink / raw)
  To: Amir Vadai, Florian Fainelli, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: David Decotigny, David S. Miller, Jason Wang, Michael S. Tsirkin,
	Herbert Xu, Al Viro, Ben Hutchings, Masatake YAMATO, Xi Wang,
	Neil Horman, WANG Cong, Flavio Leitner, Tom Gundersen, Jiri Pirko,
	Vlad Yasevich, Eric W. Biederman, Saeed Mahameed, Venkata Duvvuru,
	Govindarajulu Varadarajan
In-Reply-To: <1420512850-24699-1-git-send-email-ddecotig-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

From: David Decotigny <decot-Ypc/8FJVVoBWk0Htik3J/w@public.gmane.org>

The public ethtool API progressively shrinks "reserved" fields to
expand some other fields (eg. link mode masks). This patch allows
drivers to declare that they fully support expanded fields. When they
don't do so, the generic ethtool layer may reject (-EINVAL) userspace
requests that assign values utilizing the newly allocated areas in
these expanded fields (ie. when some reserved field receives != 0).

Signed-off-by: David Decotigny <decot-Ypc/8FJVVoBWk0Htik3J/w@public.gmane.org>
---
 include/linux/ethtool.h      | 19 +++++++++++++++++--
 include/uapi/linux/ethtool.h |  6 +++++-
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 653dc9c..dcb08c1 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -44,6 +44,17 @@ extern int __ethtool_get_settings(struct net_device *dev,
 				  struct ethtool_cmd *cmd);
 
 /**
+ * enum ethtool_compat_flags - bit indices used for %get_compat_flags() bitmaps
+ * @__ETHTOOL_COMPAT_PLACEHOLDER_BIT: to avoid a compiler error,
+ *                                    superseded by next patches
+ */
+enum ethtool_compat_flags {
+	__ETHTOOL_COMPAT_PLACEHOLDER_BIT,
+};
+
+#define __ETH_COMPAT_MASK(name)	(1UL << (ETHTOOL_COMPAT_ ## name ## _BIT))
+
+/**
  * enum ethtool_phys_id_state - indicator state for physical identification
  * @ETHTOOL_ID_INACTIVE: Physical ID indicator should be deactivated
  * @ETHTOOL_ID_ACTIVE: Physical ID indicator should be activated
@@ -99,6 +110,11 @@ static inline u32 ethtool_rxfh_indir_default(u32 index, u32 n_rx_rings)
 
 /**
  * struct ethtool_ops - optional netdev operations
+ * @get_compat_flags: Internal function. Returns the internal ethtool
+ *      compatibility flags: in the absence of this method, or of
+ *      specific compatilibilty flags, the generic layer enforces
+ *      additional constraints on the user space values before
+ *      calling the callbacks below.
  * @get_settings: Get various device settings including Ethernet link
  *	settings. The @cmd parameter is expected to have been cleared
  *	before get_settings is called. Returns a negative error code or
@@ -279,7 +295,6 @@ struct ethtool_ops {
 			       const struct ethtool_tunable *, void *);
 	int	(*set_tunable)(struct net_device *,
 			       const struct ethtool_tunable *, const void *);
-
-
+	u32	(*get_compat_flags)(struct net_device *);
 };
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 5f66d9c..d063368 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -88,7 +88,8 @@
  * other than @cmd that are not described as read-only or deprecated,
  * and must ignore all fields described as read-only.
  *
- * Deprecated fields should be ignored by both users and drivers.
+ * Deprecated and reserved fields should be ignored by both users and
+ * drivers. If reserved fields must be set, store the value 0 in them.
  */
 struct ethtool_cmd {
 	__u32	cmd;
@@ -300,6 +301,9 @@ struct ethtool_eeprom {
  * @tx_lpi_timer: Time in microseconds the interface delays prior to asserting
  *	its tx lpi (after reaching 'idle' state). Effective only when eee
  *	was negotiated and tx_lpi_enabled was set.
+ *
+ * Deprecated and reserved fields should be ignored by both users and
+ * drivers. If reserved fields must be set, store the value 0 in them.
  */
 struct ethtool_eee {
 	__u32	cmd;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next v2 2/8] net: ethtool: extend link mode support to 48 bits
From: David Decotigny @ 2015-01-06  2:54 UTC (permalink / raw)
  To: Amir Vadai, Florian Fainelli, netdev, linux-kernel, linux-api
  Cc: David Decotigny, David S. Miller, Jason Wang, Michael S. Tsirkin,
	Herbert Xu, Al Viro, Ben Hutchings, Masatake YAMATO, Xi Wang,
	Neil Horman, WANG Cong, Flavio Leitner, Tom Gundersen, Jiri Pirko,
	Vlad Yasevich, Eric W. Biederman, Saeed Mahameed, Venkata Duvvuru,
	Govindarajulu Varadarajan
In-Reply-To: <1420512850-24699-1-git-send-email-ddecotig@gmail.com>

From: David Decotigny <decot@googlers.com>

This patch also ensures that requests from userspace asking to
advertise >= 32b link mode masks will be rejected, unless the driver
explicitly supports this.

Signed-off-by: David Decotigny <decot@googlers.com>
---
 include/linux/ethtool.h      |  12 +++-
 include/uapi/linux/ethtool.h | 131 ++++++++++++++++++++++++++++++++++++-------
 net/core/ethtool.c           |  44 +++++++++++++++
 3 files changed, 164 insertions(+), 23 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index dcb08c1..9baa80f 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -45,15 +45,21 @@ extern int __ethtool_get_settings(struct net_device *dev,
 
 /**
  * enum ethtool_compat_flags - bit indices used for %get_compat_flags() bitmaps
- * @__ETHTOOL_COMPAT_PLACEHOLDER_BIT: to avoid a compiler error,
- *                                    superseded by next patches
+ * @ETHTOOL_COMPAT_SUPPORT_LINK_MODE_48b_BIT: when set, the driver
+ *      handles 48-bit link mode requests from userspace. In its
+ *      absence, the generic %ethtool_set_settings/%ethtool_set_eee
+ *      ioctl handlers will reject the request if user passed an
+ *      advertising link mode with any of the bits 32..47 set.
  */
 enum ethtool_compat_flags {
-	__ETHTOOL_COMPAT_PLACEHOLDER_BIT,
+	ETHTOOL_COMPAT_SUPPORT_LINK_MODE_48b_BIT,
 };
 
 #define __ETH_COMPAT_MASK(name)	(1UL << (ETHTOOL_COMPAT_ ## name ## _BIT))
 
+#define ETH_COMPAT_SUPPORT_LINK_MODE_48b	\
+	__ETH_COMPAT_MASK(SUPPORT_LINK_MODE_48b)
+
 /**
  * enum ethtool_phys_id_state - indicator state for physical identification
  * @ETHTOOL_ID_INACTIVE: Physical ID indicator should be deactivated
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index d063368..ab7c11d 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -23,14 +23,16 @@
 /**
  * struct ethtool_cmd - link control and status
  * @cmd: Command number = %ETHTOOL_GSET or %ETHTOOL_SSET
- * @supported: Bitmask of %SUPPORTED_* flags for the link modes,
- *	physical connectors and other link features for which the
- *	interface supports autonegotiation or auto-detection.
- *	Read-only.
- * @advertising: Bitmask of %ADVERTISED_* flags for the link modes,
- *	physical connectors and other link features that are
- *	advertised through autonegotiation or enabled for
- *	auto-detection.
+ * @supported: Low bits of bitmask of %SUPPORTED_* flags for the link
+ *	modes, physical connectors and other link features for which
+ *	the interface supports autonegotiation or auto-detection.
+ *	Read-only. Please do not access this field directly, use the
+ *	%ethtool_cmd_supported_* family of functions instead.
+ * @advertising: Low bits of bitmask of %ADVERTISED_* flags for the
+ *	link modes, physical connectors and other link features that
+ *	are advertised through autonegotiation or enabled for
+ *	auto-detection. Please do not access this field directly, use
+ *	the %ethtool_cmd_advertising_* family of functions instead.
  * @speed: Low bits of the speed
  * @duplex: Duplex mode; one of %DUPLEX_*
  * @port: Physical connector type; one of %PORT_*
@@ -56,10 +58,22 @@
  *	yield %ETH_TP_MDI_INVALID and writes may be ignored or rejected.
  *	When written successfully, the link should be renegotiated if
  *	necessary.
- * @lp_advertising: Bitmask of %ADVERTISED_* flags for the link modes
- *	and other link features that the link partner advertised
- *	through autonegotiation; 0 if unknown or not applicable.
- *	Read-only.
+ * @lp_advertising: Low bits of bitmask of %ADVERTISED_* flags for the
+ *      link modes and other link features that the link partner
+ *      advertised through autonegotiation; 0 if unknown or not
+ *      applicable.  Read-only. Please do not access this field
+ *      directly, use the %ethtool_cmd_lp_advertising_* family of
+ *      functions instead.
+ * @supported_hi: High bits of bitmask of %SUPPORTED_* flags. See
+ *      %supported. Read-only. Please do not access this field directly,
+ *      use the %ethtool_cmd_supported_* family of functions instead.
+ * @advertising_hi: High bits of bitmask of %ADVERTISING_* flags. See
+ *      %advertising. Please do not access this field directly, use the
+ *      %ethtool_cmd_advertising_* family of functions instead.
+ * @lp_advertising_hi: High bits of bitmask of %ADVERTISING_* flags.
+ *      See %lp_advertising. Read-only. Please do not access this
+ *      field directly, use the %ethtool_cmd_lp_advertising_* family
+ *      of functions instead.
  *
  * The link speed in Mbps is split between @speed and @speed_hi.  Use
  * the ethtool_cmd_speed() and ethtool_cmd_speed_set() functions to
@@ -108,7 +122,10 @@ struct ethtool_cmd {
 	__u8	eth_tp_mdix;
 	__u8	eth_tp_mdix_ctrl;
 	__u32	lp_advertising;
-	__u32	reserved[2];
+	__u16	supported_hi;
+	__u16	advertising_hi;
+	__u16	lp_advertising_hi;
+	__u16	reserved;
 };
 
 static inline void ethtool_cmd_speed_set(struct ethtool_cmd *ep,
@@ -124,6 +141,45 @@ static inline __u32 ethtool_cmd_speed(const struct ethtool_cmd *ep)
 	return (ep->speed_hi << 16) | ep->speed;
 }
 
+/**
+ * ETHTOOL_MAKE_LINK_MODE_ACCESSORS - create the link_mode accessors
+ * Macro to generate the %ethtool_cmd_supported_*,
+ * %ethtool_cmd_advertising_*, %ethtool_cmd_lp_advertising_*,
+ * %ethtool_eee_supported_*, %ethtool_eee_advertised_*,
+ * %ethtool_eee_lp_advertised_* families of functions.
+ *
+ *  @struct_name: either %ethtool_cmd or %ethtool_eee
+ *  @field_name: name of the fields in %struct_name to
+ *      access. supported/advertising/lp_advertising for ethtool_cmd,
+ *      supported/advertised/lp_advertised for ethtool_eee
+ *
+ * Generates the following static functions:
+ *  ethtool_cmd_supported(const struct ethtool_cmd*): returns
+ *      the 64b value of %supported fields (the upper bits 63..48 are 0)
+ *  ethtool_cmd_supported_set(struct ethtool_cmd*,
+ *      ethtool_link_mode_mask_t value): set the %supported fields to
+ *      given %value (only the lower 48 bits used, upper bits 63..48
+ *      ignored)
+ *
+ * Same doc for all the other functions.
+ */
+#define ETHTOOL_MAKE_LINK_MODE_ACCESSORS(struct_name, field_name)	\
+	static inline ethtool_link_mode_mask_t				\
+	struct_name ## _ ## field_name(const struct struct_name *cmd)	\
+	{ ethtool_link_mode_mask_t r = cmd->field_name;			\
+	  r |= ((__u64)cmd->field_name ## _hi) << 32; return r; }	\
+	static inline void						\
+	struct_name ## _ ## field_name ## _set(struct struct_name *cmd,	\
+					       ethtool_link_mode_mask_t mask) \
+	{ cmd->field_name = mask & 0xffffffff;				\
+	  cmd->field_name ## _hi = (mask >> 32) & 0xffff; }
+
+typedef __u64 ethtool_link_mode_mask_t;
+
+ETHTOOL_MAKE_LINK_MODE_ACCESSORS(ethtool_cmd, supported);
+ETHTOOL_MAKE_LINK_MODE_ACCESSORS(ethtool_cmd, advertising);
+ETHTOOL_MAKE_LINK_MODE_ACCESSORS(ethtool_cmd, lp_advertising);
+
 /* Device supports clause 22 register access to PHY or peripherals
  * using the interface defined in <linux/mii.h>.  This should not be
  * set if there are known to be no such peripherals present or if
@@ -288,12 +344,18 @@ struct ethtool_eeprom {
 /**
  * struct ethtool_eee - Energy Efficient Ethernet information
  * @cmd: ETHTOOL_{G,S}EEE
- * @supported: Mask of %SUPPORTED_* flags for the speed/duplex combinations
- *	for which there is EEE support.
- * @advertised: Mask of %ADVERTISED_* flags for the speed/duplex combinations
- *	advertised as eee capable.
- * @lp_advertised: Mask of %ADVERTISED_* flags for the speed/duplex
- *	combinations advertised by the link partner as eee capable.
+ * @supported: Low bits of mask of %SUPPORTED_* flags for the
+ *	speed/duplex combinations for which there is EEE
+ *	support. Please do not access this field directly, use the
+ *	%ethtool_eee_supported_* family of functions instead.
+ * @advertised: Low bits of mask of %ADVERTISED_* flags for the
+ *	speed/duplex combinations advertised as eee capable. Please do
+ *	not access this field directly, use the
+ *	%ethtool_eee_advertised_* family of functions instead.
+ * @lp_advertised: Low bits of mask of %ADVERTISED_* flags for the
+ *	speed/duplex combinations advertised by the link partner as
+ *	eee capable.  Please do not access this field directly, use
+ *	the %ethtool_eee_lp_advertised_* family of functions instead.
  * @eee_active: Result of the eee auto negotiation.
  * @eee_enabled: EEE configured mode (enabled/disabled).
  * @tx_lpi_enabled: Whether the interface should assert its tx lpi, given
@@ -301,6 +363,16 @@ struct ethtool_eeprom {
  * @tx_lpi_timer: Time in microseconds the interface delays prior to asserting
  *	its tx lpi (after reaching 'idle' state). Effective only when eee
  *	was negotiated and tx_lpi_enabled was set.
+ * @supported_hi: High bits of mask of %SUPPORTED_* flags. See
+ *      %supported. Please do not access this field directly, use the
+ *      %ethtool_eee_supported_* family of functions instead.
+ * @advertised_hi: High bits of mask of %ADVERTISING_* flags. See
+ *      %advertising. Please do not access this field directly, use
+ *      the %ethtool_eee_advertising_* family of functions instead.
+ * @lp_advertised_hi: High bits of mask of %ADVERTISING_* flags.
+ *      See %lp_advertising. Please do not access this field directly,
+ *      use the %ethtool_eee_lp_advertising_* family of functions
+ *      instead.
  *
  * Deprecated and reserved fields should be ignored by both users and
  * drivers. If reserved fields must be set, store the value 0 in them.
@@ -314,9 +386,17 @@ struct ethtool_eee {
 	__u32	eee_enabled;
 	__u32	tx_lpi_enabled;
 	__u32	tx_lpi_timer;
-	__u32	reserved[2];
+	__u16	supported_hi;
+	__u16	advertised_hi;
+	__u16	lp_advertised_hi;
+	__u16	reserved;
 };
 
+ETHTOOL_MAKE_LINK_MODE_ACCESSORS(ethtool_eee, supported);
+ETHTOOL_MAKE_LINK_MODE_ACCESSORS(ethtool_eee, advertised);
+ETHTOOL_MAKE_LINK_MODE_ACCESSORS(ethtool_eee, lp_advertised);
+
+
 /**
  * struct ethtool_modinfo - plugin module eeprom information
  * @cmd: %ETHTOOL_GMODULEINFO
@@ -1196,6 +1276,11 @@ enum ethtool_sfeatures_retval_bits {
 #define SPARC_ETH_GSET		ETHTOOL_GSET
 #define SPARC_ETH_SSET		ETHTOOL_SSET
 
+/*
+ * Do not use the following macros directly to update
+ * ethtool_cmd::supported, ethtool_eee::supported. Please use
+ * ethtool_(cmd|eee)_supported(|_set) instead.
+ */
 #define SUPPORTED_10baseT_Half		(1 << 0)
 #define SUPPORTED_10baseT_Full		(1 << 1)
 #define SUPPORTED_100baseT_Half		(1 << 2)
@@ -1228,6 +1313,12 @@ enum ethtool_sfeatures_retval_bits {
 #define SUPPORTED_56000baseSR4_Full	(1 << 29)
 #define SUPPORTED_56000baseLR4_Full	(1 << 30)
 
+/*
+ * Do not use the following macros directly to update
+ * ethtool_cmd::advertising, ethtool_cmd::lp_advertising,
+ * ethtool_eee::advertised, ethtool_eee::lp_advertised. Please use
+ * ethtool_(cmd|eee)_*(|_set).
+ */
 #define ADVERTISED_10baseT_Half		(1 << 0)
 #define ADVERTISED_10baseT_Full		(1 << 1)
 #define ADVERTISED_100baseT_Half	(1 << 2)
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 550892c..462a8f4 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -362,6 +362,19 @@ static int ethtool_get_settings(struct net_device *dev, void __user *useraddr)
 	if (err < 0)
 		return err;
 
+	/* Log a warning if driver reports high link mode bits when
+	 * it's not officially supporting them
+	 */
+	if (!dev->ethtool_ops->get_compat_flags ||
+	    !(dev->ethtool_ops->get_compat_flags(dev)
+	      & ETH_COMPAT_SUPPORT_LINK_MODE_48b)) {
+		WARN_ONCE((0 != cmd.supported_hi) ||
+			  (0 != cmd.advertising_hi) ||
+			  (0 != cmd.lp_advertising_hi),
+			  "%s: using ethtool link mode high bits",
+			  netdev_name(dev));
+	}
+
 	if (copy_to_user(useraddr, &cmd, sizeof(cmd)))
 		return -EFAULT;
 	return 0;
@@ -377,6 +390,15 @@ static int ethtool_set_settings(struct net_device *dev, void __user *useraddr)
 	if (copy_from_user(&cmd, useraddr, sizeof(cmd)))
 		return -EFAULT;
 
+	/* Reject the high advertising bits if driver doesn't support
+	 * them
+	 */
+	if (cmd.advertising_hi &&
+	    (!dev->ethtool_ops->get_compat_flags ||
+	     !(dev->ethtool_ops->get_compat_flags(dev)
+	       & ETH_COMPAT_SUPPORT_LINK_MODE_48b)))
+		return -EINVAL;
+
 	return dev->ethtool_ops->set_settings(dev, &cmd);
 }
 
@@ -955,6 +977,19 @@ static int ethtool_get_eee(struct net_device *dev, char __user *useraddr)
 	if (rc)
 		return rc;
 
+	/* Log a warning if driver reports high link mode bits when
+	 * it's not officially supporting them
+	 */
+	if (!dev->ethtool_ops->get_compat_flags ||
+	    !(dev->ethtool_ops->get_compat_flags(dev)
+	      & ETH_COMPAT_SUPPORT_LINK_MODE_48b)) {
+		WARN_ONCE((0 != edata.supported_hi) ||
+			  (0 != edata.advertised_hi) ||
+			  (0 != edata.lp_advertised_hi),
+			  "%s: using ethtool EEE link mode high bits",
+			  netdev_name(dev));
+	}
+
 	if (copy_to_user(useraddr, &edata, sizeof(edata)))
 		return -EFAULT;
 
@@ -971,6 +1006,15 @@ static int ethtool_set_eee(struct net_device *dev, char __user *useraddr)
 	if (copy_from_user(&edata, useraddr, sizeof(edata)))
 		return -EFAULT;
 
+	/* Reject the high advertising bits if driver doesn't support
+	 * them
+	 */
+	if (edata.advertised_hi &&
+	    (!dev->ethtool_ops->get_compat_flags ||
+	     !(dev->ethtool_ops->get_compat_flags(dev)
+		     & ETH_COMPAT_SUPPORT_LINK_MODE_48b)))
+		return -EINVAL;
+
 	return dev->ethtool_ops->set_eee(dev, &edata);
 }
 
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next v2 3/8] net: phy: extend link mode support to 48 bits
From: David Decotigny @ 2015-01-06  2:54 UTC (permalink / raw)
  To: Amir Vadai, Florian Fainelli, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: David Decotigny, David S. Miller, Jason Wang, Michael S. Tsirkin,
	Herbert Xu, Al Viro, Ben Hutchings, Masatake YAMATO, Xi Wang,
	Neil Horman, WANG Cong, Flavio Leitner, Tom Gundersen, Jiri Pirko,
	Vlad Yasevich, Eric W. Biederman, Saeed Mahameed, Venkata Duvvuru,
	Govindarajulu Varadarajan
In-Reply-To: <1420512850-24699-1-git-send-email-ddecotig-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

From: David Decotigny <decot-Ypc/8FJVVoBWk0Htik3J/w@public.gmane.org>

Signed-off-by: David Decotigny <decot-Ypc/8FJVVoBWk0Htik3J/w@public.gmane.org>
---
 drivers/net/phy/phy.c        | 30 ++++++++++++++++--------------
 drivers/net/phy/phy_device.c |  4 ++--
 include/linux/phy.h          | 10 +++++-----
 3 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 767cd11..77d4eb9 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -132,7 +132,7 @@ static inline int phy_aneg_done(struct phy_device *phydev)
 struct phy_setting {
 	int speed;
 	int duplex;
-	u32 setting;
+	ethtool_link_mode_mask_t setting;
 };
 
 /* A mapping of all SUPPORTED settings to speed/duplex */
@@ -227,7 +227,8 @@ static inline unsigned int phy_find_setting(int speed, int duplex)
  *   the mask in features.  Returns the index of the last setting
  *   if nothing else matches.
  */
-static inline unsigned int phy_find_valid(unsigned int idx, u32 features)
+static inline unsigned int phy_find_valid(unsigned int idx,
+					  ethtool_link_mode_mask_t features)
 {
 	while (idx < MAX_NUM_SETTINGS && !(settings[idx].setting & features))
 		idx++;
@@ -245,7 +246,7 @@ static inline unsigned int phy_find_valid(unsigned int idx, u32 features)
  */
 static void phy_sanitize_settings(struct phy_device *phydev)
 {
-	u32 features = phydev->supported;
+	ethtool_link_mode_mask_t features = phydev->supported;
 	unsigned int idx;
 
 	/* Sanitize settings based on PHY capabilities */
@@ -274,18 +275,19 @@ static void phy_sanitize_settings(struct phy_device *phydev)
 int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd)
 {
 	u32 speed = ethtool_cmd_speed(cmd);
+	ethtool_link_mode_mask_t eth_adv = ethtool_cmd_advertising(cmd);
 
 	if (cmd->phy_address != phydev->addr)
 		return -EINVAL;
 
 	/* We make sure that we don't pass unsupported values in to the PHY */
-	cmd->advertising &= phydev->supported;
+	eth_adv &= phydev->supported;
 
 	/* Verify the settings we care about. */
 	if (cmd->autoneg != AUTONEG_ENABLE && cmd->autoneg != AUTONEG_DISABLE)
 		return -EINVAL;
 
-	if (cmd->autoneg == AUTONEG_ENABLE && cmd->advertising == 0)
+	if (cmd->autoneg == AUTONEG_ENABLE && eth_adv == 0)
 		return -EINVAL;
 
 	if (cmd->autoneg == AUTONEG_DISABLE &&
@@ -300,7 +302,7 @@ int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd)
 
 	phydev->speed = speed;
 
-	phydev->advertising = cmd->advertising;
+	phydev->advertising = eth_adv;
 
 	if (AUTONEG_ENABLE == cmd->autoneg)
 		phydev->advertising |= ADVERTISED_Autoneg;
@@ -318,10 +320,10 @@ EXPORT_SYMBOL(phy_ethtool_sset);
 
 int phy_ethtool_gset(struct phy_device *phydev, struct ethtool_cmd *cmd)
 {
-	cmd->supported = phydev->supported;
+	ethtool_cmd_supported_set(cmd, phydev->supported);
 
-	cmd->advertising = phydev->advertising;
-	cmd->lp_advertising = phydev->lp_advertising;
+	ethtool_cmd_advertising_set(cmd, phydev->advertising);
+	ethtool_cmd_lp_advertising_set(cmd, phydev->lp_advertising);
 
 	ethtool_cmd_speed_set(cmd, phydev->speed);
 	cmd->duplex = phydev->duplex;
@@ -1040,7 +1042,7 @@ int phy_init_eee(struct phy_device *phydev, bool clk_stop_enable)
 	    (phydev->interface == PHY_INTERFACE_MODE_RGMII) ||
 	     phy_is_internal(phydev))) {
 		int eee_lp, eee_cap, eee_adv;
-		u32 lp, cap, adv;
+		ethtool_link_mode_mask_t cap, adv, lp;
 		int status;
 		unsigned int idx;
 
@@ -1132,21 +1134,21 @@ int phy_ethtool_get_eee(struct phy_device *phydev, struct ethtool_eee *data)
 				    MDIO_MMD_PCS, phydev->addr);
 	if (val < 0)
 		return val;
-	data->supported = mmd_eee_cap_to_ethtool_sup_t(val);
+	ethtool_eee_supported_set(data, mmd_eee_cap_to_ethtool_sup_t(val));
 
 	/* Get advertisement EEE */
 	val = phy_read_mmd_indirect(phydev, MDIO_AN_EEE_ADV,
 				    MDIO_MMD_AN, phydev->addr);
 	if (val < 0)
 		return val;
-	data->advertised = mmd_eee_adv_to_ethtool_adv_t(val);
+	ethtool_eee_advertised_set(data, mmd_eee_adv_to_ethtool_adv_t(val));
 
 	/* Get LP advertisement EEE */
 	val = phy_read_mmd_indirect(phydev, MDIO_AN_EEE_LPABLE,
 				    MDIO_MMD_AN, phydev->addr);
 	if (val < 0)
 		return val;
-	data->lp_advertised = mmd_eee_adv_to_ethtool_adv_t(val);
+	ethtool_eee_lp_advertised_set(data, mmd_eee_adv_to_ethtool_adv_t(val));
 
 	return 0;
 }
@@ -1161,7 +1163,7 @@ EXPORT_SYMBOL(phy_ethtool_get_eee);
  */
 int phy_ethtool_set_eee(struct phy_device *phydev, struct ethtool_eee *data)
 {
-	int val = ethtool_adv_to_mmd_eee_adv_t(data->advertised);
+	int val = ethtool_adv_to_mmd_eee_adv_t(ethtool_eee_advertised(data));
 
 	phy_write_mmd_indirect(phydev, MDIO_AN_EEE_ADV, MDIO_MMD_AN,
 			       phydev->addr, val);
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 3fc91e8..4391dc7 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -734,7 +734,7 @@ EXPORT_SYMBOL(phy_resume);
  */
 static int genphy_config_advert(struct phy_device *phydev)
 {
-	u32 advertise;
+	ethtool_link_mode_mask_t advertise;
 	int oldadv, adv, bmsr;
 	int err, changed = 0;
 
@@ -1086,7 +1086,7 @@ EXPORT_SYMBOL(genphy_soft_reset);
 int genphy_config_init(struct phy_device *phydev)
 {
 	int val;
-	u32 features;
+	ethtool_link_mode_mask_t features;
 
 	features = (SUPPORTED_TP | SUPPORTED_MII
 			| SUPPORTED_AUI | SUPPORTED_FIBRE |
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 22af8f8..aabf808 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -390,10 +390,10 @@ struct phy_device {
 	u32 interrupts;
 
 	/* Union of PHY and Attached devices' supported modes */
-	/* See mii.h for more info */
-	u32 supported;
-	u32 advertising;
-	u32 lp_advertising;
+	/* See ethtool.h for more info */
+	ethtool_link_mode_mask_t supported;
+	ethtool_link_mode_mask_t advertising;
+	ethtool_link_mode_mask_t lp_advertising;
 
 	int autoneg;
 
@@ -447,7 +447,7 @@ struct phy_driver {
 	u32 phy_id;
 	char *name;
 	unsigned int phy_id_mask;
-	u32 features;
+	ethtool_link_mode_mask_t features;
 	u32 flags;
 	const void *driver_data;
 
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next v2 4/8] net: mii: extend link mode support to 48 bits
From: David Decotigny @ 2015-01-06  2:54 UTC (permalink / raw)
  To: Amir Vadai, Florian Fainelli, netdev, linux-kernel, linux-api
  Cc: David Decotigny, David S. Miller, Jason Wang, Michael S. Tsirkin,
	Herbert Xu, Al Viro, Ben Hutchings, Masatake YAMATO, Xi Wang,
	Neil Horman, WANG Cong, Flavio Leitner, Tom Gundersen, Jiri Pirko,
	Vlad Yasevich, Eric W. Biederman, Saeed Mahameed, Venkata Duvvuru,
	Govindarajulu Varadarajan
In-Reply-To: <1420512850-24699-1-git-send-email-ddecotig@gmail.com>

From: David Decotigny <decot@googlers.com>

Signed-off-by: David Decotigny <decot@googlers.com>
---
 drivers/net/mii.c   | 52 +++++++++++++++++++++++++++++-----------------------
 include/linux/mii.h | 31 ++++++++++++++++---------------
 2 files changed, 45 insertions(+), 38 deletions(-)

diff --git a/drivers/net/mii.c b/drivers/net/mii.c
index 4a99c39..2be59ba 100644
--- a/drivers/net/mii.c
+++ b/drivers/net/mii.c
@@ -33,7 +33,7 @@
 #include <linux/ethtool.h>
 #include <linux/mii.h>
 
-static u32 mii_get_an(struct mii_if_info *mii, u16 addr)
+static ethtool_link_mode_mask_t mii_get_an(struct mii_if_info *mii, u16 addr)
 {
 	int advert;
 
@@ -56,14 +56,15 @@ int mii_ethtool_gset(struct mii_if_info *mii, struct ethtool_cmd *ecmd)
 {
 	struct net_device *dev = mii->dev;
 	u16 bmcr, bmsr, ctrl1000 = 0, stat1000 = 0;
-	u32 nego;
+	ethtool_link_mode_mask_t supported_link_modes, advertising_link_modes;
+	ethtool_link_mode_mask_t nego;
 
-	ecmd->supported =
+	supported_link_modes =
 	    (SUPPORTED_10baseT_Half | SUPPORTED_10baseT_Full |
 	     SUPPORTED_100baseT_Half | SUPPORTED_100baseT_Full |
 	     SUPPORTED_Autoneg | SUPPORTED_TP | SUPPORTED_MII);
 	if (mii->supports_gmii)
-		ecmd->supported |= SUPPORTED_1000baseT_Half |
+		supported_link_modes |= SUPPORTED_1000baseT_Half |
 			SUPPORTED_1000baseT_Full;
 
 	/* only supports twisted-pair */
@@ -76,7 +77,7 @@ int mii_ethtool_gset(struct mii_if_info *mii, struct ethtool_cmd *ecmd)
 	ecmd->phy_address = mii->phy_id;
 	ecmd->mdio_support = ETH_MDIO_SUPPORTS_C22;
 
-	ecmd->advertising = ADVERTISED_TP | ADVERTISED_MII;
+	advertising_link_modes = ADVERTISED_TP | ADVERTISED_MII;
 
 	bmcr = mii->mdio_read(dev, mii->phy_id, MII_BMCR);
 	bmsr = mii->mdio_read(dev, mii->phy_id, MII_BMSR);
@@ -85,23 +86,25 @@ int mii_ethtool_gset(struct mii_if_info *mii, struct ethtool_cmd *ecmd)
 		stat1000 = mii->mdio_read(dev, mii->phy_id, MII_STAT1000);
 	}
 	if (bmcr & BMCR_ANENABLE) {
-		ecmd->advertising |= ADVERTISED_Autoneg;
+		ethtool_link_mode_mask_t lp_adv;
+
+		advertising_link_modes |= ADVERTISED_Autoneg;
 		ecmd->autoneg = AUTONEG_ENABLE;
 
-		ecmd->advertising |= mii_get_an(mii, MII_ADVERTISE);
+		advertising_link_modes |= mii_get_an(mii, MII_ADVERTISE);
 		if (mii->supports_gmii)
-			ecmd->advertising |=
+			advertising_link_modes |=
 					mii_ctrl1000_to_ethtool_adv_t(ctrl1000);
 
 		if (bmsr & BMSR_ANEGCOMPLETE) {
-			ecmd->lp_advertising = mii_get_an(mii, MII_LPA);
-			ecmd->lp_advertising |=
-					mii_stat1000_to_ethtool_lpa_t(stat1000);
+			lp_adv = mii_get_an(mii, MII_LPA);
+			lp_adv |= mii_stat1000_to_ethtool_lpa_t(stat1000);
 		} else {
-			ecmd->lp_advertising = 0;
+			lp_adv = 0;
 		}
 
-		nego = ecmd->advertising & ecmd->lp_advertising;
+		ethtool_cmd_lp_advertising_set(ecmd, lp_adv);
+		nego = advertising_link_modes & lp_adv;
 
 		if (nego & (ADVERTISED_1000baseT_Full |
 			    ADVERTISED_1000baseT_Half)) {
@@ -128,6 +131,8 @@ int mii_ethtool_gset(struct mii_if_info *mii, struct ethtool_cmd *ecmd)
 	}
 
 	mii->full_duplex = ecmd->duplex;
+	ethtool_cmd_supported_set(ecmd, supported_link_modes);
+	ethtool_cmd_advertising_set(ecmd, advertising_link_modes);
 
 	/* ignore maxtxpkt, maxrxpkt for now */
 
@@ -168,13 +173,15 @@ int mii_ethtool_sset(struct mii_if_info *mii, struct ethtool_cmd *ecmd)
 	if (ecmd->autoneg == AUTONEG_ENABLE) {
 		u32 bmcr, advert, tmp;
 		u32 advert2 = 0, tmp2 = 0;
-
-		if ((ecmd->advertising & (ADVERTISED_10baseT_Half |
-					  ADVERTISED_10baseT_Full |
-					  ADVERTISED_100baseT_Half |
-					  ADVERTISED_100baseT_Full |
-					  ADVERTISED_1000baseT_Half |
-					  ADVERTISED_1000baseT_Full)) == 0)
+		ethtool_link_mode_mask_t ethtool_adv;
+
+		ethtool_adv = ethtool_cmd_advertising(ecmd);
+		if ((ethtool_adv & (ADVERTISED_10baseT_Half |
+				    ADVERTISED_10baseT_Full |
+				    ADVERTISED_100baseT_Half |
+				    ADVERTISED_100baseT_Full |
+				    ADVERTISED_1000baseT_Half |
+				    ADVERTISED_1000baseT_Full)) == 0)
 			return -EINVAL;
 
 		/* advertise only what has been requested */
@@ -184,11 +191,10 @@ int mii_ethtool_sset(struct mii_if_info *mii, struct ethtool_cmd *ecmd)
 			advert2 = mii->mdio_read(dev, mii->phy_id, MII_CTRL1000);
 			tmp2 = advert2 & ~(ADVERTISE_1000HALF | ADVERTISE_1000FULL);
 		}
-		tmp |= ethtool_adv_to_mii_adv_t(ecmd->advertising);
+		tmp |= ethtool_adv_to_mii_adv_t(ethtool_adv);
 
 		if (mii->supports_gmii)
-			tmp2 |=
-			      ethtool_adv_to_mii_ctrl1000_t(ecmd->advertising);
+			tmp2 |= ethtool_adv_to_mii_ctrl1000_t(ethtool_adv);
 		if (advert != tmp) {
 			mii->mdio_write(dev, mii->phy_id, MII_ADVERTISE, tmp);
 			mii->advertising = tmp;
diff --git a/include/linux/mii.h b/include/linux/mii.h
index 47492c9..6f5336c 100644
--- a/include/linux/mii.h
+++ b/include/linux/mii.h
@@ -106,7 +106,7 @@ static inline unsigned int mii_duplex (unsigned int duplex_lock,
  * settings to phy autonegotiation advertisements for the
  * MII_ADVERTISE register.
  */
-static inline u32 ethtool_adv_to_mii_adv_t(u32 ethadv)
+static inline u32 ethtool_adv_to_mii_adv_t(ethtool_link_mode_mask_t ethadv)
 {
 	u32 result = 0;
 
@@ -133,9 +133,9 @@ static inline u32 ethtool_adv_to_mii_adv_t(u32 ethadv)
  * A small helper function that translates MII_ADVERTISE bits
  * to ethtool advertisement settings.
  */
-static inline u32 mii_adv_to_ethtool_adv_t(u32 adv)
+static inline ethtool_link_mode_mask_t mii_adv_to_ethtool_adv_t(u32 adv)
 {
-	u32 result = 0;
+	ethtool_link_mode_mask_t result = 0;
 
 	if (adv & ADVERTISE_10HALF)
 		result |= ADVERTISED_10baseT_Half;
@@ -161,7 +161,8 @@ static inline u32 mii_adv_to_ethtool_adv_t(u32 adv)
  * settings to phy autonegotiation advertisements for the
  * MII_CTRL1000 register when in 1000T mode.
  */
-static inline u32 ethtool_adv_to_mii_ctrl1000_t(u32 ethadv)
+static inline u32
+ethtool_adv_to_mii_ctrl1000_t(ethtool_link_mode_mask_t ethadv)
 {
 	u32 result = 0;
 
@@ -181,9 +182,9 @@ static inline u32 ethtool_adv_to_mii_ctrl1000_t(u32 ethadv)
  * bits, when in 1000Base-T mode, to ethtool
  * advertisement settings.
  */
-static inline u32 mii_ctrl1000_to_ethtool_adv_t(u32 adv)
+static inline ethtool_link_mode_mask_t mii_ctrl1000_to_ethtool_adv_t(u32 adv)
 {
-	u32 result = 0;
+	ethtool_link_mode_mask_t result = 0;
 
 	if (adv & ADVERTISE_1000HALF)
 		result |= ADVERTISED_1000baseT_Half;
@@ -201,9 +202,9 @@ static inline u32 mii_ctrl1000_to_ethtool_adv_t(u32 adv)
  * bits, when in 1000Base-T mode, to ethtool
  * LP advertisement settings.
  */
-static inline u32 mii_lpa_to_ethtool_lpa_t(u32 lpa)
+static inline ethtool_link_mode_mask_t mii_lpa_to_ethtool_lpa_t(u32 lpa)
 {
-	u32 result = 0;
+	ethtool_link_mode_mask_t result = 0;
 
 	if (lpa & LPA_LPACK)
 		result |= ADVERTISED_Autoneg;
@@ -219,9 +220,9 @@ static inline u32 mii_lpa_to_ethtool_lpa_t(u32 lpa)
  * bits, when in 1000Base-T mode, to ethtool
  * advertisement settings.
  */
-static inline u32 mii_stat1000_to_ethtool_lpa_t(u32 lpa)
+static inline ethtool_link_mode_mask_t mii_stat1000_to_ethtool_lpa_t(u32 lpa)
 {
-	u32 result = 0;
+	ethtool_link_mode_mask_t result = 0;
 
 	if (lpa & LPA_1000HALF)
 		result |= ADVERTISED_1000baseT_Half;
@@ -239,7 +240,7 @@ static inline u32 mii_stat1000_to_ethtool_lpa_t(u32 lpa)
  * settings to phy autonegotiation advertisements for the
  * MII_CTRL1000 register when in 1000Base-X mode.
  */
-static inline u32 ethtool_adv_to_mii_adv_x(u32 ethadv)
+static inline u32 ethtool_adv_to_mii_adv_x(ethtool_link_mode_mask_t ethadv)
 {
 	u32 result = 0;
 
@@ -263,9 +264,9 @@ static inline u32 ethtool_adv_to_mii_adv_x(u32 ethadv)
  * bits, when in 1000Base-X mode, to ethtool
  * advertisement settings.
  */
-static inline u32 mii_adv_to_ethtool_adv_x(u32 adv)
+static inline ethtool_link_mode_mask_t mii_adv_to_ethtool_adv_x(u32 adv)
 {
-	u32 result = 0;
+	ethtool_link_mode_mask_t result = 0;
 
 	if (adv & ADVERTISE_1000XHALF)
 		result |= ADVERTISED_1000baseT_Half;
@@ -287,9 +288,9 @@ static inline u32 mii_adv_to_ethtool_adv_x(u32 adv)
  * bits, when in 1000Base-X mode, to ethtool
  * LP advertisement settings.
  */
-static inline u32 mii_lpa_to_ethtool_lpa_x(u32 lpa)
+static inline ethtool_link_mode_mask_t mii_lpa_to_ethtool_lpa_x(u32 lpa)
 {
-	u32 result = 0;
+	ethtool_link_mode_mask_t result = 0;
 
 	if (lpa & LPA_LPACK)
 		result |= ADVERTISED_Autoneg;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next v2 5/8] net: mdio: extend link mode support to 48 bits
From: David Decotigny @ 2015-01-06  2:54 UTC (permalink / raw)
  To: Amir Vadai, Florian Fainelli, netdev, linux-kernel, linux-api
  Cc: David Decotigny, David S. Miller, Jason Wang, Michael S. Tsirkin,
	Herbert Xu, Al Viro, Ben Hutchings, Masatake YAMATO, Xi Wang,
	Neil Horman, WANG Cong, Flavio Leitner, Tom Gundersen, Jiri Pirko,
	Vlad Yasevich, Eric W. Biederman, Saeed Mahameed, Venkata Duvvuru,
	Govindarajulu Varadarajan
In-Reply-To: <1420512850-24699-1-git-send-email-ddecotig@gmail.com>

From: David Decotigny <decot@googlers.com>

Signed-off-by: David Decotigny <decot@googlers.com>
---
 drivers/net/mdio.c   | 59 +++++++++++++++++++++++++++++-----------------------
 include/linux/mdio.h | 15 +++++++------
 2 files changed, 42 insertions(+), 32 deletions(-)

diff --git a/drivers/net/mdio.c b/drivers/net/mdio.c
index 3e027ed..5cac2ac 100644
--- a/drivers/net/mdio.c
+++ b/drivers/net/mdio.c
@@ -148,9 +148,10 @@ int mdio45_nway_restart(const struct mdio_if_info *mdio)
 }
 EXPORT_SYMBOL(mdio45_nway_restart);
 
-static u32 mdio45_get_an(const struct mdio_if_info *mdio, u16 addr)
+static ethtool_link_mode_mask_t
+mdio45_get_an(const struct mdio_if_info *mdio, u16 addr)
 {
-	u32 result = 0;
+	ethtool_link_mode_mask_t result = 0;
 	int reg;
 
 	reg = mdio->mdio_read(mdio->dev, mdio->prtad, MDIO_MMD_AN, addr);
@@ -185,9 +186,11 @@ static u32 mdio45_get_an(const struct mdio_if_info *mdio, u16 addr)
  */
 void mdio45_ethtool_gset_npage(const struct mdio_if_info *mdio,
 			       struct ethtool_cmd *ecmd,
-			       u32 npage_adv, u32 npage_lpa)
+			       ethtool_link_mode_mask_t npage_adv,
+			       ethtool_link_mode_mask_t npage_lpa)
 {
 	int reg;
+	ethtool_link_mode_mask_t supported_link_modes, advertising_link_modes;
 	u32 speed;
 
 	BUILD_BUG_ON(MDIO_SUPPORTS_C22 != ETH_MDIO_SUPPORTS_C22);
@@ -206,64 +209,64 @@ void mdio45_ethtool_gset_npage(const struct mdio_if_info *mdio,
 	case MDIO_PMA_CTRL2_100BTX:
 	case MDIO_PMA_CTRL2_10BT:
 		ecmd->port = PORT_TP;
-		ecmd->supported = SUPPORTED_TP;
+		supported_link_modes = SUPPORTED_TP;
 		reg = mdio->mdio_read(mdio->dev, mdio->prtad, MDIO_MMD_PMAPMD,
 				      MDIO_SPEED);
 		if (reg & MDIO_SPEED_10G)
-			ecmd->supported |= SUPPORTED_10000baseT_Full;
+			supported_link_modes |= SUPPORTED_10000baseT_Full;
 		if (reg & MDIO_PMA_SPEED_1000)
-			ecmd->supported |= (SUPPORTED_1000baseT_Full |
+			supported_link_modes |= (SUPPORTED_1000baseT_Full |
 					    SUPPORTED_1000baseT_Half);
 		if (reg & MDIO_PMA_SPEED_100)
-			ecmd->supported |= (SUPPORTED_100baseT_Full |
+			supported_link_modes |= (SUPPORTED_100baseT_Full |
 					    SUPPORTED_100baseT_Half);
 		if (reg & MDIO_PMA_SPEED_10)
-			ecmd->supported |= (SUPPORTED_10baseT_Full |
+			supported_link_modes |= (SUPPORTED_10baseT_Full |
 					    SUPPORTED_10baseT_Half);
-		ecmd->advertising = ADVERTISED_TP;
+		advertising_link_modes = ADVERTISED_TP;
 		break;
 
 	case MDIO_PMA_CTRL2_10GBCX4:
 		ecmd->port = PORT_OTHER;
-		ecmd->supported = 0;
-		ecmd->advertising = 0;
+		supported_link_modes = 0;
+		advertising_link_modes = 0;
 		break;
 
 	case MDIO_PMA_CTRL2_10GBKX4:
 	case MDIO_PMA_CTRL2_10GBKR:
 	case MDIO_PMA_CTRL2_1000BKX:
 		ecmd->port = PORT_OTHER;
-		ecmd->supported = SUPPORTED_Backplane;
+		supported_link_modes = SUPPORTED_Backplane;
 		reg = mdio->mdio_read(mdio->dev, mdio->prtad, MDIO_MMD_PMAPMD,
 				      MDIO_PMA_EXTABLE);
 		if (reg & MDIO_PMA_EXTABLE_10GBKX4)
-			ecmd->supported |= SUPPORTED_10000baseKX4_Full;
+			supported_link_modes |= SUPPORTED_10000baseKX4_Full;
 		if (reg & MDIO_PMA_EXTABLE_10GBKR)
-			ecmd->supported |= SUPPORTED_10000baseKR_Full;
+			supported_link_modes |= SUPPORTED_10000baseKR_Full;
 		if (reg & MDIO_PMA_EXTABLE_1000BKX)
-			ecmd->supported |= SUPPORTED_1000baseKX_Full;
+			supported_link_modes |= SUPPORTED_1000baseKX_Full;
 		reg = mdio->mdio_read(mdio->dev, mdio->prtad, MDIO_MMD_PMAPMD,
 				      MDIO_PMA_10GBR_FECABLE);
 		if (reg & MDIO_PMA_10GBR_FECABLE_ABLE)
-			ecmd->supported |= SUPPORTED_10000baseR_FEC;
-		ecmd->advertising = ADVERTISED_Backplane;
+			supported_link_modes |= SUPPORTED_10000baseR_FEC;
+		advertising_link_modes = ADVERTISED_Backplane;
 		break;
 
 	/* All the other defined modes are flavours of optical */
 	default:
 		ecmd->port = PORT_FIBRE;
-		ecmd->supported = SUPPORTED_FIBRE;
-		ecmd->advertising = ADVERTISED_FIBRE;
+		supported_link_modes = SUPPORTED_FIBRE;
+		advertising_link_modes = ADVERTISED_FIBRE;
 		break;
 	}
 
 	if (mdio->mmds & MDIO_DEVS_AN) {
-		ecmd->supported |= SUPPORTED_Autoneg;
+		supported_link_modes |= SUPPORTED_Autoneg;
 		reg = mdio->mdio_read(mdio->dev, mdio->prtad, MDIO_MMD_AN,
 				      MDIO_CTRL1);
 		if (reg & MDIO_AN_CTRL1_ENABLE) {
 			ecmd->autoneg = AUTONEG_ENABLE;
-			ecmd->advertising |=
+			advertising_link_modes |=
 				ADVERTISED_Autoneg |
 				mdio45_get_an(mdio, MDIO_AN_ADVERTISE) |
 				npage_adv;
@@ -275,21 +278,22 @@ void mdio45_ethtool_gset_npage(const struct mdio_if_info *mdio,
 	}
 
 	if (ecmd->autoneg) {
-		u32 modes = 0;
+		ethtool_link_mode_mask_t modes = 0;
 		int an_stat = mdio->mdio_read(mdio->dev, mdio->prtad,
 					      MDIO_MMD_AN, MDIO_STAT1);
 
 		/* If AN is complete and successful, report best common
 		 * mode, otherwise report best advertised mode. */
 		if (an_stat & MDIO_AN_STAT1_COMPLETE) {
-			ecmd->lp_advertising =
+			ethtool_link_mode_mask_t lp_adv =
 				mdio45_get_an(mdio, MDIO_AN_LPA) | npage_lpa;
 			if (an_stat & MDIO_AN_STAT1_LPABLE)
-				ecmd->lp_advertising |= ADVERTISED_Autoneg;
-			modes = ecmd->advertising & ecmd->lp_advertising;
+				lp_adv |= ADVERTISED_Autoneg;
+			ethtool_cmd_lp_advertising_set(ecmd, lp_adv);
+			modes = advertising_link_modes & lp_adv;
 		}
 		if ((modes & ~ADVERTISED_Autoneg) == 0)
-			modes = ecmd->advertising;
+			modes = advertising_link_modes;
 
 		if (modes & (ADVERTISED_10000baseT_Full |
 			     ADVERTISED_10000baseKX4_Full |
@@ -338,6 +342,9 @@ void mdio45_ethtool_gset_npage(const struct mdio_if_info *mdio,
 			break;
 		}
 	}
+
+	ethtool_cmd_supported_set(ecmd, supported_link_modes);
+	ethtool_cmd_advertising_set(ecmd, advertising_link_modes);
 }
 EXPORT_SYMBOL(mdio45_ethtool_gset_npage);
 
diff --git a/include/linux/mdio.h b/include/linux/mdio.h
index b42963b..18c649a 100644
--- a/include/linux/mdio.h
+++ b/include/linux/mdio.h
@@ -69,7 +69,8 @@ extern int mdio45_links_ok(const struct mdio_if_info *mdio, u32 mmds);
 extern int mdio45_nway_restart(const struct mdio_if_info *mdio);
 extern void mdio45_ethtool_gset_npage(const struct mdio_if_info *mdio,
 				      struct ethtool_cmd *ecmd,
-				      u32 npage_adv, u32 npage_lpa);
+				      ethtool_link_mode_mask_t npage_adv,
+				      ethtool_link_mode_mask_t npage_lpa);
 
 /**
  * mdio45_ethtool_gset - get settings for ETHTOOL_GSET
@@ -97,9 +98,10 @@ extern int mdio_mii_ioctl(const struct mdio_if_info *mdio,
  * A small helper function that translates MMD EEE Capability (3.20) bits
  * to ethtool supported settings.
  */
-static inline u32 mmd_eee_cap_to_ethtool_sup_t(u16 eee_cap)
+static inline ethtool_link_mode_mask_t
+mmd_eee_cap_to_ethtool_sup_t(u16 eee_cap)
 {
-	u32 supported = 0;
+	ethtool_link_mode_mask_t supported = 0;
 
 	if (eee_cap & MDIO_EEE_100TX)
 		supported |= SUPPORTED_100baseT_Full;
@@ -125,9 +127,10 @@ static inline u32 mmd_eee_cap_to_ethtool_sup_t(u16 eee_cap)
  * and MMD EEE Link Partner Ability (7.61) bits to ethtool advertisement
  * settings.
  */
-static inline u32 mmd_eee_adv_to_ethtool_adv_t(u16 eee_adv)
+static inline ethtool_link_mode_mask_t
+mmd_eee_adv_to_ethtool_adv_t(u16 eee_adv)
 {
-	u32 adv = 0;
+	ethtool_link_mode_mask_t adv = 0;
 
 	if (eee_adv & MDIO_EEE_100TX)
 		adv |= ADVERTISED_100baseT_Full;
@@ -153,7 +156,7 @@ static inline u32 mmd_eee_adv_to_ethtool_adv_t(u16 eee_adv)
  * to EEE advertisements for the MMD EEE Advertisement (7.60) and
  * MMD EEE Link Partner Ability (7.61) registers.
  */
-static inline u16 ethtool_adv_to_mmd_eee_adv_t(u32 adv)
+static inline u16 ethtool_adv_to_mmd_eee_adv_t(ethtool_link_mode_mask_t adv)
 {
 	u16 reg = 0;
 
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox