Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH] capabilities: Ambient capability set V1
From: Serge Hallyn @ 2015-02-23 18:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Serge E. Hallyn, Serge Hallyn, Andy Lutomirski, Aaron Jones,
	Ted Ts'o, linux-security-module, akpm, Andrew G. Morgan,
	Mimi Zohar, Austin S Hemmelgarn, Markku Savela, Jarkko Sakkinen,
	linux-kernel, linux-api, Michael Kerrisk, Jonathan Corbet
In-Reply-To: <alpine.DEB.2.11.1502231049110.21926@gentwo.org>

Quoting Christoph Lameter (cl@linux.com):
> On Mon, 23 Feb 2015, Serge E. Hallyn wrote:
> 
> > > I do not see a problem with dropping privilege since the ambient set
> > > is supposed to be preserved across a drop of priviledge.
> >
> > Because you're tricking the program into thinking it has dropped
> > the privilege, when in fact it has not.
> 
> So the cap was dropped from the cap perm set but it is still active
> in the ambient set?

Right, and the legacy program doesn't know to check the new set.

^ permalink raw reply

* Re: [PATCH] capabilities: Ambient capability set V1
From: Christoph Lameter @ 2015-02-23 18:27 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Serge E. Hallyn, Serge Hallyn, Andy Lutomirski, Aaron Jones,
	Ted Ts'o, linux-security-module, akpm, Andrew G. Morgan,
	Mimi Zohar, Austin S Hemmelgarn, Markku Savela, Jarkko Sakkinen,
	linux-kernel, linux-api, Michael Kerrisk, Jonathan Corbet
In-Reply-To: <20150223181553.GE25477@ubuntumail>

On Mon, 23 Feb 2015, Serge Hallyn wrote:

> > So the cap was dropped from the cap perm set but it is still active
> > in the ambient set?
>
> Right, and the legacy program doesn't know to check the new set.

Does it have to? The ambient set could not show up in permitted.


^ permalink raw reply

* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Alexei Starovoitov @ 2015-02-23 18:55 UTC (permalink / raw)
  To: He Kuang
  Cc: Steven Rostedt, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, Masami Hiramatsu, Linux API,
	Network Development, LKML, Linus Torvalds, Peter Zijlstra,
	Eric W. Biederman, wangnan0-hv44wF8Li93QT0dZR+AlfA

On Mon, Feb 16, 2015 at 6:26 AM, He Kuang <hekuang-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
> Hi, Alexei
>
> Another suggestion on bpf syscall interface. Currently, BPF +
> syscalls/kprobes depends on CONFIG_BPF_SYSCALL. In kernel used on
> commercial products, CONFIG_BPF_SYSCALL is probably disabled, in this
> case, bpf bytecode cannot be loaded to the kernel.

I'm seeing a flurry of use cases for bpf in ovs, tc, tracing, etc
When it's all ready, we can turn that config on by default.

> If we turn the functionality of BPF_SYSCALL into a loadable module, then
> we can use it without any dependencies on the kernel. What about change
> bpf syscall to a /dev node or /sys file which can be exported by a
> kernel module?

I don't think we will allow extending bpf by modules.
'bpf in modules' is an interface that is too easy to abuse.
So all of bpf core, helper functions and program types will be builtin.

As far as bpf+tracing the plan is to do bpf+kprobe and bpf+syscalls
first. Then add right set of helpers to make sure that use cases
like 'tcp stack instrumentation' are fully addressed.
Then there were few great ideas of accelerating kprobes
with trace markers and debug tracepoints that we can do later.

^ permalink raw reply

* [GIT PULL RESEND] kselftest-3.20-rc1 (4.0-rc1)
From: Shuah Khan @ 2015-02-23 19:23 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, Shuah Khan, linux-api

Hi Linus,

I sent the following pull request in time for rc-1. I didn't see
it included in 4.0-rc1. Wondering what happened? Resending just
in case it got lost.

thanks,
-- Shuah

The following changes since commit 97bf6af1f928216fd6c5a66e8a57bfa95a659672:

  Linux 3.19-rc1 (2014-12-20 17:08:50 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
tags/linux-kselftest-3.20-rc1

for you to fetch changes up to 6ddf898c23d62c974e148efd9e509731324a167a:

  selftests/exec: Check if the syscall exists and bail if not
(2015-02-04 10:17:35 -0700)

----------------------------------------------------------------
Kselftest updates for 3.20-rc1

This update adds:
- Kselftest install target feature
- Fix for selftests/exec test

----------------------------------------------------------------
Michael Ellerman (1):
      selftests/exec: Check if the syscall exists and bail if not

Shuah Khan (20):
      selftests/breakpoints: add install target to enable test install
      selftests/cpu-hotplug: add install target to enable test install
      selftests/efivarfs: add install target to enable test install
      selftests/firmware: add install target to enable test install
      selftests/ftrace: add install target to enable test install
      selftests/ipc: add install target to enable test install
      selftests/kcmp: add install target to enable test install
      selftests/memfd: add install target to enable test install
      selftests/memory-hotplug: add install target to enable test install
      selftests/mount: add install target to enable test install
      selftests/mqueue: add install target to enable test install
      selftests/net: add install target to enable test install
      selftests/ptrace: add install target to enable test install
      selftests/size: add install target to enable test install
      selftests/sysctl: add install target to enable test install
      selftests/timers: add install target to enable test install
      selftests/user: add install target to enable test install
      selftests/vm: add install target to enable test install
      selftests: add install target to enable test install
      kbuild: add a new kselftest_install make target to install selftests

 Makefile                                           | 14 +++++-
 tools/testing/selftests/Makefile                   | 54
+++++++++++++++++++++-
 tools/testing/selftests/breakpoints/Makefile       | 19 +++++++-
 tools/testing/selftests/cpu-hotplug/Makefile       | 14 +++++-
 .../{on-off-test.sh => cpu-on-off-test.sh}         |  0
 tools/testing/selftests/efivarfs/Makefile          | 16 ++++++-
 tools/testing/selftests/exec/execveat.c            | 10 +++-
 tools/testing/selftests/firmware/Makefile          | 43 ++++++++++-------
 tools/testing/selftests/ftrace/Makefile            | 13 +++++-
 tools/testing/selftests/ipc/Makefile               | 19 +++++++-
 tools/testing/selftests/kcmp/Makefile              | 13 +++++-
 tools/testing/selftests/memfd/Makefile             | 17 +++++--
 tools/testing/selftests/memory-hotplug/Makefile    | 14 +++++-
 .../{on-off-test.sh => mem-on-off-test.sh}         |  0
 tools/testing/selftests/mount/Makefile             | 12 ++++-
 tools/testing/selftests/mqueue/Makefile            | 18 ++++++--
 tools/testing/selftests/net/Makefile               | 20 ++++++--
 tools/testing/selftests/ptrace/Makefile            | 16 +++++--
 tools/testing/selftests/size/Makefile              | 12 ++++-
 tools/testing/selftests/sysctl/Makefile            | 17 ++++++-
 tools/testing/selftests/timers/Makefile            | 12 ++++-
 tools/testing/selftests/user/Makefile              | 12 ++++-
 tools/testing/selftests/vm/Makefile                | 11 ++++-
 23 files changed, 326 insertions(+), 50 deletions(-)
 rename tools/testing/selftests/cpu-hotplug/{on-off-test.sh =>
cpu-on-off-test.sh} (100%)
 rename tools/testing/selftests/memory-hotplug/{on-off-test.sh =>
mem-on-off-test.sh} (100%)

-- 
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh@osg.samsung.com | (970) 217-8978

^ permalink raw reply

* Re: [PATCH] fork: report pid reservation failure properly
From: Michal Hocko @ 2015-02-23 20:17 UTC (permalink / raw)
  To: linux-api
  Cc: Andrew Morton, Oleg Nesterov, Eric W. Biederman, Michael Kerrisk,
	LKML
In-Reply-To: <1423046628-23638-1-git-send-email-mhocko@suse.cz>

ping on this one? Should I just resend (your way Andrew)? Or are there
any objections to the patch as is.

On Wed 04-02-15 11:43:48, Michal Hocko wrote:
> copy_process will report any failure in alloc_pid as ENOMEM currently
> which is misleading because the pid allocation might fail not only when
> the memory is short but also when the pid space is consumed already.
> 
> The current man page even mentions this case:
> "
> EAGAIN
> 
>       A system-imposed limit on the number of threads was encountered.
>       There are a number of limits that may trigger this error: the
>       RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
>       limits the number of processes and threads for a real user ID, was
>       reached; the kernel's system-wide limit on the number of processes
>       and threads, /proc/sys/kernel/threads-max, was reached (see
>       proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
>       was reached (see proc(5)).
> "
> 
> so the current behavior is also incorrect wrt. documentation. POSIX man
> page also suggest returing EAGAIN when the process count limit is
> reached.
> 
> This patch simply propagates error code from alloc_pid and makes sure we
> return -EAGAIN due to reservation failure. This will make behavior of
> fork closer to both our documentation and POSIX.
> 
> alloc_pid might alsoo fail when the reaper in the pid namespace is dead
> (the namespace basically disallows all new processes) and there is no
> good error code which would match documented ones. We have traditionally
> returned ENOMEM for this case which is misleading as well but as per
> Eric W. Biederman this behavior is documented in man pid_namespaces(7)
> "
> If the "init" process of a PID namespace terminates, the kernel
> terminates all of the processes in the namespace via a SIGKILL signal.
> This behavior reflects the fact that the "init" process is essential for
> the correct operation of a PID namespace.  In this case, a subsequent
> fork(2) into this PID namespace will fail with the error ENOMEM; it is
> not possible to create a new processes in a PID namespace whose "init"
> process has terminated.
> "
> and introducing a new error code would be too risky so let's stick to
> ENOMEM for this case.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  kernel/fork.c |  5 +++--
>  kernel/pid.c  | 15 ++++++++-------
>  2 files changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3ce8d57cff09..c37b88a162c5 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1405,10 +1405,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  		goto bad_fork_cleanup_io;
>  
>  	if (pid != &init_struct_pid) {
> -		retval = -ENOMEM;
>  		pid = alloc_pid(p->nsproxy->pid_ns_for_children);
> -		if (!pid)
> +		if (IS_ERR(pid)) {
> +			retval = PTR_ERR(pid);
>  			goto bad_fork_cleanup_io;
> +		}
>  	}
>  
>  	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 82430c858d69..bbb805ccd893 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -179,7 +179,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  			spin_unlock_irq(&pidmap_lock);
>  			kfree(page);
>  			if (unlikely(!map->page))
> -				break;
> +				return -ENOMEM;
>  		}
>  		if (likely(atomic_read(&map->nr_free))) {
>  			for ( ; ; ) {
> @@ -207,7 +207,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  		}
>  		pid = mk_pid(pid_ns, map, offset);
>  	}
> -	return -1;
> +	return -EAGAIN;
>  }
>  
>  int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
> @@ -298,17 +298,20 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>  	int i, nr;
>  	struct pid_namespace *tmp;
>  	struct upid *upid;
> +	int retval = -ENOMEM;
>  
>  	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
>  	if (!pid)
> -		goto out;
> +		return ERR_PTR(retval);
>  
>  	tmp = ns;
>  	pid->level = ns->level;
>  	for (i = ns->level; i >= 0; i--) {
>  		nr = alloc_pidmap(tmp);
> -		if (nr < 0)
> +		if (IS_ERR_VALUE(nr)) {
> +			retval = nr;
>  			goto out_free;
> +		}
>  
>  		pid->numbers[i].nr = nr;
>  		pid->numbers[i].ns = tmp;
> @@ -336,7 +339,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>  	}
>  	spin_unlock_irq(&pidmap_lock);
>  
> -out:
>  	return pid;
>  
>  out_unlock:
> @@ -348,8 +350,7 @@ out_free:
>  		free_pidmap(pid->numbers + i);
>  
>  	kmem_cache_free(ns->pid_cachep, pid);
> -	pid = NULL;
> -	goto out;
> +	return ERR_PTR(retval);
>  }
>  
>  void disable_pid_allocation(struct pid_namespace *ns)
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH 1/3] media: Fix ALSA and DVB representation at media controller API
From: Laurent Pinchart @ 2015-02-23 21:20 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Devin Heitmueller, Hans Verkuil, Linux Media Mailing List,
	Mauro Carvalho Chehab, Hans Verkuil, Sakari Ailus,
	Antti Palosaari, Ricardo Ribalda, Marek Szyprowski,
	Ramakrishnan Muthukrishnan, linux-api
In-Reply-To: <20150223105508.54c730fc@recife.lan>

Hi Mauro,

On Monday 23 February 2015 10:55:08 Mauro Carvalho Chehab wrote:
> Em Mon, 26 Jan 2015 09:41:41 -0500 Devin Heitmueller escreveu:
> >> It is actually trivial to get the device nodes once you have the
> >> major/minor. The media-ctl library does that for you. See:
> >
> > No objection then.
> > 
> > On a related note, you would be very well served to consider testing
> > your dvb changes with a device that has more than one DVB tuner (such
> > as the hvr-2200/2250).  That will help you shake out any edge cases
> > related to ensuring that the different DVB nodes appear in different
> > groups.
> 
> Hi Devin,
> 
> I did some tests (and fixes) for WinTV Nova-TD, with has two adapters.
> 
> I saw two alternatives for it:
> 
> 1) to create a media controller device for each adapter;
> 2) to create just one media controller.
> 
> I actually implemented (1), as, in the case of this device, AFAIKT, the
> two devices are indepentent, e. g. it is not possible to, for example,
> share the same tuner with two demods:
> 
> $ ls -la /dev/media?
> crw-rw----. 1 root video 249, 0 Fev 23 10:02 /dev/media0
> crw-rw----. 1 root video 249, 1 Fev 23 10:02 /dev/media1
> 
> The adapter 0 corresponds to /dev/media0, and the adapter 1
> to /dev/media1:
> 
> $ media-ctl --print-dot -d /dev/media0
> digraph board {
> 	rankdir=TB
> 	n00000001 [label="dvb-demux\n/dev/dvb/adapter0/demux0", shape=box,
> style=filled, fillcolor=yellow] n00000001 -> n00000002
> 	n00000002 [label="dvb-dvr\n/dev/dvb/adapter0/dvr0", shape=box,
> style=filled, fillcolor=yellow] n00000003
> [label="dvb-net\n/dev/dvb/adapter0/net0", shape=box, style=filled,
> fillcolor=yellow] n00000004 [label="DiBcom
> 7000PC\n/dev/dvb/adapter0/frontend0", shape=box, style=filled,
> fillcolor=yellow] n00000004 -> n00000001
> }
> 
> $ media-ctl --print-dot -d /dev/media1
> digraph board {
> 	rankdir=TB
> 	n00000001 [label="dvb-demux\n/dev/dvb/adapter1/demux0", shape=box,
> style=filled, fillcolor=yellow] n00000001 -> n00000002
> 	n00000002 [label="dvb-dvr\n/dev/dvb/adapter1/dvr0", shape=box,
> style=filled, fillcolor=yellow] n00000003
> [label="dvb-net\n/dev/dvb/adapter1/net0", shape=box, style=filled,
> fillcolor=yellow] n00000004 [label="DiBcom
> 7000PC\n/dev/dvb/adapter1/frontend0", shape=box, style=filled,
> fillcolor=yellow] n00000004 -> n00000001
> }
> 
> On a more complex hardware where some components may be rewired
> on a different way, however, using just one media controller
> would be a better approach.
> 
> Comments?

A few, yes :-)

There's surprisingly many "details" that are not fully specified in MC, and 
this is one of them. Historically the design idea was to create one media 
device per instance of master video device. For PCI devices that's roughly one 
media device per card, for platform devices one media device per master IP 
core (or group of IP cores) instance, and for USB devices one media device per 
USB control interface.

We can depart from that model in two ways.

As you mentioned above, it could make sense to create separate media device 
instances for a single master device (USB in this case) when the master device 
contains several completely independent pipelines. Completely independent 
means that no data link connects the two pipelines, and that no resource is 
shared between them in a way that affects the device's behaviour.

In the example above the only shared hardware resource seems to be the USB 
bridge chip, which implements two independent DMA channels through the same 
USB interface. If the DMA channels are really independent, in the sense that 
they have no influence on each other such as e.g. bandwidth sharing, then two 
media devices could be created. Note that there's no requirement to do so, 
creating a single media device in this case is still perfectly valid.

It should also be noted that split pipelines could be difficult to represent 
as independent media devices when an external chip is shared between the two 
pipelines, even if that chip is itself split in two independent parts. This is 
caused by how the kernel APIs used to manage composite devices (v4l2-async and 
the component framework) handle components. For instance, in the DT case, we 
can't use v4l2-async to bind to one of the subdevs create for a single I2C 
chip or IP core, as they would share the same hardware identifier (device 
name, DT node, ...) used to match subdevs. Arguably that's a framework issue 
that should be fixed, but it wouldn't be trivial.

The second way to depart from the existing model is unrelated to the DVB 
examples above, but is still worth mentioning for completeness. When several 
master devices share resources (either on-chip or off-chip), a single media 
device is required. I've discussed such a case with Jean-Michel Hautbois and 
Hans Verkuil at the FOSDEM for the i.MX6 SoC. The chip includes two 
independent IPUs (Image Processing Units), with two independent parallel 
receivers but one shared CSI-2 receiver. On the system Jean-Michel is working 
with, an external FPGA is connected to the two parallel receivers. The 
resulting media pipeline thus includes the FPGA, the CSI-2 receiver and the 
two IPUs, requiring a single media device to cover both IPUs. As the IPU 
driver is the master driver in this case, it would normally create one media 
device per hardware device instance.

A similar situation occurs when video capture devices, memory to memory 
devices and/or display devices are connected together through deep pipelining. 
Such an example can be found in the Renesas R-Car Gen2 SoCs, where a memory to 
memory video processing device (VSP1) has one output connected directly to the 
display engine (DU). The VSP1 and DU are otherwise completely separate IP 
cores, with the former supported by a V4L2 driver and the latter by a DRM/KMS 
driver. We also need a single media device to cover the whole graph, without 
intrusive changes in either the VSP1 or the DU driver as the two devices can 
operate completely independently and without both being present.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply

* Re: [PATCH 1/3] media: Fix ALSA and DVB representation at media controller API
From: Laurent Pinchart @ 2015-02-23 22:58 UTC (permalink / raw)
  To: Hans Verkuil
  Cc: Mauro Carvalho Chehab, Linux Media Mailing List,
	Mauro Carvalho Chehab, Hans Verkuil, Sakari Ailus,
	Antti Palosaari, Ricardo Ribalda, Marek Szyprowski,
	Ramakrishnan Muthukrishnan, linux-api
In-Reply-To: <54C64421.1030302@xs4all.nl>

Hi Mauro and Hans,

On Monday 26 January 2015 14:41:53 Hans Verkuil wrote:
> On 01/26/2015 02:34 PM, Mauro Carvalho Chehab wrote:
> > Em Mon, 26 Jan 2015 14:11:50 +0100 Hans Verkuil escreveu:
> >> On 01/26/2015 01:47 PM, Mauro Carvalho Chehab wrote:
> >>> The previous provision for DVB media controller support were to
> >>> define an ID (likely meaning the adapter number) for the DVB
> >>> devnodes.
> >>> 
> >>> This is just plain wrong. Just like V4L, DVB devices (and ALSA,
> >>> or whatever) are identified via a (major, minor) tuple.
> >>> 
> >>> This is enough to uniquely identify a devnode, no matter what
> >>> API it implements.
> >>> 
> >>> So, before we go too far, let's mark the old v4l, dvb and alsa
> >>> "devnode" info as deprecated, and just call it as "dev".
> >>> 
> >>> As we don't want to break compilation on already existing apps,
> >>> let's just keep the old definitions as-is, adding a note that
> >>> those are deprecated at media-entity.h.
> >>> 
> >>> Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>

[snip]

> >>> diff --git a/include/media/media-entity.h b/include/media/media-entity.h
> >>> index e00459185d20..d6d74bcfe183 100644
> >>> --- a/include/media/media-entity.h
> >>> +++ b/include/media/media-entity.h
> >>> @@ -87,17 +87,7 @@ struct media_entity {
> >>>  		struct {
> >>>  			u32 major;
> >>>  			u32 minor;
> >>> -		} v4l;
> >>> -		struct {
> >>> -			u32 major;
> >>> -			u32 minor;
> >>> -		} fb;
> >>> -		struct {
> >>> -			u32 card;
> >>> -			u32 device;
> >>> -			u32 subdevice;
> >>> -		} alsa;
> >> 
> >> I don't think the alsa entity information can be replaced by major/minor.
> >> In particular you will loose the subdevice information which you need as
> >> well. In addition, alsa devices are almost never referenced via major and
> >> minor numbers, but always by card/device/subdevice numbers.
> > 
> > For media-ctl, it is easier to handle major/minor, in order to identify
> > the associated devnode name. Btw, media-ctl currently assumes that all
> > devnode devices are specified by v4l.major/v4l.minor.
> > 
> > Ok, maybe for alsa we'll need also card/device/subdevice, but I think this
> > should be mapped elsewhere, if this can't be retrieved via its sysfs/udev
> > interface (with seems to be doubtful).
> 
> The card/device tuple can likely be mapped to major/minor, but not
> subdevice. And since everything inside alsa is based on card/device I
> wouldn't change that.

I think we'll likely need changes for ALSA, but I don't think we know which 
ones yet. For that reason I'd avoid creating any ALSA-specific API for now 
until we implement proper support for ALSA devices. I'm fine, however, with 
deprecating the ALSA API we have now, before it gets (ab)used.

> >>> -		int dvb;
> >>> +		} dev;
> >>> 
> >>>  		/* Sub-device specifications */
> >>>  		/* Nothing needed yet */
> >>> diff --git a/include/uapi/linux/media.h b/include/uapi/linux/media.h
> >>> index d847c760e8f0..418f4fec391a 100644
> >>> --- a/include/uapi/linux/media.h
> >>> +++ b/include/uapi/linux/media.h
> >>> @@ -78,6 +78,20 @@ struct media_entity_desc {
> >>>  		struct {
> >>>  			__u32 major;
> >>>  			__u32 minor;
> >>> +		} dev;
> >>> +
> >>> +#if 1
> >>> +		/*
> >>> +		 * DEPRECATED: previous node specifications. Kept just to
> >>> +		 * avoid breaking compilation, but media_entity_desc.dev
> >>> +		 * should be used instead. In particular, alsa and dvb
> >>> +		 * fields below are wrong: for all devnodes, there should
> >>> +		 * be just major/minor inside the struct, as this is enough
> >>> +		 * to represent any devnode, no matter what type.
> >>> +		 */
> >>> +		struct {
> >>> +			__u32 major;
> >>> +			__u32 minor;
> >>>  		} v4l;
> >>>  		struct {
> >>>  			__u32 major;
> >>> @@ -89,6 +103,7 @@ struct media_entity_desc {
> >>>  			__u32 subdevice;
> >>>  		} alsa;
> >>>  		int dvb;
> >> 
> >> I wouldn't merge all the v4l/fb/etc. structs into one struct. That will
> >> make it difficult in the future if you need to add a field for e.g. v4l
> >> entities.

This is something Hans and I have discussed at the FOSDEM, and ended up 
disagreeing.

I like the single struct dev here, but I believe it should be moved before the 
union, or, as Hans proposed, that the reserved bytes should be moved after the 
union. However, the reason why I like your proposal has probably nothing to do 
with your intents.

To start directly with the big news, I believe the MC device node entity types 
are bogus. They should never have been created in the first place, and your 
proposal for DVB support in MC clearly shows it in my opinion.

The media controller model is supposed to describe the device hardware 
topology, or at least a logical view of that topology mapped to the operating 
system from a hardware point of view. Device nodes, on the other hand, are 
purely Linux concepts, and have nothing to do with the hardware topology. What 
we have tried to describe with the MEDIA_ENT_T_DEVNODE_V4L entity type is 
essentially a DMA engine. As DMA engines are mapped 1:1 to device nodes in 
V4L2, we have mixed the two concepts and have called those entities device 
nodes, while they should really have been called DMA engines.

One additional clue that made me realize this is that we now have device nodes 
for subdevs, which we don't report through the MC API. That's wrong, and 
should be fixed. We could simply consider that V4L2 subdevs, as they're V4L2 
entities, can use the media_entity_desc v4l union and report the device node 
major and minor there, but I think that would be a short-term solution.

Instead, I believe we should start describing our media entities based on the 
hardware point of view, and consider device nodes to be an entity property 
instead of an entity identity. In other words, a V4L2 DMA engine is not a 
device node by hardware nature, but has the property of exposing a device node 
to userspace. The same applies to other entities, such as V4L2 subdevices or 
DVB entities.

I've used the word property for a reason, as we've discussed numerous times in 
the past an MC API extension to report detailed entity properties. However, 
considering that exposing a device node is a very common property for 
entities, I believe it makes sense to report that information through 
media_entity_desc instead of requiring a separate, not implemented yet ioctl.

For those reasons, I believe that we should have a single struct dev in struct 
media_entity_desc used to report the "entity exposes a device node" property. 
This isn't linked to v4l, fb, dvb or alsa, and is independent from the entity 
type as such, so I wouldn't describe that property with subsystem-specific 
structures.

This doesn't preclude struct media_entity_desc from exposing subsystem-
specific information in v4l, fb, dvb or alsa structures, even though I 
currently believe that only core properties should be exposed through that 
structure, and subsystem-specific properties would likely not qualified, but 
that should be decided on a case-by-case basis. The struct dev, however, isn't 
a subsystem-specific property, and I would thus like it to be reported as 
such. A flag could also be used to report whether the entity has a device 
node, although that could be inferred from major:minor not being equal to 0:0.

With the above explanation hopefully clear, I believe that the following DVB 
entity types are wrong.

#define MEDIA_ENT_T_DEVNODE_DVB_FE      (MEDIA_ENT_T_DEVNODE + 4)
#define MEDIA_ENT_T_DEVNODE_DVB_DEMUX   (MEDIA_ENT_T_DEVNODE + 5)
#define MEDIA_ENT_T_DEVNODE_DVB_DVR     (MEDIA_ENT_T_DEVNODE + 6)
#define MEDIA_ENT_T_DEVNODE_DVB_CA      (MEDIA_ENT_T_DEVNODE + 7)
#define MEDIA_ENT_T_DEVNODE_DVB_NET     (MEDIA_ENT_T_DEVNODE + 8)

We should get rid of DEVNODE_ there.

Let's take http://linuxtv.org/downloads/presentations/cx231xx.ps as an 
example. If I understand it correctly, the frontend has a dedicated device 
node for control purposes (it has no DMA engine, no data can be transferred 
through that device node), while the demux and dvr have DMA engines. The demux 
can pass data to the dvr DMA engine, and also capture "raw" data directly. 
Please correct me if that's not correct.

The frontend should in my opinion be represented as a green box, not a yellow 
box, as it's a processing entity and not a DMA engine. It should, however, 
expose its device node property through struct dev.

The dvr, on the other side, is a DMA engine, for which a yellow box is thus 
correct. Its device node used for data transmission (and I suppose also some 
DMA-related configuration) should also be exposed through struct dev.

The demux is, if my understanding is correct, a hybrid beast, with both 
processing and DMA capabilities. I lack proper DVB knowledge here, but based 
on what I understand it might make sense to split it in two entities, one to 
handle the data processing capability, and the other one to handle the DMA 
engine. The processing entity could have one sink pad and two source pads, one 
connected to the dvr and the other one connected to the demux DMA engine 
entity.

This model departs from your proposal, but I think it would be a better base 
for future developments, for DVB but also other subsystems.

> > No. You could just create another union for the API-specific bits, using
> > the reserved bytes.
> > 
> >> So I would keep the v4l, fb and alsa structs, and just add a new struct
> >> for dvb. I wonder if the dvb field can't just be replaced since I doubt
> >> anyone is using it. And even if someone does, then it can't be right
> >> since a single int isn't enough and never worked anyway.
> > 
> > All devnodes have major/minor. Making it standard for all devices makes
> > easy for userspace to properly get the data it requires to work.
> 
> I think you are making assumptions here that may not be true. I don't see
> any reason to make a 'dev' struct here. The real problem is the dvb int, so
> that's what needs to be addressed. Changing anything else will cause API
> headaches for no good reason.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply

* Re: [RFC PATCH 1/3] eeprom: Add a simple EEPROM framework
From: Stephen Boyd @ 2015-02-23 23:11 UTC (permalink / raw)
  To: Rob Herring, Maxime Ripard
  Cc: Srinivas Kandagatla,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	Rob Herring, Pawel Moll, Kumar Gala,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Arnd Bergmann,
	Mark Brown, Greg Kroah-Hartman
In-Reply-To: <CAL_JsqLqqw5DyUqCaOTSKPwF-Ms5EU4f_OPoW9YVpMCo8A_bbw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 02/22/15 16:57, Rob Herring wrote:
> On Sun, Feb 22, 2015 at 8:32 AM, Maxime Ripard
> <maxime.ripard-wi1+55ScJUtKEb57/3fJTNBPR1lH4CV8@public.gmane.org> wrote:
>> On Fri, Feb 20, 2015 at 04:01:55PM -0600, Rob Herring wrote:
>>> [...]
>>>
>>>>>> += Data consumers =
>>>>>> +
>>>>>> +Required properties:
>>>>>> +
>>>>>> +eeproms: List of phandle and data cell specifier triplet, one triplet
>>>>>> +        for each data cell the device might be interested in. The
>>>>>> +        triplet consists of the phandle to the eeprom provider, then
>>>>>> +        the offset in byte within that storage device, and the length
>>>>>> +        in byte of the data we care about.
>>>>>
>>>>> The problem with this is it assumes you know who the consumer is and
>>>>> that it is a DT node. For example, how would you describe a serial
>>>>> number?
>>>> Correct me if I miss understood.
>>>> Is serial number any different?
>>>> Am hoping that the eeprom consumer would be aware of offset and size of
>>>> serial number in the eeprom
>>>>
>>>> Cant the consumer do:
>>>>
>>>> eeprom-consumer {
>>>>         eeproms = <&at24 0 4>;
>>>>         eeprom-names = "device-serial-number";
>>> Yes, but who is "eeprom-consumer"?
>> Any device that should lookup values in one of the EEPROM.
>>
>>> DT nodes generally describe a h/w block, but it this case, the
>>> consumer depends on the OS, not the h/w.
>> Not really, or at least, not more than any similar binding we
>> currently have.
>>
>> The fact that a MAC-address for the first ethernet chip is stored at a
>> given offset in a given eeprom has nothing to do with the OS.
> So MAC address would be a valid use to link to a h/w device, but there
> are certainly cases that don't correspond to a device.
>
>>> I'm not saying you can't describe where things are, but I don't
>>> think you should imply who is the consumer and doing so is
>>> unnecessarily complicated.
>> If you only take the case of a serial number, indeed. If you take
>> other usage into account, you can't really do without it.
>>
>>> Also, the layout of EEPROM is likely very much platform specific.
>> Indeed, which is why it should be in the DT.
> Agreed. I'm not saying the layout should not be.
>
>>> Some could have a more complex structure perhaps with key ids and
>>> linked list structure.
>> Then just request the size of the whole list, and parse it afterwards
>> in your driver?
>>
>>> I would do something more simple that is just a list of keys and their
>>> location like this:
>>>
>>> device-serial-number = <start size>;
>>> key1 = <start size>;
>>> key2 = <start size>;
>> I'm sorry, but what's the difference?
> It can describe the layout completely whether the fields are tied to a
> h/w device or not.
>
> What I would like to see here is the entire layout described covering
> both types of fields.
>

I was thinking the DT might be like this on the provider side:

   qfprom@1000000 {
      reg = <0x1000000 0x1000>;
      ranges = <0 0x1000000 0x1000>;
      compatible = "qcom,qfprom-msm8960"

      pvs-data: pvs-data@40 {
            compatible = "qcom,pvs-a";
            reg = <0x40 0x20>,
	    #eeprom-cells = <0>;
      };

       tsens-data: tmdata@10 {
            compatible = "qcom,tsens-data-msm8960";
            reg = <0x10 4>, <0x16 4>;
	    #eeprom-cells = <0>;

      };

      serial-number: serial@50 {
            compatible = "qcom,serial-msm8960";
            reg = <0x50 4>, <0x60 4>;
	    #eeprom-cells = <0>;

      };
   };

and then on the consumer side:

	device {
		eeproms = <&serial-number>;
		eeprom-names = "soc-rev-id";
	};


This would solve a problem where the consumer device is some standard
off-the-shelf IP block that needs to get some SoC specific calibration
data from the eeprom. I may want to interpret the bits differently
depending on which eeprom is connected to my SoC. Sometimes that data
format may be the same across many variations of the SoC (e.g. the
qcom,pvs-a node) or it may be completely different for a given SoC (e.g.
qcom,serial-msm8960 node). I imagine for other SoCs out there it could
be different depending on which eeprom the board manufacturer decides to
wire onto their board and how they choose to program the data.

So this is where I think the eeprom-cells and offset + length starts to
fall apart. It forces us to make up a bunch of different compatible
strings for our consumer device just so that we can parse the eeprom
that we decided to use for some SoC/board specific data. Instead I'd
like to see some framework that expresses exactly which eeprom is on my
board and how to interpret the bits in a way that doesn't require me to
keep refining the compatible string for my generic IP block.

I worry that if we put all those details in DT we'll end up having to
describe individual bits like serial-number-bit-2, serial-number-bit-3,
etc. because sometimes these pieces of data are scattered all around the
eeprom and aren't contiguous or aligned on a byte boundary. It may be
easier to just have a way to express that this is an eeprom with this
specific layout and my device has data stored in there. Then the driver
can be told what layout it is (via compatible or some other string based
means if we're not using DT?) and match that up with some driver data if
it needs to know how to understand the bits it can read with the
eeprom_read() API.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] capabilities: Ambient capability set V1
From: Andy Lutomirski @ 2015-02-23 23:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jarkko Sakkinen, Ted Ts'o, linux-kernel@vger.kernel.org,
	Andrew G. Morgan, Andrew Morton, LSM List, Michael Kerrisk,
	Linux API, Mimi Zohar, Austin S Hemmelgarn, Aaron Jones,
	Serge Hallyn, Markku Savela, Jonathan Corbet
In-Reply-To: <alpine.DEB.2.11.1502231040270.21784@gentwo.org>

On Feb 23, 2015 8:41 AM, "Christoph Lameter" <cl@linux.com> wrote:
>
> On Mon, 23 Feb 2015, Andy Lutomirski wrote:
>
> > If you set ambient caps and then run a setuid program (without
> > no_new_privs), then the ambient set *must* be cleared by the kernel
> > because that's what the setuid program expects.  Yes, the whole
>
> Why would a setuid program expect that? I'd say we expect the ambient set
> to remain in effect. What would break if the ambient set would stay
> active?
>

On a total guess: exim, sendmail, sudo, Apache suexec, etc.  Basically
anything that expects setresuid(nonzero values); execve to drop caps.

I haven't checked how many of the examples above actually do this.

--Andy

^ permalink raw reply

* Re: [PATCH 1/3] media: Fix ALSA and DVB representation at media controller API
From: Mauro Carvalho Chehab @ 2015-02-24  2:51 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Devin Heitmueller, Hans Verkuil, Linux Media Mailing List,
	Mauro Carvalho Chehab, Hans Verkuil, Sakari Ailus,
	Antti Palosaari, Ricardo Ribalda, Marek Szyprowski,
	Ramakrishnan Muthukrishnan, linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <2073043.Yp0nzaPjZq@avalon>

Em Mon, 23 Feb 2015 23:20:15 +0200
Laurent Pinchart <laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw@public.gmane.org> escreveu:

> Hi Mauro,
> 
> On Monday 23 February 2015 10:55:08 Mauro Carvalho Chehab wrote:
> > Em Mon, 26 Jan 2015 09:41:41 -0500 Devin Heitmueller escreveu:
> > >> It is actually trivial to get the device nodes once you have the
> > >> major/minor. The media-ctl library does that for you. See:
> > >
> > > No objection then.
> > > 
> > > On a related note, you would be very well served to consider testing
> > > your dvb changes with a device that has more than one DVB tuner (such
> > > as the hvr-2200/2250).  That will help you shake out any edge cases
> > > related to ensuring that the different DVB nodes appear in different
> > > groups.
> > 
> > Hi Devin,
> > 
> > I did some tests (and fixes) for WinTV Nova-TD, with has two adapters.
> > 
> > I saw two alternatives for it:
> > 
> > 1) to create a media controller device for each adapter;
> > 2) to create just one media controller.
> > 
> > I actually implemented (1), as, in the case of this device, AFAIKT, the
> > two devices are indepentent, e. g. it is not possible to, for example,
> > share the same tuner with two demods:
> > 
> > $ ls -la /dev/media?
> > crw-rw----. 1 root video 249, 0 Fev 23 10:02 /dev/media0
> > crw-rw----. 1 root video 249, 1 Fev 23 10:02 /dev/media1
> > 
> > The adapter 0 corresponds to /dev/media0, and the adapter 1
> > to /dev/media1:
> > 
> > $ media-ctl --print-dot -d /dev/media0
> > digraph board {
> > 	rankdir=TB
> > 	n00000001 [label="dvb-demux\n/dev/dvb/adapter0/demux0", shape=box,
> > style=filled, fillcolor=yellow] n00000001 -> n00000002
> > 	n00000002 [label="dvb-dvr\n/dev/dvb/adapter0/dvr0", shape=box,
> > style=filled, fillcolor=yellow] n00000003
> > [label="dvb-net\n/dev/dvb/adapter0/net0", shape=box, style=filled,
> > fillcolor=yellow] n00000004 [label="DiBcom
> > 7000PC\n/dev/dvb/adapter0/frontend0", shape=box, style=filled,
> > fillcolor=yellow] n00000004 -> n00000001
> > }
> > 
> > $ media-ctl --print-dot -d /dev/media1
> > digraph board {
> > 	rankdir=TB
> > 	n00000001 [label="dvb-demux\n/dev/dvb/adapter1/demux0", shape=box,
> > style=filled, fillcolor=yellow] n00000001 -> n00000002
> > 	n00000002 [label="dvb-dvr\n/dev/dvb/adapter1/dvr0", shape=box,
> > style=filled, fillcolor=yellow] n00000003
> > [label="dvb-net\n/dev/dvb/adapter1/net0", shape=box, style=filled,
> > fillcolor=yellow] n00000004 [label="DiBcom
> > 7000PC\n/dev/dvb/adapter1/frontend0", shape=box, style=filled,
> > fillcolor=yellow] n00000004 -> n00000001
> > }
> > 
> > On a more complex hardware where some components may be rewired
> > on a different way, however, using just one media controller
> > would be a better approach.
> > 
> > Comments?
> 
> A few, yes :-)
> 
> There's surprisingly many "details" that are not fully specified in MC, and 
> this is one of them. Historically the design idea was to create one media 
> device per instance of master video device. For PCI devices that's roughly one 
> media device per card, for platform devices one media device per master IP 
> core (or group of IP cores) instance, and for USB devices one media device per 
> USB control interface.
> 
> We can depart from that model in two ways.
> 
> As you mentioned above, it could make sense to create separate media device 
> instances for a single master device (USB in this case) when the master device 
> contains several completely independent pipelines. Completely independent 
> means that no data link connects the two pipelines, and that no resource is 
> shared between them in a way that affects the device's behaviour.
> 
> In the example above the only shared hardware resource seems to be the USB 
> bridge chip, which implements two independent DMA channels through the same 
> USB interface. If the DMA channels are really independent, in the sense that 
> they have no influence on each other such as e.g. bandwidth sharing, then two 
> media devices could be created. Note that there's no requirement to do so, 
> creating a single media device in this case is still perfectly valid.

Ok, so both ways can be applied on this case. I'll think a little bit more
about it.

> It should also be noted that split pipelines could be difficult to represent 
> as independent media devices when an external chip is shared between the two 
> pipelines, even if that chip is itself split in two independent parts. This is 
> caused by how the kernel APIs used to manage composite devices (v4l2-async and 
> the component framework) handle components. For instance, in the DT case, we 
> can't use v4l2-async to bind to one of the subdevs create for a single I2C 
> chip or IP core, as they would share the same hardware identifier (device 
> name, DT node, ...) used to match subdevs. Arguably that's a framework issue 
> that should be fixed, but it wouldn't be trivial.
> 
> The second way to depart from the existing model is unrelated to the DVB 
> examples above, but is still worth mentioning for completeness. When several 
> master devices share resources (either on-chip or off-chip), a single media 
> device is required. I've discussed such a case with Jean-Michel Hautbois and 
> Hans Verkuil at the FOSDEM for the i.MX6 SoC. The chip includes two 
> independent IPUs (Image Processing Units), with two independent parallel 
> receivers but one shared CSI-2 receiver. On the system Jean-Michel is working 
> with, an external FPGA is connected to the two parallel receivers. The 
> resulting media pipeline thus includes the FPGA, the CSI-2 receiver and the 
> two IPUs, requiring a single media device to cover both IPUs. As the IPU 
> driver is the master driver in this case, it would normally create one media 
> device per hardware device instance.
> 
> A similar situation occurs when video capture devices, memory to memory 
> devices and/or display devices are connected together through deep pipelining. 
> Such an example can be found in the Renesas R-Car Gen2 SoCs, where a memory to 
> memory video processing device (VSP1) has one output connected directly to the 
> display engine (DU). The VSP1 and DU are otherwise completely separate IP 
> cores, with the former supported by a V4L2 driver and the latter by a DRM/KMS 
> driver. We also need a single media device to cover the whole graph, without 
> intrusive changes in either the VSP1 or the DU driver as the two devices can 
> operate completely independently and without both being present.

We have the same case on some DVB hardware, used on TV sets and Set Top Boxes,
but the ones currently mapped provide an independent graph for DVB than the one
for DRM/KMS.

This is something that we could discuss further during the Media Summit.

Regards,
Mauro

^ permalink raw reply

* Re: [PATCH 1/3] media: Fix ALSA and DVB representation at media controller API
From: Mauro Carvalho Chehab @ 2015-02-24  3:51 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Hans Verkuil, Linux Media Mailing List, Mauro Carvalho Chehab,
	Hans Verkuil, Sakari Ailus, Antti Palosaari, Ricardo Ribalda,
	Marek Szyprowski, Ramakrishnan Muthukrishnan, linux-api
In-Reply-To: <2131864.KjrDFSJfqE@avalon>

Em Tue, 24 Feb 2015 00:58:23 +0200
Laurent Pinchart <laurent.pinchart@ideasonboard.com> escreveu:

> Hi Mauro and Hans,
> 
> On Monday 26 January 2015 14:41:53 Hans Verkuil wrote:
> > On 01/26/2015 02:34 PM, Mauro Carvalho Chehab wrote:
> > > Em Mon, 26 Jan 2015 14:11:50 +0100 Hans Verkuil escreveu:
> > >> On 01/26/2015 01:47 PM, Mauro Carvalho Chehab wrote:
> > >>> The previous provision for DVB media controller support were to
> > >>> define an ID (likely meaning the adapter number) for the DVB
> > >>> devnodes.
> > >>> 
> > >>> This is just plain wrong. Just like V4L, DVB devices (and ALSA,
> > >>> or whatever) are identified via a (major, minor) tuple.
> > >>> 
> > >>> This is enough to uniquely identify a devnode, no matter what
> > >>> API it implements.
> > >>> 
> > >>> So, before we go too far, let's mark the old v4l, dvb and alsa
> > >>> "devnode" info as deprecated, and just call it as "dev".
> > >>> 
> > >>> As we don't want to break compilation on already existing apps,
> > >>> let's just keep the old definitions as-is, adding a note that
> > >>> those are deprecated at media-entity.h.
> > >>> 
> > >>> Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
> 
> [snip]
> 
> > >>> diff --git a/include/media/media-entity.h b/include/media/media-entity.h
> > >>> index e00459185d20..d6d74bcfe183 100644
> > >>> --- a/include/media/media-entity.h
> > >>> +++ b/include/media/media-entity.h
> > >>> @@ -87,17 +87,7 @@ struct media_entity {
> > >>>  		struct {
> > >>>  			u32 major;
> > >>>  			u32 minor;
> > >>> -		} v4l;
> > >>> -		struct {
> > >>> -			u32 major;
> > >>> -			u32 minor;
> > >>> -		} fb;
> > >>> -		struct {
> > >>> -			u32 card;
> > >>> -			u32 device;
> > >>> -			u32 subdevice;
> > >>> -		} alsa;
> > >> 
> > >> I don't think the alsa entity information can be replaced by major/minor.
> > >> In particular you will loose the subdevice information which you need as
> > >> well. In addition, alsa devices are almost never referenced via major and
> > >> minor numbers, but always by card/device/subdevice numbers.
> > > 
> > > For media-ctl, it is easier to handle major/minor, in order to identify
> > > the associated devnode name. Btw, media-ctl currently assumes that all
> > > devnode devices are specified by v4l.major/v4l.minor.
> > > 
> > > Ok, maybe for alsa we'll need also card/device/subdevice, but I think this
> > > should be mapped elsewhere, if this can't be retrieved via its sysfs/udev
> > > interface (with seems to be doubtful).
> > 
> > The card/device tuple can likely be mapped to major/minor, but not
> > subdevice. And since everything inside alsa is based on card/device I
> > wouldn't change that.
> 
> I think we'll likely need changes for ALSA, but I don't think we know which 
> ones yet. For that reason I'd avoid creating any ALSA-specific API for now 
> until we implement proper support for ALSA devices. I'm fine, however, with 
> deprecating the ALSA API we have now, before it gets (ab)used.

That's my opinion too: deprecate it before it gets (ab)used.

> 
> > >>> -		int dvb;
> > >>> +		} dev;
> > >>> 
> > >>>  		/* Sub-device specifications */
> > >>>  		/* Nothing needed yet */
> > >>> diff --git a/include/uapi/linux/media.h b/include/uapi/linux/media.h
> > >>> index d847c760e8f0..418f4fec391a 100644
> > >>> --- a/include/uapi/linux/media.h
> > >>> +++ b/include/uapi/linux/media.h
> > >>> @@ -78,6 +78,20 @@ struct media_entity_desc {
> > >>>  		struct {
> > >>>  			__u32 major;
> > >>>  			__u32 minor;
> > >>> +		} dev;
> > >>> +
> > >>> +#if 1
> > >>> +		/*
> > >>> +		 * DEPRECATED: previous node specifications. Kept just to
> > >>> +		 * avoid breaking compilation, but media_entity_desc.dev
> > >>> +		 * should be used instead. In particular, alsa and dvb
> > >>> +		 * fields below are wrong: for all devnodes, there should
> > >>> +		 * be just major/minor inside the struct, as this is enough
> > >>> +		 * to represent any devnode, no matter what type.
> > >>> +		 */
> > >>> +		struct {
> > >>> +			__u32 major;
> > >>> +			__u32 minor;
> > >>>  		} v4l;
> > >>>  		struct {
> > >>>  			__u32 major;
> > >>> @@ -89,6 +103,7 @@ struct media_entity_desc {
> > >>>  			__u32 subdevice;
> > >>>  		} alsa;
> > >>>  		int dvb;
> > >> 
> > >> I wouldn't merge all the v4l/fb/etc. structs into one struct. That will
> > >> make it difficult in the future if you need to add a field for e.g. v4l
> > >> entities.
> 
> This is something Hans and I have discussed at the FOSDEM, and ended up 
> disagreeing.
> 
> I like the single struct dev here, but I believe it should be moved before the 
> union, or, as Hans proposed, that the reserved bytes should be moved after the 
> union. However, the reason why I like your proposal has probably nothing to do 
> with your intents.

We can't move it before the union, as it would otherwise break userspace.
The union should contain both the new name and the legacy ones.

Moving the reserved fields out of the union seems fine for me.

> To start directly with the big news, I believe the MC device node entity types 
> are bogus. They should never have been created in the first place, and your 
> proposal for DVB support in MC clearly shows it in my opinion.
> 
> The media controller model is supposed to describe the device hardware 
> topology, or at least a logical view of that topology mapped to the operating 
> system from a hardware point of view. Device nodes, on the other hand, are 
> purely Linux concepts, and have nothing to do with the hardware topology.

That's somewhat true, but all device nodes are the points where the hardware
can be controlled. Any userspace software that needs to control the hardware
need some way to detect the device nodes, in order to use them

> What 
> we have tried to describe with the MEDIA_ENT_T_DEVNODE_V4L entity type is 
> essentially a DMA engine. As DMA engines are mapped 1:1 to device nodes in 
> V4L2, we have mixed the two concepts and have called those entities device 
> nodes, while they should really have been called DMA engines.

As I said, it is, in practice, more than just the DMA, even for V4L2. It
is the control point of the device to where an specific API is used to
control the hardware.

Btw, calling them as DMA is a simplification. On simple devices, yes
there's a DMA directly associated with the devnode, but on devices like
USB, the DMA is actually hidden, and a piece of software at the Kernel
filters out the payload from the USB headers.

So, a devnode is actually several blocks:

	[hw control] or, eventually, [kernel control]
	[DMA] ---> [payload filter] ---> [userspace stream I/O]

> One additional clue that made me realize this is that we now have device nodes 
> for subdevs, which we don't report through the MC API. That's wrong, and 
> should be fixed.

A subdev node is actually:
	[hw control] or, eventually, [kernel control]

I agree that they should be reported via the MC API.

> We could simply consider that V4L2 subdevs, as they're V4L2 
> entities, can use the media_entity_desc v4l union and report the device node 
> major and minor there, but I think that would be a short-term solution.

> Instead, I believe we should start describing our media entities based on the 
> hardware point of view, and consider device nodes to be an entity property 
> instead of an entity identity. In other words, a V4L2 DMA engine is not a 
> device node by hardware nature, but has the property of exposing a device node 
> to userspace. The same applies to other entities, such as V4L2 subdevices or 
> DVB entities.

Well, for me a devnode has always an API associated to it, as all devnodes
allow some set of ioctls to use to control it.

So, for me, each media entity that is a devnode should have a property or
identity that will tell what API is supported there. This is so important
for userspace that I would keep using "type" for that.

Ok, some devnodes have more than just [hw control].

> I've used the word property for a reason, as we've discussed numerous times in 
> the past an MC API extension to report detailed entity properties. However, 
> considering that exposing a device node is a very common property for 
> entities, I believe it makes sense to report that information through 
> media_entity_desc instead of requiring a separate, not implemented yet ioctl.

Agreed.

> For those reasons, I believe that we should have a single struct dev in struct 
> media_entity_desc used to report the "entity exposes a device node" property. 
> This isn't linked to v4l, fb, dvb or alsa, and is independent from the entity 
> type as such, so I wouldn't describe that property with subsystem-specific 
> structures.

Agreed.

> This doesn't preclude struct media_entity_desc from exposing subsystem-
> specific information in v4l, fb, dvb or alsa structures, even though I 
> currently believe that only core properties should be exposed through that 
> structure, and subsystem-specific properties would likely not qualified, but 
> that should be decided on a case-by-case basis. The struct dev, however, isn't 
> a subsystem-specific property, and I would thus like it to be reported as 
> such. A flag could also be used to report whether the entity has a device 
> node, although that could be inferred from major:minor not being equal to 0:0.

Fully agreed.

> With the above explanation hopefully clear, I believe that the following DVB 
> entity types are wrong.
> 
> #define MEDIA_ENT_T_DEVNODE_DVB_FE      (MEDIA_ENT_T_DEVNODE + 4)
> #define MEDIA_ENT_T_DEVNODE_DVB_DEMUX   (MEDIA_ENT_T_DEVNODE + 5)
> #define MEDIA_ENT_T_DEVNODE_DVB_DVR     (MEDIA_ENT_T_DEVNODE + 6)
> #define MEDIA_ENT_T_DEVNODE_DVB_CA      (MEDIA_ENT_T_DEVNODE + 7)
> #define MEDIA_ENT_T_DEVNODE_DVB_NET     (MEDIA_ENT_T_DEVNODE + 8)
> 
> We should get rid of DEVNODE_ there.

I'm fine with such change.

> Let's take http://linuxtv.org/downloads/presentations/cx231xx.ps as an 
> example. If I understand it correctly, the frontend has a dedicated device 
> node for control purposes (it has no DMA engine, no data can be transferred 
> through that device node), while the demux and dvr have DMA engines. The demux 
> can pass data to the dvr DMA engine, and also capture "raw" data directly. 
> Please correct me if that's not correct.

Well, this is actually a simplification. It works fine for now, but
we'll need to go deeper in some future to properly describe the DVB
pipeline.

The frontend is actually composed of two parts:
	- A tuner;
	- A digital TV demodulator.

(On some devices, it is impossible to distinguish between them,
as the tuner is controlled by the firmware inside the demod)

The demodulator has a serial or parallel bus that either sends
the data to a hardware demux or sends it to the Kernel via DMA.

In the first case, the diagram is:

	[tuner] ---> [demod] --> [hw demux] --> [DMA] --> Kernel --> userspace (either via dmx or dvr devnode)

In the second case, it is:

	[tuner] ---> [demod] --> [DMA] --> Kernel --> [sw demux in Kernel] --> userspace (either via dmx or dvr devnode)

The DMA is thus hidden in the middle of the pipeline.

For userspace, it doesn't matter if the hardware has or not a demux.
The logic is that the DVB Demux API is used to set a filter at the
demux.

The demux filter will select the packet IDs (PID) that userspace
wants from the MPEG-TS stream, and either dynamically wire the hardware
to send the selected program via DMA (for hardware demux) or it will
program the Kernel to filter it in kernelspace.

Btw, the above diagrams are for PC typical hardware. For Set Top Boxes
and TV sets, in order to meet DRM (Digital Rights Management)
requirements at the DVB specs, they may not have any DMA engine. In
such case, the demuxed video should be passed to a CA module (CAM), that
handles the decrypt, and then the output should go to the video adapter
for display.

So, the diagram would be something like:

	[tuner] ---> [demod] --> [hw demux] --> [CAM] --> [/dev/fb0]

In such case, all device nodes are just for hardware control.

To make things a little worse, it is not [hw demux] but, actually
[hw demux filter #0] ... [[hw demux filter #n].

A typical PC hardware with hw demux implements something like 32 filters.

A TV set hardware supports a way more filters and some can even have
dynamically created filters that can be shared between different DVB adapters.

> The frontend should in my opinion be represented as a green box, not a yellow 
> box, as it's a processing entity and not a DMA engine. It should, however, 
> expose its device node property through struct dev.

I see your point.

> The dvr, on the other side, is a DMA engine, for which a yellow box is thus 
> correct. Its device node used for data transmission (and I suppose also some 
> DMA-related configuration) should also be exposed through struct dev.

A dvr interface is always associated with the stream output to userspace.

> 
> The demux is, if my understanding is correct, a hybrid beast, with both 
> processing and DMA capabilities.

Yes. it is hw/sw control and can be stream output too.

> I lack proper DVB knowledge here, but based 
> on what I understand it might make sense to split it in two entities, one to 
> handle the data processing capability, and the other one to handle the DMA 
> engine. The processing entity could have one sink pad and two source pads, one 
> connected to the dvr and the other one connected to the demux DMA engine 
> entity.

Actually, the processing unity can have one sink pad with the TS input, and
a dynamically created number of PADs, one for each filter.

Please notice that a filter is associated with a file descriptor, and not
with a devnode.

So, from devnode perspective, we have:

	/dev/adapter0/frontend0 - to control the tuner/demod
	/dev/adapter0/demux0 - to create the filters
	/dev/adapter0/dvr0 - to receive the data
	/dev/adapter0/ca0 - to control data decrypt

but, from the data flux, we have:

	frontend0 -> demux0 input pad

	demux0 output pad filter 0 --> DMA --> dvr0 fd 5 
	demux0 output pad filter 1 --> DMA --> dvr0 fd 6
	demux0 output pad filter 2 --> DMA --> dvr0 fd 7
	demux0 output pad filter 3 --> ca0 --> DMA --> dvr0 fd 8
	demux0 output pad filter 4 --> /dev/fb0
	...

So, what we're really mapping right now is the hardware/software
control points associated with the device nodes. So, the way I see
is that we'll still need to represent at the media controller
the device nodes as such, as this is the point where the hardware
(and the Kernel) can be controlled.

The data flux can be simplified as we're doing right now, but we'll
need to add more features at the media controller (and to better
discuss) how to implement the DVB data flux.

In other words, for DVB, we'll need to represent:

- all devnodes used to control the DVB API (that's the goal of the
  current patchset);

- the demux processing unit with their filter outputs and the userspace
  file descriptors. I intend to discuss this further during the
  Media summit, and should be object of another set of patches.

> This model departs from your proposal, but I think it would be a better base 
> for future developments, for DVB but also other subsystems.
> 
> > > No. You could just create another union for the API-specific bits, using
> > > the reserved bytes.
> > > 
> > >> So I would keep the v4l, fb and alsa structs, and just add a new struct
> > >> for dvb. I wonder if the dvb field can't just be replaced since I doubt
> > >> anyone is using it. And even if someone does, then it can't be right
> > >> since a single int isn't enough and never worked anyway.
> > > 
> > > All devnodes have major/minor. Making it standard for all devices makes
> > > easy for userspace to properly get the data it requires to work.
> > 
> > I think you are making assumptions here that may not be true. I don't see
> > any reason to make a 'dev' struct here. The real problem is the dvb int, so
> > that's what needs to be addressed. Changing anything else will cause API
> > headaches for no good reason.
> 

^ permalink raw reply

* Re: [PATCH] capabilities: Ambient capability set V1
From: Serge E. Hallyn @ 2015-02-24  5:19 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Christoph Lameter, Serge E. Hallyn, Serge Hallyn, Andy Lutomirski,
	Aaron Jones, Ted Ts'o, linux-security-module, akpm,
	Andrew G. Morgan, Mimi Zohar, Austin S Hemmelgarn, Markku Savela,
	Jarkko Sakkinen, linux-kernel, linux-api, Michael Kerrisk,
	Jonathan Corbet
In-Reply-To: <20150223181553.GE25477@ubuntumail>

On Mon, Feb 23, 2015 at 06:15:53PM +0000, Serge Hallyn wrote:
> Quoting Christoph Lameter (cl@linux.com):
> > On Mon, 23 Feb 2015, Serge E. Hallyn wrote:
> > 
> > > > I do not see a problem with dropping privilege since the ambient set
> > > > is supposed to be preserved across a drop of priviledge.
> > >
> > > Because you're tricking the program into thinking it has dropped
> > > the privilege, when in fact it has not.
> > 
> > So the cap was dropped from the cap perm set but it is still active
> > in the ambient set?
> 
> Right, and the legacy program doesn't know to check the new set.

we've been assuming the ambient set must be like fP.  is there any
reason why it doesn't suffice for them to be or'ed with fI instead at
exec?  then the bits would need to ne in pI. this might sufice for
Christoph's use case, as pI will generally not change.  and for programs
that really care, they can check pI.

^ permalink raw reply

* Re: [RFC PATCH 1/3] eeprom: Add a simple EEPROM framework
From: Srinivas Kandagatla @ 2015-02-24  7:08 UTC (permalink / raw)
  To: Stephen Boyd, Rob Herring, Maxime Ripard
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	Rob Herring, Pawel Moll, Kumar Gala,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Arnd Bergmann,
	Mark Brown, Greg Kroah-Hartman
In-Reply-To: <54EBB3AC.30000-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>

Thanks Stephen for the comments.

On 23/02/15 23:11, Stephen Boyd wrote:
> On 02/22/15 16:57, Rob Herring wrote:
>> On Sun, Feb 22, 2015 at 8:32 AM, Maxime Ripard
>> <maxime.ripard-wi1+55ScJUtKEb57/3fJTNBPR1lH4CV8@public.gmane.org> wrote:
>>> On Fri, Feb 20, 2015 at 04:01:55PM -0600, Rob Herring wrote:
>>>> [...]
>>>>
>>>>>>> += Data consumers =
>>>>>>> +
>>>>>>> +Required properties:
>>>>>>> +
>>>>>>> +eeproms: List of phandle and data cell specifier triplet, one triplet
>>>>>>> +        for each data cell the device might be interested in. The
>>>>>>> +        triplet consists of the phandle to the eeprom provider, then
>>>>>>> +        the offset in byte within that storage device, and the length
>>>>>>> +        in byte of the data we care about.
>>>>>>
>>>>>> The problem with this is it assumes you know who the consumer is and
>>>>>> that it is a DT node. For example, how would you describe a serial
>>>>>> number?
>>>>> Correct me if I miss understood.
>>>>> Is serial number any different?
>>>>> Am hoping that the eeprom consumer would be aware of offset and size of
>>>>> serial number in the eeprom
>>>>>
>>>>> Cant the consumer do:
>>>>>
>>>>> eeprom-consumer {
>>>>>          eeproms = <&at24 0 4>;
>>>>>          eeprom-names = "device-serial-number";
>>>> Yes, but who is "eeprom-consumer"?
>>> Any device that should lookup values in one of the EEPROM.
>>>
>>>> DT nodes generally describe a h/w block, but it this case, the
>>>> consumer depends on the OS, not the h/w.
>>> Not really, or at least, not more than any similar binding we
>>> currently have.
>>>
>>> The fact that a MAC-address for the first ethernet chip is stored at a
>>> given offset in a given eeprom has nothing to do with the OS.
>> So MAC address would be a valid use to link to a h/w device, but there
>> are certainly cases that don't correspond to a device.
>>
>>>> I'm not saying you can't describe where things are, but I don't
>>>> think you should imply who is the consumer and doing so is
>>>> unnecessarily complicated.
>>> If you only take the case of a serial number, indeed. If you take
>>> other usage into account, you can't really do without it.
>>>
>>>> Also, the layout of EEPROM is likely very much platform specific.
>>> Indeed, which is why it should be in the DT.
>> Agreed. I'm not saying the layout should not be.
>>
>>>> Some could have a more complex structure perhaps with key ids and
>>>> linked list structure.
>>> Then just request the size of the whole list, and parse it afterwards
>>> in your driver?
>>>
>>>> I would do something more simple that is just a list of keys and their
>>>> location like this:
>>>>
>>>> device-serial-number = <start size>;
>>>> key1 = <start size>;
>>>> key2 = <start size>;
>>> I'm sorry, but what's the difference?
>> It can describe the layout completely whether the fields are tied to a
>> h/w device or not.
>>
>> What I would like to see here is the entire layout described covering
>> both types of fields.
>>
>
> I was thinking the DT might be like this on the provider side:
>
>     qfprom@1000000 {
>        reg = <0x1000000 0x1000>;
>        ranges = <0 0x1000000 0x1000>;
>        compatible = "qcom,qfprom-msm8960"
>
>        pvs-data: pvs-data@40 {
>              compatible = "qcom,pvs-a";

These compatibles could be optional. As it might not be required for 
devices which are simple and do not require any special interpretation 
of eeprom data.

>              reg = <0x40 0x20>,
> 	    #eeprom-cells = <0>;

Do we still need eeprom-cells if we are moving to use reg property here?

>        };
>
>         tsens-data: tmdata@10 {
>              compatible = "qcom,tsens-data-msm8960";
>              reg = <0x10 4>, <0x16 4>;
> 	    #eeprom-cells = <0>;
>
>        };
>
>        serial-number: serial@50 {
>              compatible = "qcom,serial-msm8960";
>              reg = <0x50 4>, <0x60 4>;
> 	    #eeprom-cells = <0>;
>
>        };
>     };
>
> and then on the consumer side:
>
> 	device {
> 		eeproms = <&serial-number>;
> 		eeprom-names = "soc-rev-id";
> 	};
>

Yes, this looks good, and Sasha was also recommending something on these 
lines too. Also this addresses the use cases like serial number too.

>
> This would solve a problem where the consumer device is some standard
> off-the-shelf IP block that needs to get some SoC specific calibration
> data from the eeprom. I may want to interpret the bits differently
> depending on which eeprom is connected to my SoC. Sometimes that data
> format may be the same across many variations of the SoC (e.g. the
> qcom,pvs-a node) or it may be completely different for a given SoC (e.g.
> qcom,serial-msm8960 node). I imagine for other SoCs out there it could
> be different depending on which eeprom the board manufacturer decides to
> wire onto their board and how they choose to program the data.
>
> So this is where I think the eeprom-cells and offset + length starts to
> fall apart. It forces us to make up a bunch of different compatible
> strings for our consumer device just so that we can parse the eeprom
> that we decided to use for some SoC/board specific data. Instead I'd
> like to see some framework that expresses exactly which eeprom is on my
> board and how to interpret the bits in a way that doesn't require me to
> keep refining the compatible string for my generic IP block.
>
> I worry that if we put all those details in DT we'll end up having to
> describe individual bits like serial-number-bit-2, serial-number-bit-3,
> etc. because sometimes these pieces of data are scattered all around the
> eeprom and aren't contiguous or aligned on a byte boundary. It may be
> easier to just have a way to express that this is an eeprom with this
> specific layout and my device has data stored in there. Then the driver
> can be told what layout it is (via compatible or some other string based
> means if we're not using DT?) and match that up with some driver data if
> it needs to know how to understand the bits it can read with the
> eeprom_read() API.
Yes this sounds more sensible to let the consumer driver interpret the 
eeprom data than the eeprom framework itself.


--srini
>

^ permalink raw reply

* Re: [RFC PATCH 1/3] eeprom: Add a simple EEPROM framework
From: Maxime Ripard @ 2015-02-24  9:21 UTC (permalink / raw)
  To: Stephen Boyd
  Cc: Rob Herring, Srinivas Kandagatla,
	linux-arm-kernel@lists.infradead.org, Rob Herring, Pawel Moll,
	Kumar Gala, linux-api@vger.kernel.org,
	linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
	Arnd Bergmann, Mark Brown, Greg Kroah-Hartman
In-Reply-To: <54EBB3AC.30000@codeaurora.org>

[-- Attachment #1: Type: text/plain, Size: 4924 bytes --]

On Mon, Feb 23, 2015 at 03:11:40PM -0800, Stephen Boyd wrote:
> >>> I would do something more simple that is just a list of keys and
> >>> their location like this:
> >>>
> >>> device-serial-number = <start size>;
> >>> key1 = <start size>;
> >>> key2 = <start size>;
> >> I'm sorry, but what's the difference?
> > It can describe the layout completely whether the fields are tied to a
> > h/w device or not.
> >
> > What I would like to see here is the entire layout described covering
> > both types of fields.
> >
> 
> I was thinking the DT might be like this on the provider side:
> 
>    qfprom@1000000 {
>       reg = <0x1000000 0x1000>;
>       ranges = <0 0x1000000 0x1000>;
>       compatible = "qcom,qfprom-msm8960"
> 
>       pvs-data: pvs-data@40 {
>             compatible = "qcom,pvs-a";
>             reg = <0x40 0x20>,
> 	    #eeprom-cells = <0>;
>       };
> 
>        tsens-data: tmdata@10 {
>             compatible = "qcom,tsens-data-msm8960";
>             reg = <0x10 4>, <0x16 4>;
> 	    #eeprom-cells = <0>;
> 
>       };
> 
>       serial-number: serial@50 {
>             compatible = "qcom,serial-msm8960";
>             reg = <0x50 4>, <0x60 4>;
> 	    #eeprom-cells = <0>;
> 
>       };
>    };

I'm not sure the compatible is really needed.

A label of some sort, just like the MTD partitions do would do just
fine, and wouldn't have the implicit expectation that a driver will be
probed from that node.

> and then on the consumer side:
> 
> 	device {
> 		eeproms = <&serial-number>;
> 		eeprom-names = "soc-rev-id";
> 	};
> 
> 
> This would solve a problem where the consumer device is some standard
> off-the-shelf IP block that needs to get some SoC specific calibration
> data from the eeprom. I may want to interpret the bits differently
> depending on which eeprom is connected to my SoC. Sometimes that data
> format may be the same across many variations of the SoC (e.g. the
> qcom,pvs-a node) or it may be completely different for a given SoC (e.g.
> qcom,serial-msm8960 node). I imagine for other SoCs out there it could
> be different depending on which eeprom the board manufacturer decides to
> wire onto their board and how they choose to program the data.

Oh, so you'd like to infer the data format it's stored in from the
compatible?

AFAICT, this format will be highly depending on the board itself,
rather than on the SoC, do you think it will scale enough?

> So this is where I think the eeprom-cells and offset + length starts to
> fall apart. It forces us to make up a bunch of different compatible
> strings for our consumer device just so that we can parse the eeprom
> that we decided to use for some SoC/board specific data. Instead I'd
> like to see some framework that expresses exactly which eeprom is on my
> board and how to interpret the bits in a way that doesn't require me to
> keep refining the compatible string for my generic IP block.

Hmmmm, apparently you don't :)

> I worry that if we put all those details in DT we'll end up having to
> describe individual bits like serial-number-bit-2, serial-number-bit-3,
> etc. because sometimes these pieces of data are scattered all around the
> eeprom and aren't contiguous or aligned on a byte boundary. It may be
> easier to just have a way to express that this is an eeprom with this
> specific layout and my device has data stored in there. Then the driver
> can be told what layout it is (via compatible or some other string based
> means if we're not using DT?) and match that up with some driver data if
> it needs to know how to understand the bits it can read with the
> eeprom_read() API.

I'm half convinced that the layout information will actually work for
more complex cases, like the linked list Rob described.

If such a structure is ever to be found, it would feel wrong to have
that in the EEPROM driver, but it would feel just as wrong to put that
in the client driver, that would have to handle the parsing of raw
data coming flashed by one single crazy board vendor.

Maybe we can have each cell carry a property that defines the format
it's stored in, and match that to some parsers plugins, starting with
the generic and trivial cases but still allowing for custom parsers to
be defined?

Something like

eeprom@42 {
	compatible = "atmel,at24something";
	reg = <0x42>;

	serial@0 {
		label = "board serial";
		reg = <0x0 0x10>;
		format = "packed";
	};

	opps@10 {
		label = "board serial";
		reg = <0x10 0x10>, <0x40 0x10>, <0x80 0x10>;
		format = "random-vendor,opp-linked-list";
	};
};

That would make eeprom_read always return the same format of data to
the client drivers, without cripling the generic EEPROM drivers
either.

Maxime

-- 
Maxime Ripard, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH] capabilities: Ambient capability set V1
From: Serge E. Hallyn @ 2015-02-24 15:47 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Serge Hallyn, Christoph Lameter, Serge Hallyn, Andy Lutomirski,
	Aaron Jones, Ted Ts'o,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	akpm-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r, Andrew G. Morgan,
	Mimi Zohar, Austin S Hemmelgarn, Markku Savela, Jarkko Sakkinen,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	Jonathan Corbet
In-Reply-To: <20150224051928.GA14755-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>

On Mon, Feb 23, 2015 at 11:19:29PM -0600, Serge E. Hallyn wrote:
> On Mon, Feb 23, 2015 at 06:15:53PM +0000, Serge Hallyn wrote:
> > Quoting Christoph Lameter (cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org):
> > > On Mon, 23 Feb 2015, Serge E. Hallyn wrote:
> > > 
> > > > > I do not see a problem with dropping privilege since the ambient set
> > > > > is supposed to be preserved across a drop of priviledge.
> > > >
> > > > Because you're tricking the program into thinking it has dropped
> > > > the privilege, when in fact it has not.
> > > 
> > > So the cap was dropped from the cap perm set but it is still active
> > > in the ambient set?
> > 
> > Right, and the legacy program doesn't know to check the new set.
> 
> we've been assuming the ambient set must be like fP.  is there any
> reason why it doesn't suffice for them to be or'ed with fI instead at
> exec?  then the bits would need to ne in pI. this might sufice for
> Christoph's use case, as pI will generally not change.  and for programs
> that really care, they can check pI.

The other way to look at it then is that it's basically as though the
privileged task (which has CAP_SETFCAP) could've just added fI=full to
all binaries on the filesystem;  instead it's using the ambient set
so that the risk from fI=full is contained to its own process tree.

^ permalink raw reply

* Re: [PATCH] capabilities: Ambient capability set V1
From: Christoph Lameter @ 2015-02-24 15:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jarkko Sakkinen, Ted Ts'o, linux-kernel@vger.kernel.org,
	Andrew G. Morgan, Andrew Morton, LSM List, Michael Kerrisk,
	Linux API, Mimi Zohar, Austin S Hemmelgarn, Aaron Jones,
	Serge Hallyn, Markku Savela, Jonathan Corbet
In-Reply-To: <CALCETrXnBvxG7i9A1Q8oS7kr8TG+FuGJMRVBQ_gFsxM6u=qdDA@mail.gmail.com>

On Mon, 23 Feb 2015, Andy Lutomirski wrote:

> On Feb 23, 2015 8:41 AM, "Christoph Lameter" <cl@linux.com> wrote:
> >
> > On Mon, 23 Feb 2015, Andy Lutomirski wrote:
> >
> > > If you set ambient caps and then run a setuid program (without
> > > no_new_privs), then the ambient set *must* be cleared by the kernel
> > > because that's what the setuid program expects.  Yes, the whole
> >
> > Why would a setuid program expect that? I'd say we expect the ambient set
> > to remain in effect. What would break if the ambient set would stay
> > active?
> >
>
> On a total guess: exim, sendmail, sudo, Apache suexec, etc.  Basically
> anything that expects setresuid(nonzero values); execve to drop caps.

Really? We have been running these things for years with the approach of
leaving these caps active.

^ permalink raw reply

* Re: [PATCH] capabilities: Ambient capability set V1
From: Christoph Lameter @ 2015-02-24 15:58 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Serge Hallyn, Serge Hallyn, Andy Lutomirski, Aaron Jones,
	Ted Ts'o, linux-security-module, akpm, Andrew G. Morgan,
	Mimi Zohar, Austin S Hemmelgarn, Markku Savela, Jarkko Sakkinen,
	linux-kernel, linux-api, Michael Kerrisk, Jonathan Corbet
In-Reply-To: <20150224154715.GA20682@mail.hallyn.com>

On Tue, 24 Feb 2015, Serge E. Hallyn wrote:

> The other way to look at it then is that it's basically as though the
> privileged task (which has CAP_SETFCAP) could've just added fI=full to
> all binaries on the filesystem;  instead it's using the ambient set
> so that the risk from fI=full is contained to its own process tree.

The way that our internal patch works is to leave these things alone and
just check the ambient mask in the *capable*() functions. That way the
behavior of the existing cap bits does not change but the ambient caps
stay available. Apps have no surprises.

^ permalink raw reply

* Re: [PATCH] capabilities: Ambient capability set V1
From: Serge Hallyn @ 2015-02-24 16:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Serge E. Hallyn, Serge Hallyn, Andy Lutomirski, Aaron Jones,
	Ted Ts'o, linux-security-module, akpm, Andrew G. Morgan,
	Mimi Zohar, Austin S Hemmelgarn, Markku Savela, Jarkko Sakkinen,
	linux-kernel, linux-api, Michael Kerrisk, Jonathan Corbet
In-Reply-To: <alpine.DEB.2.11.1502240956400.29006@gentwo.org>

Quoting Christoph Lameter (cl@linux.com):
> On Tue, 24 Feb 2015, Serge E. Hallyn wrote:
> 
> > The other way to look at it then is that it's basically as though the
> > privileged task (which has CAP_SETFCAP) could've just added fI=full to
> > all binaries on the filesystem;  instead it's using the ambient set
> > so that the risk from fI=full is contained to its own process tree.
> 
> The way that our internal patch works is to leave these things alone and
> just check the ambient mask in the *capable*() functions. That way the
> behavior of the existing cap bits does not change but the ambient caps
> stay available. Apps have no surprises.

Unless I'm misunderstanding what you are saying, apps do have surprises.
They drop capabilities, execute a file, and the result has capabilities
which the app couldn't have expected.  At least if the bits have to be
in fI to become part of pP', the app has a clue.

To be clear, I'm suggesting that the rules at exec become:

pI' = pI
pA' = pA  (pA is ambient)
pP' = (X & fP) | (pI & (fI | pA))
pE' = pP' & fE

^ permalink raw reply

* Re: [PATCH] capabilities: Ambient capability set V1
From: Christoph Lameter @ 2015-02-24 17:28 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Serge E. Hallyn, Serge Hallyn, Andy Lutomirski, Aaron Jones,
	Ted Ts'o, linux-security-module, akpm, Andrew G. Morgan,
	Mimi Zohar, Austin S Hemmelgarn, Markku Savela, Jarkko Sakkinen,
	linux-kernel, linux-api, Michael Kerrisk, Jonathan Corbet
In-Reply-To: <20150224164429.GB29685@ubuntumail>

On Tue, 24 Feb 2015, Serge Hallyn wrote:

> Unless I'm misunderstanding what you are saying, apps do have surprises.
> They drop capabilities, execute a file, and the result has capabilities
> which the app couldn't have expected.  At least if the bits have to be
> in fI to become part of pP', the app has a clue.

Well yes but the surprises do not occur in the cap bits they are
manipulating or inspecting via prctl.

> To be clear, I'm suggesting that the rules at exec become:
>
> pI' = pI

Ok that is new and on its own may solve the issue?

> pA' = pA  (pA is ambient)

Thats what this patch does

> pP' = (X & fP) | (pI & (fI | pA))

Hmmm... fP is empty for the file not having caps. so

pP' = pI & pA

> pE' = pP' & fE

fE? So the inherited caps are not effective? fE would be empty for a file
not having caps thus the ambient caps would not be available in the child.

^ permalink raw reply

* [PATCH v3 0/3] epoll: introduce round robin wakeup mode
From: Jason Baron @ 2015-02-24 21:25 UTC (permalink / raw)
  To: peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, luto, linux-kernel,
	linux-fsdevel, linux-api

Hi,

When we are sharing a wakeup source among multiple epoll fds, we end up with
thundering herd wakeups, since there is currently no way to add to the
wakeup source exclusively. This series introduces a new EPOLL_ROTATE flag
to allow for round robin exclusive wakeups.

I believe this patch series addresses the two main concerns that were raised in
prior postings. Namely, that it affected code (and potentially performance)
of the core kernel wakeup functions, even in cases where it was not strictly
needed, and that it could lead to wakeup starvation (since we were are no
longer waking up all waiters). It does so by adding an extra layer of
indirection, whereby waiters are attached to a 'psuedo' epoll fd, which in turn
is attached directly to the wakeup source.

Patch 1 introduces the required wakeup hooks. This could be restricted to just
the epoll code, but I added them to the generic code in case other ppl might
find them useful.

Patch 2 adds an optimization to the epoll wakeup code that allows EPOLL_ROTATE
to work optimally, however it could be its own standalone patch.

Finally, patch 3 adds the EPOLL_ROTATE, and documents the API usage.

I'm also inlining test code making use of this interface, which shows roughly
a 50% speedup, similar to my previous results: http://lwn.net/Articles/632590/.

Sample epoll_create1 manpage text:

EPOLL_ROTATE
	Set the 'exclusive rotation' rotation flag on the new file descriptor.
	This new file descriptor can be added via epoll_ctl() to at most 1
	non-epoll file descriptors. Any epoll fds addeded directory to the
	new file descriptor via epoll_ctl() will be woken up in a round robin
	exclusive manner.

Thanks,

-Jason

v3:
-restrict epoll exclusive rotate wakeups to within the epoll code
-Add epoll optimization for overflow list

Jason Baron (3):
  sched/wait: add __wake_up_rotate()
  epoll: limit wakeups to the overflow list
  epoll: Add EPOLL_ROTATE mode

 fs/eventpoll.c                 | 52 +++++++++++++++++++++++++++++++++++-------
 include/linux/wait.h           |  1 +
 include/uapi/linux/eventpoll.h |  4 ++++
 kernel/sched/wait.c            | 27 ++++++++++++++++++++++
 4 files changed, 76 insertions(+), 8 deletions(-)

-- 
1.8.2.rc2

#include <unistd.h>
#include <sys/epoll.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#define NUM_THREADS 100
#define NUM_EVENTS 20000
#define EPOLLEXCLUSIVE (1 << 28)
#define EPOLLBALANCED (1 << 27)

int optimize, exclusive;
int p[2];
int ep_src_fd;
pthread_t threads[NUM_THREADS];
int event_count[NUM_THREADS];

struct epoll_event evt = {
	.events = EPOLLIN 
};

void die(const char *msg) {
    perror(msg);
    exit(-1);
}

void *run_func(void *ptr)
{
	int i = 0;
	int j = 0;
	int ret;
	int epfd;
	char buf[4];
	int id = *(int *)ptr;
	int *contents;

	if ((epfd = epoll_create(1)) < 0)
		die("create");

	ret = epoll_ctl(epfd, EPOLL_CTL_ADD, ep_src_fd, &evt);
	if (ret)
		perror("epoll_ctl add error!\n");

	while (1) { 
    		ret = epoll_wait(epfd, &evt, 10000, -1);
		ret = read(p[0], buf, sizeof(int));
		if (ret == 4)
			event_count[id]++;
	}
}

#define EPOLL_ROTATE 1

int main(int argc, char *argv[])
{
	int ret, i, j;
	int id[NUM_THREADS];
	int total = 0;
	int nohit = 0;
	int extra_wakeups = 0;

	if (argc == 2) {
		if (strcmp(argv[1], "-o") == 0)
			optimize = 1;
		if (strcmp(argv[1], "-e") == 0)
			exclusive = 1;
	}

	if (pipe(p) < 0)
		die("pipe");
	if (optimize) {
		if ((ep_src_fd = epoll_create1(EPOLL_ROTATE)) < 0)
			die("create");
	} else {
		if ((ep_src_fd = epoll_create1(0)) < 0)
			die("create");
	}

	ret = epoll_ctl(ep_src_fd, EPOLL_CTL_ADD, p[0], &evt);
	if (ret)
		perror("epoll_ctl add core error!\n");

	for (i = 0; i < NUM_THREADS; i++) {
		id[i] = i;
		pthread_create(&threads[i], NULL, run_func, &id[i]);
	} 

	for (j = 0; j < NUM_EVENTS; j++) {
		write(p[1], p, sizeof(int));
		usleep(100);
	}

	for (i = 0; i < NUM_THREADS; i++) {
		pthread_cancel(threads[i]);
		printf("joined: %d\n", i);
		printf("event count: %d\n", event_count[i]);
		total += event_count[i];
		if (!event_count[i])
			nohit++;
	} 

	printf("total events is: %d\n", total);
	printf("nohit is: %d\n", nohit);
}

^ permalink raw reply

* [PATCH v3 1/3] sched/wait: add __wake_up_rotate()
From: Jason Baron @ 2015-02-24 21:25 UTC (permalink / raw)
  To: peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, luto, linux-kernel,
	linux-fsdevel, linux-api
In-Reply-To: <cover.1424805740.git.jbaron@akamai.com>

Create a special queue where waiters are 'rotated' to the end of the queue
after they are woken up. Waiters are expected to be added 'exclusively'
to this queue, and the wakeup must occur with __wake_up_rotate().

The current issue with just adding a waiter as exclusive is that it that often
results in the same thread woken up again and again. The first intended user of
this functionality is epoll.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 include/linux/wait.h |  1 +
 kernel/sched/wait.c  | 27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2232ed1..86f06f4 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -152,6 +152,7 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *k
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_bit(wait_queue_head_t *, void *, int);
+void __wake_up_rotate(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int wake_flags, void *key);
 int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, wait_bit_action_f *, unsigned);
 int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, wait_bit_action_f *, unsigned);
 void wake_up_bit(void *, int);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 852143a..2ceed03 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -157,6 +157,33 @@ void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
 EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
 
 /*
+ * Special wait queue were anything added as excluive will be rotated to the
+ * back of the queue in order to balance the wakeups.
+ */
+void __wake_up_rotate(wait_queue_head_t *q, unsigned int mode,
+		      int nr_exclusive, int wake_flags, void *key)
+{
+	unsigned long flags;
+	wait_queue_t *curr, *next;
+	LIST_HEAD(rotate_list);
+
+	spin_lock_irqsave(&q->lock, flags);
+	list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
+		unsigned wq_flags = curr->flags;
+
+		if (curr->func(curr, mode, wake_flags, key) &&
+					(wq_flags & WQ_FLAG_EXCLUSIVE)) {
+			if (nr_exclusive > 0)
+				list_move_tail(&curr->task_list, &rotate_list);
+			if (!--nr_exclusive)
+				break;
+		}
+	}
+	list_splice_tail(&rotate_list, &q->task_list);
+	spin_unlock_irqrestore(&q->lock, flags);
+}
+
+/*
  * Note: we use "set_current_state()" _after_ the wait-queue add,
  * because we need a memory barrier there on SMP, so that any
  * wake-function that tests for the wait-queue being active
-- 
1.8.2.rc2

^ permalink raw reply related

* [PATCH v3 2/3] epoll: restrict wakeups to the overflow list
From: Jason Baron @ 2015-02-24 21:25 UTC (permalink / raw)
  To: peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, luto, linux-kernel,
	linux-fsdevel, linux-api
In-Reply-To: <cover.1424805740.git.jbaron@akamai.com>

During ep_scan_ready_list(), when the ep->mtx is dropped, we queue new
events to the ep->ovflist. However, instead of just issuing wakeup for these
newly encountered events, we instead proceed to issue wakeups even if
nothing new is being propagated.

Normally, this simply results in unnecessary calls to wakeup. However,
now that we want to add wakeup queues that have 'state', this results in
unnecessary state transitions. That is, with the current default behavior
of always waking up all threads, the extra calls to wakeup do not affect
things adversely (besides the extra call overheads). However, we wish to
add policies that are stateful (for example rotating wakeups among epoll
sets), and these unnecessary wakeups cause unwanted transitions.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..da84712 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -594,7 +594,7 @@ static int ep_scan_ready_list(struct eventpoll *ep,
 					   struct list_head *, void *),
 			      void *priv, int depth, bool ep_locked)
 {
-	int error, pwake = 0;
+	int error, pwake = 0, newly_ready = 0;
 	unsigned long flags;
 	struct epitem *epi, *nepi;
 	LIST_HEAD(txlist);
@@ -634,6 +634,13 @@ static int ep_scan_ready_list(struct eventpoll *ep,
 	for (nepi = ep->ovflist; (epi = nepi) != NULL;
 	     nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
 		/*
+		 * We only need to perform wakeups if new events have arrived
+		 * while the ep->lock was dropped. We should have already
+		 * issued the wakeups for an existing events.
+		 */
+		if (!newly_ready)
+			newly_ready = 1;
+		/*
 		 * We need to check if the item is already in the list.
 		 * During the "sproc" callback execution time, items are
 		 * queued into ->ovflist but the "txlist" might already
@@ -657,7 +664,7 @@ static int ep_scan_ready_list(struct eventpoll *ep,
 	list_splice(&txlist, &ep->rdllist);
 	__pm_relax(ep->ws);
 
-	if (!list_empty(&ep->rdllist)) {
+	if (newly_ready) {
 		/*
 		 * Wake up (if active) both the eventpoll wait list and
 		 * the ->poll() wait list (delayed after we release the lock).
-- 
1.8.2.rc2

^ permalink raw reply related

* [PATCH v3 3/3] epoll: Add EPOLL_ROTATE mode
From: Jason Baron @ 2015-02-24 21:25 UTC (permalink / raw)
  To: peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, luto, linux-kernel,
	linux-fsdevel, linux-api
In-Reply-To: <cover.1424805740.git.jbaron@akamai.com>

Epoll file descriptors that are added to a shared wakeup source are always
added in a non-exclusive manner. That means that when we have multiple epoll
fds attached to a shared wakeup source they are all woken up. This can
lead to excessive cpu usage and uneven load distribution.

This patch introduces a new 'EPOLL_ROTATE' flag that can be passed as
an argument to epoll_create1(). The 'EPOLL_ROTATE' flag creates a 'special'
epfd that, such that it will round robin its wakeups among any epfds attached
to it. This special epfd can then be attached to at most 1 wakeup sources.

For example:

1. ep1 = epoll_create1(EPOLL_ROTATE);
2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL);
3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event);
   epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event);
   epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event);
     .
     .
     .

The usage of the indirect 'special' epfd is designed such that we don't have to
attach the epfds in step #3, directly to the wakeup source. This address two
issues. First, it ensures that competing process do not miss wakeups when they
are attached directly to the wakeup source. Because of the round robin nature
of the wakeups, we want to ensure that competing threads can't starve out or
insert themselves into our wakeup queue. Since, this 'special' fd is local to
our process we can control its exposure. And second, we didn't want the
implementation to affect the core kernel wake up paths, even when unused. By
creating a separate 'special' epfd we have isolated the wakeup paths.

An implementation note is that in the epoll wakeup routine,
'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
wakeup, only when there are current waiters. The idea is to use this additional
heuristic in order minimize wakeup latencies.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c                 | 41 +++++++++++++++++++++++++++++++++++------
 include/uapi/linux/eventpoll.h |  4 ++++
 2 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index da84712..a8b06a1 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -219,6 +219,9 @@ struct eventpoll {
 	/* used to optimize loop detection check */
 	int visited;
 	struct list_head visited_list_link;
+
+	/* wakeup policy type */
+	int wakeup_policy;
 };
 
 /* Wait structure used by the poll hooks */
@@ -1009,6 +1012,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	unsigned long flags;
 	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
+	int ewake = 0;
 
 	if ((unsigned long)key & POLLFREE) {
 		ep_pwq_from_wait(wait)->whead = NULL;
@@ -1073,8 +1077,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 	 * wait list.
 	 */
-	if (waitqueue_active(&ep->wq))
+	if (waitqueue_active(&ep->wq)) {
 		wake_up_locked(&ep->wq);
+		ewake = 1;
+	}
 	if (waitqueue_active(&ep->poll_wait))
 		pwake++;
 
@@ -1082,9 +1088,15 @@ out_unlock:
 	spin_unlock_irqrestore(&ep->lock, flags);
 
 	/* We have to call this outside the lock */
-	if (pwake)
-		ep_poll_safewake(&ep->poll_wait);
-
+	if (pwake) {
+		if (ep->wakeup_policy == EPOLL_ROTATE)
+			__wake_up_rotate(&ep->poll_wait, TASK_NORMAL, 1, 0,
+					 (void *)POLLIN);
+		else
+			ep_poll_safewake(&ep->poll_wait);
+	}
+	if (epi->event.events & EPOLLROTATE)
+		return ewake;
 	return 1;
 }
 
@@ -1097,12 +1109,19 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 {
 	struct epitem *epi = ep_item_from_epqueue(pt);
 	struct eppoll_entry *pwq;
+	struct eventpoll *ep;
 
 	if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
 		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
 		pwq->whead = whead;
 		pwq->base = epi;
-		add_wait_queue(whead, &pwq->wait);
+		if (is_file_epoll(epi->ffd.file))
+			ep = epi->ffd.file->private_data;
+		if (ep && (ep->wakeup_policy == EPOLL_ROTATE)) {
+			__add_wait_queue_exclusive(whead, &pwq->wait);
+			epi->event.events |= EPOLLROTATE;
+		} else
+			add_wait_queue(whead, &pwq->wait);
 		list_add_tail(&pwq->llink, &epi->pwqlist);
 		epi->nwait++;
 	} else {
@@ -1777,7 +1796,7 @@ SYSCALL_DEFINE1(epoll_create1, int, flags)
 	/* Check the EPOLL_* constant for consistency.  */
 	BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
 
-	if (flags & ~EPOLL_CLOEXEC)
+	if (flags & ~(EPOLL_CLOEXEC | EPOLL_ROTATE))
 		return -EINVAL;
 	/*
 	 * Create the internal data structure ("struct eventpoll").
@@ -1802,6 +1821,8 @@ SYSCALL_DEFINE1(epoll_create1, int, flags)
 	}
 	ep->file = file;
 	fd_install(fd, file);
+	if (flags & EPOLL_ROTATE)
+		ep->wakeup_policy = EPOLL_ROTATE;
 	return fd;
 
 out_free_fd:
@@ -1874,6 +1895,9 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	 */
 	ep = f.file->private_data;
 
+	if ((ep->wakeup_policy == EPOLL_ROTATE) && is_file_epoll(tf.file))
+		goto error_tgt_fput;
+
 	/*
 	 * When we insert an epoll file descriptor, inside another epoll file
 	 * descriptor, there is the change of creating closed loops, which are
@@ -1911,6 +1935,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 				mutex_lock_nested(&tep->mtx, 1);
 			}
 		}
+		error = -EINVAL;
+		if ((ep->wakeup_policy == EPOLL_ROTATE) &&
+					(!RB_EMPTY_ROOT(&ep->rbr)))
+			goto error_unlock;
 	}
 
 	/*
@@ -1945,6 +1973,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 			error = -ENOENT;
 		break;
 	}
+error_unlock:
 	if (tep != NULL)
 		mutex_unlock(&tep->mtx);
 	mutex_unlock(&ep->mtx);
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..b0a3b41 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -20,12 +20,16 @@
 
 /* Flags for epoll_create1.  */
 #define EPOLL_CLOEXEC O_CLOEXEC
+#define EPOLL_ROTATE 0x1
 
 /* Valid opcodes to issue to sys_epoll_ctl() */
 #define EPOLL_CTL_ADD 1
 #define EPOLL_CTL_DEL 2
 #define EPOLL_CTL_MOD 3
 
+/* Marks struct epitem that are attached to wakeup policy type EPOLL_ROTATE */
+#define EPOLLROTATE (1 << 28)
+
 /*
  * Request the handling of system wakeup events so as to prevent system suspends
  * from happening while those events are being processed.
-- 
1.8.2.rc2

^ permalink raw reply related

* Notification and Proposals to changes of the Scheduling subsystem
From: Mitchell Erblich @ 2015-02-25  1:20 UTC (permalink / raw)
  To: linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: mingo-DgEjT+Ai2ygdnm+yROfE0A, jbeulich-UKVsQwr1y4U,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, riel-H+wXaHxf7aLQT0dZR+AlfA

Please note that this engineer only receives memory subsystem patches and thus please include my email on responses.

Please note that this proposal is from this engineer.

This also fulfills any legal notification of work done, but not submitted.


	Notification & Proposal of Feasibility to Support System V Release 4 Defacto Standard Scheduling Extensions, etc within Linux


SCHED_IA
	Over 10 years ago, System V Release 4 was enhanced with additional features by Sun Microsystems. One of the more minor extensions dealt with the subdivision of process’s scheduling characteristics and was known as he INTERACTIVE /IA scheduling class. This scheduling class was targeted to frequent sleepers, with the mouse icon being one the first processes/tasks..

	Linux has no explicit SCHED_IA scheduling policy, but does alter scheduling characteristics based on some sleep behavior (example: GENTLE_FAIR_SLEEPERS) within the fair share scheduling configuration option. Processes / tasks that are CPU bound that fit into a SLEEPER behavior can have a hybrid behavior over time where during any one scheduling period, it may consume its variable length allocated time. This can alter its expected short latency to be current / ONPROC. To simplify the implementation, it is suggested that SCHED_IA be a sub scheduling policy of SCHED_NORMAL. Shouldn’t an administrator be able to classify that the NORMAL long term behavior of a task, be one as a FIXED sleeper? 

	Thus, the first Proposal is to explicitly support the SCHED_IA scheduling policy within the Linux kernel. After kernel support, any application that has the same functionality as priocntl(1) then needs to be altered to support this new scheduling policy.


Note: Administrator in this context should be a task with a UID, EUID, GID, EGID, etc, that has the proper CAPABILITY to alter scheduling behavior.


SCHED_UNFAIR
		UNIX / Linux scheduling has in the most part attempt to achieve some level of process / task scheduling fairness with Linux using the fair share scheduling class. Exceptions do exist, but are not being discussed below. In general this type of scheduling is acceptable in a generic implementation, but has weaknesses when UNIX / Linux is moved into a different environment. Many companies use UNIX / Linux in heavy networking environments where one or more tasks can infrequently attempt to consume more than its fair share.

		This proposal is to acknowledge that “nice” and a few other scheduling workarounds do no always suffice to allow this temporary inequality to exist. Yes, a cpumask could be set that allows only certain tasks to run on specific nodes, however the implied assumption is that they only infrequently need to have inequality / greater “time slice” than their fair share. A network protocol example is a convergence task that needs to run until it is finished and until that happens needed routing changes will not occur. The time window in which all tasks need to be run, SHOULD not need this task in consecutive time windows, thus over longer periods of time, it still fulfills the fair scheduling policy. Again the proper CAPABILITY needs to be specified with the priocntl(1) like application as running many tasks per CPU COULD then effect the performance of the system.

		Thus, explicit support for a new SCHED_UNFAIR scheduling policy is proposed / suggested within the Linux kernel. Again it can be a sub scheduling policy of SCHED_NORMAL.

Thank you,
		Mitchell Erblich
		UNIX Kernel Engineer
	


	


	
			--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH 1/3] eeprom: Add a simple EEPROM framework
From: Stephen Boyd @ 2015-02-25  1:30 UTC (permalink / raw)
  To: Maxime Ripard
  Cc: Rob Herring, Srinivas Kandagatla,
	linux-arm-kernel@lists.infradead.org, Rob Herring, Pawel Moll,
	Kumar Gala, linux-api@vger.kernel.org,
	linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
	Arnd Bergmann, Mark Brown, Greg Kroah-Hartman
In-Reply-To: <20150224092155.GO25269@lukather>

On 02/24, Maxime Ripard wrote:
> On Mon, Feb 23, 2015 at 03:11:40PM -0800, Stephen Boyd wrote:
> > >>> I would do something more simple that is just a list of keys and
> > >>> their location like this:
> > >>>
> > >>> device-serial-number = <start size>;
> > >>> key1 = <start size>;
> > >>> key2 = <start size>;
> > >> I'm sorry, but what's the difference?
> > > It can describe the layout completely whether the fields are tied to a
> > > h/w device or not.
> > >
> > > What I would like to see here is the entire layout described covering
> > > both types of fields.
> > >
> > 
> > I was thinking the DT might be like this on the provider side:
> > 
> >    qfprom@1000000 {
> >       reg = <0x1000000 0x1000>;
> >       ranges = <0 0x1000000 0x1000>;
> >       compatible = "qcom,qfprom-msm8960"
> > 
> >       pvs-data: pvs-data@40 {
> >             compatible = "qcom,pvs-a";
> >             reg = <0x40 0x20>,
> > 	    #eeprom-cells = <0>;
> >       };
> > 
> >        tsens-data: tmdata@10 {
> >             compatible = "qcom,tsens-data-msm8960";
> >             reg = <0x10 4>, <0x16 4>;
> > 	    #eeprom-cells = <0>;
> > 
> >       };
> > 
> >       serial-number: serial@50 {
> >             compatible = "qcom,serial-msm8960";
> >             reg = <0x50 4>, <0x60 4>;
> > 	    #eeprom-cells = <0>;
> > 
> >       };
> >    };
> 
> I'm not sure the compatible is really needed.
> 
> A label of some sort, just like the MTD partitions do would do just
> fine, and wouldn't have the implicit expectation that a driver will be
> probed from that node.

I wasn't aware that compatible meant driver probe. I thought
compatible just meant some software entity can understand what
I've described within this node. For example, compatible for
reserved-memory nodes doesn't mean we're going to probe a device.

> 
> > and then on the consumer side:
> > 
> > 	device {
> > 		eeproms = <&serial-number>;
> > 		eeprom-names = "soc-rev-id";
> > 	};
> > 
> > 
> > This would solve a problem where the consumer device is some standard
> > off-the-shelf IP block that needs to get some SoC specific calibration
> > data from the eeprom. I may want to interpret the bits differently
> > depending on which eeprom is connected to my SoC. Sometimes that data
> > format may be the same across many variations of the SoC (e.g. the
> > qcom,pvs-a node) or it may be completely different for a given SoC (e.g.
> > qcom,serial-msm8960 node). I imagine for other SoCs out there it could
> > be different depending on which eeprom the board manufacturer decides to
> > wire onto their board and how they choose to program the data.
> 
> Oh, so you'd like to infer the data format it's stored in from the
> compatible?
> 
> AFAICT, this format will be highly depending on the board itself,
> rather than on the SoC, do you think it will scale enough?
> 
> > So this is where I think the eeprom-cells and offset + length starts to
> > fall apart. It forces us to make up a bunch of different compatible
> > strings for our consumer device just so that we can parse the eeprom
> > that we decided to use for some SoC/board specific data. Instead I'd
> > like to see some framework that expresses exactly which eeprom is on my
> > board and how to interpret the bits in a way that doesn't require me to
> > keep refining the compatible string for my generic IP block.
> 
> Hmmmm, apparently you don't :)
> 
> > I worry that if we put all those details in DT we'll end up having to
> > describe individual bits like serial-number-bit-2, serial-number-bit-3,
> > etc. because sometimes these pieces of data are scattered all around the
> > eeprom and aren't contiguous or aligned on a byte boundary. It may be
> > easier to just have a way to express that this is an eeprom with this
> > specific layout and my device has data stored in there. Then the driver
> > can be told what layout it is (via compatible or some other string based
> > means if we're not using DT?) and match that up with some driver data if
> > it needs to know how to understand the bits it can read with the
> > eeprom_read() API.
> 
> I'm half convinced that the layout information will actually work for
> more complex cases, like the linked list Rob described.
> 
> If such a structure is ever to be found, it would feel wrong to have
> that in the EEPROM driver, but it would feel just as wrong to put that
> in the client driver, that would have to handle the parsing of raw
> data coming flashed by one single crazy board vendor.
> 
> Maybe we can have each cell carry a property that defines the format
> it's stored in, and match that to some parsers plugins, starting with
> the generic and trivial cases but still allowing for custom parsers to
> be defined?
> 
> Something like
> 
> eeprom@42 {
> 	compatible = "atmel,at24something";
> 	reg = <0x42>;
> 
> 	serial@0 {
> 		label = "board serial";
> 		reg = <0x0 0x10>;
> 		format = "packed";
> 	};
> 
> 	opps@10 {
> 		label = "board serial";
> 		reg = <0x10 0x10>, <0x40 0x10>, <0x80 0x10>;
> 		format = "random-vendor,opp-linked-list";
> 	};
> };
> 
> That would make eeprom_read always return the same format of data to
> the client drivers, without cripling the generic EEPROM drivers
> either.
> 

Is the goal here to make eeprom_read() figure out how to return
the next byte of data and hide the parsing logic behind the
eeprom APIs? I imagine "random-vendor,opp-linked-list" would be
handled by the eeprom driver and that would return OPPs byte by
byte across the different reg properties to the eeprom consumer?

This approach concerns me because every eeprom_read() call needs
to fit the format that the client driver is expecting. How do we
validate that? What do we do if we have a random OPP client #1
that expects to get the data from eeprom_read() with OPPs in
ascending order and random OPP client #2 that expects to get the
data from eeprom_read() with OPPs in descending order?

It feels like we're making the eeprom framework too smart without
a well defined abstraction. If we were to make it so that
eeprom_get_opps() knew what to do and parsed/populated the OPPs,
it might work. But if we're just exporting raw data across a
read/write API with some implementation specific mangling it
sounds like it's going to get messy fast. And if the API is well
defined, it would start to become rather large with many
different types of data that need to be parsed and sometimes data
that's only specific to a single SoC.

I wonder how much we could get away with this approach though. If
the eeprom driver probed and populated OPPs, made a serial number
available via the soc device, and then we made up framework(s)
for things like our thermal sensor calibration data and display
panel calibration data, I would guess that covers most of my
use-cases. The client drivers would need some sort of 'wait for
eeprom to populate things' API or we'd need to work that into the
new calibration framework.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox