Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v2 0/2] Add epoll round robin wakeup mode
From: Jason Baron @ 2015-02-18  3:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Ingo Molnar, Al Viro, Andrew Morton, Eric Wong,
	Davide Libenzi, Michael Kerrisk-manpages,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linux FS Devel, Linux API, Linus Torvalds, Mathieu Desnoyers,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA
In-Reply-To: <CALCETrWg9sdyoKg0-BkwKQgyANvJybQ_wqjTfvYEGW1+S1J5Bw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 02/17/2015 04:09 PM, Andy Lutomirski wrote:
> On Tue, Feb 17, 2015 at 12:33 PM, Jason Baron <jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org> wrote:
>> On 02/17/2015 02:46 PM, Andy Lutomirski wrote:
>>> On Tue, Feb 17, 2015 at 11:33 AM, Jason Baron <jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org> wrote:
>>>> When we are sharing a wakeup source among multiple epoll fds, we end up with
>>>> thundering herd wakeups, since there is currently no way to add to the
>>>> wakeup source exclusively. This series introduces 2 new epoll flags,
>>>> EPOLLEXCLUSIVE for adding to a wakeup source exclusively. And EPOLLROUNDROBIN
>>>> which is to be used in conjunction to EPOLLEXCLUSIVE to evenly
>>>> distribute the wakeups. This patch was originally motivated by a desire to
>>>> improve wakeup balance and cpu usage for a listen socket() shared amongst
>>>> multiple epoll fd sets.
>>>>
>>>> See: http://lwn.net/Articles/632590/ for previous test program and testing
>>>> resutls.
>>>>
>>>> Epoll manpage text:
>>>>
>>>> EPOLLEXCLUSIVE
>>>>         Provides exclusive wakeups when attaching multiple epoll fds to a
>>>>         shared wakeup source. Must be specified with an EPOLL_CTL_ADD operation.
>>>>
>>>> EPOLLROUNDROBIN
>>>>         Provides balancing for exclusive wakeups when attaching multiple epoll
>>>>         fds to a shared wakeup soruce. Depends on EPOLLEXCLUSIVE being set and
>>>>         must be specified with an EPOLL_CTL_ADD operation.
>>>>
>>>> Thanks,
>>> What permissions do you need on the file descriptor to do this?  This
>>> will be the first case where a poll-like operation has side effects,
>>> and that's rather weird IMO.
>>>
>> So in the case where you have both non-exclusive and exclusive
>> waiters, all of the non-exclusive waiters will continue to get woken
>> up. However, I think you're getting at having multiple exclusive
>> waiters and potentially 'starving' out other exclusive waiters.
>>
>> In general, I think wait queues are associated with a 'struct file',
>> so I think unless you are sharing your fd table, this isn't an issue.
>> However, there may be cases where this is not true? In which
>> case, perhaps, we could limit this to CAP_SYS_ADMIN...
> There's also SCM_RIGHTS, which can be used in conjunction with file
> sealing and such.
>
> In general, I feel like this patch series solves a problem that isn't
> well understood and does it by adding a rather strange new mechanism.
> Is there really a problem that can't be addressed by more normal epoll
> features?
>
> --Andy

hmm....so I dug through some of the Linux archives a bit and this
problem seems to crop up every so often without resolution.
So I do believe that its an issue that ppl are more generally
interested in.

See:

http://lkml.iu.edu/hypermail/linux/kernel/1201.1/02620.html
http://marc.info/?l=linux-kernel&m=128638781921073&w=2

In the latter thread, Linus suggests adding it to the "requested events"
field to poll: http://marc.info/?l=linux-kernel&m=128639416832335&w=2

So, I think that this series at least moves in that suggested direction.

Thanks,

-Jason

^ permalink raw reply

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
From: Ingo Molnar @ 2015-02-18  8:07 UTC (permalink / raw)
  To: Jason Baron
  Cc: peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, normalperson-rMlxZR9MS24,
	davidel-AhlLAIvw+VEjIGhXcJzhZg,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Linus Torvalds,
	Peter Zijlstra
In-Reply-To: <7956874bfdc7403f37afe8a75e50c24221039bd2.1424200151.git.jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org>


* Jason Baron <jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org> wrote:

> Epoll file descriptors that are added to a shared wakeup 
> source are always added in a non-exclusive manner. That 
> means that when we have multiple epoll fds attached to a 
> shared wakeup source they are all woken up. This can lead 
> to excessive cpu usage and uneven load distribution.
> 
> This patch introduces two new 'events' flags that are 
> intended to be used with EPOLL_CTL_ADD operations. 
> EPOLLEXCLUSIVE, adds the epoll fd to the event source in 
> an exclusive manner such that the minimum number of 
> threads are woken. EPOLLROUNDROBIN, which depends on 
> EPOLLEXCLUSIVE also being set, can also be added to the 
> 'events' flag, such that we round robin through the set 
> of waiting threads.
> 
> An implementation note is that in the epoll wakeup 
> routine, 'ep_poll_callback()', if EPOLLROUNDROBIN is set, 
> we return 1, for a successful wakeup, only when there are 
> current waiters. The idea is to use this additional 
> heuristic in order minimize wakeup latencies.
> 
> Signed-off-by: Jason Baron <jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/eventpoll.c                 | 25 ++++++++++++++++++++-----
>  include/uapi/linux/eventpoll.h |  6 ++++++
>  2 files changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index d77f944..382c832 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -92,7 +92,8 @@
>   */
>  
>  /* Epoll private bits inside the event mask */
> -#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
> +#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \
> +			 EPOLLEXCLUSIVE | EPOLLROUNDROBIN)
>  
>  /* Maximum number of nesting allowed inside epoll sets */
>  #define EP_MAX_NESTS 4
> @@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>  	unsigned long flags;
>  	struct epitem *epi = ep_item_from_wait(wait);
>  	struct eventpoll *ep = epi->ep;
> +	int ewake = 0;
>  
>  	if ((unsigned long)key & POLLFREE) {
>  		ep_pwq_from_wait(wait)->whead = NULL;
> @@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>  	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
>  	 * wait list.
>  	 */
> -	if (waitqueue_active(&ep->wq))
> +	if (waitqueue_active(&ep->wq)) {
> +		ewake = 1;
>  		wake_up_locked(&ep->wq);
> +	}
>  	if (waitqueue_active(&ep->poll_wait))
>  		pwake++;
>  
> @@ -1078,6 +1082,8 @@ out_unlock:
>  	if (pwake)
>  		ep_poll_safewake(&ep->poll_wait);
>  
> +	if (epi->event.events & EPOLLROUNDROBIN)
> +		return ewake;
>  	return 1;
>  }
>  
> @@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
>  		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
>  		pwq->whead = whead;
>  		pwq->base = epi;
> -		add_wait_queue(whead, &pwq->wait);
> +		if (epi->event.events & EPOLLROUNDROBIN)
> +			add_wait_queue_rr(whead, &pwq->wait);
> +		else if (epi->event.events & EPOLLEXCLUSIVE)
> +			add_wait_queue_exclusive(whead, &pwq->wait);
> +		else
> +			add_wait_queue(whead, &pwq->wait);
>  		list_add_tail(&pwq->llink, &epi->pwqlist);
>  		epi->nwait++;
>  	} else {
> @@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
>  SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>  		struct epoll_event __user *, event)
>  {
> -	int error;
> -	int full_check = 0;
> +	int error, full_check = 0, wait_flags = 0;
>  	struct fd f, tf;
>  	struct eventpoll *ep;
>  	struct epitem *epi;
> @@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>  	if (f.file == tf.file || !is_file_epoll(f.file))
>  		goto error_tgt_fput;
>  
> +	wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN);
> +	if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) &&
> +	    ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file))))))
> +		goto error_tgt_fput;
> +
>  	/*
>  	 * At this point it is safe to assume that the "private_data" contains
>  	 * our own data structure.
> diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
> index bc81fb2..10260a1 100644
> --- a/include/uapi/linux/eventpoll.h
> +++ b/include/uapi/linux/eventpoll.h
> @@ -26,6 +26,12 @@
>  #define EPOLL_CTL_DEL 2
>  #define EPOLL_CTL_MOD 3
>  
> +/* Balance wakeups for a shared event source */
> +#define EPOLLROUNDROBIN (1 << 27)
> +
> +/* Add exclusively */
> +#define EPOLLEXCLUSIVE (1 << 28)
> +
>  /*
>   * Request the handling of system wakeup events so as to prevent system suspends
>   * from happening while those events are being processed.

So let me rephrase the justification of your two patches:

Unlike regular waitqueue usage (where threads remove 
themselves from the waitqueue once they receive a wakeup), 
epoll waitqueues are static and 'persistent': epoll target 
threads are on the poll waitqueue indefinitely, only 
register/unregister removes threads from them.

So they are not really 'wait queues', but static 'task 
lists', and are thus exposed to classic thundering herd 
scheduling problems and scheduling assymetries: a single 
event on a shared event source will wake all epoll 
'task-lists', and not only will it wake them, but due to 
the static nature of the lists, even an exclusive wakeup 
will iterate along the list with O(N) overhead, until it 
finds a wakeable thread.

As the number of lists and the number of threads in the 
lists increases this scales suboptimally, and it also looks 
slightly odd that a random set of epoll worker threads is 
'more equal' than the others, in receiving a comparably 
higher proportion of events.

The solution is to add this new ABI to allow epoll events 
to be actively load-balanced both between the persistent 
'task lists', and to also allow the individual task lists 
to act as dynamic runqueues: the head of the list is likely 
to be sleeping, newly woken tasks get moved to the tail of 
the list.

This has two main advantages: firstly it solves the O(N) 
(micro-)problem, but it also more evenly distributes events 
both between task-lists and within epoll groups as tasks as 
well.

The disadvantages: slightly higher management micro-costs, 
plus a global waitqueue list, which used to be read-mostly, 
is now actively dirtied by every event, adding more global 
serialization. The latter is somewhat muted by the fact 
that the waitqueue lock itself is already a global 
serialization point today and got dirtied by every event, 
and the list head is next to it, usually in the same 
cacheline.

Did I get it right?

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
From: Arnd Bergmann @ 2015-02-18 11:06 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Andrey Vagin, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi
In-Reply-To: <20150217213313.GB7091-yYYamFZzV1regbzhZkK2zA@public.gmane.org>

On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > is used to get information about sockets.
> > > 
> > > A request is described by the task_diag_pid structure:
> > > 
> > > struct task_diag_pid {
> > >        __u64   show_flags;      /* specify which information are required */
> > >        __u64   dump_stratagy;   /* specify a group of processes */
> > > 
> > >        __u32   pid;
> > > };
> > 
> > Can you explain how the interface relates to the 'taskstats' genetlink
> > API? Did you consider extending that interface to provide the
> > information you need instead of basing on the socket-diag?
> 
> It isn't based on the socket-diag, it looks like socket-diag.
> 
> Current task_diag registers a new genl family, but we can use the taskstats
> family and add task_diag commands to it.

What I meant was more along the lines of making it look like taskstats
by adding new fields to 'struct taskstat' for what you want return.
I don't know if that is possible or a good idea for the information
you want to get out of the kernel, but it seems like a more natural
interface, as it already has some of the same data (comm, gid, pid,
ppid, ...).

	Arnd

^ permalink raw reply

* Re: [Xen-devel] [PATCH 25/45] gntalloc.h: include stdint.h in userspace
From: David Vrabel @ 2015-02-18 11:53 UTC (permalink / raw)
  To: Mikko Rapeli, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: David Vrabel, linux-api-u79uwXL29TY76Z2rM5mHXA,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b, Boris Ostrovsky
In-Reply-To: <1424127948-22484-26-git-send-email-mikko.rapeli-X3B1VOXEql0@public.gmane.org>

On 16/02/15 23:05, Mikko Rapeli wrote:
> Fixes compilation error:
> 
> xen/gntalloc.h:22:2: error: unknown type name ‘uint16_t’
> 
> Signed-off-by: Mikko Rapeli <mikko.rapeli-X3B1VOXEql0@public.gmane.org>
> ---
>  include/uapi/xen/gntalloc.h | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/include/uapi/xen/gntalloc.h b/include/uapi/xen/gntalloc.h
> index 76bd580..184df7e 100644
> --- a/include/uapi/xen/gntalloc.h
> +++ b/include/uapi/xen/gntalloc.h
> @@ -11,6 +11,12 @@
>  #ifndef __LINUX_PUBLIC_GNTALLOC_H__
>  #define __LINUX_PUBLIC_GNTALLOC_H__
>  
> +#ifdef __KERNEL__
> +#include <linux/types.h>
> +#else
> +#include <stdint.h>
> +#endif

I think it would be preferrable to #include <linux/types.h> only and
switch to using the __u32 etc. types (as others have suggested).

David

^ permalink raw reply

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
From: Andrew Vagin @ 2015-02-18 12:42 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrey Vagin, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi
In-Reply-To: <2631974.u59Ona1mDt@wuerfel>

On Wed, Feb 18, 2015 at 12:06:40PM +0100, Arnd Bergmann wrote:
> On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> > On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > is used to get information about sockets.
> > > > 
> > > > A request is described by the task_diag_pid structure:
> > > > 
> > > > struct task_diag_pid {
> > > >        __u64   show_flags;      /* specify which information are required */
> > > >        __u64   dump_stratagy;   /* specify a group of processes */
> > > > 
> > > >        __u32   pid;
> > > > };
> > > 
> > > Can you explain how the interface relates to the 'taskstats' genetlink
> > > API? Did you consider extending that interface to provide the
> > > information you need instead of basing on the socket-diag?
> > 
> > It isn't based on the socket-diag, it looks like socket-diag.
> > 
> > Current task_diag registers a new genl family, but we can use the taskstats
> > family and add task_diag commands to it.
> 
> What I meant was more along the lines of making it look like taskstats
> by adding new fields to 'struct taskstat' for what you want return.
> I don't know if that is possible or a good idea for the information
> you want to get out of the kernel, but it seems like a more natural
> interface, as it already has some of the same data (comm, gid, pid,
> ppid, ...).

Now I see what you mean. task_diag has more flexible and universal
interface than taskstat. A response of taskstat only contains a
taskstats structure. A response of taskdiag can contains a few types of
properties. Each type is described by its own structure.

Curently here are only two groups of parameters: task_diag_msg and
task_diag_creds.

task_diag_msg contains a few basic parameters.
task_diag_creds contains credentials.

I'm going to add other groups to describe all kind of task properties
which currently are presented in procfs (e.g. /proc/pid/maps,
/proc/pid/fding/*, /proc/pid/status, etc).

One of features of task_diag is an ability to choose which information
are required. This allows to minimize a response size and a time, which
is requred to fill this response.

struct task_diag_msg {
        __u32   tgid;
        __u32   pid;
        __u32   ppid;
        __u32   tpid;
        __u32   sid;
        __u32   pgid;
        __u8    state;
        char    comm[TASK_DIAG_COMM_LEN];
};

struct task_diag_creds {
        struct task_diag_caps cap_inheritable;
        struct task_diag_caps cap_permitted;
        struct task_diag_caps cap_effective;
        struct task_diag_caps cap_bset;

        __u32 uid;
        __u32 euid;
        __u32 suid;
        __u32 fsuid;
        __u32 gid;
        __u32 egid;
        __u32 sgid;
        __u32 fsgid;
};

Thanks,
Andrew
> 
> 	Arnd

^ permalink raw reply

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
From: Andrew Vagin @ 2015-02-18 14:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrey Vagin, Pavel Emelyanov, Roger Luethi, Oleg Nesterov,
	Cyrill Gorcunov, Andrew Morton, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <CALCETrWyQpr-x=No4mK_95gSANL-_fTr3qC7WjT_5TyFQb_rGw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
> >
> > Here is a preview version. It provides restricted set of functionality.
> > I would like to collect feedback about this idea.
> >
> > Currently we use the proc file system, where all information are
> > presented in text files, what is convenient for humans.  But if we need
> > to get information about processes from code (e.g. in C), the procfs
> > doesn't look so cool.
> >
> > From code we would prefer to get information in binary format and to be
> > able to specify which information and for which tasks are required. Here
> > is a new interface with all these features, which is called task_diag.
> > In addition it's much faster than procfs.
> >
> > task_diag is based on netlink sockets and looks like socket-diag, which
> > is used to get information about sockets.
> >
> > A request is described by the task_diag_pid structure:
> >
> > struct task_diag_pid {
> >        __u64   show_flags;      /* specify which information are required */
> >        __u64   dump_stratagy;   /* specify a group of processes */
> >
> >        __u32   pid;
> > };
> >
> > A respone is a set of netlink messages. Each message describes one task.
> > All task properties are divided on groups. A message contains the
> > TASK_DIAG_MSG group and other groups if they have been requested in
> > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > response will contain the TASK_DIAG_CRED group which is described by the
> > task_diag_creds structure.
> >
> > struct task_diag_msg {
> >         __u32   tgid;
> >         __u32   pid;
> >         __u32   ppid;
> >         __u32   tpid;
> >         __u32   sid;
> >         __u32   pgid;
> >         __u8    state;
> >         char    comm[TASK_DIAG_COMM_LEN];
> > };
> >
> > Another good feature of task_diag is an ability to request information
> > for a few processes. Currently here are two stratgies
> > TASK_DIAG_DUMP_ALL      - get information for all tasks
> > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> >                           tasks
> >
> > The task diag is much faster than the proc file system. We don't need to
> > create a new file descriptor for each task. We need to send a request
> > and get a response. It allows to get information for a few task in one
> > request-response iteration.
> >
> > I have compared performance of procfs and task-diag for the
> > "ps ax -o pid,ppid" command.
> >
> > A test stand contains 10348 processes.
> > $ ps ax -o pid,ppid | wc -l
> > 10348
> >
> > $ time ps ax -o pid,ppid > /dev/null
> >
> > real    0m1.073s
> > user    0m0.086s
> > sys     0m0.903s
> >
> > $ time ./task_diag_all > /dev/null
> >
> > real    0m0.037s
> > user    0m0.004s
> > sys     0m0.020s
> >
> > And here are statistics about syscalls which were called by each
> > command.
> > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
> >             20,713      syscalls:sys_exit_open
> >             20,710      syscalls:sys_exit_close
> >             20,708      syscalls:sys_exit_read
> >             10,348      syscalls:sys_exit_newstat
> >                 31      syscalls:sys_exit_write
> >
> > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
> >                114      syscalls:sys_exit_recvfrom
> >                 49      syscalls:sys_exit_write
> >                  8      syscalls:sys_exit_mmap
> >                  4      syscalls:sys_exit_mprotect
> >                  3      syscalls:sys_exit_newfstat
> >
> > You can find the test program from this experiment in the last patch.
> >
> > The idea of this functionality was suggested by Pavel Emelyanov
> > (xemul@), when he found that operations with /proc forms a significant
> > part of a checkpointing time.
> >
> > Ten years ago here was attempt to add a netlink interface to access to /proc
> > information:
> > http://lwn.net/Articles/99600/
> 
> I don't suppose this could use real syscalls instead of netlink.  If
> nothing else, netlink seems to conflate pid and net namespaces.

What do you mean by "conflate pid and net namespaces"?

> 
> Also, using an asynchronous interface (send, poll?, recv) for
> something that's inherently synchronous (as the kernel a local
> question) seems awkward to me.

Actually all requests are handled synchronously. We call sendmsg to send
a request and it is handled in this syscall.
 2)               |  netlink_sendmsg() {
 2)               |    netlink_unicast() {
 2)               |      taskdiag_doit() {
 2)   2.153 us    |        task_diag_fill();
 2)               |        netlink_unicast() {
 2)   0.185 us    |          netlink_attachskb();
 2)   0.291 us    |          __netlink_sendskb();
 2)   2.452 us    |        }
 2) + 33.625 us   |      }
 2) + 54.611 us   |    }
 2) + 76.370 us   |  }
 2)               |  netlink_recvmsg() {
 2)   1.178 us    |    skb_recv_datagram();
 2) + 46.953 us   |  }

If we request information for a group of tasks (NLM_F_DUMP), a first
portion of data is filled from the sendmsg syscall. And then when we read
it, the kernel fills the next portion.

 3)               |  netlink_sendmsg() {
 3)               |    __netlink_dump_start() {
 3)               |      netlink_dump() {
 3)               |        taskdiag_dumpid() {
 3)   0.685 us    |          task_diag_fill();
...
 3)   0.224 us    |          task_diag_fill();
 3) + 74.028 us   |        }
 3) + 88.757 us   |      }
 3) + 89.296 us   |    }
 3) + 98.705 us   |  }
 3)               |  netlink_recvmsg() {
 3)               |    netlink_dump() {
 3)               |      taskdiag_dumpid() {
 3)   0.594 us    |        task_diag_fill();
...
 3)   0.242 us    |        task_diag_fill();
 3) + 60.634 us   |      }
 3) + 72.803 us   |    }
 3) + 88.005 us   |  }
 3)               |  netlink_recvmsg() {
 3)               |    netlink_dump() {
 3)   2.403 us    |      taskdiag_dumpid();
 3) + 26.236 us   |    }
 3) + 40.522 us   |  }
 0) + 20.407 us   |  netlink_recvmsg();


netlink is really good for this type of tasks.  It allows to create an
extendable interface which can be easy customized for different needs.

I don't think that we would want to create another similar interface
just to be independent from network subsystem.

Thanks,
Andrew

> 
> --Andy

^ permalink raw reply

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
From: Arnd Bergmann @ 2015-02-18 14:46 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Andrey Vagin, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi
In-Reply-To: <20150218123659.GA24098-yYYamFZzV1regbzhZkK2zA@public.gmane.org>

On Wednesday 18 February 2015 15:42:11 Andrew Vagin wrote:
> On Wed, Feb 18, 2015 at 12:06:40PM +0100, Arnd Bergmann wrote:
> > On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> > > On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > > > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > > is used to get information about sockets.
> > > > > 
> > > > > A request is described by the task_diag_pid structure:
> > > > > 
> > > > > struct task_diag_pid {
> > > > >        __u64   show_flags;      /* specify which information are required */
> > > > >        __u64   dump_stratagy;   /* specify a group of processes */
> > > > > 
> > > > >        __u32   pid;
> > > > > };
> > > > 
> > > > Can you explain how the interface relates to the 'taskstats' genetlink
> > > > API? Did you consider extending that interface to provide the
> > > > information you need instead of basing on the socket-diag?
> > > 
> > > It isn't based on the socket-diag, it looks like socket-diag.
> > > 
> > > Current task_diag registers a new genl family, but we can use the taskstats
> > > family and add task_diag commands to it.
> > 
> > What I meant was more along the lines of making it look like taskstats
> > by adding new fields to 'struct taskstat' for what you want return.
> > I don't know if that is possible or a good idea for the information
> > you want to get out of the kernel, but it seems like a more natural
> > interface, as it already has some of the same data (comm, gid, pid,
> > ppid, ...).
> 
> Now I see what you mean. task_diag has more flexible and universal
> interface than taskstat. A response of taskstat only contains a
> taskstats structure. A response of taskdiag can contains a few types of
> properties. Each type is described by its own structure.

Right, so the question is whether that flexibility is actually required
here. Independent of which design you personally prefer, what are the
downsides of extending the existing but less flexible interface?

If it's good enough, that would seem to provide a more consistent
API, which in turn helps users understand the interface and use it
correctly.

> Curently here are only two groups of parameters: task_diag_msg and
> task_diag_creds.
> 
> task_diag_msg contains a few basic parameters.
> task_diag_creds contains credentials.
> 
> I'm going to add other groups to describe all kind of task properties
> which currently are presented in procfs (e.g. /proc/pid/maps,
> /proc/pid/fding/*, /proc/pid/status, etc).
> 
> One of features of task_diag is an ability to choose which information
> are required. This allows to minimize a response size and a time, which
> is requred to fill this response.

I realize that you are trying to optimize for performance, but it
would be nice to quantify this if you want to argue for requiring
a split interface.

> struct task_diag_msg {
>         __u32   tgid;
>         __u32   pid;
>         __u32   ppid;
>         __u32   tpid;
>         __u32   sid;
>         __u32   pgid;
>         __u8    state;
>         char    comm[TASK_DIAG_COMM_LEN];
> };

I guess this part would be a very natural extension to the
existing taskstats structure, and we should only add a new
one here if there are extremely good reasons for it.

> struct task_diag_creds {
>         struct task_diag_caps cap_inheritable;
>         struct task_diag_caps cap_permitted;
>         struct task_diag_caps cap_effective;
>         struct task_diag_caps cap_bset;
> 
>         __u32 uid;
>         __u32 euid;
>         __u32 suid;
>         __u32 fsuid;
>         __u32 gid;
>         __u32 egid;
>         __u32 sgid;
>         __u32 fsgid;
> };

while this part could well be kept separate so you can query
it individually from the rest of taskstats, but through a
related interface.

	Arnd

^ permalink raw reply

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
From: Jason Baron @ 2015-02-18 15:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, normalperson-rMlxZR9MS24,
	davidel-AhlLAIvw+VEjIGhXcJzhZg,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Linus Torvalds,
	Peter Zijlstra
In-Reply-To: <20150218080740.GA10199-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On 02/18/2015 03:07 AM, Ingo Molnar wrote:
> * Jason Baron <jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org> wrote:
>
>> Epoll file descriptors that are added to a shared wakeup 
>> source are always added in a non-exclusive manner. That 
>> means that when we have multiple epoll fds attached to a 
>> shared wakeup source they are all woken up. This can lead 
>> to excessive cpu usage and uneven load distribution.
>>
>> This patch introduces two new 'events' flags that are 
>> intended to be used with EPOLL_CTL_ADD operations. 
>> EPOLLEXCLUSIVE, adds the epoll fd to the event source in 
>> an exclusive manner such that the minimum number of 
>> threads are woken. EPOLLROUNDROBIN, which depends on 
>> EPOLLEXCLUSIVE also being set, can also be added to the 
>> 'events' flag, such that we round robin through the set 
>> of waiting threads.
>>
>> An implementation note is that in the epoll wakeup 
>> routine, 'ep_poll_callback()', if EPOLLROUNDROBIN is set, 
>> we return 1, for a successful wakeup, only when there are 
>> current waiters. The idea is to use this additional 
>> heuristic in order minimize wakeup latencies.
>>
>> Signed-off-by: Jason Baron <jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org>
>> ---
>>  fs/eventpoll.c                 | 25 ++++++++++++++++++++-----
>>  include/uapi/linux/eventpoll.h |  6 ++++++
>>  2 files changed, 26 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index d77f944..382c832 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -92,7 +92,8 @@
>>   */
>>  
>>  /* Epoll private bits inside the event mask */
>> -#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
>> +#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \
>> +			 EPOLLEXCLUSIVE | EPOLLROUNDROBIN)
>>  
>>  /* Maximum number of nesting allowed inside epoll sets */
>>  #define EP_MAX_NESTS 4
>> @@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>>  	unsigned long flags;
>>  	struct epitem *epi = ep_item_from_wait(wait);
>>  	struct eventpoll *ep = epi->ep;
>> +	int ewake = 0;
>>  
>>  	if ((unsigned long)key & POLLFREE) {
>>  		ep_pwq_from_wait(wait)->whead = NULL;
>> @@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>>  	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
>>  	 * wait list.
>>  	 */
>> -	if (waitqueue_active(&ep->wq))
>> +	if (waitqueue_active(&ep->wq)) {
>> +		ewake = 1;
>>  		wake_up_locked(&ep->wq);
>> +	}
>>  	if (waitqueue_active(&ep->poll_wait))
>>  		pwake++;
>>  
>> @@ -1078,6 +1082,8 @@ out_unlock:
>>  	if (pwake)
>>  		ep_poll_safewake(&ep->poll_wait);
>>  
>> +	if (epi->event.events & EPOLLROUNDROBIN)
>> +		return ewake;
>>  	return 1;
>>  }
>>  
>> @@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
>>  		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
>>  		pwq->whead = whead;
>>  		pwq->base = epi;
>> -		add_wait_queue(whead, &pwq->wait);
>> +		if (epi->event.events & EPOLLROUNDROBIN)
>> +			add_wait_queue_rr(whead, &pwq->wait);
>> +		else if (epi->event.events & EPOLLEXCLUSIVE)
>> +			add_wait_queue_exclusive(whead, &pwq->wait);
>> +		else
>> +			add_wait_queue(whead, &pwq->wait);
>>  		list_add_tail(&pwq->llink, &epi->pwqlist);
>>  		epi->nwait++;
>>  	} else {
>> @@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
>>  SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>>  		struct epoll_event __user *, event)
>>  {
>> -	int error;
>> -	int full_check = 0;
>> +	int error, full_check = 0, wait_flags = 0;
>>  	struct fd f, tf;
>>  	struct eventpoll *ep;
>>  	struct epitem *epi;
>> @@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>>  	if (f.file == tf.file || !is_file_epoll(f.file))
>>  		goto error_tgt_fput;
>>  
>> +	wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN);
>> +	if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) &&
>> +	    ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file))))))
>> +		goto error_tgt_fput;
>> +
>>  	/*
>>  	 * At this point it is safe to assume that the "private_data" contains
>>  	 * our own data structure.
>> diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
>> index bc81fb2..10260a1 100644
>> --- a/include/uapi/linux/eventpoll.h
>> +++ b/include/uapi/linux/eventpoll.h
>> @@ -26,6 +26,12 @@
>>  #define EPOLL_CTL_DEL 2
>>  #define EPOLL_CTL_MOD 3
>>  
>> +/* Balance wakeups for a shared event source */
>> +#define EPOLLROUNDROBIN (1 << 27)
>> +
>> +/* Add exclusively */
>> +#define EPOLLEXCLUSIVE (1 << 28)
>> +
>>  /*
>>   * Request the handling of system wakeup events so as to prevent system suspends
>>   * from happening while those events are being processed.
> So let me rephrase the justification of your two patches:
>
> Unlike regular waitqueue usage (where threads remove 
> themselves from the waitqueue once they receive a wakeup), 
> epoll waitqueues are static and 'persistent': epoll target 
> threads are on the poll waitqueue indefinitely, only 
> register/unregister removes threads from them.
>
> So they are not really 'wait queues', but static 'task 
> lists', and are thus exposed to classic thundering herd 
> scheduling problems and scheduling assymetries: a single 
> event on a shared event source will wake all epoll 
> 'task-lists', and not only will it wake them, but due to 
> the static nature of the lists, even an exclusive wakeup 
> will iterate along the list with O(N) overhead, until it 
> finds a wakeable thread.
>
> As the number of lists and the number of threads in the 
> lists increases this scales suboptimally, and it also looks 
> slightly odd that a random set of epoll worker threads is 
> 'more equal' than the others, in receiving a comparably 
> higher proportion of events.

yes,  in fact we are currently working around these imbalances
by doing register/unregister (EPOLL_CTL_ADD/EPOLL_CTL_DEL),
periodically to re-set the order of the queues. This resolves the
balancing to an extent, but not the spurious wakeups.

>
> The solution is to add this new ABI to allow epoll events 
> to be actively load-balanced both between the persistent 
> 'task lists', and to also allow the individual task lists 
> to act as dynamic runqueues: the head of the list is likely 
> to be sleeping, newly woken tasks get moved to the tail of 
> the list.
>
> This has two main advantages: firstly it solves the O(N) 
> (micro-)problem, but it also more evenly distributes events 
> both between task-lists and within epoll groups as tasks as 
> well.

Its solving 2 issues - spurious wakeups, and more
even loading of threads. The event distribution is more
even between 'epoll groups' with this patch, however,
if multiple threads are blocking on a single 'epoll group',
this patch does not affect the the event distribution there.
Currently, threads are added to 'epoll group' as exclusive
already, so when you have multiple threads blocking on
an epoll group, only one wakes up. In our use case, we
have a 1-to-1 mapping b/w threads and epoll groups, so
we don't have spurious or un-balanced wakeups there.

That suggests the alternative user-space model for
addressing this problem. That is, to have a single epoll
group added with EPOLLONESHOT. In this way threads
can pull work or events off of a single queue, work on
the event and then re-arm (such that other threads don't
see events from that source in the meantime).
This, however, means all threads work on all events, and
they all have to synchronize to an extent on the single queue.
That is all register/unregister and re-arm event to that queue
have to be visible or synchronized for all the waiters. This
model also doesn't allow userspace to partition events that
are naturally local to thread, since there a single epoll group.

The second userspace model is to have worker threads with
their own separate epoll groups, and then have separate
thread(s) to address the shared wakeup sources. Then the
threads that are waiting on the shared wakeup sources can
distribute the events fairly to the worker threads. This involves
extra context switching for shared events, and I think ends up
degenerating back into the original problem.

> The disadvantages: slightly higher management micro-costs, 
> plus a global waitqueue list, which used to be read-mostly, 
> is now actively dirtied by every event, adding more global 
> serialization. The latter is somewhat muted by the fact 
> that the waitqueue lock itself is already a global 
> serialization point today and got dirtied by every event, 
> and the list head is next to it, usually in the same 
> cacheline.

Yes, I'm a bit concerned about the changes to the core
wakeup function, however in the non-rotate case, the
only additional write is to initialize the 'rotate_list' on entry.
I measured the latency of the __wake_up_common() for
the case where this code was added and we were not
doing 'rotate', and I didn't measure any additional latency
with ftrace, but it perhaps warrants more careful testing.

The outstanding issues I have are:

1) Does epoll need 2 new flags here - EPOLLEXCLUSIVE and
EPOLLROUNDROBIN. IE should they just be combined since
EPOLLROUNDROBIN depends on EPOLLEXCLUSIVE, or is there
a valid use for just EPOLLEXCLUSIVE (wake up the first waiter
but don't do the balancing)?

2) The concern Andy raised regarding potential starvation.
That is could a adversarial thread cause us to miss wakeups
if it can add itself exclusively to the shared wakeup source.
Currently, the adversarial thread would need to simply be
able to open the file in question. For things like pipe, this is
not an issue, but perhaps it is for files in the global
namespace...

Thanks,

-Jason

^ permalink raw reply

* Re: [PATCH 20/45] sctp.h: include stdint.h in userspace
From: Daniel Borkmann @ 2015-02-18 15:48 UTC (permalink / raw)
  To: Mikko Rapeli, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Vlad Yasevich, Neil Horman, linux-sctp-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1424127948-22484-21-git-send-email-mikko.rapeli-X3B1VOXEql0@public.gmane.org>

On 02/17/2015 12:05 AM, Mikko Rapeli wrote:
> Fixes compilation error:
>
> linux/sctp.h:652:2: error: unknown type name ‘uint32_t’
>
> Signed-off-by: Mikko Rapeli <mikko.rapeli-X3B1VOXEql0@public.gmane.org>

I have no idea where the rest of this series went (neither netdev, nor
lkml, nor linux-sctp) and what user space application making use of the
header you are trying to fix !?

> ---
>   include/uapi/linux/sctp.h | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
> index ce70fe6..9fd31cf 100644
> --- a/include/uapi/linux/sctp.h
> +++ b/include/uapi/linux/sctp.h
> @@ -53,7 +53,11 @@
>   #ifndef _UAPI_SCTP_H
>   #define _UAPI_SCTP_H
>
> +#ifdef __KERNEL__
>   #include <linux/types.h>
> +#else
> +#include <stdint.h>
> +#endif
>   #include <linux/socket.h>

Looks ugly, but could also be resolved in user space.

^ permalink raw reply

* Re: [PATCH 20/45] sctp.h: include stdint.h in userspace
From: Mikko Rapeli @ 2015-02-18 16:05 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Vlad Yasevich, Neil Horman,
	linux-sctp-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <54E4B434.3030302-FeC+5ew28dpmcu3hnIyYJQ@public.gmane.org>

On Wed, Feb 18, 2015 at 04:48:04PM +0100, Daniel Borkmann wrote:
> On 02/17/2015 12:05 AM, Mikko Rapeli wrote:
> >Fixes compilation error:
> >
> >linux/sctp.h:652:2: error: unknown type name ‘uint32_t’
> >
> >Signed-off-by: Mikko Rapeli <mikko.rapeli-X3B1VOXEql0@public.gmane.org>
> 
> I have no idea where the rest of this series went (neither netdev, nor
> lkml, nor linux-sctp) and what user space application making use of the
> header you are trying to fix !?

There is a test written to test compilation of exported kernel headers
one by one in userspace. lkml thread is here:

https://lkml.org/lkml/2015/2/16/521

The series should have gone to the right addresses with:

$ git send-email --to linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
	--cc-cmd "scripts/get_maintainer.pl --norolestats"

-Mikko

^ permalink raw reply

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
From: Ingo Molnar @ 2015-02-18 16:33 UTC (permalink / raw)
  To: Jason Baron
  Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra
In-Reply-To: <54E4B2D0.8020706@akamai.com>


* Jason Baron <jbaron@akamai.com> wrote:

> > This has two main advantages: firstly it solves the 
> > O(N) (micro-)problem, but it also more evenly 
> > distributes events both between task-lists and within 
> > epoll groups as tasks as well.
> 
> Its solving 2 issues - spurious wakeups, and more even 
> loading of threads. The event distribution is more even 
> between 'epoll groups' with this patch, however, if 
> multiple threads are blocking on a single 'epoll group', 
> this patch does not affect the the event distribution 
> there. [...]

Regarding your last point, are you sure about that?

If we have say 16 epoll threads registered, and if the list 
is static (no register/unregister activity), then the 
wakeup pattern is in strict order of the list: threads 
closer to the list head will be woken more frequently, in a 
wake-once fashion. So if threads do just quick work and go 
back to sleep quickly, then typically only the first 2-3 
threads will get any runtime in practice - the wakeup 
iteration never gets 'deep' into the list.

With the round-robin shuffling of the list, the threads get 
shuffled to the tail on wakeup, which distributes events 
evenly: all 16 epoll threads will accumulate an even 
distribution of runtime, statistically.

Have I misunderstood this somehow?

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
From: Jason Baron @ 2015-02-18 17:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra
In-Reply-To: <20150218163300.GA28007@gmail.com>

On 02/18/2015 11:33 AM, Ingo Molnar wrote:
> * Jason Baron <jbaron@akamai.com> wrote:
>
>>> This has two main advantages: firstly it solves the 
>>> O(N) (micro-)problem, but it also more evenly 
>>> distributes events both between task-lists and within 
>>> epoll groups as tasks as well.
>> Its solving 2 issues - spurious wakeups, and more even 
>> loading of threads. The event distribution is more even 
>> between 'epoll groups' with this patch, however, if 
>> multiple threads are blocking on a single 'epoll group', 
>> this patch does not affect the the event distribution 
>> there. [...]
> Regarding your last point, are you sure about that?
>
> If we have say 16 epoll threads registered, and if the list 
> is static (no register/unregister activity), then the 
> wakeup pattern is in strict order of the list: threads 
> closer to the list head will be woken more frequently, in a 
> wake-once fashion. So if threads do just quick work and go 
> back to sleep quickly, then typically only the first 2-3 
> threads will get any runtime in practice - the wakeup 
> iteration never gets 'deep' into the list.
>
> With the round-robin shuffling of the list, the threads get 
> shuffled to the tail on wakeup, which distributes events 
> evenly: all 16 epoll threads will accumulate an even 
> distribution of runtime, statistically.
>
> Have I misunderstood this somehow?
>
>

So in the case of multiple threads per epoll set, we currently
add to the head of wakeup queue exclusively in 'epoll_wait()',
and then subsequently remove from the queue once
'epoll_wait()' returns. So I don't think this patch addresses
balancing on a per epoll set basis.

I think we could address the case you describe by simply doing
__add_wait_queue_tail_exclusive() instead of
__add_wait_queue_exclusive() in epoll_wait(). However, I think
the userspace API change is less clear since epoll_wait() doesn't
currently have an 'input' events argument as epoll_ctl() does.

Thanks,

-Jason

^ permalink raw reply

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
From: Ingo Molnar @ 2015-02-18 17:45 UTC (permalink / raw)
  To: Jason Baron
  Cc: peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, normalperson-rMlxZR9MS24,
	davidel-AhlLAIvw+VEjIGhXcJzhZg,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Linus Torvalds,
	Peter Zijlstra
In-Reply-To: <54E4CE14.5010708-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org>


* Jason Baron <jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org> wrote:

> So in the case of multiple threads per epoll set, we 
> currently add to the head of wakeup queue exclusively in 
> 'epoll_wait()', and then subsequently remove from the 
> queue once 'epoll_wait()' returns. So I don't think this 
> patch addresses balancing on a per epoll set basis.

Okay, so I was confused about how the code works.

> I think we could address the case you describe by simply 
> doing __add_wait_queue_tail_exclusive() instead of 
> __add_wait_queue_exclusive() in epoll_wait(). [...]

Yes.

> [...] However, I think the userspace API change is less 
> clear since epoll_wait() doesn't currently have an 
> 'input' events argument as epoll_ctl() does.

... but the change would be a bit clearer and somewhat more 
flexible: LIFO or FIFO queueing, right?

But having the queueing model as part of the epoll context 
is a legitimate approach as well.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
From: Ingo Molnar @ 2015-02-18 17:51 UTC (permalink / raw)
  To: Jason Baron
  Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra
In-Reply-To: <20150218174533.GB31566@gmail.com>


* Ingo Molnar <mingo@kernel.org> wrote:

> > [...] However, I think the userspace API change is less 
> > clear since epoll_wait() doesn't currently have an 
> > 'input' events argument as epoll_ctl() does.
> 
> ... but the change would be a bit clearer and somewhat 
> more flexible: LIFO or FIFO queueing, right?
> 
> But having the queueing model as part of the epoll 
> context is a legitimate approach as well.

Btw., there's another optimization that the networking code 
already does when processing incoming packets: waking up a 
thread on the local CPU, where the wakeup is running.

Doing the same on epoll would have real scalability 
advantages where incoming events are IRQ driven and are 
distributed amongst multiple CPUs.

Where events are task driven the scheduler will already try 
to pair up waker and wakee so it might not show up in 
measurements that markedly.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
From: Ingo Molnar @ 2015-02-18 18:49 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Jonathan Corbet, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Alexander Viro, Andrew Morton, Kees Cook,
	Andy Lutomirski, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers
In-Reply-To: <20150216010224.GA32421@ad.nay.redhat.com>


* Fam Zheng <famz@redhat.com> wrote:

> On Sun, 02/15 15:00, Jonathan Corbet wrote:
> > On Fri, 13 Feb 2015 17:03:56 +0800
> > Fam Zheng <famz@redhat.com> wrote:
> > 
> > > SYNOPSIS
> > > 
> > >        #include <sys/epoll.h>
> > > 
> > >        int epoll_pwait1(int epfd, int flags,
> > >                         struct epoll_event *events,
> > >                         int maxevents,
> > >                         struct epoll_wait_params *params);
> > 
> > Quick, possibly dumb question: might it make sense to also pass in 
> > sizeof(struct epoll_wait_params)?  That way, when somebody wants to add
> > another parameter in the future, the kernel can tell which version is in
> > use and they won't have to do an epoll_pwait2()?
> > 
> 
> Flags can be used for that, if the change is not 
> radically different.

Passing in size is generally better than flags, because 
that way an extension of the ABI (new field[s]) 
automatically signals towards the kernel what to do with 
old binaries - while extending the functionality of new 
binaries, without sacrificing functionality.

With flags you are either limited to the same structure 
size - or have to decode a 'size' value from the flags 
value - which is fragile (and in which case a real 'size' 
parameter is better).

in the perf ABI we use something like that: there's a 
perf_attr.size parameter that iterates the ABI forward, 
while still being binary compatible with older software.

If old binaries pass in a smaller structure to a newer 
kernel then the kernel pads the new fields with zero by 
default - that way the kernel internals are never burdened 
with compatibility details and data format versions.

If new user-space passes in a large structure than the 
kernel can handle then the kernel returns an error - this 
way user-space can transparently support conditional 
features and fallback logic.

It works really well, we've done literally a hundred perf 
ABI extensions this way in the last 4+ years, in a pretty 
natural fashion, without littering the kernel (or 
user-space) with version legacies and without breaking 
existing perf tooling.

Other syscall ABIs already get painful when trying to 
handle 2-3 data structure versions, so people either give 
up, or add flags kludges or go to new syscall entries: 
which is painful in its own fashion and adds unnecessary 
latency to feature introduction as well.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
From: Eric Wong @ 2015-02-18 22:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jason Baron, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davidel-AhlLAIvw+VEjIGhXcJzhZg,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Linus Torvalds,
	Peter Zijlstra
In-Reply-To: <20150218175123.GA31878-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> 
> * Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> 
> > > [...] However, I think the userspace API change is less 
> > > clear since epoll_wait() doesn't currently have an 
> > > 'input' events argument as epoll_ctl() does.
> > 
> > ... but the change would be a bit clearer and somewhat 
> > more flexible: LIFO or FIFO queueing, right?
> > 
> > But having the queueing model as part of the epoll 
> > context is a legitimate approach as well.
> 
> Btw., there's another optimization that the networking code 
> already does when processing incoming packets: waking up a 
> thread on the local CPU, where the wakeup is running.
> 
> Doing the same on epoll would have real scalability 
> advantages where incoming events are IRQ driven and are 
> distributed amongst multiple CPUs.

Right.  One thing in the back of my mind has been to have CPU
affinity for epoll.  Either having everything in an epoll set
favor a certain CPU or even having affinity down to the epitem
level (so concurrent epoll_wait callers end up favoring the
same epitems).

I'm not convinced this series is worth doing without a
comparison against my previous suggestion to use a dedicated
thread which only makes blocking accept4 + EPOLL_CTL_ADD calls.

The majority of epoll events in a typical server should not be
for listen sockets, so I'd rather not bloat existing code paths
for them.  For web servers nowadays, the benefits of maintaining
long-lived connections to avoid handshakes is even more
beneficial with increasing HTTPS and HTTP2 adoption; so
listen socket events should become less common.

^ permalink raw reply

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
From: Andy Lutomirski @ 2015-02-18 23:12 UTC (permalink / raw)
  To: Jason Baron
  Cc: Davide Libenzi, Michael Kerrisk, Linux FS Devel, Eric Wong,
	Andrew Morton, Peter Zijlstra, linux-kernel@vger.kernel.org,
	Al Viro, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
	Ingo Molnar, Linux API, Ingo Molnar
In-Reply-To: <54E4CE14.5010708@akamai.com>

On Feb 18, 2015 9:38 AM, "Jason Baron" <jbaron@akamai.com> wrote:
>
> On 02/18/2015 11:33 AM, Ingo Molnar wrote:
> > * Jason Baron <jbaron@akamai.com> wrote:
> >
> >>> This has two main advantages: firstly it solves the
> >>> O(N) (micro-)problem, but it also more evenly
> >>> distributes events both between task-lists and within
> >>> epoll groups as tasks as well.
> >> Its solving 2 issues - spurious wakeups, and more even
> >> loading of threads. The event distribution is more even
> >> between 'epoll groups' with this patch, however, if
> >> multiple threads are blocking on a single 'epoll group',
> >> this patch does not affect the the event distribution
> >> there. [...]
> > Regarding your last point, are you sure about that?
> >
> > If we have say 16 epoll threads registered, and if the list
> > is static (no register/unregister activity), then the
> > wakeup pattern is in strict order of the list: threads
> > closer to the list head will be woken more frequently, in a
> > wake-once fashion. So if threads do just quick work and go
> > back to sleep quickly, then typically only the first 2-3
> > threads will get any runtime in practice - the wakeup
> > iteration never gets 'deep' into the list.
> >
> > With the round-robin shuffling of the list, the threads get
> > shuffled to the tail on wakeup, which distributes events
> > evenly: all 16 epoll threads will accumulate an even
> > distribution of runtime, statistically.
> >
> > Have I misunderstood this somehow?
> >
> >
>
> So in the case of multiple threads per epoll set, we currently
> add to the head of wakeup queue exclusively in 'epoll_wait()',
> and then subsequently remove from the queue once
> 'epoll_wait()' returns. So I don't think this patch addresses
> balancing on a per epoll set basis.
>
> I think we could address the case you describe by simply doing
> __add_wait_queue_tail_exclusive() instead of
> __add_wait_queue_exclusive() in epoll_wait(). However, I think
> the userspace API change is less clear since epoll_wait() doesn't
> currently have an 'input' events argument as epoll_ctl() does.

FWIW there's currently discussion about adding a new epoll API for
batch epoll_ctl.  It could be with coordinating with that effort if
some variant could address both use cases.

I'm still nervous about changing the per-fd wakeup stuff to do
anything other than waking everything.  After all, epoll and poll can
be used concurrently.

What about a slightly different approach: could an epoll fd support
multiple contexts?  For example, an fd could be set (with epoll_ctl or
the new batch stuff) to wake an any epoll waiter, one specific epoll
waiter, an epoll waiter preferably on the waking cpu, etc.

This would have the benefit of keeping the wakeup changes localized to
the epoll code.

--Andy

>
> Thanks,
>
> -Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
From: Andy Lutomirski @ 2015-02-19  1:18 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Pavel Emelyanov, Roger Luethi, Oleg Nesterov, Cyrill Gorcunov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andrew Morton, Linux API, Andrey Vagin
In-Reply-To: <20150218142718.GA30542-yYYamFZzV1regbzhZkK2zA@public.gmane.org>

On Feb 18, 2015 6:27 AM, "Andrew Vagin" <avagin-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
>
> On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> > On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
> > >
> > > Here is a preview version. It provides restricted set of functionality.
> > > I would like to collect feedback about this idea.
> > >
> > > Currently we use the proc file system, where all information are
> > > presented in text files, what is convenient for humans.  But if we need
> > > to get information about processes from code (e.g. in C), the procfs
> > > doesn't look so cool.
> > >
> > > From code we would prefer to get information in binary format and to be
> > > able to specify which information and for which tasks are required. Here
> > > is a new interface with all these features, which is called task_diag.
> > > In addition it's much faster than procfs.
> > >
> > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > is used to get information about sockets.
> > >
> > > A request is described by the task_diag_pid structure:
> > >
> > > struct task_diag_pid {
> > >        __u64   show_flags;      /* specify which information are required */
> > >        __u64   dump_stratagy;   /* specify a group of processes */
> > >
> > >        __u32   pid;
> > > };
> > >
> > > A respone is a set of netlink messages. Each message describes one task.
> > > All task properties are divided on groups. A message contains the
> > > TASK_DIAG_MSG group and other groups if they have been requested in
> > > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > > response will contain the TASK_DIAG_CRED group which is described by the
> > > task_diag_creds structure.
> > >
> > > struct task_diag_msg {
> > >         __u32   tgid;
> > >         __u32   pid;
> > >         __u32   ppid;
> > >         __u32   tpid;
> > >         __u32   sid;
> > >         __u32   pgid;
> > >         __u8    state;
> > >         char    comm[TASK_DIAG_COMM_LEN];
> > > };
> > >
> > > Another good feature of task_diag is an ability to request information
> > > for a few processes. Currently here are two stratgies
> > > TASK_DIAG_DUMP_ALL      - get information for all tasks
> > > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> > >                           tasks
> > >
> > > The task diag is much faster than the proc file system. We don't need to
> > > create a new file descriptor for each task. We need to send a request
> > > and get a response. It allows to get information for a few task in one
> > > request-response iteration.
> > >
> > > I have compared performance of procfs and task-diag for the
> > > "ps ax -o pid,ppid" command.
> > >
> > > A test stand contains 10348 processes.
> > > $ ps ax -o pid,ppid | wc -l
> > > 10348
> > >
> > > $ time ps ax -o pid,ppid > /dev/null
> > >
> > > real    0m1.073s
> > > user    0m0.086s
> > > sys     0m0.903s
> > >
> > > $ time ./task_diag_all > /dev/null
> > >
> > > real    0m0.037s
> > > user    0m0.004s
> > > sys     0m0.020s
> > >
> > > And here are statistics about syscalls which were called by each
> > > command.
> > > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
> > >             20,713      syscalls:sys_exit_open
> > >             20,710      syscalls:sys_exit_close
> > >             20,708      syscalls:sys_exit_read
> > >             10,348      syscalls:sys_exit_newstat
> > >                 31      syscalls:sys_exit_write
> > >
> > > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
> > >                114      syscalls:sys_exit_recvfrom
> > >                 49      syscalls:sys_exit_write
> > >                  8      syscalls:sys_exit_mmap
> > >                  4      syscalls:sys_exit_mprotect
> > >                  3      syscalls:sys_exit_newfstat
> > >
> > > You can find the test program from this experiment in the last patch.
> > >
> > > The idea of this functionality was suggested by Pavel Emelyanov
> > > (xemul@), when he found that operations with /proc forms a significant
> > > part of a checkpointing time.
> > >
> > > Ten years ago here was attempt to add a netlink interface to access to /proc
> > > information:
> > > http://lwn.net/Articles/99600/
> >
> > I don't suppose this could use real syscalls instead of netlink.  If
> > nothing else, netlink seems to conflate pid and net namespaces.
>
> What do you mean by "conflate pid and net namespaces"?

A netlink socket is bound to a network namespace, but you should be
returning data specific to a pid namespace.

On a related note, how does this interact with hidepid?  More
generally, what privileges are you requiring to obtain what data?

>
> >
> > Also, using an asynchronous interface (send, poll?, recv) for
> > something that's inherently synchronous (as the kernel a local
> > question) seems awkward to me.
>
> Actually all requests are handled synchronously. We call sendmsg to send
> a request and it is handled in this syscall.
>  2)               |  netlink_sendmsg() {
>  2)               |    netlink_unicast() {
>  2)               |      taskdiag_doit() {
>  2)   2.153 us    |        task_diag_fill();
>  2)               |        netlink_unicast() {
>  2)   0.185 us    |          netlink_attachskb();
>  2)   0.291 us    |          __netlink_sendskb();
>  2)   2.452 us    |        }
>  2) + 33.625 us   |      }
>  2) + 54.611 us   |    }
>  2) + 76.370 us   |  }
>  2)               |  netlink_recvmsg() {
>  2)   1.178 us    |    skb_recv_datagram();
>  2) + 46.953 us   |  }
>
> If we request information for a group of tasks (NLM_F_DUMP), a first
> portion of data is filled from the sendmsg syscall. And then when we read
> it, the kernel fills the next portion.
>
>  3)               |  netlink_sendmsg() {
>  3)               |    __netlink_dump_start() {
>  3)               |      netlink_dump() {
>  3)               |        taskdiag_dumpid() {
>  3)   0.685 us    |          task_diag_fill();
> ...
>  3)   0.224 us    |          task_diag_fill();
>  3) + 74.028 us   |        }
>  3) + 88.757 us   |      }
>  3) + 89.296 us   |    }
>  3) + 98.705 us   |  }
>  3)               |  netlink_recvmsg() {
>  3)               |    netlink_dump() {
>  3)               |      taskdiag_dumpid() {
>  3)   0.594 us    |        task_diag_fill();
> ...
>  3)   0.242 us    |        task_diag_fill();
>  3) + 60.634 us   |      }
>  3) + 72.803 us   |    }
>  3) + 88.005 us   |  }
>  3)               |  netlink_recvmsg() {
>  3)               |    netlink_dump() {
>  3)   2.403 us    |      taskdiag_dumpid();
>  3) + 26.236 us   |    }
>  3) + 40.522 us   |  }
>  0) + 20.407 us   |  netlink_recvmsg();
>
>
> netlink is really good for this type of tasks.  It allows to create an
> extendable interface which can be easy customized for different needs.
>
> I don't think that we would want to create another similar interface
> just to be independent from network subsystem.

I guess this is a bit streamy in that you ask one question and get
multiple answers.

>
> Thanks,
> Andrew
>
> >
> > --Andy

^ permalink raw reply

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
From: Jason Baron @ 2015-02-19  3:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra,
	luto@amacapital.net >> Andy Lutomirski
In-Reply-To: <20150218175123.GA31878@gmail.com>

On 02/18/2015 12:51 PM, Ingo Molnar wrote:
> * Ingo Molnar <mingo@kernel.org> wrote:
>
>>> [...] However, I think the userspace API change is less 
>>> clear since epoll_wait() doesn't currently have an 
>>> 'input' events argument as epoll_ctl() does.
>> ... but the change would be a bit clearer and somewhat 
>> more flexible: LIFO or FIFO queueing, right?
>>
>> But having the queueing model as part of the epoll 
>> context is a legitimate approach as well.
> Btw., there's another optimization that the networking code 
> already does when processing incoming packets: waking up a 
> thread on the local CPU, where the wakeup is running.
>
> Doing the same on epoll would have real scalability 
> advantages where incoming events are IRQ driven and are 
> distributed amongst multiple CPUs.
>
> Where events are task driven the scheduler will already try 
> to pair up waker and wakee so it might not show up in 
> measurements that markedly.
>

Right, so this makes me think that we may want to potentially
support a variety of wakeup policies. Adding these to the
generic wake up code is just going to be too messy. So, perhaps
a better approach here would be to register a single
wait_queue_t with the event source queue that will always
be woken up, and then layer any epoll balancing/irq affinity
policies on top of that. So in essence we end up with sort of
two queues layers, but I think it provides much nicer isolation
between layers. Also, the bulk of the changes are going to be
isolated to the epoll code, and we avoid Andy's concern about
missing, or starving out wakeups.

So here's a stab at how this API could look:

1. ep1 = epoll_create1(EPOLL_POLICY);

So EPOLL_POLICY here could the round robin policy described
here, or the irq affinity or other ideas. The idea is to create
an fd that is local to the process, such that other processes
can not subsequently attach to it and affect our policy.

2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL);

This associates ep1 with the event source. ep1 can be
associated with or added to at most 1 wakeup source. This call
would largely just form the association, but not queue anything
to the fd_source wait queue.

3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event);
    epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event);
    epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event);
     .
     .
     .

Finally, we add the epoll sets to the event source (indirectly via
ep1). So the first add would actually queue the callback to the
fd_source. While the subsequent calls would simply queue things
to the 'nested' wakeup queue associated with ep1.

So any existing epoll/poll/select calls could be queued as well
to fd_source and will operate independenly from this mechanism,
as the fd_source queue continues to be 'wake all'. Also, there
should be no changes necessary to __wake_up_common(), other
than potentially passing more back though the
wait_queue_func_t, such as 'nr_exclusive'.

Thanks,

-Jason

^ permalink raw reply

* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
From: Andrew Vagin @ 2015-02-19 14:04 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrey Vagin, linux-kernel, linux-api, Oleg Nesterov,
	Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi
In-Reply-To: <2264653.BOMoOiAVWc@wuerfel>

On Wed, Feb 18, 2015 at 03:46:31PM +0100, Arnd Bergmann wrote:
> On Wednesday 18 February 2015 15:42:11 Andrew Vagin wrote:
> > On Wed, Feb 18, 2015 at 12:06:40PM +0100, Arnd Bergmann wrote:
> > > On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> > > > On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > > > > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > > > is used to get information about sockets.
> > > > > > 
> > > > > > A request is described by the task_diag_pid structure:
> > > > > > 
> > > > > > struct task_diag_pid {
> > > > > >        __u64   show_flags;      /* specify which information are required */
> > > > > >        __u64   dump_stratagy;   /* specify a group of processes */
> > > > > > 
> > > > > >        __u32   pid;
> > > > > > };
> > > > > 
> > > > > Can you explain how the interface relates to the 'taskstats' genetlink
> > > > > API? Did you consider extending that interface to provide the
> > > > > information you need instead of basing on the socket-diag?
> > > > 
> > > > It isn't based on the socket-diag, it looks like socket-diag.
> > > > 
> > > > Current task_diag registers a new genl family, but we can use the taskstats
> > > > family and add task_diag commands to it.
> > > 
> > > What I meant was more along the lines of making it look like taskstats
> > > by adding new fields to 'struct taskstat' for what you want return.
> > > I don't know if that is possible or a good idea for the information
> > > you want to get out of the kernel, but it seems like a more natural
> > > interface, as it already has some of the same data (comm, gid, pid,
> > > ppid, ...).
> > 
> > Now I see what you mean. task_diag has more flexible and universal
> > interface than taskstat. A response of taskstat only contains a
> > taskstats structure. A response of taskdiag can contains a few types of
> > properties. Each type is described by its own structure.
> 
> Right, so the question is whether that flexibility is actually required
> here. Independent of which design you personally prefer, what are the
> downsides of extending the existing but less flexible interface?

I have looked at taskstat once again.

The format of response messages for taskstat and taskdiag are the same.
It's a netlink message with a set of nested attributes. New attributes
can be added without breaking backward compatibility.

The request can be expanded to be able to specified which information is
required and for which tasks.

These two features allow to significantly improve performance, because
in this case we don't need to do a system call for each task.

I have done a few experiments to prove these words.

task_proc_all reads /proc/pid/stat for each tast
$ time ./task_proc_all > /dev/null

real	0m1.528s
user	0m0.016s
sys	0m1.341s

task_diag uses task_diag and requests information for each task
separately.
$ time ./task_diag > /dev/null

real	0m1.166s
user	0m0.024s
sys	0m1.127s

task_diag_all uses task_diag and requests information for all tasks in
one request.
$ time ./task_diag_all > /dev/null

real	0m0.077s
user	0m0.018s
sys	0m0.053s

So you can see that the ability to request information for a group of
tasks allows to be more effective.

The summary of this message is that we can use the interface of
taskstats with some extensions.

Arnd, thank you for your opinion and suggestions.

> 
> If it's good enough, that would seem to provide a more consistent
> API, which in turn helps users understand the interface and use it
> correctly.
> 
> > Curently here are only two groups of parameters: task_diag_msg and
> > task_diag_creds.
> > 
> > task_diag_msg contains a few basic parameters.
> > task_diag_creds contains credentials.
> > 
> > I'm going to add other groups to describe all kind of task properties
> > which currently are presented in procfs (e.g. /proc/pid/maps,
> > /proc/pid/fding/*, /proc/pid/status, etc).
> > 
> > One of features of task_diag is an ability to choose which information
> > are required. This allows to minimize a response size and a time, which
> > is requred to fill this response.
> 
> I realize that you are trying to optimize for performance, but it
> would be nice to quantify this if you want to argue for requiring
> a split interface.
> 
> > struct task_diag_msg {
> >         __u32   tgid;
> >         __u32   pid;
> >         __u32   ppid;
> >         __u32   tpid;
> >         __u32   sid;
> >         __u32   pgid;
> >         __u8    state;
> >         char    comm[TASK_DIAG_COMM_LEN];
> > };
> 
> I guess this part would be a very natural extension to the
> existing taskstats structure, and we should only add a new
> one here if there are extremely good reasons for it.

The task_diag_msg structure contains properties which are used more
frequently than statistics from the taststats structure.

The size of the task_diag_msg structure is 44 bytes, the size of the
taststats structure 328.  If we have more data, we need to do more
system calls. So I have done one more experiment to look how it affects
perfomance:

If we use the task_diag_msg structure:
$ time ./task_diag_all > /dev/null

real	0m0.077s
user	0m0.018s
sys	0m0.053s

If we use the taststats structure:
$ time ./task_diag_all > /dev/null

real	0m0.117s
user	0m0.029s
sys	0m0.085s

Thanks,
Andrew

^ permalink raw reply

* Re: [PATCH 02/14] ARM: ARMv7M: Enlarge vector table to 256 entries
From: Maxime Coquelin @ 2015-02-19 16:13 UTC (permalink / raw)
  To: Rob Herring
  Cc: Geert Uytterhoeven, Jonathan Corbet, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, Philipp Zabel,
	Russell King, Daniel Lezcano, Thomas Gleixner, Linus Walleij,
	Greg Kroah-Hartman, Jiri Slaby, Arnd Bergmann, Andrew Morton,
	David S. Miller, Mauro Carvalho Chehab, Joe Perches,
	Antti Palosaari, Tejun Heo, Will Deacon, Nikolay Borisov
In-Reply-To: <CAL_Jsq+rHFPXmWqByvdG435tpbO_+ODByiMcdxboULYpUfDUmg@mail.gmail.com>

Hi Rob,

2015-02-15 23:42 GMT+01:00 Rob Herring <robherring2@gmail.com>:
> On Fri, Feb 13, 2015 at 2:42 AM, Maxime Coquelin
> <mcoquelin.stm32@gmail.com> wrote:
>> Hi Geert,
>>
>> 2015-02-12 21:34 GMT+01:00 Geert Uytterhoeven <geert@linux-m68k.org>:
>>> On Thu, Feb 12, 2015 at 6:45 PM, Maxime Coquelin
>>> <mcoquelin.stm32@gmail.com> wrote:
>>>> From Cortex-M4 and M7 reference manuals, the nvic supports up to 240
>>>> interrupts. So the number of entries in vectors table is 256.
>>>>
>>>> This patch adds the missing entries, and change the alignement, so that
>>>> vector_table remains naturally aligned.
>>>
>>> Shouldn't this depend on ARCH_STM32, or some other M4 or M7 specific
>>> Kconfig option, to avoid wasting the space on other CPUs?
>>
>> Actually, the STM32F429 has 90 interrupts, so it would need 106
>> entries in the vector table.
>> The maximum of supported interrupts is not only for Cortex-M4 and M7,
>> this is also true for Cortex-M3.
>>
>> I see two possibilities:
>>  1 - We declare the vector table for the maximum supported number of
>> IRQs, as this patch does.
>>         - Pro: it will be functionnal with all Cortex-M MCUs
>>         - Con: Waste of less than 1KB for memory
>
> The waste depends on the alignment size as well and could be up to
> almost 2KB worst case. It varies depending on the padding. We should
> try to place it so it always aligned and the wasted space is
> minimized.

Sorry, I just notice I didn't replied to all. That was my question:

Do you mean by forcing its location in the arch/arm/kernel/vmlinux.lds.S file?

Regards,
Maxime

>
> Rob
>
>>  2 - We introduce a config flag that provides the number of interrupts
>>         - Pro: No more memory waste
>>         - Con: Need to declare a per MCU model config flag.
>>
>> Then, regarding the natural alignment, is there a way to ensure it
>> depending on the value of a config flag?
>> Or we should keep it at the maximum value possible?
>>
>> Any feedback will be appreciated, especially from Uwe who maintains
>> the efm32 machine.
>>
>> Kind regards,
>> Maxime

^ permalink raw reply

* Re: [PATCH 02/14] ARM: ARMv7M: Enlarge vector table to 256 entries
From: Rob Herring @ 2015-02-19 16:35 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: Geert Uytterhoeven, Jonathan Corbet, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, Philipp Zabel,
	Russell King, Daniel Lezcano, Thomas Gleixner, Linus Walleij,
	Greg Kroah-Hartman, Jiri Slaby, Arnd Bergmann, Andrew Morton,
	David S. Miller, Mauro Carvalho Chehab, Joe Perches,
	Antti Palosaari, Tejun Heo, Will Deacon, Nikolay Borisov
In-Reply-To: <CALszF6DyVqaE6DLKSQHeuYWgSChzhEtCbmOsbjoTkvipquBivw@mail.gmail.com>

On Thu, Feb 19, 2015 at 10:13 AM, Maxime Coquelin
<mcoquelin.stm32@gmail.com> wrote:
> Hi Rob,
>
> 2015-02-15 23:42 GMT+01:00 Rob Herring <robherring2@gmail.com>:
>> On Fri, Feb 13, 2015 at 2:42 AM, Maxime Coquelin
>> <mcoquelin.stm32@gmail.com> wrote:
>>> Hi Geert,
>>>
>>> 2015-02-12 21:34 GMT+01:00 Geert Uytterhoeven <geert@linux-m68k.org>:
>>>> On Thu, Feb 12, 2015 at 6:45 PM, Maxime Coquelin
>>>> <mcoquelin.stm32@gmail.com> wrote:
>>>>> From Cortex-M4 and M7 reference manuals, the nvic supports up to 240
>>>>> interrupts. So the number of entries in vectors table is 256.
>>>>>
>>>>> This patch adds the missing entries, and change the alignement, so that
>>>>> vector_table remains naturally aligned.
>>>>
>>>> Shouldn't this depend on ARCH_STM32, or some other M4 or M7 specific
>>>> Kconfig option, to avoid wasting the space on other CPUs?
>>>
>>> Actually, the STM32F429 has 90 interrupts, so it would need 106
>>> entries in the vector table.
>>> The maximum of supported interrupts is not only for Cortex-M4 and M7,
>>> this is also true for Cortex-M3.
>>>
>>> I see two possibilities:
>>>  1 - We declare the vector table for the maximum supported number of
>>> IRQs, as this patch does.
>>>         - Pro: it will be functionnal with all Cortex-M MCUs
>>>         - Con: Waste of less than 1KB for memory
>>
>> The waste depends on the alignment size as well and could be up to
>> almost 2KB worst case. It varies depending on the padding. We should
>> try to place it so it always aligned and the wasted space is
>> minimized.
>
> Sorry, I just notice I didn't replied to all. That was my question:
>
> Do you mean by forcing its location in the arch/arm/kernel/vmlinux.lds.S file?

Yes, that is one way. Or we might be able to just be smarter about how
we arrange the code. The first thing to do is figure out how much
space we waste and what comes before it.

Rob

>
> Regards,
> Maxime
>
>>
>> Rob
>>
>>>  2 - We introduce a config flag that provides the number of interrupts
>>>         - Pro: No more memory waste
>>>         - Con: Need to declare a per MCU model config flag.
>>>
>>> Then, regarding the natural alignment, is there a way to ensure it
>>> depending on the value of a config flag?
>>> Or we should keep it at the maximum value possible?
>>>
>>> Any feedback will be appreciated, especially from Uwe who maintains
>>> the efm32 machine.
>>>
>>> Kind regards,
>>> Maxime

^ permalink raw reply

* [RFC PATCH 0/3] Add simple EEPROM Framework via regmap.
From: Srinivas Kandagatla @ 2015-02-19 17:07 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Maxime Ripard, Rob Herring, Pawel Moll, Kumar Gala, linux-api,
	linux-kernel, devicetree, Stephen Boyd, Arnd Bergmann, broonie,
	Greg Kroah-Hartman, Srinivas Kandagatla

This patchset adds a new simple EEPROM framework to kernel.

Up until now, EEPROM drivers were stored in drivers/misc, where they all had to
duplicate pretty much the same code to register a sysfs file, allow in-kernel
users to access the content of the devices they were driving, etc.
    
This was also a problem as far as other in-kernel users were involved, since
the solutions used were pretty much different from on driver to another, there
was a rather big abstraction leak.
    
This introduction of this framework aims at solving this. It also introduces DT
representation for consumer devices to go get the data they require (MAC
Addresses, SoC/Revision ID, part numbers, and so on) from the EEPROMs.
    
Having regmap interface to this framework would give much better
abstraction for eeproms on different buses.

patch 1 Introduces the EEPROM framework.
Patch 2 migrates an existing driver to eeprom framework.
Patch 3 Adds Qualcomm specific qfprom driver.

Its also possible to migrate other eeprom drivers to this framework.
Patch 3 can also be made a generic mmio-eeprom driver.

Providers APIs:
	eeprom_register/unregister();

Consumers APIs:
	eeprom_cell_get()/eeprom_cell_get_byname();
	eeprom_cell_read()/eeprom_cell_write();

Device Tree:
	qfprom: qfprom@00700000 {
		#eeprom-cells = <2>;
		compatible 	= "qcom,qfprom";
		reg		= <0x00700000 0x1000>;
	};

	tsens: tsens {
		...
		eeproms = <&qfprom 0x404 0x10>, <&qfprom 0x414 0x10>;
		eeprom-names = "calib", "backup_calib";
		...
	};

userspace interface:

	hexdump /sys/class/eeprom/eeprom0/eeprom
                                                                                                                                                                                                                  
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
00000a0 db10 2240 0000 e000 0c00 0c00 0000 0c00
0000000 0000 0000 0000 0000 0000 0000 0000 0000
...
*
0001000


Thanks,
srini

Maxime Ripard (2):
  eeprom: Add a simple EEPROM framework
  eeprom: sunxi: Move the SID driver to the eeprom framework

Srinivas Kandagatla (1):
  eeprom: qfprom: Add Qualcomm QFPROM support.

 Documentation/ABI/testing/sysfs-driver-sunxi-sid   |  22 --
 .../devicetree/bindings/eeprom/eeprom.txt          |  48 ++++
 drivers/Kconfig                                    |   2 +
 drivers/Makefile                                   |   1 +
 drivers/eeprom/Kconfig                             |  35 +++
 drivers/eeprom/Makefile                            |  11 +
 drivers/eeprom/core.c                              | 290 +++++++++++++++++++++
 drivers/eeprom/eeprom-sunxi-sid.c                  | 125 +++++++++
 drivers/eeprom/qfprom.c                            |  75 ++++++
 drivers/misc/eeprom/Kconfig                        |  13 -
 drivers/misc/eeprom/Makefile                       |   1 -
 drivers/misc/eeprom/sunxi_sid.c                    | 156 -----------
 include/linux/eeprom-consumer.h                    |  73 ++++++
 include/linux/eeprom-provider.h                    |  51 ++++
 14 files changed, 711 insertions(+), 192 deletions(-)
 delete mode 100644 Documentation/ABI/testing/sysfs-driver-sunxi-sid
 create mode 100644 Documentation/devicetree/bindings/eeprom/eeprom.txt
 create mode 100644 drivers/eeprom/Kconfig
 create mode 100644 drivers/eeprom/Makefile
 create mode 100644 drivers/eeprom/core.c
 create mode 100644 drivers/eeprom/eeprom-sunxi-sid.c
 create mode 100644 drivers/eeprom/qfprom.c
 delete mode 100644 drivers/misc/eeprom/sunxi_sid.c
 create mode 100644 include/linux/eeprom-consumer.h
 create mode 100644 include/linux/eeprom-provider.h

-- 
1.9.1

^ permalink raw reply

* [RFC PATCH 1/3] eeprom: Add a simple EEPROM framework
From: Srinivas Kandagatla @ 2015-02-19 17:08 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Maxime Ripard, Rob Herring, Pawel Moll, Kumar Gala, linux-api,
	linux-kernel, devicetree, Stephen Boyd, Arnd Bergmann, broonie,
	Greg Kroah-Hartman, Srinivas Kandagatla
In-Reply-To: <1424365639-26634-1-git-send-email-srinivas.kandagatla@linaro.org>

From: Maxime Ripard <maxime.ripard@free-electrons.com>

Up until now, EEPROM drivers were stored in drivers/misc, where they all had to
duplicate pretty much the same code to register a sysfs file, allow in-kernel
users to access the content of the devices they were driving, etc.

This was also a problem as far as other in-kernel users were involved, since
the solutions used were pretty much different from on driver to another, there
was a rather big abstraction leak.

This introduction of this framework aims at solving this. It also introduces DT
representation for consumer devices to go get the data they require (MAC
Addresses, SoC/Revision ID, part numbers, and so on) from the EEPROMs.

Having regmap interface to this framework would give much better
abstraction for eeproms on different buses.

Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com>
[srinivas.kandagatla: Moved to regmap based and cleanedup apis]
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
---
 .../devicetree/bindings/eeprom/eeprom.txt          |  48 ++++
 drivers/Kconfig                                    |   2 +
 drivers/Makefile                                   |   1 +
 drivers/eeprom/Kconfig                             |  19 ++
 drivers/eeprom/Makefile                            |   9 +
 drivers/eeprom/core.c                              | 290 +++++++++++++++++++++
 include/linux/eeprom-consumer.h                    |  73 ++++++
 include/linux/eeprom-provider.h                    |  51 ++++
 8 files changed, 493 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/eeprom/eeprom.txt
 create mode 100644 drivers/eeprom/Kconfig
 create mode 100644 drivers/eeprom/Makefile
 create mode 100644 drivers/eeprom/core.c
 create mode 100644 include/linux/eeprom-consumer.h
 create mode 100644 include/linux/eeprom-provider.h

diff --git a/Documentation/devicetree/bindings/eeprom/eeprom.txt b/Documentation/devicetree/bindings/eeprom/eeprom.txt
new file mode 100644
index 0000000..9ec1ec2
--- /dev/null
+++ b/Documentation/devicetree/bindings/eeprom/eeprom.txt
@@ -0,0 +1,48 @@
+= EEPROM Data Device Tree Bindings =
+
+This binding is intended to represent the location of hardware
+configuration data stored in EEPROMs.
+
+On a significant proportion of boards, the manufacturer has stored
+some data on an EEPROM-like device, for the OS to be able to retrieve
+these information and act upon it. Obviously, the OS has to know
+about where to retrieve these data from, and where they are stored on
+the storage device.
+
+This document is here to document this.
+
+= Data providers =
+
+Required properties:
+#eeprom-cells:	Number of cells in an eeprom specifier; The common
+		case is 2.
+
+For example:
+
+	at24: eeprom@42 {
+		#eeprom-cells = <2>;
+	};
+
+= Data consumers =
+
+Required properties:
+
+eeproms: List of phandle and data cell specifier triplet, one triplet
+	 for each data cell the device might be interested in. The
+	 triplet consists of the phandle to the eeprom provider, then
+	 the offset in byte within that storage device, and the length
+	 in byte of the data we care about.
+
+Optional properties:
+
+eeprom-names: List of data cell name strings sorted in the same order
+ 	      as the resets property. Consumers drivers will use
+ 	      eeprom-names to differentiate between multiple cells,
+ 	      and hence being able to know what these cells are for.
+
+For example:
+
+	device {
+		eeproms = <&at24 14 42>;
+		eeprom-names = "soc-rev-id";
+	};
diff --git a/drivers/Kconfig b/drivers/Kconfig
index c70d6e4..d7afc82 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -184,4 +184,6 @@ source "drivers/thunderbolt/Kconfig"
 
 source "drivers/android/Kconfig"
 
+source "drivers/eeprom/Kconfig"
+
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index 527a6da..57eb5b0 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -165,3 +165,4 @@ obj-$(CONFIG_RAS)		+= ras/
 obj-$(CONFIG_THUNDERBOLT)	+= thunderbolt/
 obj-$(CONFIG_CORESIGHT)		+= coresight/
 obj-$(CONFIG_ANDROID)		+= android/
+obj-$(CONFIG_EEPROM)		+= eeprom/
diff --git a/drivers/eeprom/Kconfig b/drivers/eeprom/Kconfig
new file mode 100644
index 0000000..2c5452a
--- /dev/null
+++ b/drivers/eeprom/Kconfig
@@ -0,0 +1,19 @@
+menuconfig EEPROM
+	bool "EEPROM Support"
+	depends on OF
+	help
+	  Support for EEPROM alike devices.
+
+	  This framework is designed to provide a generic interface to EEPROM
+	  from both the Linux Kernel and the userspace.
+
+	  If unsure, say no.
+
+if EEPROM
+
+config EEPROM_DEBUG
+	bool "EEPROM debug support"
+	help
+	  Say yes here to enable debugging support.
+
+endif
diff --git a/drivers/eeprom/Makefile b/drivers/eeprom/Makefile
new file mode 100644
index 0000000..e130079
--- /dev/null
+++ b/drivers/eeprom/Makefile
@@ -0,0 +1,9 @@
+#
+# Makefile for eeprom drivers.
+#
+
+ccflags-$(CONFIG_EEPROM_DEBUG) += -DDEBUG
+
+obj-$(CONFIG_EEPROM)		+= core.o
+
+# Devices
diff --git a/drivers/eeprom/core.c b/drivers/eeprom/core.c
new file mode 100644
index 0000000..bc877a6
--- /dev/null
+++ b/drivers/eeprom/core.c
@@ -0,0 +1,290 @@
+/*
+ * EEPROM framework core.
+ *
+ * Copyright (C) 2015 Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
+ * Copyright (C) 2013 Maxime Ripard <maxime.ripard@free-electrons.com>
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2.  This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/device.h>
+#include <linux/eeprom-provider.h>
+#include <linux/eeprom-consumer.h>
+#include <linux/export.h>
+#include <linux/fs.h>
+#include <linux/idr.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+struct eeprom_cell {
+	struct eeprom_device *eeprom;
+	loff_t offset;
+	size_t count;
+};
+
+static DEFINE_MUTEX(eeprom_list_mutex);
+static LIST_HEAD(eeprom_list);
+static DEFINE_IDA(eeprom_ida);
+
+static ssize_t bin_attr_eeprom_read_write(struct kobject *kobj,
+				    char *buf, loff_t offset,
+				    size_t count, bool read)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+	struct eeprom_device *eeprom = container_of(dev, struct eeprom_device,
+						    edev);
+	int rc;
+
+	if (offset > eeprom->size)
+		return 0;
+
+	if (offset + count > eeprom->size)
+		count = eeprom->size - offset;
+
+	if (read)
+		rc = regmap_bulk_read(eeprom->regmap, offset,
+				      buf, count/eeprom->stride);
+	else
+		rc = regmap_bulk_write(eeprom->regmap, offset,
+				       buf, count/eeprom->stride);
+
+	if (IS_ERR_VALUE(rc))
+		return 0;
+
+	return count;
+}
+
+static ssize_t bin_attr_eeprom_read(struct file *filp, struct kobject *kobj,
+				    struct bin_attribute *attr,
+				    char *buf, loff_t offset, size_t count)
+{
+	return bin_attr_eeprom_read_write(kobj, buf, offset, count, true);
+}
+
+static ssize_t bin_attr_eeprom_write(struct file *filp, struct kobject *kobj,
+				     struct bin_attribute *attr,
+				     char *buf, loff_t offset, size_t count)
+{
+	return bin_attr_eeprom_read_write(kobj, buf, offset, count, false);
+}
+
+static struct bin_attribute bin_attr_eeprom = {
+	.attr	= {
+		.name	= "eeprom",
+		.mode	= 0660,
+	},
+	.read	= bin_attr_eeprom_read,
+	.write	= bin_attr_eeprom_write,
+};
+
+static struct bin_attribute *eeprom_bin_attributes[] = {
+	&bin_attr_eeprom,
+	NULL,
+};
+
+static const struct attribute_group eeprom_bin_group = {
+	.bin_attrs	= eeprom_bin_attributes,
+};
+
+static const struct attribute_group *eeprom_dev_groups[] = {
+	&eeprom_bin_group,
+	NULL,
+};
+
+static struct class eeprom_class = {
+	.name		= "eeprom",
+	.dev_groups	= eeprom_dev_groups,
+};
+
+int eeprom_register(struct eeprom_device *eeprom)
+{
+	int rval;
+
+	if (!eeprom->regmap || !eeprom->size) {
+		dev_err(eeprom->dev, "Regmap not found\n");
+		return -EINVAL;
+	}
+
+	eeprom->id = ida_simple_get(&eeprom_ida, 0, 0, GFP_KERNEL);
+	if (eeprom->id < 0)
+		return eeprom->id;
+
+	eeprom->edev.class = &eeprom_class;
+	eeprom->edev.parent = eeprom->dev;
+	eeprom->edev.of_node = eeprom->dev ? eeprom->dev->of_node : NULL;
+	dev_set_name(&eeprom->edev, "eeprom%d", eeprom->id);
+
+	device_initialize(&eeprom->edev);
+
+	dev_dbg(&eeprom->edev, "Registering eeprom device %s\n",
+		dev_name(&eeprom->edev));
+
+	rval = device_add(&eeprom->edev);
+	if (rval)
+		return rval;
+
+	mutex_lock(&eeprom_list_mutex);
+	list_add(&eeprom->list, &eeprom_list);
+	mutex_unlock(&eeprom_list_mutex);
+
+	return 0;
+}
+EXPORT_SYMBOL(eeprom_register);
+
+int eeprom_unregister(struct eeprom_device *eeprom)
+{
+	device_del(&eeprom->edev);
+
+	mutex_lock(&eeprom_list_mutex);
+	list_del(&eeprom->list);
+	mutex_unlock(&eeprom_list_mutex);
+
+	return 0;
+}
+EXPORT_SYMBOL(eeprom_unregister);
+
+static struct eeprom_cell *__eeprom_cell_get(struct device_node *node,
+					     int index)
+{
+	struct of_phandle_args args;
+	struct eeprom_cell *cell;
+	struct eeprom_device *e, *eeprom = NULL;
+	int ret;
+
+	ret = of_parse_phandle_with_args(node, "eeproms",
+					 "#eeprom-cells", index, &args);
+	if (ret)
+		return ERR_PTR(ret);
+
+	if (args.args_count != 2)
+		return ERR_PTR(-EINVAL);
+
+	mutex_lock(&eeprom_list_mutex);
+
+	list_for_each_entry(e, &eeprom_list, list) {
+		if (args.np == e->edev.of_node) {
+			eeprom = e;
+			break;
+		}
+	}
+	mutex_unlock(&eeprom_list_mutex);
+
+	if (!eeprom)
+		return ERR_PTR(-EPROBE_DEFER);
+
+	cell = kzalloc(sizeof(*cell), GFP_KERNEL);
+	if (!cell)
+		return ERR_PTR(-ENOMEM);
+
+	cell->eeprom = eeprom;
+	cell->offset = args.args[0];
+	cell->count = args.args[1];
+
+	return cell;
+}
+
+static struct eeprom_cell *__eeprom_cell_get_byname(struct device_node *node,
+						    const char *id)
+{
+	int index = 0;
+
+	if (id)
+		index = of_property_match_string(node,
+						 "eeprom-names",
+						 id);
+	return __eeprom_cell_get(node, index);
+
+}
+
+struct eeprom_cell *eeprom_cell_get(struct device *dev, int index)
+{
+	if (!dev)
+		return ERR_PTR(-EINVAL);
+
+	/* First, attempt to retrieve the cell through the DT */
+	if (dev->of_node)
+		return __eeprom_cell_get(dev->of_node, index);
+
+	/* We don't support anything else yet */
+	return ERR_PTR(-ENODEV);
+}
+EXPORT_SYMBOL(eeprom_cell_get);
+
+struct eeprom_cell *eeprom_cell_get_byname(struct device *dev, const char *id)
+{
+	if (!dev)
+		return ERR_PTR(-EINVAL);
+
+	if (id && dev->of_node)
+		return __eeprom_cell_get_byname(dev->of_node, id);
+
+	/* We don't support anything else yet */
+	return ERR_PTR(-ENODEV);
+}
+EXPORT_SYMBOL(eeprom_cell_get_byname);
+
+void eeprom_cell_put(struct eeprom_cell *cell)
+{
+	kfree(cell);
+}
+EXPORT_SYMBOL(eeprom_cell_put);
+
+char *eeprom_cell_read(struct eeprom_cell *cell, ssize_t *len)
+{
+	struct eeprom_device *eeprom = cell->eeprom;
+	char *buf;
+	int rc;
+
+	if (!eeprom || !eeprom->regmap)
+		return ERR_PTR(-EINVAL);
+
+	buf = kzalloc(cell->count, GFP_KERNEL);
+	if (!buf)
+		return ERR_PTR(-ENOMEM);
+
+	rc = regmap_bulk_read(eeprom->regmap, cell->offset,
+			      buf, cell->count/eeprom->stride);
+	if (IS_ERR_VALUE(rc)) {
+		kfree(buf);
+		return ERR_PTR(rc);
+	}
+
+	*len = cell->count;
+
+	return buf;
+}
+EXPORT_SYMBOL(eeprom_cell_read);
+
+int eeprom_cell_write(struct eeprom_cell *cell, const char *buf, ssize_t len)
+{
+	struct eeprom_device *eeprom = cell->eeprom;
+
+	if (!eeprom || !eeprom->regmap)
+		return -EINVAL;
+
+	return regmap_bulk_write(eeprom->regmap, cell->offset,
+				 buf, cell->count/eeprom->stride);
+}
+EXPORT_SYMBOL(eeprom_cell_write);
+
+static int eeprom_init(void)
+{
+	return class_register(&eeprom_class);
+}
+
+static void eeprom_exit(void)
+{
+	class_unregister(&eeprom_class);
+}
+
+subsys_initcall(eeprom_init);
+module_exit(eeprom_exit);
+
+MODULE_AUTHOR("Maxime Ripard <maxime.ripard@free-electrons.com");
+MODULE_DESCRIPTION("EEPROM Driver Core");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/eeprom-consumer.h b/include/linux/eeprom-consumer.h
new file mode 100644
index 0000000..706ae9d
--- /dev/null
+++ b/include/linux/eeprom-consumer.h
@@ -0,0 +1,73 @@
+/*
+ * EEPROM framework consumer.
+ *
+ * Copyright (C) 2015 Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
+ * Copyright (C) 2013 Maxime Ripard <maxime.ripard@free-electrons.com>
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2.  This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#ifndef _LINUX_EEPROM_CONSUMER_H
+#define _LINUX_EEPROM_CONSUMER_H
+
+struct eeprom_cell;
+
+/**
+ * eeprom_cell_get(): Get eeprom cell of device form a given index.
+ *
+ * @dev: Device that will be interacted with
+ * @index: Index of the eeprom cell.
+ *
+ * The return value will be an ERR_PTR() on error or a valid pointer
+ * to a struct eeprom_cell.  The eeprom_cell will be freed by the
+ * eeprom_cell_put().
+ */
+struct eeprom_cell *eeprom_cell_get(struct device *dev, int index);
+
+/**
+ * eeprom_cell_get(): Get eeprom cell of device form a given name.
+ *
+ * @dev: Device that will be interacted with
+ * @name: Name of the eeprom cell.
+ *
+ * The return value will be an ERR_PTR() on error or a valid pointer
+ * to a struct eeprom_cell.  The eeprom_cell will be freed by the
+ * eeprom_cell_put().
+ */
+struct eeprom_cell *eeprom_cell_get_byname(struct device *dev,
+					   const char *name);
+
+/**
+ * eeprom_cell_put(): Release previously allocated eeprom cell.
+ *
+ * @cell: Previously allocated eeprom cell by eeprom_cell_get()
+ * or eeprom_cell_get_byname().
+ */
+void eeprom_cell_put(struct eeprom_cell *cell);
+
+/**
+ * eeprom_cell_read(): Read a given eeprom cell
+ *
+ * @cell: eeprom cell to be read.
+ * @len: pointer to length of cell which will be populated on successful read.
+ *
+ * The return value will be an ERR_PTR() on error or a valid pointer
+ * to a char * bufffer.  The buffer should be freed by the consumer with a
+ * kfree().
+ */
+char *eeprom_cell_read(struct eeprom_cell *cell, ssize_t *len);
+
+/**
+ * eeprom_cell_write(): Write to a given eeprom cell
+ *
+ * @cell: eeprom cell to be written.
+ * @buf: Buffer to be written.
+ * @len: length of buffer to be written to eeprom cell.
+ *
+ * The return value will be an non zero on error or a zero on successful write.
+ */
+int eeprom_cell_write(struct eeprom_cell *cell, const char *buf, ssize_t len);
+
+#endif  /* ifndef _LINUX_EEPROM_CONSUMER_H */
diff --git a/include/linux/eeprom-provider.h b/include/linux/eeprom-provider.h
new file mode 100644
index 0000000..3943c2f
--- /dev/null
+++ b/include/linux/eeprom-provider.h
@@ -0,0 +1,51 @@
+/*
+ * EEPROM framework provider.
+ *
+ * Copyright (C) 2015 Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
+ * Copyright (C) 2013 Maxime Ripard <maxime.ripard@free-electrons.com>
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2.  This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#ifndef _LINUX_EEPROM_PROVIDER_H
+#define _LINUX_EEPROM_PROVIDER_H
+
+#include <linux/device.h>
+#include <linux/regmap.h>
+#include <linux/list.h>
+
+struct eeprom_device {
+	struct regmap		*regmap;
+	int			stride;
+	size_t			size;
+	struct device		*dev;
+
+	/* Internal to framework */
+	struct device		edev;
+	int			id;
+	struct list_head	list;
+};
+
+/**
+ * eeprom_register(): Register a eeprom device for given eeprom.
+ * Also creates an binary entry in /sys/class/eeprom/eeprom[id]/eeprom
+ *
+ * @eeprom: eeprom device that needs to be created
+ *
+ * The return value will be an error code on error or a zero on success.
+ * The eeprom_device and sysfs entery will be freed by the eeprom_unregister().
+ */
+int eeprom_register(struct eeprom_device *eeprom);
+
+/**
+ * eeprom_unregister(): Unregister previously registered eeprom device
+ *
+ * @eeprom: Pointer to previously registered eeprom device.
+ *
+ * The return value will be an non zero on error or a zero on success.
+ */
+int eeprom_unregister(struct eeprom_device *eeprom);
+
+#endif  /* ifndef _LINUX_EEPROM_PROVIDER_H */
-- 
1.9.1

^ permalink raw reply related

* [RFC PATCH 2/3] eeprom: sunxi: Move the SID driver to the eeprom framework
From: Srinivas Kandagatla @ 2015-02-19 17:08 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Maxime Ripard, Rob Herring, Pawel Moll, Kumar Gala, linux-api,
	linux-kernel, devicetree, Stephen Boyd, Arnd Bergmann, broonie,
	Greg Kroah-Hartman, Srinivas Kandagatla
In-Reply-To: <1424365639-26634-1-git-send-email-srinivas.kandagatla@linaro.org>

From: Maxime Ripard <maxime.ripard@free-electrons.com>

Now that we have the EEPROM framework, we can consolidate the common driver
code. Move the driver to the framework, and hopefully, it will fix the sysfs
file creation race.

Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com>
[srinivas.kandagatla: Moved to regmap based EEPROM framework]
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
---
 Documentation/ABI/testing/sysfs-driver-sunxi-sid |  22 ----
 drivers/eeprom/Kconfig                           |  10 ++
 drivers/eeprom/Makefile                          |   1 +
 drivers/eeprom/eeprom-sunxi-sid.c                | 125 ++++++++++++++++++
 drivers/misc/eeprom/Kconfig                      |  13 --
 drivers/misc/eeprom/Makefile                     |   1 -
 drivers/misc/eeprom/sunxi_sid.c                  | 156 -----------------------
 7 files changed, 136 insertions(+), 192 deletions(-)
 delete mode 100644 Documentation/ABI/testing/sysfs-driver-sunxi-sid
 create mode 100644 drivers/eeprom/eeprom-sunxi-sid.c
 delete mode 100644 drivers/misc/eeprom/sunxi_sid.c

diff --git a/Documentation/ABI/testing/sysfs-driver-sunxi-sid b/Documentation/ABI/testing/sysfs-driver-sunxi-sid
deleted file mode 100644
index ffb9536..0000000
--- a/Documentation/ABI/testing/sysfs-driver-sunxi-sid
+++ /dev/null
@@ -1,22 +0,0 @@
-What:		/sys/devices/*/<our-device>/eeprom
-Date:		August 2013
-Contact:	Oliver Schinagl <oliver@schinagl.nl>
-Description:	read-only access to the SID (Security-ID) on current
-		A-series SoC's from Allwinner. Currently supports A10, A10s, A13
-		and A20 CPU's. The earlier A1x series of SoCs exports 16 bytes,
-		whereas the newer A20 SoC exposes 512 bytes split into sections.
-		Besides the 16 bytes of SID, there's also an SJTAG area,
-		HDMI-HDCP key and some custom keys. Below a quick overview, for
-		details see the user manual:
-		0x000  128 bit root-key (sun[457]i)
-		0x010  128 bit boot-key (sun7i)
-		0x020   64 bit security-jtag-key (sun7i)
-		0x028   16 bit key configuration (sun7i)
-		0x02b   16 bit custom-vendor-key (sun7i)
-		0x02c  320 bit low general key (sun7i)
-		0x040   32 bit read-control access (sun7i)
-		0x064  224 bit low general key (sun7i)
-		0x080 2304 bit HDCP-key (sun7i)
-		0x1a0  768 bit high general key (sun7i)
-Users:		any user space application which wants to read the SID on
-		Allwinner's A-series of CPU's.
diff --git a/drivers/eeprom/Kconfig b/drivers/eeprom/Kconfig
index 2c5452a..39235a9 100644
--- a/drivers/eeprom/Kconfig
+++ b/drivers/eeprom/Kconfig
@@ -16,4 +16,14 @@ config EEPROM_DEBUG
 	help
 	  Say yes here to enable debugging support.
 
+config EEPROM_SUNXI_SID
+	depends on ARCH_SUNXI
+	tristate "Allwinner SoCs SID support"
+	help
+	  This is a driver for the 'security ID' available on various Allwinner
+	  devices.
+
+	  This driver can also be built as a module. If so, the module
+	  will be called sunxi_sid.
+
 endif
diff --git a/drivers/eeprom/Makefile b/drivers/eeprom/Makefile
index e130079..661422c 100644
--- a/drivers/eeprom/Makefile
+++ b/drivers/eeprom/Makefile
@@ -7,3 +7,4 @@ ccflags-$(CONFIG_EEPROM_DEBUG) += -DDEBUG
 obj-$(CONFIG_EEPROM)		+= core.o
 
 # Devices
+obj-$(CONFIG_EEPROM_SUNXI_SID)	+= eeprom-sunxi-sid.o
diff --git a/drivers/eeprom/eeprom-sunxi-sid.c b/drivers/eeprom/eeprom-sunxi-sid.c
new file mode 100644
index 0000000..f2939b9
--- /dev/null
+++ b/drivers/eeprom/eeprom-sunxi-sid.c
@@ -0,0 +1,125 @@
+/*
+ * Allwinner sunXi SoCs Security ID support.
+ *
+ * Copyright (c) 2013 Oliver Schinagl <oliver@schinagl.nl>
+ * Copyright (C) 2014 Maxime Ripard <maxime.ripard@free-electrons.com>
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2.  This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/device.h>
+#include <linux/eeprom-provider.h>
+#include <linux/io.h>
+#include <linux/module.h>
+#include <linux/of_device.h>
+#include <linux/platform_device.h>
+
+struct eeprom_sid {
+	void __iomem	*membase;
+	struct eeprom_device eeprom;
+};
+
+/* We read the entire key, due to a 32 bit read alignment requirement. Since we
+ * want to return the requested byte, this results in somewhat slower code and
+ * uses 4 times more reads as needed but keeps code simpler. Since the SID is
+ * only very rarely probed, this is not really an issue.
+ */
+static int sunxi_sid_reg_read(void *context,
+			      unsigned int offset, unsigned int *val)
+{
+	struct eeprom_sid *sid  = context;
+	u32 sid_key;
+
+	sid_key = ioread32be(sid->membase + round_down(offset, 4));
+	sid_key >>= (offset % 4) * 8;
+
+	*val = sid_key;
+
+	return 0;
+}
+
+static bool sunxi_sid_writeable_reg(struct device *dev, unsigned int reg)
+{
+	return false;
+}
+
+static const struct of_device_id sunxi_sid_of_match[] = {
+	{ .compatible = "allwinner,sun4i-a10-sid", .data = (void *)16},
+	{ .compatible = "allwinner,sun7i-a20-sid", .data = (void *)512},
+	{/* sentinel */},
+};
+MODULE_DEVICE_TABLE(of, sunxi_sid_of_match);
+
+static struct regmap_config sunxi_sid_regmap_config = {
+	.reg_bits = 32,
+	.val_bits = 8,
+	.reg_stride = 1,
+	.reg_read = sunxi_sid_reg_read,
+	.writeable_reg = sunxi_sid_writeable_reg,
+};
+
+static int sunxi_sid_probe(struct platform_device *pdev)
+{
+	const struct of_device_id *device;
+	struct eeprom_sid *sid;
+	struct resource *res;
+	struct eeprom_device *eeprom;
+	struct device *dev = &pdev->dev;
+	int rval;
+
+	sid = devm_kzalloc(dev, sizeof(*sid), GFP_KERNEL);
+	if (!sid)
+		return -ENOMEM;
+
+	eeprom = &sid->eeprom;
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	sid->membase = devm_ioremap_resource(dev, res);
+	if (IS_ERR(sid->membase))
+		return PTR_ERR(sid->membase);
+
+	device = of_match_device(sunxi_sid_of_match, dev);
+	if (!device)
+		return -ENODEV;
+
+	sunxi_sid_regmap_config.max_register = (unsigned int)device->data - 1;
+
+	eeprom->regmap = devm_regmap_init(dev, NULL,
+					  sid, &sunxi_sid_regmap_config);
+	if (IS_ERR(eeprom->regmap))
+		return PTR_ERR(eeprom->regmap);
+
+	eeprom->dev = dev;
+	eeprom->stride = sunxi_sid_regmap_config.reg_stride;
+	eeprom->size = sunxi_sid_regmap_config.max_register;
+	rval = eeprom_register(eeprom);
+	if (rval)
+		return rval;
+
+	platform_set_drvdata(pdev, eeprom);
+
+	return 0;
+}
+
+static int sunxi_sid_remove(struct platform_device *pdev)
+{
+	struct eeprom_device *eeprom = platform_get_drvdata(pdev);
+
+	return eeprom_unregister(eeprom);
+}
+
+static struct platform_driver sunxi_sid_driver = {
+	.probe = sunxi_sid_probe,
+	.remove = sunxi_sid_remove,
+	.driver = {
+		.name = "sunxi-sid",
+		.of_match_table = sunxi_sid_of_match,
+	},
+};
+module_platform_driver(sunxi_sid_driver);
+
+MODULE_AUTHOR("Oliver Schinagl <oliver@schinagl.nl>");
+MODULE_DESCRIPTION("Allwinner sunxi security id driver");
+MODULE_LICENSE("GPL");
diff --git a/drivers/misc/eeprom/Kconfig b/drivers/misc/eeprom/Kconfig
index 9536852f..04f2e1f 100644
--- a/drivers/misc/eeprom/Kconfig
+++ b/drivers/misc/eeprom/Kconfig
@@ -96,17 +96,4 @@ config EEPROM_DIGSY_MTC_CFG
 
 	  If unsure, say N.
 
-config EEPROM_SUNXI_SID
-	tristate "Allwinner sunxi security ID support"
-	depends on ARCH_SUNXI && SYSFS
-	help
-	  This is a driver for the 'security ID' available on various Allwinner
-	  devices.
-
-	  Due to the potential risks involved with changing e-fuses,
-	  this driver is read-only.
-
-	  This driver can also be built as a module. If so, the module
-	  will be called sunxi_sid.
-
 endmenu
diff --git a/drivers/misc/eeprom/Makefile b/drivers/misc/eeprom/Makefile
index 9507aec..fc1e81d 100644
--- a/drivers/misc/eeprom/Makefile
+++ b/drivers/misc/eeprom/Makefile
@@ -4,5 +4,4 @@ obj-$(CONFIG_EEPROM_LEGACY)	+= eeprom.o
 obj-$(CONFIG_EEPROM_MAX6875)	+= max6875.o
 obj-$(CONFIG_EEPROM_93CX6)	+= eeprom_93cx6.o
 obj-$(CONFIG_EEPROM_93XX46)	+= eeprom_93xx46.o
-obj-$(CONFIG_EEPROM_SUNXI_SID)	+= sunxi_sid.o
 obj-$(CONFIG_EEPROM_DIGSY_MTC_CFG) += digsy_mtc_eeprom.o
diff --git a/drivers/misc/eeprom/sunxi_sid.c b/drivers/misc/eeprom/sunxi_sid.c
deleted file mode 100644
index 8385177..0000000
--- a/drivers/misc/eeprom/sunxi_sid.c
+++ /dev/null
@@ -1,156 +0,0 @@
-/*
- * Copyright (c) 2013 Oliver Schinagl <oliver@schinagl.nl>
- * http://www.linux-sunxi.org
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * This driver exposes the Allwinner security ID, efuses exported in byte-
- * sized chunks.
- */
-
-#include <linux/compiler.h>
-#include <linux/device.h>
-#include <linux/err.h>
-#include <linux/export.h>
-#include <linux/fs.h>
-#include <linux/io.h>
-#include <linux/kernel.h>
-#include <linux/kobject.h>
-#include <linux/module.h>
-#include <linux/of_device.h>
-#include <linux/platform_device.h>
-#include <linux/random.h>
-#include <linux/slab.h>
-#include <linux/stat.h>
-#include <linux/sysfs.h>
-#include <linux/types.h>
-
-#define DRV_NAME "sunxi-sid"
-
-struct sunxi_sid_data {
-	void __iomem *reg_base;
-	unsigned int keysize;
-};
-
-/* We read the entire key, due to a 32 bit read alignment requirement. Since we
- * want to return the requested byte, this results in somewhat slower code and
- * uses 4 times more reads as needed but keeps code simpler. Since the SID is
- * only very rarely probed, this is not really an issue.
- */
-static u8 sunxi_sid_read_byte(const struct sunxi_sid_data *sid_data,
-			      const unsigned int offset)
-{
-	u32 sid_key;
-
-	if (offset >= sid_data->keysize)
-		return 0;
-
-	sid_key = ioread32be(sid_data->reg_base + round_down(offset, 4));
-	sid_key >>= (offset % 4) * 8;
-
-	return sid_key; /* Only return the last byte */
-}
-
-static ssize_t sid_read(struct file *fd, struct kobject *kobj,
-			struct bin_attribute *attr, char *buf,
-			loff_t pos, size_t size)
-{
-	struct platform_device *pdev;
-	struct sunxi_sid_data *sid_data;
-	int i;
-
-	pdev = to_platform_device(kobj_to_dev(kobj));
-	sid_data = platform_get_drvdata(pdev);
-
-	if (pos < 0 || pos >= sid_data->keysize)
-		return 0;
-	if (size > sid_data->keysize - pos)
-		size = sid_data->keysize - pos;
-
-	for (i = 0; i < size; i++)
-		buf[i] = sunxi_sid_read_byte(sid_data, pos + i);
-
-	return i;
-}
-
-static struct bin_attribute sid_bin_attr = {
-	.attr = { .name = "eeprom", .mode = S_IRUGO, },
-	.read = sid_read,
-};
-
-static int sunxi_sid_remove(struct platform_device *pdev)
-{
-	device_remove_bin_file(&pdev->dev, &sid_bin_attr);
-	dev_dbg(&pdev->dev, "driver unloaded\n");
-
-	return 0;
-}
-
-static const struct of_device_id sunxi_sid_of_match[] = {
-	{ .compatible = "allwinner,sun4i-a10-sid", .data = (void *)16},
-	{ .compatible = "allwinner,sun7i-a20-sid", .data = (void *)512},
-	{/* sentinel */},
-};
-MODULE_DEVICE_TABLE(of, sunxi_sid_of_match);
-
-static int sunxi_sid_probe(struct platform_device *pdev)
-{
-	struct sunxi_sid_data *sid_data;
-	struct resource *res;
-	const struct of_device_id *of_dev_id;
-	u8 *entropy;
-	unsigned int i;
-
-	sid_data = devm_kzalloc(&pdev->dev, sizeof(struct sunxi_sid_data),
-				GFP_KERNEL);
-	if (!sid_data)
-		return -ENOMEM;
-
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	sid_data->reg_base = devm_ioremap_resource(&pdev->dev, res);
-	if (IS_ERR(sid_data->reg_base))
-		return PTR_ERR(sid_data->reg_base);
-
-	of_dev_id = of_match_device(sunxi_sid_of_match, &pdev->dev);
-	if (!of_dev_id)
-		return -ENODEV;
-	sid_data->keysize = (int)of_dev_id->data;
-
-	platform_set_drvdata(pdev, sid_data);
-
-	sid_bin_attr.size = sid_data->keysize;
-	if (device_create_bin_file(&pdev->dev, &sid_bin_attr))
-		return -ENODEV;
-
-	entropy = kzalloc(sizeof(u8) * sid_data->keysize, GFP_KERNEL);
-	for (i = 0; i < sid_data->keysize; i++)
-		entropy[i] = sunxi_sid_read_byte(sid_data, i);
-	add_device_randomness(entropy, sid_data->keysize);
-	kfree(entropy);
-
-	dev_dbg(&pdev->dev, "loaded\n");
-
-	return 0;
-}
-
-static struct platform_driver sunxi_sid_driver = {
-	.probe = sunxi_sid_probe,
-	.remove = sunxi_sid_remove,
-	.driver = {
-		.name = DRV_NAME,
-		.of_match_table = sunxi_sid_of_match,
-	},
-};
-module_platform_driver(sunxi_sid_driver);
-
-MODULE_AUTHOR("Oliver Schinagl <oliver@schinagl.nl>");
-MODULE_DESCRIPTION("Allwinner sunxi security id driver");
-MODULE_LICENSE("GPL");
-- 
1.9.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox