Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
From: Omar Sandoval @ 2015-01-21  7:56 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra <peterz@
In-Reply-To: <1421747878-30744-6-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Tue, Jan 20, 2015 at 05:57:57PM +0800, Fam Zheng wrote:
> This syscall is a sequence of
> 
> 1) a number of epoll_ctl calls
> 2) a epoll_pwait, with timeout enhancement.
> 
> The epoll_ctl operations are embeded so that application doesn't have to use
> separate syscalls to insert/delete/update the fds before poll. It is more
> efficient if the set of fds varies from one poll to another, which is the
> common pattern for certain applications. For example, depending on the input
> buffer status, a data reading program may decide to temporarily not polling an
> fd.
> 
> Because the enablement of batching in this interface, even that regular
> epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
> 
> The only complexity is returning the result of each operation.  For each
> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> the return code *iff* the command is executed (0 for success and -errno of the
> equivalent epoll_ctl call), and will be left unchanged if the command is not
> executed because some earlier error, for example due to failure of
> copy_from_user to copy the array.
> 
> Applications can utilize this fact to do error handling: they could initialize
> all the epoll_mod_wait.error to a positive value, which is by definition not a
> possible output value from epoll_mod_wait. Then when the syscall returned, they
> know whether or not the command is executed by comparing each error with the
> init value, if they're different, they have the result of the command.
> More roughly, they can put any non-zero and not distinguish "not run" from
> failure.
> 
> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> scalar. This provides higher precision. The parameter field in struct
> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> clock than the default when it makes more sense.
> 
> Signed-off-by: Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h |  5 ++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e7a116d..2cc22c9 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>  			      sigmask ? &ksigmask : NULL);
>  }
>  
> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> +		struct epoll_wait_spec __user *, spec)
> +{
> +	struct epoll_mod_cmd *kcmds = NULL;
> +	int i, ret = 0;
> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (ncmds) {
> +		if (!cmds)
> +			return -EINVAL;
> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> +		if (!kcmds)
> +			return -ENOMEM;
> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +	}
> +	for (i = 0; i < ncmds; i++) {
> +		struct epoll_event ev = (struct epoll_event) {
> +			.events = kcmds[i].events,
> +			.data = kcmds[i].data,
> +		};
> +		if (kcmds[i].flags) {
> +			kcmds[i].error = ret = -EINVAL;
> +			goto out;
> +		}
> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> +		if (ret)
> +			goto out;
> +	}
> +	if (spec) {
> +		sigset_t ksigmask;
> +		struct epoll_wait_spec kspec;
> +		ktime_t timeout;
> +
> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> +			return -EFAULT;
This should probably be goto out, or you'll leak kcmds.

> +		if (kspec.sigmask) {
> +			if (kspec.sigsetsize != sizeof(sigset_t))
> +				return -EINVAL;
Same here...

> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> +				return -EFAULT;
and here.

> +		}
> +		timeout = timespec_to_ktime(kspec.timeout);
> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> +				     kspec.clockid, timeout,
> +				     kspec.sigmask ? &ksigmask : NULL);
> +	}
> +
> +out:
> +	if (ncmds && copy_to_user(cmds, kcmds, cmd_size))
> +		return -EFAULT;
This will also leak kcmds, it should be ret = -EFAULT. This case, however, seems
to lead to a weird corner case: if cmds is read-only, we'll end up executing
every command but fail to copy out the return values, so when userspace gets the
EFAULT, it won't know whether anything was executed. But, getting an EFAULT here
means you're probably doing something wrong anyways, so maybe not the biggest
concern.

> +	kfree(kcmds);
> +	return ret;
> +}
> +
>  #ifdef CONFIG_COMPAT
>  COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
>  		       struct epoll_event __user *, events,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 85893d7..7156c80 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -12,6 +12,8 @@
>  #define _LINUX_SYSCALLS_H
>  
>  struct epoll_event;
> +struct epoll_mod_cmd;
> +struct epoll_wait_spec;
>  struct iattr;
>  struct inode;
>  struct iocb;
> @@ -630,6 +632,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
>  				int maxevents, int timeout,
>  				const sigset_t __user *sigmask,
>  				size_t sigsetsize);
> +asmlinkage long sys_epoll_mod_wait(int epfd, int flags,
> +				   int ncmds, struct epoll_mod_cmd __user * cmds,
> +				   struct epoll_wait_spec __user * spec);
>  asmlinkage long sys_gethostname(char __user *name, int len);
>  asmlinkage long sys_sethostname(char __user *name, int len);
>  asmlinkage long sys_setdomainname(char __user *name, int len);
> -- 
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Omar

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Michael Kerrisk (man-pages) @ 2015-01-21  8:57 UTC (permalink / raw)
  To: Daniel Mack, Greg Kroah-Hartman, arnd-r2nGTMty4D4,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	daniel-cYrQPVfZooxQFI55V6+gNQ, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	tixxdz-Umm1ozX2/EEdnm+yROfE0A
In-Reply-To: <54BE957A.60305-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>

Hi Daniel,

On 01/20/2015 06:50 PM, Daniel Mack wrote:
> Hi Michael,
> 
> Thanks a lot for for intense review of the documentation. Much appreciated.
> 
> I've addressed all but the below issues, following your suggestions.

Are your changes already visible somewhere?

> On 01/20/2015 02:58 PM, Michael Kerrisk (man-pages) wrote:
>>> +    and KDBUS_CMD_ENDPOINT_MAKE (see above).
>>> +
>>> +    Following items are expected for KDBUS_CMD_BUS_MAKE:
>>> +    KDBUS_ITEM_MAKE_NAME
>>> +      Contains a string to identify the bus name.
>>
>> So, up to here, I've seen no definition of 'kdbus_item', which leaves me 
>> asking questions like: what subfield is KDBUS_ITEM_MAKE_NAME stored in?
>> which subfield holds the pointer to the string?
>>
>> Somewhere earlier,  kdbus_item needs to be exaplained in more detail, 
>> I think.
> 
> Hmm, you're quoting text from section 5, and section 4 actually
> describes the concept of items quite well I believe?

Well, Section 4 is pretty short. My point is that most of the various 
blob formats (e.g., kdbus_pids, kdbus_caps, kdbus_memfd) are not
documented in kdbus.txt. They all should be, IMO.

>>> +  __s64 priority;
>>> +    With KDBUS_RECV_USE_PRIORITY set in flags, receive the next message in
>>> +    the queue with at least the given priority. If no such message is waiting
>>> +    in the queue, -ENOMSG is returned.
>>
>> ###
>> How do I simply select the highest priority message, without knowing what 
>> its priority is?
> 
> The wording is indeed unclear here. KDBUS_RECV_USE_PRIORITY causes the
> messages to be dequeued by their priority. The 'priority' field is
> simply a filter that request a minimum priority. By setting this field
> to the highest possible value, you effectively bypass the filter. I've
> added a better description now.

Thanks for the clarification.

>>> +  -ENOMEM	The kernel memory is exhausted
>>> +  -ENOTTY	Illegal ioctl command issued for the file descriptor
>>
>> Why ENOTTY here, rather than EINVAL? The latter is, I beleive, the usual 
>> ioctl() error for invalid commands, I believe (If you keep ENOTTY, add an
>> explanation in this document.)
> 
> Hmm, no. -ENOTTY is commonly used as return code when calling ioctls
> that can't be handled by the FDs they're called on. 'man errno(3)' even
> states: "ENOTTY   Inappropriate I/O control operation (POSIX.1)".

Okay.

>>> +  -EINVAL	Unsupported item attached to command
>>> +
>>> +For all ioctls that carry a struct as payload:
>>> +
>>> +  -EFAULT	The supplied data pointer was not 64-bit aligned, or was
>>> +		inaccessible from the kernel side.
>>> +  -EINVAL	The size inside the supplied struct was smaller than expected
>>> +  -EMSGSIZE	The size inside the supplied struct was bigger than expected
>>
>> Why two different errors for smaller and larger than expected? (If you keep things this
>> way, pelase explain the reason in this document.)
> 
> Providing a struct that is smaller than the minimum doesn't give the
> ioctl a valid set of information to process the request. Hence, I think
> -EINVAL is appropriate. However, -EMSGSIZE is something that users might
> hit when they make message payloads too big, and I think it's good to
> have a change to distinguish the two cases in error handling. I added
> something in the document now.

Thanks.

> Again, thanks a lot for reading the documentation so accurately!

You're welcome.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
From: Fam Zheng @ 2015-01-21  8:58 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Andy Lutomirski
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel
In-Reply-To: <54BF5ACE.6030206@gmail.com>

On Wed, 01/21 08:52, Michael Kerrisk (man-pages) wrote:
> Hello Fam Zheng,
> 
> On 01/21/2015 05:59 AM, Fam Zheng wrote:
> > On Tue, 01/20 13:50, Michael Kerrisk (man-pages) wrote:
> >> Hello Fam Zheng,
> >>
> >> On 01/20/2015 10:57 AM, Fam Zheng wrote:
> >>> This syscall is a sequence of
> >>>
> >>> 1) a number of epoll_ctl calls
> >>> 2) a epoll_pwait, with timeout enhancement.
> >>>
> >>> The epoll_ctl operations are embeded so that application doesn't have to use
> >>> separate syscalls to insert/delete/update the fds before poll. It is more
> >>> efficient if the set of fds varies from one poll to another, which is the
> >>> common pattern for certain applications. 
> >>
> >> Which applications? Could we have some specific examples? This is a 
> >> complex API, and it needs good justification.
> > 
> > OK, I'll explain more in v2.
> 
> Okay.
> 
> [...]
> 
> >>> The only complexity is returning the result of each operation.  For each
> >>> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> >>> the return code *iff* the command is executed (0 for success and -errno of the
> >>> equivalent epoll_ctl call), and will be left unchanged if the command is not
> >>> executed because some earlier error, for example due to failure of
> >>> copy_from_user to copy the array.
> >>>
> >>> Applications can utilize this fact to do error handling: they could initialize
> >>> all the epoll_mod_wait.error to a positive value, which is by definition not a
> >>> possible output value from epoll_mod_wait. Then when the syscall returned, they
> >>> know whether or not the command is executed by comparing each error with the
> >>> init value, if they're different, they have the result of the command.
> >>> More roughly, they can put any non-zero and not distinguish "not run" from
> >>> failure.
> >>
> >> The "cmds' are not executed in a specified order plus the need to
> >> initialize the 'errors' fields to a positive value feels a bit ugly.
> >> And indeed the whole "command list was only partially run" case
> >> is not pretty. Am I correct to understand that if an error is found
> >> during execution of one of the "epoll_ctl" commands in 'cmds' then
> >> the system call will return -1 with errno set, indicating an error,
> >> even though the epoll interest list may have changed because some
> >> of the earlier 'cmds' executed successfully? This all seems a bit of
> >> a headache for user space.
> > 
> > This is the trade-off for batching. The best we can do is probably make this
> > transactional: none or all of the commands succeeds. It will require a much
> > more complex implementation, though. 
> 
> Transactional would be more comfortable for user-space, and while I 
> can see that it would be complex to implement, perhaps the greater
> point might be that the implementation is CPU expensive.

Good point.

> 
> > But even with that, the error reporting on
> > which command failed is a complication.
> 
> My suggestions below could make the error reporting much simpler...
> 
> >> I have a couple of questions:
> >>
> >> Q1. I can see that batching "epoll_ctl" commands might be useful,
> >> since it results in fewer systems calls. But, does it really
> >> need to be bound together with the "epoll_pwait" functionality?
> >> (Perhaps this point was covered in previous discussions, but
> >> neither the message accompanying this patch nor the 0/6 man page
> >> provide a compelling rationale for the need to bind these two
> >> operations together.)
> >>
> >> Yes, I realize you might save a system call, but it makes for a
> >> cumbersome API that has the above headache, and also forces the 
> >> need for double pointer indirection in the 'spec' argument (i.e., 
> >> spec is a pointer to an array of structures where each element
> >> in turn includes an 'events' pointer that points to another array).
> >>
> >> Why not a simpler API with two syscalls such as:
> >>
> >> epoll_ctl_batch(int epfd, int flags,
> >>                 int ncmds, struct epoll_mod_cmd *cmds);
> >>
> >> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
> >>              struct timespec *timeout, int clock_id, 
> >>              const sigset_t *sigmask, size_t sigsetsize);
> > 
> > The problem is that there is no room for flags field in epoll_pwait1, which is
> > asked for, in previous discussion thread [1].
> 
> Ahh yes, I certainly should not have forgotten that. But that's easily solved.
> Do as for pselect6():
> 
> strcut sigargs {
>     const sigset_t *ss;
>     size_t          ss_len; /* Size (in bytes) of object pointed
>                                to by 'ss' */
> }
> 
> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
>              struct timespec *timeout, int clock_id, int flags,
>              int timeout,
>              const struct sigargs *sargs);
> 
> > I don't see epoll_mod_wait as a *significantly more* complicated interface
> > compared to epoll_ctl_batch and epoll_pwait1 above. 
> 
> My biggest problem with epoll_ctl_wait() is the complexity of error 
> handling. epoll_ctl_batch and epoll_pwait1 would simplify that, as I 
> note below.
> 
> Aside from that, I do think that epoll_ctl_wait() passes a certain
> threshold of complexity that warrants good justification, which
> I haven't really seen yet.

OK, see below.

> 
> > In epoll_mod_wait, if you
> > leave out ncmds and cmds, it is effectively a poll without batch; and if
> > leaving out spec, it is effectively a batch without poll.
> > 
> > The most important change here is the timeout. IMO I wouldn't mind leaving out
> > batching. Integrating it is requested by Andy:
> > 
> > [1]: http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591
> > 
> > which also made sense to me; I do have a patch in QEMU to call epoll_ctl for a
> > number of times right before epoll_wait.
> > 
> > [Sorry for not putting anything into cover letter changelog, but it is also
> > interesting to see people's reaction on the patch itself without bias of
> > others' opinions. This indeed brings in more points. :]
> 
> But it also has the downside that the same discussions
> may be repeated.

Yes, that's why I definitely do it for any version that is > v1.

> 
> >> This gives us much of the benefit of reducing system calls, but 
> >> with greater simplicity. And epoll_ctl_batch() could simply return
> >> the number of 'cmds' that were successfully executed.)
> >>
> >> Q2. In the man page in 0/6 you said that the 'cmds' were not 
> >> guaranteed to be executed in order. Why not? If you did provide
> >> such a guarantee, then, when using your current epoll_mod_wait(),
> >> user space could do the following:
> > 
> > I guess we can make a guarentee on that.
> 
> I'm puzzled by that response. Surely you *must* guarantee it.
> If there's no defined order, then if the batch includes multiple 
> commands that operate on the same FD, the result is undefined 
> unless you provide that guarantee. (Unless, of course, you want to
> explicitly specify that using the same FD multiple times in the
> batch gives undefined behavior.)

OK. Then let's guarentee it.

> 
> >> 1. Initialize the cmd.errors fields to zero.
> >> 2. Call epoll_ctl_mod()
> >> 3. Iterate through cmd.errors looking for the first nonzero 
> >>    field.
> > 
> > It's close, but zero is not good enough, if copy_from_user of cmds failed in
> > the first place. Impossible value, or error value, will be safer.
> 
> See my comment in the earlier mail. If you split this into two 
> APIs, and epoll_ctl_batch() is guaranteed to execute 'cmds' in order, 
> then the return value of epoll_ctl_batch() could be used to tell
> user space how many commands succeeded. Much simpler!

Yes it is much simpler. However the reason to add batching in the first place is
to make epoll faster, by reducing syscalls. Splitting makes the result
sub-optimal: we still need at least 2 calls instead of 1.  Each one of the three
proposed new call *is* a step forward, but I don't think we will have everything
solved even by implementing them all. Compromise needed between performance or
complexity.

My take for simplicity will be leaving epoll_ctl as-is, and my take for
performance will be epoll_pwait1. And I don't really like putting my time on
epoll_ctl_batch, thinking it as a ambivalent compromise in between.

Andy, will you be OK with the above epoll_pwait1? Or do you have more
justifications for epoll_mod_and_wait?

> 
> >>> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> >>> scalar. This provides higher precision. 
> >>
> >> Yes, that change seemed inevitable. It slightly puzzled me at the time when
> >> Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
> >> though pselect() already had demonstrated the need for higher precision.
> >> I should have called it out way back then :-{.
> >>
> >>> The parameter field in struct
> >>> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> >>> clock than the default when it makes more sense.
> >>>
> >>> Signed-off-by: Fam Zheng <famz@redhat.com>
> >>> ---
> >>>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
> >>>  include/linux/syscalls.h |  5 ++++
> >>>  2 files changed, 65 insertions(+)
> >>>
> >>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> >>> index e7a116d..2cc22c9 100644
> >>> --- a/fs/eventpoll.c
> >>> +++ b/fs/eventpoll.c
> >>> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
> >>>  			      sigmask ? &ksigmask : NULL);
> >>>  }
> >>>  
> >>> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> >>> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> >>> +		struct epoll_wait_spec __user *, spec)
> >>> +{
> >>> +	struct epoll_mod_cmd *kcmds = NULL;
> >>> +	int i, ret = 0;
> >>> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> >>> +
> >>> +	if (flags)
> >>> +		return -EINVAL;
> >>> +	if (ncmds) {
> >>> +		if (!cmds)
> >>> +			return -EINVAL;
> >>> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> >>> +		if (!kcmds)
> >>> +			return -ENOMEM;
> >>> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> >>> +			ret = -EFAULT;
> >>> +			goto out;
> >>> +		}
> >>> +	}
> >>> +	for (i = 0; i < ncmds; i++) {
> >>> +		struct epoll_event ev = (struct epoll_event) {
> >>> +			.events = kcmds[i].events,
> >>> +			.data = kcmds[i].data,
> >>> +		};
> >>> +		if (kcmds[i].flags) {
> >>> +			kcmds[i].error = ret = -EINVAL;
> >>
> >> To make the 'ret' change a little more obvious, maybe it's better to write
> >>
> >> 			ret = kcmds[i].error = -EINVAL;
> >>
> >>> +			goto out;
> >>> +		}
> >>> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> >>
> >> Likewise:
> >> 		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> >>
> >>> +		if (ret)
> >>> +			goto out;
> >>> +	}
> >>> +	if (spec) {
> >>> +		sigset_t ksigmask;
> >>> +		struct epoll_wait_spec kspec;
> >>> +		ktime_t timeout;
> >>> +
> >>> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> >>
> >> Cosmetic point: s/if(/if (/
> >>
> >>> +			return -EFAULT;
> >>> +		if (kspec.sigmask) {
> >>> +			if (kspec.sigsetsize != sizeof(sigset_t))
> >>> +				return -EINVAL;
> >>> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> >>> +				return -EFAULT;
> >>> +		}
> >>> +		timeout = timespec_to_ktime(kspec.timeout);
> >>> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> >>> +				     kspec.clockid, timeout,
> >>> +				     kspec.sigmask ? &ksigmask : NULL);
> >>
> >> If I understand correctly, the implementation means that the
> >> 'size_t sigsetsize' field will probably need to be exposed to 
> >> applications. In the existing epoll_pwait() call (as in  ppoll()
> >> and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
> >> However, unless we expect glibc to do some structure copying to/from
> >> a structure that hides this field, then we're going end up exposing
> >> 'size_t sigsetsize' to applications. (This could be avoided, if we
> >> split the API as I suggest above. glibc would do the same thing 
> >> in epoll_pwait1() that it currently does in epoll_pwait().)
> 
> You missed responding to this point; I think it matters.
> (There were also some other points to consider in my reply 
> to your 0/6 mail.)
> 

My bad, something I typed when replying went into /dev/null for unknown reasons,
what I had was:

This should be easy to solve: glibc can be responsible for building spec, and
applications will do something like:

	epoll_mod_wait(epfd, ncmds, cmds, maxevents, events, clockid,
		       timeout, sigmask);

Thanks,
Fam

^ permalink raw reply

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
From: Fam Zheng @ 2015-01-21  8:59 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
	Peter Zijlstra <peter>
In-Reply-To: <20150121075617.GA21266@mew>

On Tue, 01/20 23:56, Omar Sandoval wrote:
> On Tue, Jan 20, 2015 at 05:57:57PM +0800, Fam Zheng wrote:
> > This syscall is a sequence of
> > 
> > 1) a number of epoll_ctl calls
> > 2) a epoll_pwait, with timeout enhancement.
> > 
> > The epoll_ctl operations are embeded so that application doesn't have to use
> > separate syscalls to insert/delete/update the fds before poll. It is more
> > efficient if the set of fds varies from one poll to another, which is the
> > common pattern for certain applications. For example, depending on the input
> > buffer status, a data reading program may decide to temporarily not polling an
> > fd.
> > 
> > Because the enablement of batching in this interface, even that regular
> > epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> > single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
> > 
> > The only complexity is returning the result of each operation.  For each
> > epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> > the return code *iff* the command is executed (0 for success and -errno of the
> > equivalent epoll_ctl call), and will be left unchanged if the command is not
> > executed because some earlier error, for example due to failure of
> > copy_from_user to copy the array.
> > 
> > Applications can utilize this fact to do error handling: they could initialize
> > all the epoll_mod_wait.error to a positive value, which is by definition not a
> > possible output value from epoll_mod_wait. Then when the syscall returned, they
> > know whether or not the command is executed by comparing each error with the
> > init value, if they're different, they have the result of the command.
> > More roughly, they can put any non-zero and not distinguish "not run" from
> > failure.
> > 
> > Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> > scalar. This provides higher precision. The parameter field in struct
> > epoll_wait_spec, "clockid", also makes it possible for users to use a different
> > clock than the default when it makes more sense.
> > 
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/syscalls.h |  5 ++++
> >  2 files changed, 65 insertions(+)
> > 
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index e7a116d..2cc22c9 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
> >  			      sigmask ? &ksigmask : NULL);
> >  }
> >  
> > +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> > +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> > +		struct epoll_wait_spec __user *, spec)
> > +{
> > +	struct epoll_mod_cmd *kcmds = NULL;
> > +	int i, ret = 0;
> > +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +	if (ncmds) {
> > +		if (!cmds)
> > +			return -EINVAL;
> > +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> > +		if (!kcmds)
> > +			return -ENOMEM;
> > +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +	}
> > +	for (i = 0; i < ncmds; i++) {
> > +		struct epoll_event ev = (struct epoll_event) {
> > +			.events = kcmds[i].events,
> > +			.data = kcmds[i].data,
> > +		};
> > +		if (kcmds[i].flags) {
> > +			kcmds[i].error = ret = -EINVAL;
> > +			goto out;
> > +		}
> > +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +	if (spec) {
> > +		sigset_t ksigmask;
> > +		struct epoll_wait_spec kspec;
> > +		ktime_t timeout;
> > +
> > +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> > +			return -EFAULT;
> This should probably be goto out, or you'll leak kcmds.
> 
> > +		if (kspec.sigmask) {
> > +			if (kspec.sigsetsize != sizeof(sigset_t))
> > +				return -EINVAL;
> Same here...
> 
> > +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> > +				return -EFAULT;
> and here.
> 
> > +		}
> > +		timeout = timespec_to_ktime(kspec.timeout);
> > +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> > +				     kspec.clockid, timeout,
> > +				     kspec.sigmask ? &ksigmask : NULL);
> > +	}
> > +
> > +out:
> > +	if (ncmds && copy_to_user(cmds, kcmds, cmd_size))
> > +		return -EFAULT;
> This will also leak kcmds, it should be ret = -EFAULT. This case, however, seems
> to lead to a weird corner case: if cmds is read-only, we'll end up executing
> every command but fail to copy out the return values, so when userspace gets the
> EFAULT, it won't know whether anything was executed. But, getting an EFAULT here
> means you're probably doing something wrong anyways, so maybe not the biggest
> concern.

Yes, thanks! Will fix this.

Fam

> 
> > +	kfree(kcmds);
> > +	return ret;
> > +}
> > +
> >  #ifdef CONFIG_COMPAT
> >  COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
> >  		       struct epoll_event __user *, events,
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 85893d7..7156c80 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -12,6 +12,8 @@
> >  #define _LINUX_SYSCALLS_H
> >  
> >  struct epoll_event;
> > +struct epoll_mod_cmd;
> > +struct epoll_wait_spec;
> >  struct iattr;
> >  struct inode;
> >  struct iocb;
> > @@ -630,6 +632,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
> >  				int maxevents, int timeout,
> >  				const sigset_t __user *sigmask,
> >  				size_t sigsetsize);
> > +asmlinkage long sys_epoll_mod_wait(int epfd, int flags,
> > +				   int ncmds, struct epoll_mod_cmd __user * cmds,
> > +				   struct epoll_wait_spec __user * spec);
> >  asmlinkage long sys_gethostname(char __user *name, int len);
> >  asmlinkage long sys_sethostname(char __user *name, int len);
> >  asmlinkage long sys_setdomainname(char __user *name, int len);
> > -- 
> > 1.9.3
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
> Omar

^ permalink raw reply

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
From: Fam Zheng @ 2015-01-21  9:05 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra <peter>
In-Reply-To: <54BE4EA4.6080901-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Tue, 01/20 13:48, Michael Kerrisk (man-pages) wrote:
> Hello Fam Zheng,
> 
> I know this API has been through a number of iterations, and there were 
> discussions about the design that led to it becoming more complex.
> But, let us assume that someone has not seen those discussions,
> or forgotten them, or is too lazy to go hunting list archives.
> 
> Then: this patch series should somewhere have an explanation of
> why the API is what it is, ideally with links to previous relevant
> discussions. I see that you do part of that in
> 
>     [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
> 
> There are however no links to previous discussions in that mail (I guess
> http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591 is most
> relevant, nor is there any sort of change log in the commit message 
> that explains the evolution of the API. Having those would ease the 
> task of reviewers.
> 
> Coming back to THIS mail, this man page should also include an
> explanation of why the API is what it is. That would include much
> of the detail from the 5/6 patch, and probably more info besides.
> 
> Some specific points below.
> 
> On 01/20/2015 10:57 AM, Fam Zheng wrote:
> > This adds a new system call, epoll_mod_wait. It's described as below:
> > 
> > NAME
> >        epoll_mod_wait - modify and wait for I/O events on an epoll file
> >                         descriptor
> > 
> > SYNOPSIS
> > 
> >        int epoll_mod_wait(int epfd, int flags,
> >                           int ncmds, struct epoll_mod_cmd *cmds,
> >                           struct epoll_wait_spec *spec);
> > 
> > DESCRIPTION
> > 
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >        
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >
> >        2) When you need higher precision than microsecond for wait timeout.
> 
> s/microsecond/millisecond/

Yes, thanks for pointing out.

> >        if all the commands are successfully executed (all the error fields are
> >        set to 0), events are polled.
> 
> Does the operation execute all commands, or stop when it encounters the first 
> error? In other words, when looping over the returned 'error' fields, what
> is the termination condition for the user-space application?
> 
> (Yes, I know I can trivially inspect the patch 5/6 to answer this question, 
> but the man page should explicitly state this so that I don't have to 
> read the source, and also because it is only if you explicitly document 
> the intended behavior that I can tell whether the actual implementation 
> matches the intention.)


> 
> >        The last parameter "spec" is a pointer to struct epoll_wait_spec, which
> >        contains the information about how to poll the events. If it's NULL, this
> >        call will immediately return after running all the commands in cmds.
> > 
> >        The structure is defined as below:
> > 
> >            struct epoll_wait_spec {
> > 
> >                   /* The same as "maxevents" in epoll_pwait() */
> >                   int maxevents;
> > 
> >                   /* The same as "events" in epoll_pwait() */
> >                   struct epoll_event *events;
> > 
> >                   /* Which clock to use for timeout */
> >                   int clockid;
> 
> Which clocks can be specified here?
> CLOCK_MONOTONIC?
> CLOCK_REALTIME?
> CLOCK_PROCESS_CPUTIME_ID?
> clock_getcpuclockid()?
> others?

At the moment we can limit it to CLOCK_MONOTONIC and CLOCK_REALTIME, I'm not
sure any application care about others. It's not checked in this series, but
should be done in v2.

> 
> >                   /* Maximum time to wait if there is no event */
> >                   struct timespec timeout;
> 
> Is this timeout relative or absolute?

Relative. I'll document it. Absolute timeout can be added later with new flags.

> 
> >                   /* The same as "sigmask" in epoll_pwait() */
> >                   sigset_t *sigmask;
> 
> I just want to confirm here that 'sigmask' can be NULL, meaning
> that we degenerate to epoll_wait() functionality, right?

Yes. Will document explicitly.

> 
> >                   /* The same as "sigsetsize" in epoll_pwait() */
> >                   size_t sigsetsize;
> >            } EPOLL_PACKED;
> 
> What is the "EPOLL_PACKED" here for?

Copy paste error. :)

> 
> > RETURN VALUE
> > 
> >        When any error occurs, epoll_mod_wait() returns -1 and errno is set
> >        appropriately. All the "error" fields in cmds are unchanged before they
> >        are executed, and if any cmds are executed, the "error" fields are set
> >        to a return code accordingly. See also epoll_ctl for more details of the
> >        return code.
> > 
> >        When successful, epoll_mod_wait() returns the number of file
> >        descriptors ready for the requested I/O, or zero if no file descriptor
> >        became ready during the requested timeout milliseconds.
> 
> s/milliseconds//

OK.

> 
> > 
> >        If spec is NULL, it returns 0 if all the commands are successful, and -1
> >        if an error occured.
> 
> s/occured/occurred/

OK, thanks.

> 
> > ERRORS
> > 
> >        These errors apply on either the return value of epoll_mod_wait or error
> >        status for each command, respectively.
> >
> >        EBADF  epfd or fd is not a valid file descriptor.
> > 
> >        EFAULT The memory area pointed to by events is not accessible with write
> >               permissions.
> > 
> >        EINTR  The call was interrupted by a signal handler before either any of
> >               the requested events occurred or the timeout expired; see
> >               signal(7).
> > 
> >        EINVAL epfd is not an epoll file descriptor, or maxevents is less than
> >               or equal to zero, or fd is the same as epfd, or the requested
> >               operation op is not supported by this interface.
> 
> Add: Or 'flags' is nonzero. Or a 'cmds.flags' field is nonzero.

Yes.

> 
> >        EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
> >               already registered with this epoll instance.
> > 
> >        ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
> >               with this epoll instance.
> > 
> >        ENOMEM There was insufficient memory to handle the requested op control
> >               operation.
> > 
> >        ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
> >               encountered while trying to register (EPOLL_CTL_ADD) a new file
> >               descriptor on an epoll instance.  See epoll(7) for further
> >               details.
> > 
> >        EPERM  The target file fd does not support epoll.
> > 
> > CONFORMING TO
> > 
> >        epoll_mod_wait() is Linux-specific.
> > 
> > SEE ALSO
> > 
> >        epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
> 
> Please add sigprocmask(2).

OK! Thanks for reviewing this.

Fam

^ permalink raw reply

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
From: Fam Zheng @ 2015-01-21  9:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
	Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra <pe>
In-Reply-To: <CALCETrU4TeG1ShVLkQgqQ6usFm8pg_t0D8K=Mi_UJGSfxUwXtA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, 01/20 14:40, Andy Lutomirski wrote:
> On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > This adds a new system call, epoll_mod_wait. It's described as below:
> >
> > NAME
> >        epoll_mod_wait - modify and wait for I/O events on an epoll file
> >                         descriptor
> >
> > SYNOPSIS
> >
> >        int epoll_mod_wait(int epfd, int flags,
> >                           int ncmds, struct epoll_mod_cmd *cmds,
> >                           struct epoll_wait_spec *spec);
> >
> > DESCRIPTION
> >
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >
> >        2) When you need higher precision than microsecond for wait timeout.
> >
> >        The epoll_ctl(2) operations are embedded into this call by with ncmds
> >        and cmds. The latter is an array of command structs:
> >
> >            struct epoll_mod_cmd {
> >
> >                   /* Reserved flags for future extension, must be 0 for now. */
> >                   int flags;
> >
> >                   /* The same as epoll_ctl() op parameter. */
> >                   int op;
> >
> >                   /* The same as epoll_ctl() fd parameter. */
> >                   int fd;
> >
> >                   /* The same as the "events" field in struct epoll_event. */
> >                   uint32_t events;
> >
> >                   /* The same as the "data" field in struct epoll_event. */
> >                   uint64_t data;
> >
> >                   /* Output field, will be set to the return code once this
> >                    * command is executed by kernel */
> >                   int error;
> >            };
> 
> I would add an extra u32 at the end so that the structure size will be
> a multiple of 8 bytes on all platforms.

OK, makes sense.

> 
> >
> >        There is no guartantee that all the commands are executed in order. Only
> >        if all the commands are successfully executed (all the error fields are
> >        set to 0), events are polled.
> 
> If this doesn't happen, what error is returned?

The last error in executing commands.

> 
> >            struct epoll_wait_spec {
> >
> >                   /* The same as "maxevents" in epoll_pwait() */
> >                   int maxevents;
> >
> >                   /* The same as "events" in epoll_pwait() */
> >                   struct epoll_event *events;
> >
> >                   /* Which clock to use for timeout */
> >                   int clockid;
> >
> >                   /* Maximum time to wait if there is no event */
> >                   struct timespec timeout;
> >
> >                   /* The same as "sigmask" in epoll_pwait() */
> >                   sigset_t *sigmask;
> >
> >                   /* The same as "sigsetsize" in epoll_pwait() */
> >                   size_t sigsetsize;
> >            } EPOLL_PACKED;
> 
> I think the convention is to align the structure's fields manually
> rather than declaring it to be packed.

OK.

> 
> >
> > RETURN VALUE
> >
> >        When any error occurs, epoll_mod_wait() returns -1 and errno is set
> >        appropriately. All the "error" fields in cmds are unchanged before they
> >        are executed, and if any cmds are executed, the "error" fields are set
> >        to a return code accordingly. See also epoll_ctl for more details of the
> >        return code.
> 
> Does this mean that callers should initialize the error fields to an
> impossible value first so they can tell which commands were executed?

Yes.

> 
> >
> >        When successful, epoll_mod_wait() returns the number of file
> >        descriptors ready for the requested I/O, or zero if no file descriptor
> >        became ready during the requested timeout milliseconds.
> >
> >        If spec is NULL, it returns 0 if all the commands are successful, and -1
> >        if an error occured.
> >
> > ERRORS
> >
> >        These errors apply on either the return value of epoll_mod_wait or error
> >        status for each command, respectively.
> 
> Please clarify which errors are returned overall and which are per-command.

OK.

Fam

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Daniel Mack @ 2015-01-21  9:07 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Greg Kroah-Hartman, arnd, ebiederm,
	gnomes, teg, jkosina, luto, linux-api, linux-kernel
  Cc: daniel, dh.herrmann, tixxdz
In-Reply-To: <54BF6A13.2090008@gmail.com>

Hi Michael,

On 01/21/2015 09:57 AM, Michael Kerrisk (man-pages) wrote:
> On 01/20/2015 06:50 PM, Daniel Mack wrote:

>> I've addressed all but the below issues, following your suggestions.
> 
> Are your changes already visible somewhere?

Yes, in the upstream repo for the standalone module, which we also use
to build the patch set from:

  https://code.google.com/p/d-bus/source/browse/kdbus.txt

>> Hmm, you're quoting text from section 5, and section 4 actually
>> describes the concept of items quite well I believe?
> 
> Well, Section 4 is pretty short. My point is that most of the various 
> blob formats (e.g., kdbus_pids, kdbus_caps, kdbus_memfd) are not
> documented in kdbus.txt. They all should be, IMO.

Okay, I'll add some text about them.


Best regards,
Daniel

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Michael Kerrisk (man-pages) @ 2015-01-21  9:07 UTC (permalink / raw)
  To: Greg Kroah-Hartman, arnd-r2nGTMty4D4,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Daniel Mack
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w, tixxdz-Umm1ozX2/EEdnm+yROfE0A
In-Reply-To: <54BE5F1F.8070703-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Daniel,

On 01/20/2015 02:58 PM, Michael Kerrisk (man-pages) wrote:
> On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
>> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
>>

[...]

>> +offset field contains the location of the new message inside the receiver's
>> +pool. The message is stored as struct kdbus_msg at this offset, and can be
>> +interpreted with the semantics described above.
>> +
>> +Also, if the connection allowed for file descriptor to be passed
>> +(KDBUS_HELLO_ACCEPT_FD), and if the message contained any, they will be
>> +installed into the receiving process after the KDBUS_CMD_RECV ioctl returns.
> 
> ###
> "after"??? When exactly?

By the way, what was the answer to this question?

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Daniel Mack @ 2015-01-21  9:12 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Greg Kroah-Hartman, arnd-r2nGTMty4D4,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w, tixxdz-Umm1ozX2/EEdnm+yROfE0A
In-Reply-To: <54BF6C67.7090909-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On 01/21/2015 10:07 AM, Michael Kerrisk (man-pages) wrote:
> On 01/20/2015 02:58 PM, Michael Kerrisk (man-pages) wrote:

>>> +Also, if the connection allowed for file descriptor to be passed
>>> +(KDBUS_HELLO_ACCEPT_FD), and if the message contained any, they will be
>>> +installed into the receiving process after the KDBUS_CMD_RECV ioctl returns.
>>
>> ###
>> "after"??? When exactly?
> 
> By the way, what was the answer to this question?

I've corrected the wording on this. The file descriptors are installed
when the RECV ioctl is called, so they are ready to use when the call
returns.


Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH] selftests/exec: Check if the syscall exists and bail if not
From: David Drysdale @ 2015-01-21 10:22 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Geert Uytterhoeven, Shuah Khan,
	davej-rdkfGonbjUTCLXcRTR1eJlpr/1R2p/CL
In-Reply-To: <1421826060-2237-1-git-send-email-mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>

On Wed, Jan 21, 2015 at 7:41 AM, Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org> wrote:
> On systems which don't implement sys_execveat(), this test produces a
> lot of output.
>
> Add a check at the beginning to see if the syscall is present, and if
> not just note one error and return.

Good point, thanks.

> Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
> ---
>  tools/testing/selftests/exec/execveat.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c
> index e238c9559caf..b87e4a843bea 100644
> --- a/tools/testing/selftests/exec/execveat.c
> +++ b/tools/testing/selftests/exec/execveat.c
> @@ -234,6 +234,14 @@ static int run_tests(void)
>         int fd_cloexec = open_or_die("execveat", O_RDONLY|O_CLOEXEC);
>         int fd_script_cloexec = open_or_die("script", O_RDONLY|O_CLOEXEC);
>
> +       /* Check if we have execveat at all, and bail early if not */
> +       errno = 0;
> +       execveat_(-1, NULL, NULL, NULL, 0);
> +       if (errno == -ENOSYS) {

Could we change this to ENOSYS (no minus) and also change
the execveat_() function similarly, so that a binary built where
__NR_execveat is available but running where it isn't also exits
early?  (My bad for having the minus sign in execveat_() in the
first place -- fingers too used to kernel mode.)

Thanks!

> +               printf("[FAIL] ENOSYS calling execveat - no kernel support?\n");
> +               return 1;
> +       }
> +
>         /* Change file position to confirm it doesn't affect anything */
>         lseek(fd, 10, SEEK_SET);
>
> --
> 2.1.0
>

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Michael Kerrisk (man-pages) @ 2015-01-21 10:28 UTC (permalink / raw)
  To: David Herrmann
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Greg Kroah-Hartman,
	Arnd Bergmann, Eric W. Biederman, One Thousand Gnomes,
	Tom Gundersen, Jiri Kosina, Andy Lutomirski, Linux API,
	linux-kernel, Daniel Mack, Djalal Harouni, Daniel Mack,
	Johannes Stezenbach
In-Reply-To: <CANq1E4SjfZOKqTsdkt519vKc1Poeah5McVJBb_spdHjbKv4=7g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi David,

On 01/20/2015 03:31 PM, David Herrmann wrote:
> Hi Michael
> 
> On Tue, Jan 20, 2015 at 2:53 PM, Michael Kerrisk (man-pages)
> <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
>>> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
>>>
>>> kdbus is a system for low-latency, low-overhead, easy to use
>>> interprocess communication (IPC).
>>>
>>> The interface to all functions in this driver is implemented via ioctls
>>> on files exposed through a filesystem called 'kdbusfs'. The default
>>> mount point of kdbusfs is /sys/fs/kdbus. This patch adds detailed
>>> documentation about the kernel level API design.
>>
>> I have some details feedback on the contents of this file, and some
>> bigger questions. I'll split them out into separate mails.
>>
>> So here, the bigger, general questions to start with. I've arrived late
>> to this, so sorry if they've already been discussed, but the answers to
>> some of the questions should actually be in this file, I would have
>> expected.
>>
>> This is an enormous and complex API. Why is the API ioctl() based,
>> rather than system-call-based? Have we learned nothing from the hydra
>> that the futex() multiplexing syscall became? (And kdbus is an order
>> of magnitude more complex, by the look of things.) At the very least,
>> a *good* justification of why the API is ioctl()-based should be part
>> of this documentation file.
>>
>> An observation: The documentation below is substantial, but this API is
>> enormous, so the documentation still feels rather thin. What would
>> really help would be some example code in the doc.
>>
>> And on the subject of code examples... Are there any (prototype)
>> working user-space applications that exercise the current kdbus
>> implementation? That is, can I install these kdbus patches, and
>> then find a simple example application somewhere that does
>> something to exercise kdbus?
> 
> If you run a 3.18 kernel, you can install kdbus.ko from our repository
> and boot a full Fedora system running Gnome3 with kdbus, given that
> you compiled systemd with --enable-kdbus (which is still
> experimental). No legacy dbus1 daemon is running. Instead, we have a
> bus-proxy that converts classic dbus1 to kdbus, so all
> bus-communication runs on kdbus.

Good to hear.  I think that some info like this should go out in 
the "00/" covering mails for future patch revisions, so that people
can get some sense of the testing that has been done.

>> And then: is there any substantial real-world application (e.g., a
>> full D-Bus port) that is being developed in tandem with this kernel
>> side patch? (I don't mean a user-space library; I mean a seriously
>> large application.) This is an incredibly complex API whose
>> failings are only going to become evident through real-world use.
>> Solidifying an API in the kernel and then discovering the API
>> problems later when writing real-world applications would make for
>> a sad story. A story something like that of inotify, an API which
>> is an order of magnitude less complex than kdbus. (I can't help but
>> feel that many of inotify problems that I discuss at
>> https://lwn.net/Articles/605128/ might have been fixed or mitigated
>> if a few real-world applications had been implemented before the
>> API  was set in stone.)
> 
> I think running a whole Gnome3 stack counts as "substantial real-world
> application", right? 

Yes, I'll give you that ;-).

>  Sure, it's a dbus1-to-kdbus layer, but all the
> systemd tools use kdbus natively and it works just fine. In fact, we
> all run kdbus on our main-systems every day.
> 
> We've spent over a year fixing races and API misdesigns, we've talked
> to other toolkit developers (glib, qt, ..) and made sure we're
> backwards compatible to dbus1. I don't think the API is perfect,
> everyone makes mistakes. But with bus-proxy and systemd we have two
> huge users of kdbus that put a lot of pressure on API design.

I'll say more about that in another mail in a moment. I'm not enthusiastic
about the API.

>>> +For a kdbus specific userspace library implementation please refer to:
>>> +  http://cgit.freedesktop.org/systemd/systemd/tree/src/systemd/sd-bus.h
>>
>> Is this library intended just for systemd? More generally, is there an
>> intention to provide a general purpose library API for kdbus? Or is the
>> intention that each application will roll a library suitable to its
>> needs? I think an answer to that question would be useful in this
>> Documentation file.
> 
> kdbus is in no way bound to systemd. There are ongoing efforts to port
> glib and qt to kdbus natively. The API is pretty simple 
                                 ^^^^^^^^^^^^^^^^^^^^^^^^
I think you and I must have quite different definitions of "simple"...
(For more on this point, see my reply to Daniel in a moment.)

> and I don't
> see how a libkdbus would simplify things. In fact, even our tests only
> have slim wrappers around the ioctls to simplify error-handling in
> test-scenarios.

Again, the above info would be useful in the Documentation file.

> Note that most of the toolkit work is on the marshaling level, which
> is independent of kdbus. kdbus just provides the transport level. DBus
> is just one, yet significant, application-layer on top of kdbus. Our
> test-cases use kdbus exclusively to transport raw byte streams.

Okay. Thanks for the info.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Michael Kerrisk (man-pages) @ 2015-01-21 10:32 UTC (permalink / raw)
  To: Daniel Mack, Greg Kroah-Hartman, arnd-r2nGTMty4D4,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w, tixxdz-Umm1ozX2/EEdnm+yROfE0A,
	Johannes Stezenbach
In-Reply-To: <54BE9D08.7010804-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>

Hello Daniel,

On 01/20/2015 07:23 PM, Daniel Mack wrote:
> On 01/20/2015 02:53 PM, Michael Kerrisk (man-pages) wrote:
>> This is an enormous and complex API. Why is the API ioctl() based,
>> rather than system-call-based? Have we learned nothing from the hydra
>> that the futex() multiplexing syscall became? (And kdbus is an order
>> of magnitude more complex, by the look of things.) At the very least,
>> a *good* justification of why the API is ioctl()-based should be part
>> of this documentation file.
> 
> I think the simplest reason is because we want to be able to build kdbus
> as a module. 

This isn't any _good_ justification...

> It's rather an optional driver than a core kernel feature.

Given the various things that I've seen said about kdbus, the
preceding sentence makes little sense to me:

* kdbus will be the framework supporting user-space D-Bus in the
  future, and also used by systemd, and so on pretty much every 
  desktop system.
* kdbus solves much of the bandwidth problem of D-Bus1. That,
  along with a host of other features mean that there will be
  a lot of user-space developers interested in using this API.
* Various parties in user space are already expressing strong 
  interest in kdbus.

My guess from the above? This will NOT be an "optional driver". 
It *will be* a core kernel feature.

> IMO, kernel primitives should be syscalls, but kdbus is not a primitive
> but an elaborate subsystem.

Agreed. It's an elaborate subsystem. But that fact doesn't in itself
dictate any particular API design choice.

> Also, the context the kdbus commands operate on originate from a
> mountable special-purpose file system.

It's not clear to me how this point implies any particular API design
choice.

> Hence, we decided not to use a
> global kernel interface but specific ioctls on the nodes exposed by kdbusfs.

I don't follow the reasoning here at all. Here's what we have, if I
have grasped it roughly correctly:

* 16 ioctls exposed to user space.
* some 20 different structures exchanged between kernel and user space
* about 14k lines of kernel code implement the above
* some rather thin documentation of the whole lot

Sorry if that last point seems rather harsh. I know that you personally 
have done a lot of work on the kdbus.txt file. David Herrmann asserts
that this is a simple API. It is not. He also suggests that there is
no need for a libkdbus. I don't know whether that's right or not, but the
point is then that there's an expectation that the raw kernel API is what
user space will need to work with. 

Notwithstanding the fact that there's a lot of (good) information in
kdbus.txt, there's not nearly enough for someone to create useful, 
robust applications that use that API (and not enough that I as a
reviewer feel comfortable about reviewing the API). As things stand,
user-space developers will be forced to decipher large amounts of kernel
code and existing applications in order to actually build things. And
when they do, they'll be using one of the worst APIs known to man: ioctl(),
an API that provides no type safety at all.

ioctl() is a get-out-of-jail free card when it comes to API design. Rather
than thinking carefully and long about a set of coherent, stable APIs that 
provide some degree of type-safety to user-space, one just adds/changes/removes
an ioctl. And that process seems to be frequent and ongoing even now. (And 
it's to your great credit that the API/ABI breaks are clearly and honestly 
marked in the kdbus.h changelog.) All of this lightens the burden of API
design for kernel developers, but I'm concerned that the long-term pain
for user-space developers who use an API which (in my estimation) may
come to be widely used will be enormous.

Concretely, I'd like to see the following in kdbus.txt:
* A lot more detail, adding the various pieces that are currently missing.
  I've mentioned already the absence of detail on the item blob structures, 
  but there's probably several other pieces as well. My problem is that the
  API is so big and hard to grok that it's hard to even begin to work out
  what's missing from the documentation.
* Fleshing out the API summaries with code snippets that illustrate the
  use of the APIs.
* At least one simple working example application, complete with a walk
  through of how it's built and run.

Yes, all of this is a big demand. But this is a big API that is being added 
to the kernel, and one that may become widely used and for a long time.
It's imperative that the API is well documented, and as well designed as
possible. Furthermore, with such improved documentation I feel we'd be in 
a better position to evaluate the merits of an ioctl()-based API versus
some other approach.

Thanks,

Michael




-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
From: Paolo Bonzini @ 2015-01-21 10:34 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Fam Zheng
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
	Peter Zijlstra <peter>
In-Reply-To: <54BF5ACE.6030206@gmail.com>



On 21/01/2015 08:52, Michael Kerrisk (man-pages) wrote:
>> > The problem is that there is no room for flags field in epoll_pwait1, which is
>> > asked for, in previous discussion thread [1].
> Ahh yes, I certainly should not have forgotten that. But that's easily solved.
> Do as for pselect6():
> 
> strcut sigargs {
>     const sigset_t *ss;
>     size_t          ss_len; /* Size (in bytes) of object pointed
>                                to by 'ss' */
> }
> 
> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
>              struct timespec *timeout, int clock_id, int flags,
>              int timeout,
>              const struct sigargs *sargs);
> 

Alternatively, place the clock_id in the lower 16 bits of flags.
MAX_CLOCKS is 16 right now, so there's room.

Paolo

^ permalink raw reply

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
From: Paolo Bonzini @ 2015-01-21 10:37 UTC (permalink / raw)
  To: Fam Zheng, Michael Kerrisk (man-pages), Andy Lutomirski
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro,
	Andrew Morton, Kees Cook, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
	Peter Zijlstra, linux-fsdevel-fy+rA21nqHI
In-Reply-To: <20150121085827.GB23024-ZfWej9ACyHUXGNroddHbYwC/G2K4zDHf@public.gmane.org>



On 21/01/2015 09:58, Fam Zheng wrote:
>> > See my comment in the earlier mail. If you split this into two 
>> > APIs, and epoll_ctl_batch() is guaranteed to execute 'cmds' in order, 
>> > then the return value of epoll_ctl_batch() could be used to tell
>> > user space how many commands succeeded. Much simpler!
> Yes it is much simpler. However the reason to add batching in the first place is
> to make epoll faster, by reducing syscalls. Splitting makes the result
> sub-optimal: we still need at least 2 calls instead of 1.  Each one of the three
> proposed new call *is* a step forward, but I don't think we will have everything
> solved even by implementing them all. Compromise needed between performance or
> complexity.
> 
> My take for simplicity will be leaving epoll_ctl as-is, and my take for
> performance will be epoll_pwait1. And I don't really like putting my time on
> epoll_ctl_batch, thinking it as a ambivalent compromise in between.

I agree with Michael actually.  The big change is going from O(n)
epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
a fraction of a microsecond.

Paolo

^ permalink raw reply

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
From: Fam Zheng @ 2015-01-21 11:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael Kerrisk (man-pages), Andy Lutomirski, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Kees Cook, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu
In-Reply-To: <54BF814F.7090703@redhat.com>

On Wed, 01/21 11:37, Paolo Bonzini wrote:
> 
> 
> On 21/01/2015 09:58, Fam Zheng wrote:
> >> > See my comment in the earlier mail. If you split this into two 
> >> > APIs, and epoll_ctl_batch() is guaranteed to execute 'cmds' in order, 
> >> > then the return value of epoll_ctl_batch() could be used to tell
> >> > user space how many commands succeeded. Much simpler!
> > Yes it is much simpler. However the reason to add batching in the first place is
> > to make epoll faster, by reducing syscalls. Splitting makes the result
> > sub-optimal: we still need at least 2 calls instead of 1.  Each one of the three
> > proposed new call *is* a step forward, but I don't think we will have everything
> > solved even by implementing them all. Compromise needed between performance or
> > complexity.
> > 
> > My take for simplicity will be leaving epoll_ctl as-is, and my take for
> > performance will be epoll_pwait1. And I don't really like putting my time on
> > epoll_ctl_batch, thinking it as a ambivalent compromise in between.
> 
> I agree with Michael actually.  The big change is going from O(n)
> epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
> Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
> a fraction of a microsecond.
> 

Maybe I'm missing something, but in common cases, the set of fds for epoll_wait
doesn't change that radically from one iteration to another, does it?

Fam

^ permalink raw reply

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
From: Paolo Bonzini @ 2015-01-21 11:50 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Michael Kerrisk (man-pages), Andy Lutomirski,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro,
	Andrew Morton, Kees Cook, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu
In-Reply-To: <20150121111404.GA3804-ZfWej9ACyHUXGNroddHbYwC/G2K4zDHf@public.gmane.org>



On 21/01/2015 12:14, Fam Zheng wrote:
> > My take for simplicity will be leaving epoll_ctl as-is, and my take for
> > performance will be epoll_pwait1. And I don't really like putting my time on
> > epoll_ctl_batch, thinking it as a ambivalent compromise in between.
> 
> > I agree with Michael actually.  The big change is going from O(n)
> > epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
> > Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
> > a fraction of a microsecond.
> 
> Maybe I'm missing something, but in common cases, the set of fds for epoll_wait
> doesn't change that radically from one iteration to another, does it?

That depends on the application.

Paolo

^ permalink raw reply

* [PATCH v12 02/18] vfio: platform: probe to devices on the platform bus
From: Baptiste Reynal @ 2015-01-21 12:49 UTC (permalink / raw)
  To: kvmarm, iommu, alex.williamson
  Cc: will.deacon, tech, christoffer.dall, eric.auger, kim.phillips,
	marc.zyngier, Antonios Motakis, open list, open list:VFIO DRIVER,
	open list:ABI/API
In-Reply-To: <1421844606-24751-1-git-send-email-b.reynal@virtualopensystems.com>

From: Antonios Motakis <a.motakis@virtualopensystems.com>

Driver to bind to Linux platform devices, and callbacks to discover their
resources to be used by the main VFIO PLATFORM code.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
---
 drivers/vfio/platform/vfio_platform.c | 103 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h             |   1 +
 2 files changed, 104 insertions(+)
 create mode 100644 drivers/vfio/platform/vfio_platform.c

diff --git a/drivers/vfio/platform/vfio_platform.c b/drivers/vfio/platform/vfio_platform.c
new file mode 100644
index 0000000..cef645c
--- /dev/null
+++ b/drivers/vfio/platform/vfio_platform.c
@@ -0,0 +1,103 @@
+/*
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis <a.motakis@virtualopensystems.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/vfio.h>
+#include <linux/platform_device.h>
+
+#include "vfio_platform_private.h"
+
+#define DRIVER_VERSION  "0.10"
+#define DRIVER_AUTHOR   "Antonios Motakis <a.motakis@virtualopensystems.com>"
+#define DRIVER_DESC     "VFIO for platform devices - User Level meta-driver"
+
+/* probing devices from the linux platform bus */
+
+static struct resource *get_platform_resource(struct vfio_platform_device *vdev,
+					      int num)
+{
+	struct platform_device *dev = (struct platform_device *) vdev->opaque;
+	int i;
+
+	for (i = 0; i < dev->num_resources; i++) {
+		struct resource *r = &dev->resource[i];
+
+		if (resource_type(r) & (IORESOURCE_MEM|IORESOURCE_IO)) {
+			if (!num)
+				return r;
+
+			num--;
+		}
+	}
+	return NULL;
+}
+
+static int get_platform_irq(struct vfio_platform_device *vdev, int i)
+{
+	struct platform_device *pdev = (struct platform_device *) vdev->opaque;
+
+	return platform_get_irq(pdev, i);
+}
+
+static int vfio_platform_probe(struct platform_device *pdev)
+{
+	struct vfio_platform_device *vdev;
+	int ret;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev)
+		return -ENOMEM;
+
+	vdev->opaque = (void *) pdev;
+	vdev->name = pdev->name;
+	vdev->flags = VFIO_DEVICE_FLAGS_PLATFORM;
+	vdev->get_resource = get_platform_resource;
+	vdev->get_irq = get_platform_irq;
+
+	ret = vfio_platform_probe_common(vdev, &pdev->dev);
+	if (ret)
+		kfree(vdev);
+
+	return ret;
+}
+
+static int vfio_platform_remove(struct platform_device *pdev)
+{
+	struct vfio_platform_device *vdev;
+
+	vdev = vfio_platform_remove_common(&pdev->dev);
+	if (vdev) {
+		kfree(vdev);
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static struct platform_driver vfio_platform_driver = {
+	.probe		= vfio_platform_probe,
+	.remove		= vfio_platform_remove,
+	.driver	= {
+		.name	= "vfio-platform",
+		.owner	= THIS_MODULE,
+	},
+};
+
+module_platform_driver(vfio_platform_driver);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9ade02b..4e93a97 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -159,6 +159,7 @@ struct vfio_device_info {
 	__u32	flags;
 #define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
 #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
+#define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 };
-- 
2.2.2


^ permalink raw reply related

* Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)
From: Milosz Tanski @ 2015-01-21 14:55 UTC (permalink / raw)
  To: Volker Lendecke
  Cc: Andrew Morton, LKML, Christoph Hellwig,
	linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, Mel Gorman,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro, Linux API,
	Michael Kerrisk, linux-arch
In-Reply-To: <E1Xwo5O-00GRl4-Cc@intern.SerNet.DE>

On Fri, Dec 5, 2014 at 3:17 AM, Volker Lendecke
<Volker.Lendecke@sernet.de> wrote:
>
> On Thu, Dec 04, 2014 at 03:11:02PM -0800, Andrew Morton wrote:
> > I can see all that, but it's handwaving.  Yes, preadv2() will perform
> > better in some circumstances than fincore+pread.  But how much better?
> > Enough to justify this approach, or not?
> >
> > Alas, the only way to really settle that is to implement fincore() and
> > to subject it to a decent amount of realistic quantitative testing.
> >
> > Ho hum.
> >
> > Could you please hunt down some libuv developers, see if we can solicit
> > some quality input from them?  As I said, we really don't want to merge
> > this then find that people don't use it for some reason, or that it
> > needs changes.
>
> All I can say from a Samba perspective is that none of the ARM based
> Storage boxes I have seen so far do AIO because of the base footprint
> for every read. For sequential reads kernel-level readahead could kick
> in properly and we should be able to give them the best of both worlds:
> No context switches in the default case but also good parallel behaviour
> for other workloads.  The most important benchmark for those guys is to
> read a DVD image, whether it makes sense or not.


I just made wanted to share some progress on this. And I apologize for
for all these different threads (this, LSF/FS and then Jermey and
Volker).

I recently implemented cifs support (via libsmbcli) for FIO so I can
have some hard numbers on the benchmarks. So all you guys will be
seeing more data soon enough. It's going to take a bit of time to put
it together because it takes a lot of time to benchmark to make sure
we have correct and non-noisy numbers.

In the meantime I have some numbers from my first run here:
http://i.imgur.com/05SMu8d.jpg

Sorry for the link to the image, it was easier. The test case is a
single FIO client doing 4K random reads, on localhost smbd server, on
a fully cached file for 10 minutes with a 1 minute warm up.

Threadpool + preadv2 for fast read does much better in terms of
bandwidth and a bit better in terms of latency. Sync is still the
fastest, but the gap is narrowed. Not a bad improvement for (Volker's)
9 line change to samba code.

Also, I look into why the gap between sync and threadpool + preadv2 is
not even smaller. From my preliminary investigation it looks like the
async threadpool code path does a lot more work then the sync call...
even in the case we do the fast read. According to perf the hotest
code userspace (smbd+ library) is malloc + free. So I imagine the
optimizing the fast read case to avoid a bunch of extra request
allocations will bring us even closer to sync.

Again, I'll have and more complex test cases soon just wanted to share
progress. I imagine that they'll the gap between threadpool + preadv2
and just threadpool is going to get wider as add more blocking calls
into the queue. I'll have number on that as soon as week can.


diff --git a/source3/modules/vfs_default.c b/source3/modules/vfs_default.c
index 5634cc0..90348d8 100644
--- a/source3/modules/vfs_default.c
+++ b/source3/modules/vfs_default.c
@@ -718,6 +741,7 @@ static struct tevent_req
*vfswrap_pread_send(struct vfs_handle_struct *handle,
        struct tevent_req *req;
        struct vfswrap_asys_state *state;
        int ret;
+       ssize_t nread;

        req = tevent_req_create(mem_ctx, &state, struct vfswrap_asys_state);
        if (req == NULL) {
@@ -730,6 +754,14 @@ static struct tevent_req
*vfswrap_pread_send(struct vfs_handle_struct *handle,
        state->asys_ctx = handle->conn->sconn->asys_ctx;
        state->req = req;

+       nread = pread2(fsp->fh->fd, data, n, offset, RWF_NONBLOCK);
+       // TODO: partial reads
+       if (nread == n) {
+               state->ret = nread;
+               tevent_req_done(req);
+               return tevent_req_post(req, ev);
+       }
+
        SMBPROFILE_BYTES_ASYNC_START(syscall_asys_pread, profile_p,
                                     state->profile_bytes, n);
        ret = asys_pread(state->asys_ctx, fsp->fh->fd, data, n, offset, req);


-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related

* [PATCH pre-squash 02/14] virtio-pci: define layout for virtio 1.0
From: Michael S. Tsirkin @ 2015-01-21 15:00 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Rusty Russell,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Gerd Hoffmann, linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1421852375-22604-1-git-send-email-mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

From: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>

Based on patches by Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, but I found it
hard to follow so changed to use structures which are more
self-documenting.

Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Tested-by: Gerd Hoffmann <kraxel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/uapi/linux/virtio_pci.h | 62 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 509d630..4e05423 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -99,4 +99,66 @@
 /* Vector value used to disable MSI for queue */
 #define VIRTIO_MSI_NO_VECTOR            0xffff
 
+#ifndef VIRTIO_PCI_NO_MODERN
+
+/* IDs for different capabilities.  Must all exist. */
+
+/* Common configuration */
+#define VIRTIO_PCI_CAP_COMMON_CFG	1
+/* Notifications */
+#define VIRTIO_PCI_CAP_NOTIFY_CFG	2
+/* ISR access */
+#define VIRTIO_PCI_CAP_ISR_CFG		3
+/* Device specific confiuration */
+#define VIRTIO_PCI_CAP_DEVICE_CFG	4
+
+/* This is the PCI capability header: */
+struct virtio_pci_cap {
+	__u8 cap_vndr;		/* Generic PCI field: PCI_CAP_ID_VNDR */
+	__u8 cap_next;		/* Generic PCI field: next ptr. */
+	__u8 cap_len;		/* Generic PCI field: capability length */
+	__u8 type_and_bar;	/* Upper 3 bits: bar.
+				 * Lower 3 is VIRTIO_PCI_CAP_*_CFG. */
+	__le32 offset;		/* Offset within bar. */
+	__le32 length;		/* Length. */
+};
+
+#define VIRTIO_PCI_CAP_BAR_SHIFT	5
+#define VIRTIO_PCI_CAP_BAR_MASK		0x7
+#define VIRTIO_PCI_CAP_TYPE_SHIFT	0
+#define VIRTIO_PCI_CAP_TYPE_MASK	0x7
+
+struct virtio_pci_notify_cap {
+	struct virtio_pci_cap cap;
+	__le32 notify_off_multiplier;	/* Multiplier for queue_notify_off. */
+};
+
+/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
+struct virtio_pci_common_cfg {
+	/* About the whole device. */
+	__le32 device_feature_select;	/* read-write */
+	__le32 device_feature;		/* read-only */
+	__le32 guest_feature_select;	/* read-write */
+	__le32 guest_feature;		/* read-write */
+	__le16 msix_config;		/* read-write */
+	__le16 num_queues;		/* read-only */
+	__u8 device_status;		/* read-write */
+	__u8 config_generation;		/* read-only */
+
+	/* About a specific virtqueue. */
+	__le16 queue_select;		/* read-write */
+	__le16 queue_size;		/* read-write, power of 2. */
+	__le16 queue_msix_vector;	/* read-write */
+	__le16 queue_enable;		/* read-write */
+	__le16 queue_notify_off;	/* read-only */
+	__le32 queue_desc_lo;		/* read-write */
+	__le32 queue_desc_hi;		/* read-write */
+	__le32 queue_avail_lo;		/* read-write */
+	__le32 queue_avail_hi;		/* read-write */
+	__le32 queue_used_lo;		/* read-write */
+	__le32 queue_used_hi;		/* read-write */
+};
+
+#endif /* VIRTIO_PCI_NO_MODERN */
+
 #endif
-- 
MST

^ permalink raw reply related

* [PATCH pre-squash 03/14] fixup! virtio-pci: define layout for virtio 1.0
From: Michael S. Tsirkin @ 2015-01-21 15:00 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Rusty Russell,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Gerd Hoffmann, linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1421852375-22604-1-git-send-email-mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

virtio-pci: fix up virtio 1.0 vendor capability

Gerd Hoffmann noticed that we implemented
capability layout from an old draft.
Unfortunately the code was copied to host as well,
so we didn't notice.

Luckily we caught this in time.

This fixes commit "virtio-pci: define layout for virtio 1.0"
and should be smashed with it.

Reported-by: Gerd Hoffmann <kraxel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Tested-by: Gerd Hoffmann <kraxel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/uapi/linux/virtio_pci.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 4e05423..a2b2e13 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -117,10 +117,11 @@ struct virtio_pci_cap {
 	__u8 cap_vndr;		/* Generic PCI field: PCI_CAP_ID_VNDR */
 	__u8 cap_next;		/* Generic PCI field: next ptr. */
 	__u8 cap_len;		/* Generic PCI field: capability length */
-	__u8 type_and_bar;	/* Upper 3 bits: bar.
-				 * Lower 3 is VIRTIO_PCI_CAP_*_CFG. */
+	__u8 cfg_type;		/* Identifies the structure. */
+	__u8 bar;		/* Where to find it. */
+	__u8 padding[3];	/* Pad to full dword. */
 	__le32 offset;		/* Offset within bar. */
-	__le32 length;		/* Length. */
+	__le32 length;		/* Length of the structure, in bytes. */
 };
 
 #define VIRTIO_PCI_CAP_BAR_SHIFT	5
-- 
MST

^ permalink raw reply related

* [PATCH pre-squash 07/14] virtio_pci: macros for PCI layout offsets
From: Michael S. Tsirkin @ 2015-01-21 15:00 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Rusty Russell,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Gerd Hoffmann, linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1421852375-22604-1-git-send-email-mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

From: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>

QEMU wants it, so why not?  Trust, but verify.

Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Tested-by: Gerd Hoffmann <kraxel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/uapi/linux/virtio_pci.h    | 30 ++++++++++++++++++++
 drivers/virtio/virtio_pci_modern.c | 58 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 87 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index a2b2e13..0911c62 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -160,6 +160,36 @@ struct virtio_pci_common_cfg {
 	__le32 queue_used_hi;		/* read-write */
 };
 
+/* Macro versions of offsets for the Old Timers! */
+#define VIRTIO_PCI_CAP_VNDR		0
+#define VIRTIO_PCI_CAP_NEXT		1
+#define VIRTIO_PCI_CAP_LEN		2
+#define VIRTIO_PCI_CAP_TYPE_AND_BAR	3
+#define VIRTIO_PCI_CAP_OFFSET		4
+#define VIRTIO_PCI_CAP_LENGTH		8
+
+#define VIRTIO_PCI_NOTIFY_CAP_MULT	12
+
+#define VIRTIO_PCI_COMMON_DFSELECT	0
+#define VIRTIO_PCI_COMMON_DF		4
+#define VIRTIO_PCI_COMMON_GFSELECT	8
+#define VIRTIO_PCI_COMMON_GF		12
+#define VIRTIO_PCI_COMMON_MSIX		16
+#define VIRTIO_PCI_COMMON_NUMQ		18
+#define VIRTIO_PCI_COMMON_STATUS	20
+#define VIRTIO_PCI_COMMON_CFGGENERATION	21
+#define VIRTIO_PCI_COMMON_Q_SELECT	22
+#define VIRTIO_PCI_COMMON_Q_SIZE	24
+#define VIRTIO_PCI_COMMON_Q_MSIX	26
+#define VIRTIO_PCI_COMMON_Q_ENABLE	28
+#define VIRTIO_PCI_COMMON_Q_NOFF	30
+#define VIRTIO_PCI_COMMON_Q_DESCLO	32
+#define VIRTIO_PCI_COMMON_Q_DESCHI	36
+#define VIRTIO_PCI_COMMON_Q_AVAILLO	40
+#define VIRTIO_PCI_COMMON_Q_AVAILHI	44
+#define VIRTIO_PCI_COMMON_Q_USEDLO	48
+#define VIRTIO_PCI_COMMON_Q_USEDHI	52
+
 #endif /* VIRTIO_PCI_NO_MODERN */
 
 #endif
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index a3d8101..c86594e 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -464,9 +464,65 @@ static void virtio_pci_release_dev(struct device *_d)
 	kfree(vp_dev);
 }
 
-/* TODO: validate the ABI statically. */
+/* This is part of the ABI.  Don't screw with it. */
 static inline void check_offsets(void)
 {
+	/* Note: disk space was harmed in compilation of this function. */
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_VNDR !=
+		     offsetof(struct virtio_pci_cap, cap_vndr));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_NEXT !=
+		     offsetof(struct virtio_pci_cap, cap_next));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_LEN !=
+		     offsetof(struct virtio_pci_cap, cap_len));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_TYPE_AND_BAR !=
+		     offsetof(struct virtio_pci_cap, type_and_bar));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_OFFSET !=
+		     offsetof(struct virtio_pci_cap, offset));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_LENGTH !=
+		     offsetof(struct virtio_pci_cap, length));
+	BUILD_BUG_ON(VIRTIO_PCI_NOTIFY_CAP_MULT !=
+		     offsetof(struct virtio_pci_notify_cap,
+			      notify_off_multiplier));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_DFSELECT !=
+		     offsetof(struct virtio_pci_common_cfg,
+			      device_feature_select));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_DF !=
+		     offsetof(struct virtio_pci_common_cfg, device_feature));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_GFSELECT !=
+		     offsetof(struct virtio_pci_common_cfg,
+			      guest_feature_select));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_GF !=
+		     offsetof(struct virtio_pci_common_cfg, guest_feature));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_MSIX !=
+		     offsetof(struct virtio_pci_common_cfg, msix_config));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_NUMQ !=
+		     offsetof(struct virtio_pci_common_cfg, num_queues));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_STATUS !=
+		     offsetof(struct virtio_pci_common_cfg, device_status));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_CFGGENERATION !=
+		     offsetof(struct virtio_pci_common_cfg, config_generation));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_SELECT !=
+		     offsetof(struct virtio_pci_common_cfg, queue_select));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_SIZE !=
+		     offsetof(struct virtio_pci_common_cfg, queue_size));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_MSIX !=
+		     offsetof(struct virtio_pci_common_cfg, queue_msix_vector));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_ENABLE !=
+		     offsetof(struct virtio_pci_common_cfg, queue_enable));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_NOFF !=
+		     offsetof(struct virtio_pci_common_cfg, queue_notify_off));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_DESCLO !=
+		     offsetof(struct virtio_pci_common_cfg, queue_desc_lo));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_DESCHI !=
+		     offsetof(struct virtio_pci_common_cfg, queue_desc_hi));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_AVAILLO !=
+		     offsetof(struct virtio_pci_common_cfg, queue_avail_lo));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_AVAILHI !=
+		     offsetof(struct virtio_pci_common_cfg, queue_avail_hi));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_USEDLO !=
+		     offsetof(struct virtio_pci_common_cfg, queue_used_lo));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_USEDHI !=
+		     offsetof(struct virtio_pci_common_cfg, queue_used_hi));
 }
 
 /* the PCI probing function */
-- 
MST

^ permalink raw reply related

* [PATCH pre-squash 08/14] fixup! virtio_pci: macros for PCI layout offsets
From: Michael S. Tsirkin @ 2015-01-21 15:00 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Rusty Russell,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Gerd Hoffmann, linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1421852375-22604-1-git-send-email-mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

virtio_pci_modern: fix up vendor capability macros

Gerd Hoffmann noticed that we implemented
capability layout from an old draft.
Unfortunately the code was copied to host as well,
so we didn't notice.

Luckily we caught this in time.

This fixes commit "virtio_pci: macros for PCI layout offsets"
and should be smashed with it.

Reported-by: Gerd Hoffmann <kraxel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Tested-by: Gerd Hoffmann <kraxel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/uapi/linux/virtio_pci.h    | 14 +++++---------
 drivers/virtio/virtio_pci_modern.c |  6 ++++--
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 0911c62..3b7e4d2 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -124,11 +124,6 @@ struct virtio_pci_cap {
 	__le32 length;		/* Length of the structure, in bytes. */
 };
 
-#define VIRTIO_PCI_CAP_BAR_SHIFT	5
-#define VIRTIO_PCI_CAP_BAR_MASK		0x7
-#define VIRTIO_PCI_CAP_TYPE_SHIFT	0
-#define VIRTIO_PCI_CAP_TYPE_MASK	0x7
-
 struct virtio_pci_notify_cap {
 	struct virtio_pci_cap cap;
 	__le32 notify_off_multiplier;	/* Multiplier for queue_notify_off. */
@@ -164,11 +159,12 @@ struct virtio_pci_common_cfg {
 #define VIRTIO_PCI_CAP_VNDR		0
 #define VIRTIO_PCI_CAP_NEXT		1
 #define VIRTIO_PCI_CAP_LEN		2
-#define VIRTIO_PCI_CAP_TYPE_AND_BAR	3
-#define VIRTIO_PCI_CAP_OFFSET		4
-#define VIRTIO_PCI_CAP_LENGTH		8
+#define VIRTIO_PCI_CAP_CFG_TYPE		3
+#define VIRTIO_PCI_CAP_BAR		4
+#define VIRTIO_PCI_CAP_OFFSET		8
+#define VIRTIO_PCI_CAP_LENGTH		12
 
-#define VIRTIO_PCI_NOTIFY_CAP_MULT	12
+#define VIRTIO_PCI_NOTIFY_CAP_MULT	16
 
 #define VIRTIO_PCI_COMMON_DFSELECT	0
 #define VIRTIO_PCI_COMMON_DF		4
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index c86594e..b2e707ad 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -474,8 +474,10 @@ static inline void check_offsets(void)
 		     offsetof(struct virtio_pci_cap, cap_next));
 	BUILD_BUG_ON(VIRTIO_PCI_CAP_LEN !=
 		     offsetof(struct virtio_pci_cap, cap_len));
-	BUILD_BUG_ON(VIRTIO_PCI_CAP_TYPE_AND_BAR !=
-		     offsetof(struct virtio_pci_cap, type_and_bar));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_CFG_TYPE !=
+		     offsetof(struct virtio_pci_cap, cfg_type));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_BAR !=
+		     offsetof(struct virtio_pci_cap, bar));
 	BUILD_BUG_ON(VIRTIO_PCI_CAP_OFFSET !=
 		     offsetof(struct virtio_pci_cap, offset));
 	BUILD_BUG_ON(VIRTIO_PCI_CAP_LENGTH !=
-- 
MST

^ permalink raw reply related

* [PATCH post-squash 2/9] virtio-pci: define layout for virtio 1.0
From: Michael S. Tsirkin @ 2015-01-21 15:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-api, virtualization
In-Reply-To: <1421852670-23742-1-git-send-email-mst@redhat.com>

From: Rusty Russell <rusty@rustcorp.com.au>

Based on patches by Michael S. Tsirkin <mst@redhat.com>, but I found it
hard to follow so changed to use structures which are more
self-documenting.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Gerd Hoffmann <kraxel@redhat.com>
---
 include/uapi/linux/virtio_pci.h | 63 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 509d630..a2b2e13 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -99,4 +99,67 @@
 /* Vector value used to disable MSI for queue */
 #define VIRTIO_MSI_NO_VECTOR            0xffff
 
+#ifndef VIRTIO_PCI_NO_MODERN
+
+/* IDs for different capabilities.  Must all exist. */
+
+/* Common configuration */
+#define VIRTIO_PCI_CAP_COMMON_CFG	1
+/* Notifications */
+#define VIRTIO_PCI_CAP_NOTIFY_CFG	2
+/* ISR access */
+#define VIRTIO_PCI_CAP_ISR_CFG		3
+/* Device specific confiuration */
+#define VIRTIO_PCI_CAP_DEVICE_CFG	4
+
+/* This is the PCI capability header: */
+struct virtio_pci_cap {
+	__u8 cap_vndr;		/* Generic PCI field: PCI_CAP_ID_VNDR */
+	__u8 cap_next;		/* Generic PCI field: next ptr. */
+	__u8 cap_len;		/* Generic PCI field: capability length */
+	__u8 cfg_type;		/* Identifies the structure. */
+	__u8 bar;		/* Where to find it. */
+	__u8 padding[3];	/* Pad to full dword. */
+	__le32 offset;		/* Offset within bar. */
+	__le32 length;		/* Length of the structure, in bytes. */
+};
+
+#define VIRTIO_PCI_CAP_BAR_SHIFT	5
+#define VIRTIO_PCI_CAP_BAR_MASK		0x7
+#define VIRTIO_PCI_CAP_TYPE_SHIFT	0
+#define VIRTIO_PCI_CAP_TYPE_MASK	0x7
+
+struct virtio_pci_notify_cap {
+	struct virtio_pci_cap cap;
+	__le32 notify_off_multiplier;	/* Multiplier for queue_notify_off. */
+};
+
+/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
+struct virtio_pci_common_cfg {
+	/* About the whole device. */
+	__le32 device_feature_select;	/* read-write */
+	__le32 device_feature;		/* read-only */
+	__le32 guest_feature_select;	/* read-write */
+	__le32 guest_feature;		/* read-write */
+	__le16 msix_config;		/* read-write */
+	__le16 num_queues;		/* read-only */
+	__u8 device_status;		/* read-write */
+	__u8 config_generation;		/* read-only */
+
+	/* About a specific virtqueue. */
+	__le16 queue_select;		/* read-write */
+	__le16 queue_size;		/* read-write, power of 2. */
+	__le16 queue_msix_vector;	/* read-write */
+	__le16 queue_enable;		/* read-write */
+	__le16 queue_notify_off;	/* read-only */
+	__le32 queue_desc_lo;		/* read-write */
+	__le32 queue_desc_hi;		/* read-write */
+	__le32 queue_avail_lo;		/* read-write */
+	__le32 queue_avail_hi;		/* read-write */
+	__le32 queue_used_lo;		/* read-write */
+	__le32 queue_used_hi;		/* read-write */
+};
+
+#endif /* VIRTIO_PCI_NO_MODERN */
+
 #endif
-- 
MST

^ permalink raw reply related

* [PATCH post-squash 4/9] virtio_pci: macros for PCI layout offsets
From: Michael S. Tsirkin @ 2015-01-21 15:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-api, virtualization
In-Reply-To: <1421852670-23742-1-git-send-email-mst@redhat.com>

From: Rusty Russell <rusty@rustcorp.com.au>

QEMU wants it, so why not?  Trust, but verify.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Gerd Hoffmann <kraxel@redhat.com>
---
 include/uapi/linux/virtio_pci.h    | 36 +++++++++++++++++++----
 drivers/virtio/virtio_pci_modern.c | 60 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 90 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index a2b2e13..3b7e4d2 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -124,11 +124,6 @@ struct virtio_pci_cap {
 	__le32 length;		/* Length of the structure, in bytes. */
 };
 
-#define VIRTIO_PCI_CAP_BAR_SHIFT	5
-#define VIRTIO_PCI_CAP_BAR_MASK		0x7
-#define VIRTIO_PCI_CAP_TYPE_SHIFT	0
-#define VIRTIO_PCI_CAP_TYPE_MASK	0x7
-
 struct virtio_pci_notify_cap {
 	struct virtio_pci_cap cap;
 	__le32 notify_off_multiplier;	/* Multiplier for queue_notify_off. */
@@ -160,6 +155,37 @@ struct virtio_pci_common_cfg {
 	__le32 queue_used_hi;		/* read-write */
 };
 
+/* Macro versions of offsets for the Old Timers! */
+#define VIRTIO_PCI_CAP_VNDR		0
+#define VIRTIO_PCI_CAP_NEXT		1
+#define VIRTIO_PCI_CAP_LEN		2
+#define VIRTIO_PCI_CAP_CFG_TYPE		3
+#define VIRTIO_PCI_CAP_BAR		4
+#define VIRTIO_PCI_CAP_OFFSET		8
+#define VIRTIO_PCI_CAP_LENGTH		12
+
+#define VIRTIO_PCI_NOTIFY_CAP_MULT	16
+
+#define VIRTIO_PCI_COMMON_DFSELECT	0
+#define VIRTIO_PCI_COMMON_DF		4
+#define VIRTIO_PCI_COMMON_GFSELECT	8
+#define VIRTIO_PCI_COMMON_GF		12
+#define VIRTIO_PCI_COMMON_MSIX		16
+#define VIRTIO_PCI_COMMON_NUMQ		18
+#define VIRTIO_PCI_COMMON_STATUS	20
+#define VIRTIO_PCI_COMMON_CFGGENERATION	21
+#define VIRTIO_PCI_COMMON_Q_SELECT	22
+#define VIRTIO_PCI_COMMON_Q_SIZE	24
+#define VIRTIO_PCI_COMMON_Q_MSIX	26
+#define VIRTIO_PCI_COMMON_Q_ENABLE	28
+#define VIRTIO_PCI_COMMON_Q_NOFF	30
+#define VIRTIO_PCI_COMMON_Q_DESCLO	32
+#define VIRTIO_PCI_COMMON_Q_DESCHI	36
+#define VIRTIO_PCI_COMMON_Q_AVAILLO	40
+#define VIRTIO_PCI_COMMON_Q_AVAILHI	44
+#define VIRTIO_PCI_COMMON_Q_USEDLO	48
+#define VIRTIO_PCI_COMMON_Q_USEDHI	52
+
 #endif /* VIRTIO_PCI_NO_MODERN */
 
 #endif
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index a3d8101..b2e707ad 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -464,9 +464,67 @@ static void virtio_pci_release_dev(struct device *_d)
 	kfree(vp_dev);
 }
 
-/* TODO: validate the ABI statically. */
+/* This is part of the ABI.  Don't screw with it. */
 static inline void check_offsets(void)
 {
+	/* Note: disk space was harmed in compilation of this function. */
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_VNDR !=
+		     offsetof(struct virtio_pci_cap, cap_vndr));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_NEXT !=
+		     offsetof(struct virtio_pci_cap, cap_next));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_LEN !=
+		     offsetof(struct virtio_pci_cap, cap_len));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_CFG_TYPE !=
+		     offsetof(struct virtio_pci_cap, cfg_type));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_BAR !=
+		     offsetof(struct virtio_pci_cap, bar));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_OFFSET !=
+		     offsetof(struct virtio_pci_cap, offset));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_LENGTH !=
+		     offsetof(struct virtio_pci_cap, length));
+	BUILD_BUG_ON(VIRTIO_PCI_NOTIFY_CAP_MULT !=
+		     offsetof(struct virtio_pci_notify_cap,
+			      notify_off_multiplier));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_DFSELECT !=
+		     offsetof(struct virtio_pci_common_cfg,
+			      device_feature_select));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_DF !=
+		     offsetof(struct virtio_pci_common_cfg, device_feature));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_GFSELECT !=
+		     offsetof(struct virtio_pci_common_cfg,
+			      guest_feature_select));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_GF !=
+		     offsetof(struct virtio_pci_common_cfg, guest_feature));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_MSIX !=
+		     offsetof(struct virtio_pci_common_cfg, msix_config));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_NUMQ !=
+		     offsetof(struct virtio_pci_common_cfg, num_queues));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_STATUS !=
+		     offsetof(struct virtio_pci_common_cfg, device_status));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_CFGGENERATION !=
+		     offsetof(struct virtio_pci_common_cfg, config_generation));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_SELECT !=
+		     offsetof(struct virtio_pci_common_cfg, queue_select));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_SIZE !=
+		     offsetof(struct virtio_pci_common_cfg, queue_size));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_MSIX !=
+		     offsetof(struct virtio_pci_common_cfg, queue_msix_vector));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_ENABLE !=
+		     offsetof(struct virtio_pci_common_cfg, queue_enable));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_NOFF !=
+		     offsetof(struct virtio_pci_common_cfg, queue_notify_off));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_DESCLO !=
+		     offsetof(struct virtio_pci_common_cfg, queue_desc_lo));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_DESCHI !=
+		     offsetof(struct virtio_pci_common_cfg, queue_desc_hi));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_AVAILLO !=
+		     offsetof(struct virtio_pci_common_cfg, queue_avail_lo));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_AVAILHI !=
+		     offsetof(struct virtio_pci_common_cfg, queue_avail_hi));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_USEDLO !=
+		     offsetof(struct virtio_pci_common_cfg, queue_used_lo));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_USEDHI !=
+		     offsetof(struct virtio_pci_common_cfg, queue_used_hi));
 }
 
 /* the PCI probing function */
-- 
MST

^ permalink raw reply related

* Re: [PATCH 01/13] kdbus: add documentation
From: Theodore Ts'o @ 2015-01-21 15:19 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Daniel Mack, Greg Kroah-Hartman, arnd-r2nGTMty4D4,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w, tixxdz-Umm1ozX2/EEdnm+yROfE0A,
	Johannes Stezenbach
In-Reply-To: <54BF805B.4000609-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Wed, Jan 21, 2015 at 11:32:59AM +0100, Michael Kerrisk (man-pages) wrote:
> It's rather an optional driver than a core kernel feature.
> 
> Given the various things that I've seen said about kdbus, the
> preceding sentence makes little sense to me:
> 
> * kdbus will be the framework supporting user-space D-Bus in the
>   future, and also used by systemd, and so on pretty much every 
>   desktop system.

I have to agree with Michael here; it's really, **really**
disengenuous to say that that if you don't want kdbus, you can just
#ifconfig it out.  The fact that it systemd will be using it means
that it will very shortly become a core kernel feature which is
absolutely mandatory.  Sure, maybe it can be configured out for "tiny
kernels", just as in theory we can configure out the VM system for
really tiny embedded systems.  But we should be treating this as
something that is not optional, because the reality is that's the way
it's going to be in very short order.  So if that means to use proper
system calls instead of ioctls, we should do that.

       	     	     		   - Ted

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox