From: Fam Zheng <famz@redhat.com>
To: Omar Sandoval <osandov@osandov.com>
Cc: linux-kernel@vger.kernel.org,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
x86@kernel.org, Alexander Viro <viro@zeniv.linux.org.uk>,
Andrew Morton <akpm@linux-foundation.org>,
Kees Cook <keescook@chromium.org>,
Andy Lutomirski <luto@amacapital.net>,
David Herrmann <dh.herrmann@gmail.com>,
Alexei Starovoitov <ast@plumgrid.com>,
Miklos Szeredi <mszeredi@suse.cz>,
David Drysdale <drysdale@google.com>,
Oleg Nesterov <oleg@redhat.com>,
"David S. Miller" <davem@davemloft.net>,
Vivek Goyal <vgoyal@redhat.com>,
Mike Frysinger <vapier@gentoo.org>,
"Theodore Ts'o" <tytso@mit.edu>,
Heiko Carstens <heiko.carstens@de.ibm.com>,
Rasmus Villemoes <linux@rasmusvillemoes.dk>,
Rashika Kheria <rashika.kheria@gmail.com>,
Hugh Dickins <hughd@google.com>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Peter Zijlstra <peter
Subject: Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
Date: Wed, 21 Jan 2015 16:59:00 +0800 [thread overview]
Message-ID: <20150121085900.GA25434@ad.nay.redhat.com> (raw)
In-Reply-To: <20150121075617.GA21266@mew>
On Tue, 01/20 23:56, Omar Sandoval wrote:
> On Tue, Jan 20, 2015 at 05:57:57PM +0800, Fam Zheng wrote:
> > This syscall is a sequence of
> >
> > 1) a number of epoll_ctl calls
> > 2) a epoll_pwait, with timeout enhancement.
> >
> > The epoll_ctl operations are embeded so that application doesn't have to use
> > separate syscalls to insert/delete/update the fds before poll. It is more
> > efficient if the set of fds varies from one poll to another, which is the
> > common pattern for certain applications. For example, depending on the input
> > buffer status, a data reading program may decide to temporarily not polling an
> > fd.
> >
> > Because the enablement of batching in this interface, even that regular
> > epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> > single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
> >
> > The only complexity is returning the result of each operation. For each
> > epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> > the return code *iff* the command is executed (0 for success and -errno of the
> > equivalent epoll_ctl call), and will be left unchanged if the command is not
> > executed because some earlier error, for example due to failure of
> > copy_from_user to copy the array.
> >
> > Applications can utilize this fact to do error handling: they could initialize
> > all the epoll_mod_wait.error to a positive value, which is by definition not a
> > possible output value from epoll_mod_wait. Then when the syscall returned, they
> > know whether or not the command is executed by comparing each error with the
> > init value, if they're different, they have the result of the command.
> > More roughly, they can put any non-zero and not distinguish "not run" from
> > failure.
> >
> > Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> > scalar. This provides higher precision. The parameter field in struct
> > epoll_wait_spec, "clockid", also makes it possible for users to use a different
> > clock than the default when it makes more sense.
> >
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> > fs/eventpoll.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/syscalls.h | 5 ++++
> > 2 files changed, 65 insertions(+)
> >
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index e7a116d..2cc22c9 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
> > sigmask ? &ksigmask : NULL);
> > }
> >
> > +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> > + int, ncmds, struct epoll_mod_cmd __user *, cmds,
> > + struct epoll_wait_spec __user *, spec)
> > +{
> > + struct epoll_mod_cmd *kcmds = NULL;
> > + int i, ret = 0;
> > + int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> > +
> > + if (flags)
> > + return -EINVAL;
> > + if (ncmds) {
> > + if (!cmds)
> > + return -EINVAL;
> > + kcmds = kmalloc(cmd_size, GFP_KERNEL);
> > + if (!kcmds)
> > + return -ENOMEM;
> > + if (copy_from_user(kcmds, cmds, cmd_size)) {
> > + ret = -EFAULT;
> > + goto out;
> > + }
> > + }
> > + for (i = 0; i < ncmds; i++) {
> > + struct epoll_event ev = (struct epoll_event) {
> > + .events = kcmds[i].events,
> > + .data = kcmds[i].data,
> > + };
> > + if (kcmds[i].flags) {
> > + kcmds[i].error = ret = -EINVAL;
> > + goto out;
> > + }
> > + kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> > + if (ret)
> > + goto out;
> > + }
> > + if (spec) {
> > + sigset_t ksigmask;
> > + struct epoll_wait_spec kspec;
> > + ktime_t timeout;
> > +
> > + if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> > + return -EFAULT;
> This should probably be goto out, or you'll leak kcmds.
>
> > + if (kspec.sigmask) {
> > + if (kspec.sigsetsize != sizeof(sigset_t))
> > + return -EINVAL;
> Same here...
>
> > + if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> > + return -EFAULT;
> and here.
>
> > + }
> > + timeout = timespec_to_ktime(kspec.timeout);
> > + ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> > + kspec.clockid, timeout,
> > + kspec.sigmask ? &ksigmask : NULL);
> > + }
> > +
> > +out:
> > + if (ncmds && copy_to_user(cmds, kcmds, cmd_size))
> > + return -EFAULT;
> This will also leak kcmds, it should be ret = -EFAULT. This case, however, seems
> to lead to a weird corner case: if cmds is read-only, we'll end up executing
> every command but fail to copy out the return values, so when userspace gets the
> EFAULT, it won't know whether anything was executed. But, getting an EFAULT here
> means you're probably doing something wrong anyways, so maybe not the biggest
> concern.
Yes, thanks! Will fix this.
Fam
>
> > + kfree(kcmds);
> > + return ret;
> > +}
> > +
> > #ifdef CONFIG_COMPAT
> > COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
> > struct epoll_event __user *, events,
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 85893d7..7156c80 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -12,6 +12,8 @@
> > #define _LINUX_SYSCALLS_H
> >
> > struct epoll_event;
> > +struct epoll_mod_cmd;
> > +struct epoll_wait_spec;
> > struct iattr;
> > struct inode;
> > struct iocb;
> > @@ -630,6 +632,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
> > int maxevents, int timeout,
> > const sigset_t __user *sigmask,
> > size_t sigsetsize);
> > +asmlinkage long sys_epoll_mod_wait(int epfd, int flags,
> > + int ncmds, struct epoll_mod_cmd __user * cmds,
> > + struct epoll_wait_spec __user * spec);
> > asmlinkage long sys_gethostname(char __user *name, int len);
> > asmlinkage long sys_sethostname(char __user *name, int len);
> > asmlinkage long sys_setdomainname(char __user *name, int len);
> > --
> > 1.9.3
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
> --
> Omar
next prev parent reply other threads:[~2015-01-21 8:59 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-01-20 9:57 [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait" Fam Zheng
2015-01-20 9:57 ` [PATCH RFC 1/6] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
2015-01-20 9:57 ` [PATCH RFC 2/6] epoll: Specify clockid explicitly Fam Zheng
2015-01-20 9:57 ` [PATCH RFC 3/6] epoll: Add definition for epoll_mod_wait structures Fam Zheng
2015-01-20 9:57 ` [PATCH RFC 4/6] epoll: Extract ep_ctl_do Fam Zheng
2015-01-20 9:57 ` [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait Fam Zheng
2015-01-20 12:50 ` Michael Kerrisk (man-pages)
2015-01-21 4:59 ` Fam Zheng
[not found] ` <20150121045903.GA2858-+wGkCoP0yD+sDdueE5tM26fLeoKvNuZc@public.gmane.org>
2015-01-21 7:52 ` Michael Kerrisk (man-pages)
2015-01-21 8:58 ` Fam Zheng
[not found] ` <20150121085827.GB23024-ZfWej9ACyHUXGNroddHbYwC/G2K4zDHf@public.gmane.org>
2015-01-21 10:37 ` Paolo Bonzini
2015-01-21 11:14 ` Fam Zheng
[not found] ` <20150121111404.GA3804-ZfWej9ACyHUXGNroddHbYwC/G2K4zDHf@public.gmane.org>
2015-01-21 11:50 ` Paolo Bonzini
2015-01-22 21:12 ` Andy Lutomirski
[not found] ` <CALCETrWwuJpFK+38mBxxTQCu7Oig22Nr+mAuO++Y+0CdAhfzkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-23 6:20 ` Fam Zheng
2015-01-23 9:56 ` Paolo Bonzini
2015-01-21 10:34 ` Paolo Bonzini
[not found] ` <1421747878-30744-6-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-21 7:56 ` Omar Sandoval
2015-01-21 8:59 ` Fam Zheng [this message]
2015-01-20 9:57 ` [PATCH RFC 6/6] x86: Hook up epoll_mod_wait syscall Fam Zheng
[not found] ` <1421747878-30744-1-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-20 10:37 ` [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait" Rasmus Villemoes
[not found] ` <874mrl3fh9.fsf-qQsb+v5E8BnlAoU/VqSP6n9LOBIZ5rWg@public.gmane.org>
2015-01-20 10:53 ` Fam Zheng
2015-01-20 12:48 ` Michael Kerrisk (man-pages)
[not found] ` <54BE4EA4.6080901-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-01-21 9:05 ` Fam Zheng
2015-01-20 22:40 ` Andy Lutomirski
[not found] ` <CALCETrU4TeG1ShVLkQgqQ6usFm8pg_t0D8K=Mi_UJGSfxUwXtA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-20 23:03 ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-01-21 5:55 ` Michael Kerrisk (man-pages)
2015-01-21 9:07 ` Fam Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150121085900.GA25434@ad.nay.redhat.com \
--to=famz@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=ast@plumgrid.com \
--cc=davem@davemloft.net \
--cc=dh.herrmann@gmail.com \
--cc=drysdale@google.com \
--cc=heiko.carstens@de.ibm.com \
--cc=hpa@zytor.com \
--cc=hughd@google.com \
--cc=keescook@chromium.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@rasmusvillemoes.dk \
--cc=luto@amacapital.net \
--cc=mathieu.desnoyers@efficios.com \
--cc=mingo@redhat.com \
--cc=mszeredi@suse.cz \
--cc=oleg@redhat.com \
--cc=osandov@osandov.com \
--cc=rashika.kheria@gmail.com \
--cc=tglx@linutronix.de \
--cc=tytso@mit.edu \
--cc=vapier@gentoo.org \
--cc=vgoyal@redhat.com \
--cc=viro@zeniv.linux.org.uk \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).