Re: [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch
       [not found] <54DE71E4.6070304@gmail.com>
@ 2015-02-14  0:06 ` Dan Rosenberg
  2015-02-15  5:31   ` Fam Zheng
  0 siblings, 1 reply; 3+ messages in thread
From: Dan Rosenberg @ 2015-02-14  0:06 UTC (permalink / raw)
  To: famz; +Cc: linux-kernel@vger.kernel.org


> +	if (ncmds <= 0 || !cmds)
> +		return -EINVAL;
> +	cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
> +	kcmds = kmalloc(cmd_size, GFP_KERNEL);
You should probably fix the integer overflow in the calculation of the
cmd_size variable, unless you like root vulnerabilities.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch
  2015-02-14  0:06 ` [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch Dan Rosenberg
@ 2015-02-15  5:31   ` Fam Zheng
  0 siblings, 0 replies; 3+ messages in thread
From: Fam Zheng @ 2015-02-15  5:31 UTC (permalink / raw)
  To: Dan Rosenberg; +Cc: linux-kernel@vger.kernel.org

On Fri, 02/13 19:06, Dan Rosenberg wrote:
> 
> > +	if (ncmds <= 0 || !cmds)
> > +		return -EINVAL;
> > +	cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
> > +	kcmds = kmalloc(cmd_size, GFP_KERNEL);
> You should probably fix the integer overflow in the calculation of the
> cmd_size variable, unless you like root vulnerabilities.
> 

Thanks! In the case of multiply overflow, we allocate a buffer that is smaller
than we think, and consequent writings will corrupt kernel memory after it.
That is the root vulnerabilities here. Will fix!

Fam

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
@ 2015-02-13  9:03 Fam Zheng
  2015-02-13  9:04 ` [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch Fam Zheng
  0 siblings, 1 reply; 3+ messages in thread
From: Fam Zheng @ 2015-02-13  9:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

Hi all,

This is the updated series for the new epoll system calls, with the cover
letter rewritten which includes some more explanation. Comments are very
welcome!

Original Motivation
===================

QEMU, and probably many more select/poll based applications, will consider
epoll as an alternative, when its event loop needs to handle a big number of
fds. However, there are currently two concerns with epoll which prevents the
switching from ppoll to epoll. 

The major one is the timeout precision.

For example in QEMU, the main loop takes care of calling callbacks at a
specific timeout - the QEMU timer API. The timeout value in ppoll depends on
the next firing timer. epoll_pwait's millisecond timeout is so coarse that
rounding up the timeout will hurt performance badly.

The minor one is the number of system call to update fd set.

While epoll can handle a large number of fds quickly, it still requires one
epoll_ctl per fd update, compared to the one-shot call to select/poll with an
fd array. This may as well make epoll inferior to ppoll in the cases where a
small, but frequently changing set of fds are polled by the event loop.

This series introduces two new epoll APIs to address them respectively. The
idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
suggested clockid as a parameter in epoll_pwait1.

Discussion
==========

[Note: This is the question part regarding the interface contract of
epoll_ctl_batch.  If you don't have the context of what is epoll_ctl_batch yet,
please skip this part and probably start with the man page style documentation.
You can resume to this section later.]

[Thanks to Omar Sandoval <osandov@osandov.com>, who pointed out this in
reviewing v1]

We try to report status for each command in epoll_ctl_batch, by writting to
user provided command array (pointed to cmds). The tricky thing in the
implementation is that, copying the results back to userspace comes last, after
the commands are executed. At this point, if the copy_to_user fails, the
effects are done and no return - or if we add some mechanism to revert it, the
code will be too complicated and slow.

In above corner case, the return value of epoll_ctl_batch is smaller than
ncmds, which assures our caller that last N commands failed, where N = ncmds -
ret.  But they'll also find that cmd.result is not changed to error code.

I suppose this is not a big deal, because 1) it should happen very rarely. 2)
user does know the actual number of commands that succeed.

So, do we leave it as is? Or is there any way to improve?

One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
before we execute the commands. If it succeeds, it's even less likely the last
copy_to_user could fail, so that we can even probably assert it won't. The
testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it
right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the
performance impact, especially when @ncmds is big.

Links
=====

[1]: http://lists.openwall.net/linux-kernel/2015/01/08/542

Changelog
=========

Changes from v2 (https://lkml.org/lkml/2015/2/4/105)
----------------------------------------------------

  - Rename epoll_ctl_cmd.error_hint to "result". [Michael]

  - Add background introduction in cover letter. [Michael]

  - Expand the last struct of epoll_pwait1, add clockid and timespec.
  
  - Update man page in cover letter accordingly:

    * "error_hint" -> "result".
    * The result field's caveat in "RETURN VALUE" secion of epoll_ctl_batch.

  Please review!

Changes from v1 (https://lkml.org/lkml/2015/1/20/189)
-----------------------------------------------------

  - As discussed in previous thread [1], split the call to epoll_ctl_batch and
    epoll_pwait. [Michael]

  - Fix memory leaks. [Omar]

  - Add a short comment about the ignored copy_to_user failure. [Omar]

  - Cover letter rewritten.

Documentation of the new system calls
=====================================

[I tried to write in the familiar man page style, but I am not proficient on
this. Thanks for Michael's help and suggestions in helping me improve it
through previous discussions.]

1) epoll_ctl_batch
------------------

NAME
       epoll_ctl_batch - modify an epoll descriptor in batch

SYNOPSIS

       #include <sys/epoll.h>

       int epoll_ctl_batch(int epfd, int flags,
                           int ncmds, struct epoll_ctl_cmd *cmds);

DESCRIPTION

       The system call performs a batch of epoll_ctl operations. It allows
       efficient update of events on this epoll descriptor.

       Flags is reserved and must be 0.

       Each operation is specified as an element in the cmds array, defined as:

           struct epoll_ctl_cmd {

                  /* Reserved flags for future extension, must be 0. */
                  int flags;

                  /* The same as epoll_ctl() op parameter. */
                  int op;

                  /* The same as epoll_ctl() fd parameter. */
                  int fd;

                  /* The same as the "events" field in struct epoll_event. */
                  uint32_t events;

                  /* The same as the "data" field in struct epoll_event. */
                  uint64_t data;

                  /* Output field, will be set to the return code after this
                   * command is executed by kernel */
                  int result;
           };

       Commands are executed in their order in cmds, and if one of them failed,
       the rest after it are not tried.

       Not that this call isn't atomic in terms of updating the epoll
       descriptor, which means a second epoll_ctl or epoll_ctl_batch call
       during the first epoll_ctl_batch may make the operation sequence
       interleaved. However, each single epoll_ctl_cmd operation has the same
       semantics as a epoll_ctl call.

RETURN VALUE

       If one or more of the parameters are incorrect, -1 is returned and errno
       is set appropriately. Otherwise, the number of succeeded commands is
       returned.

       Each 'result' field may be set to the error code or 0, depending on the
       result of the command. If the kernel fails to set the field after the
       commands are completed or failed, it may also be unchanged, even though
       the effects of successful commands are done. In this case, it's still
       ensured that 1) the i-th (i = ret) command is the failed command; 2) all
       the preceeding commands are successfully executed; 3) all the subsequent
       ones are not executed.

ERRORS

       Errors for the overall system call (in errno) are:

       EINVAL flags is non-zero, or ncmds is less than or equal to zero, or
              cmds is NULL.

       ENOMEM There was insufficient memory to handle the requested op control
              operation.

       EFAULT The memory area pointed to by cmds is not accessible with write
              permissions.


       Errors for each command (in result field) are:

       EBADF  epfd or fd is not a valid file descriptor.

       EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
              already registered with this epoll instance.

       EINVAL epfd is not an epoll file descriptor, or fd is the same as epfd,
              or the requested operation op is not supported by this interface.

       ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
              with this epoll instance.

       ENOMEM There was insufficient memory to handle the requested op control
              operation.

       ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
              encountered while trying to register (EPOLL_CTL_ADD) a new file
              descriptor on an epoll instance.  See epoll(7) for further
              details.

       EPERM  The target file fd does not support epoll.

CONFORMING TO

       epoll_ctl_batch() is Linux-specific.

SEE ALSO

       epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)


2) epoll_pwait1
---------------

NAME
       epoll_pwait1 - wait for an I/O event on an epoll file descriptor

SYNOPSIS

       #include <sys/epoll.h>

       int epoll_pwait1(int epfd, int flags,
                        struct epoll_event *events,
                        int maxevents,
                        struct epoll_wait_params *params);

DESCRIPTION

       The epoll_pwait1 system call differs from epoll_pwait only in
       parameter list. The epfd, events and maxevents parameters are the same
       as in epoll_wait and epoll_pwait. The flags and params are new.

       The flags is reserved and must be zero.

       The params is a pointer to a structure parameters for the epoll_pwait1,
       defined as:

           struct epoll_wait_params {
               int clockid;
               struct timespec timeout;
               sigset_t *sigmask;
               size_t sigsetsize;
           };

       The field clockid must be either CLOCK_REALTIME or CLOCK_MONOTONIC, to
       choose the clock type to use for timeout. Note that CLOCK_MONOTONIC is
       the implicit and only possible clock type for epoll_pwait.
       
       The timeout field specifies the minimum time that epoll_wait() will
       block.  (The effective length will be rounded up to the clock
       granularity, and kernel scheduling delays mean that the blocking
       interval may overrun by a small amount.) Specifying a nagative time
       lenght (for example, timeout.tv_sec = 0 and timeout.tv_nsec = -1, or the
       other way round) causes epoll_pwait1() to block indefinitely, while
       specifying a timeout equal to zero (both fields in timeout are zero)
       causes epoll_wait() to return immediately, even if no events are
       available.

       The sigmask and sigsetsize has the same semantics as epoll_pwait. The
       sigmask field may be specified as NULL, in which case epoll_pwait1()
       will behave like epoll_wait.

   User visibility of sigsetsize

       In epoll_pwait and other syscalls, sigsetsize is not visible to
       application developer as glibc has a wrapper around epoll_pwait system
       call. Now that we pack several parameters in epoll_wait_params, so in
       order to hide sigsetsize from application code, we can still wrap it,
       either by expanding parameters and build the structure in the wrapper
       function, or only ask application to provide the first half:

           struct epoll_wait_params_user {
               int clockid;
               struct timespec timeout;
               sigset_t *sigmask;
           };

      Then in the wrapper function copy to a full structure and fill in
      sigsetsize.

RETURN VALUE

       The same as said in epoll_pwait(2).

ERRORS

       The same as said in man epoll_pwait(2), plus:

       EINVAL flags is not zero, or clockid is neither CLOCK_REALTIME nor
              CLOCK_MONOTONIC.

       EFAULT The memory area pointed to by params is not accessible.


CONFORMING TO

       epoll_pwait1() is Linux-specific.

SEE ALSO

       epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

Fam Zheng (7):
  epoll: Extract epoll_wait_do and epoll_pwait_do
  epoll: Specify clockid explicitly
  epoll: Extract ep_ctl_do
  epoll: Add implementation for epoll_ctl_batch
  x86: Hook up epoll_ctl_batch syscall
  epoll: Add implementation for epoll_pwait1
  x86: Hook up epoll_pwait1 syscall

 arch/x86/syscalls/syscall_32.tbl |   2 +
 arch/x86/syscalls/syscall_64.tbl |   2 +
 fs/eventpoll.c                   | 258 +++++++++++++++++++++++++--------------
 include/linux/syscalls.h         |   9 ++
 include/uapi/linux/eventpoll.h   |  18 +++
 5 files changed, 199 insertions(+), 90 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch
  2015-02-13  9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
@ 2015-02-13  9:04 ` Fam Zheng
  0 siblings, 0 replies; 3+ messages in thread
From: Fam Zheng @ 2015-02-13  9:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

This new syscall is a batched version of epoll_ctl. It will execute each
command as specified in cmds in given order, and stop at first failure
or upon completion of all commands.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c                 | 48 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h       |  4 ++++
 include/uapi/linux/eventpoll.h | 11 ++++++++++
 3 files changed, 63 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 6a2b0a4..12e2e63 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2069,6 +2069,54 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 			      sigmask ? &ksigmask : NULL);
 }
 
+SYSCALL_DEFINE4(epoll_ctl_batch, int, epfd, int, flags,
+		int, ncmds, struct epoll_ctl_cmd __user *, cmds)
+{
+	struct epoll_ctl_cmd *kcmds = NULL;
+	int i, r, ret = 0;
+	int cmd_size;
+
+	if (flags)
+		return -EINVAL;
+	if (ncmds <= 0 || !cmds)
+		return -EINVAL;
+	cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
+	kcmds = kmalloc(cmd_size, GFP_KERNEL);
+	if (!kcmds)
+		return -ENOMEM;
+	if (copy_from_user(kcmds, cmds, cmd_size)) {
+		ret = -EFAULT;
+		goto out;
+	}
+	for (i = 0; i < ncmds; i++) {
+		struct epoll_event ev = (struct epoll_event) {
+			.events = kcmds[i].events,
+			.data = kcmds[i].data,
+		};
+		if (kcmds[i].flags) {
+			kcmds[i].result = -EINVAL;
+			goto copy;
+		}
+		kcmds[i].result = ep_ctl_do(epfd, kcmds[i].op,
+					    kcmds[i].fd, ev);
+		if (kcmds[i].result)
+			goto copy;
+		ret++;
+	}
+copy:
+	r = copy_to_user(cmds, kcmds,
+			 sizeof(struct epoll_ctl_cmd) * ncmds);
+	/* Failing to copy the command results back will leave
+	 * userspace no way to know the actual error code, but we still
+	 * report the number of succeeded commands with ret, so it's
+	 * not a big problem. Ignore it for now.
+	 */
+	(void) r;
+out:
+	kfree(kcmds);
+	return ret;
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		       struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..7d784e3 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -12,6 +12,7 @@
 #define _LINUX_SYSCALLS_H
 
 struct epoll_event;
+struct epoll_ctl_cmd;
 struct iattr;
 struct inode;
 struct iocb;
@@ -634,6 +635,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
 				int maxevents, int timeout,
 				const sigset_t __user *sigmask,
 				size_t sigsetsize);
+asmlinkage long sys_epoll_ctl_batch(int epfd, int flags,
+				    int ncmds,
+				    struct epoll_ctl_cmd __user *cmds);
 asmlinkage long sys_gethostname(char __user *name, int len);
 asmlinkage long sys_sethostname(char __user *name, int len);
 asmlinkage long sys_setdomainname(char __user *name, int len);
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..4e18b17 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -18,6 +18,8 @@
 #include <linux/fcntl.h>
 #include <linux/types.h>
 
+#include <linux/signal.h>
+
 /* Flags for epoll_create1.  */
 #define EPOLL_CLOEXEC O_CLOEXEC
 
@@ -61,6 +63,15 @@ struct epoll_event {
 	__u64 data;
 } EPOLL_PACKED;
 
+struct epoll_ctl_cmd {
+	int flags;
+	int op;
+	int fd;
+	__u32 events;
+	__u64 data;
+	int result;
+} EPOLL_PACKED;
+
 #ifdef CONFIG_PM_SLEEP
 static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
 {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-02-15  5:31 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <54DE71E4.6070304@gmail.com>
2015-02-14  0:06 ` [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch Dan Rosenberg
2015-02-15  5:31   ` Fam Zheng
2015-02-13  9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
2015-02-13  9:04 ` [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch Fam Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox