* Re: [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch
[not found] <54DE71E4.6070304@gmail.com>
@ 2015-02-14 0:06 ` Dan Rosenberg
2015-02-15 5:31 ` Fam Zheng
0 siblings, 1 reply; 3+ messages in thread
From: Dan Rosenberg @ 2015-02-14 0:06 UTC (permalink / raw)
To: famz; +Cc: linux-kernel@vger.kernel.org
> + if (ncmds <= 0 || !cmds)
> + return -EINVAL;
> + cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
> + kcmds = kmalloc(cmd_size, GFP_KERNEL);
You should probably fix the integer overflow in the calculation of the
cmd_size variable, unless you like root vulnerabilities.
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch
2015-02-14 0:06 ` [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch Dan Rosenberg
@ 2015-02-15 5:31 ` Fam Zheng
0 siblings, 0 replies; 3+ messages in thread
From: Fam Zheng @ 2015-02-15 5:31 UTC (permalink / raw)
To: Dan Rosenberg; +Cc: linux-kernel@vger.kernel.org
On Fri, 02/13 19:06, Dan Rosenberg wrote:
>
> > + if (ncmds <= 0 || !cmds)
> > + return -EINVAL;
> > + cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
> > + kcmds = kmalloc(cmd_size, GFP_KERNEL);
> You should probably fix the integer overflow in the calculation of the
> cmd_size variable, unless you like root vulnerabilities.
>
Thanks! In the case of multiply overflow, we allocate a buffer that is smaller
than we think, and consequent writings will corrupt kernel memory after it.
That is the root vulnerabilities here. Will fix!
Fam
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
@ 2015-02-13 9:03 Fam Zheng
2015-02-13 9:04 ` [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch Fam Zheng
0 siblings, 1 reply; 3+ messages in thread
From: Fam Zheng @ 2015-02-13 9:03 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
linux-api, Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval
Hi all,
This is the updated series for the new epoll system calls, with the cover
letter rewritten which includes some more explanation. Comments are very
welcome!
Original Motivation
===================
QEMU, and probably many more select/poll based applications, will consider
epoll as an alternative, when its event loop needs to handle a big number of
fds. However, there are currently two concerns with epoll which prevents the
switching from ppoll to epoll.
The major one is the timeout precision.
For example in QEMU, the main loop takes care of calling callbacks at a
specific timeout - the QEMU timer API. The timeout value in ppoll depends on
the next firing timer. epoll_pwait's millisecond timeout is so coarse that
rounding up the timeout will hurt performance badly.
The minor one is the number of system call to update fd set.
While epoll can handle a large number of fds quickly, it still requires one
epoll_ctl per fd update, compared to the one-shot call to select/poll with an
fd array. This may as well make epoll inferior to ppoll in the cases where a
small, but frequently changing set of fds are polled by the event loop.
This series introduces two new epoll APIs to address them respectively. The
idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
suggested clockid as a parameter in epoll_pwait1.
Discussion
==========
[Note: This is the question part regarding the interface contract of
epoll_ctl_batch. If you don't have the context of what is epoll_ctl_batch yet,
please skip this part and probably start with the man page style documentation.
You can resume to this section later.]
[Thanks to Omar Sandoval <osandov@osandov.com>, who pointed out this in
reviewing v1]
We try to report status for each command in epoll_ctl_batch, by writting to
user provided command array (pointed to cmds). The tricky thing in the
implementation is that, copying the results back to userspace comes last, after
the commands are executed. At this point, if the copy_to_user fails, the
effects are done and no return - or if we add some mechanism to revert it, the
code will be too complicated and slow.
In above corner case, the return value of epoll_ctl_batch is smaller than
ncmds, which assures our caller that last N commands failed, where N = ncmds -
ret. But they'll also find that cmd.result is not changed to error code.
I suppose this is not a big deal, because 1) it should happen very rarely. 2)
user does know the actual number of commands that succeed.
So, do we leave it as is? Or is there any way to improve?
One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
before we execute the commands. If it succeeds, it's even less likely the last
copy_to_user could fail, so that we can even probably assert it won't. The
testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it
right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the
performance impact, especially when @ncmds is big.
Links
=====
[1]: http://lists.openwall.net/linux-kernel/2015/01/08/542
Changelog
=========
Changes from v2 (https://lkml.org/lkml/2015/2/4/105)
----------------------------------------------------
- Rename epoll_ctl_cmd.error_hint to "result". [Michael]
- Add background introduction in cover letter. [Michael]
- Expand the last struct of epoll_pwait1, add clockid and timespec.
- Update man page in cover letter accordingly:
* "error_hint" -> "result".
* The result field's caveat in "RETURN VALUE" secion of epoll_ctl_batch.
Please review!
Changes from v1 (https://lkml.org/lkml/2015/1/20/189)
-----------------------------------------------------
- As discussed in previous thread [1], split the call to epoll_ctl_batch and
epoll_pwait. [Michael]
- Fix memory leaks. [Omar]
- Add a short comment about the ignored copy_to_user failure. [Omar]
- Cover letter rewritten.
Documentation of the new system calls
=====================================
[I tried to write in the familiar man page style, but I am not proficient on
this. Thanks for Michael's help and suggestions in helping me improve it
through previous discussions.]
1) epoll_ctl_batch
------------------
NAME
epoll_ctl_batch - modify an epoll descriptor in batch
SYNOPSIS
#include <sys/epoll.h>
int epoll_ctl_batch(int epfd, int flags,
int ncmds, struct epoll_ctl_cmd *cmds);
DESCRIPTION
The system call performs a batch of epoll_ctl operations. It allows
efficient update of events on this epoll descriptor.
Flags is reserved and must be 0.
Each operation is specified as an element in the cmds array, defined as:
struct epoll_ctl_cmd {
/* Reserved flags for future extension, must be 0. */
int flags;
/* The same as epoll_ctl() op parameter. */
int op;
/* The same as epoll_ctl() fd parameter. */
int fd;
/* The same as the "events" field in struct epoll_event. */
uint32_t events;
/* The same as the "data" field in struct epoll_event. */
uint64_t data;
/* Output field, will be set to the return code after this
* command is executed by kernel */
int result;
};
Commands are executed in their order in cmds, and if one of them failed,
the rest after it are not tried.
Not that this call isn't atomic in terms of updating the epoll
descriptor, which means a second epoll_ctl or epoll_ctl_batch call
during the first epoll_ctl_batch may make the operation sequence
interleaved. However, each single epoll_ctl_cmd operation has the same
semantics as a epoll_ctl call.
RETURN VALUE
If one or more of the parameters are incorrect, -1 is returned and errno
is set appropriately. Otherwise, the number of succeeded commands is
returned.
Each 'result' field may be set to the error code or 0, depending on the
result of the command. If the kernel fails to set the field after the
commands are completed or failed, it may also be unchanged, even though
the effects of successful commands are done. In this case, it's still
ensured that 1) the i-th (i = ret) command is the failed command; 2) all
the preceeding commands are successfully executed; 3) all the subsequent
ones are not executed.
ERRORS
Errors for the overall system call (in errno) are:
EINVAL flags is non-zero, or ncmds is less than or equal to zero, or
cmds is NULL.
ENOMEM There was insufficient memory to handle the requested op control
operation.
EFAULT The memory area pointed to by cmds is not accessible with write
permissions.
Errors for each command (in result field) are:
EBADF epfd or fd is not a valid file descriptor.
EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
already registered with this epoll instance.
EINVAL epfd is not an epoll file descriptor, or fd is the same as epfd,
or the requested operation op is not supported by this interface.
ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
with this epoll instance.
ENOMEM There was insufficient memory to handle the requested op control
operation.
ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
encountered while trying to register (EPOLL_CTL_ADD) a new file
descriptor on an epoll instance. See epoll(7) for further
details.
EPERM The target file fd does not support epoll.
CONFORMING TO
epoll_ctl_batch() is Linux-specific.
SEE ALSO
epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
2) epoll_pwait1
---------------
NAME
epoll_pwait1 - wait for an I/O event on an epoll file descriptor
SYNOPSIS
#include <sys/epoll.h>
int epoll_pwait1(int epfd, int flags,
struct epoll_event *events,
int maxevents,
struct epoll_wait_params *params);
DESCRIPTION
The epoll_pwait1 system call differs from epoll_pwait only in
parameter list. The epfd, events and maxevents parameters are the same
as in epoll_wait and epoll_pwait. The flags and params are new.
The flags is reserved and must be zero.
The params is a pointer to a structure parameters for the epoll_pwait1,
defined as:
struct epoll_wait_params {
int clockid;
struct timespec timeout;
sigset_t *sigmask;
size_t sigsetsize;
};
The field clockid must be either CLOCK_REALTIME or CLOCK_MONOTONIC, to
choose the clock type to use for timeout. Note that CLOCK_MONOTONIC is
the implicit and only possible clock type for epoll_pwait.
The timeout field specifies the minimum time that epoll_wait() will
block. (The effective length will be rounded up to the clock
granularity, and kernel scheduling delays mean that the blocking
interval may overrun by a small amount.) Specifying a nagative time
lenght (for example, timeout.tv_sec = 0 and timeout.tv_nsec = -1, or the
other way round) causes epoll_pwait1() to block indefinitely, while
specifying a timeout equal to zero (both fields in timeout are zero)
causes epoll_wait() to return immediately, even if no events are
available.
The sigmask and sigsetsize has the same semantics as epoll_pwait. The
sigmask field may be specified as NULL, in which case epoll_pwait1()
will behave like epoll_wait.
User visibility of sigsetsize
In epoll_pwait and other syscalls, sigsetsize is not visible to
application developer as glibc has a wrapper around epoll_pwait system
call. Now that we pack several parameters in epoll_wait_params, so in
order to hide sigsetsize from application code, we can still wrap it,
either by expanding parameters and build the structure in the wrapper
function, or only ask application to provide the first half:
struct epoll_wait_params_user {
int clockid;
struct timespec timeout;
sigset_t *sigmask;
};
Then in the wrapper function copy to a full structure and fill in
sigsetsize.
RETURN VALUE
The same as said in epoll_pwait(2).
ERRORS
The same as said in man epoll_pwait(2), plus:
EINVAL flags is not zero, or clockid is neither CLOCK_REALTIME nor
CLOCK_MONOTONIC.
EFAULT The memory area pointed to by params is not accessible.
CONFORMING TO
epoll_pwait1() is Linux-specific.
SEE ALSO
epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
Fam Zheng (7):
epoll: Extract epoll_wait_do and epoll_pwait_do
epoll: Specify clockid explicitly
epoll: Extract ep_ctl_do
epoll: Add implementation for epoll_ctl_batch
x86: Hook up epoll_ctl_batch syscall
epoll: Add implementation for epoll_pwait1
x86: Hook up epoll_pwait1 syscall
arch/x86/syscalls/syscall_32.tbl | 2 +
arch/x86/syscalls/syscall_64.tbl | 2 +
fs/eventpoll.c | 258 +++++++++++++++++++++++++--------------
include/linux/syscalls.h | 9 ++
include/uapi/linux/eventpoll.h | 18 +++
5 files changed, 199 insertions(+), 90 deletions(-)
--
1.9.3
^ permalink raw reply [flat|nested] 3+ messages in thread* [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch
2015-02-13 9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
@ 2015-02-13 9:04 ` Fam Zheng
0 siblings, 0 replies; 3+ messages in thread
From: Fam Zheng @ 2015-02-13 9:04 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
linux-api, Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval
This new syscall is a batched version of epoll_ctl. It will execute each
command as specified in cmds in given order, and stop at first failure
or upon completion of all commands.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
fs/eventpoll.c | 48 ++++++++++++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 4 ++++
include/uapi/linux/eventpoll.h | 11 ++++++++++
3 files changed, 63 insertions(+)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 6a2b0a4..12e2e63 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2069,6 +2069,54 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
sigmask ? &ksigmask : NULL);
}
+SYSCALL_DEFINE4(epoll_ctl_batch, int, epfd, int, flags,
+ int, ncmds, struct epoll_ctl_cmd __user *, cmds)
+{
+ struct epoll_ctl_cmd *kcmds = NULL;
+ int i, r, ret = 0;
+ int cmd_size;
+
+ if (flags)
+ return -EINVAL;
+ if (ncmds <= 0 || !cmds)
+ return -EINVAL;
+ cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
+ kcmds = kmalloc(cmd_size, GFP_KERNEL);
+ if (!kcmds)
+ return -ENOMEM;
+ if (copy_from_user(kcmds, cmds, cmd_size)) {
+ ret = -EFAULT;
+ goto out;
+ }
+ for (i = 0; i < ncmds; i++) {
+ struct epoll_event ev = (struct epoll_event) {
+ .events = kcmds[i].events,
+ .data = kcmds[i].data,
+ };
+ if (kcmds[i].flags) {
+ kcmds[i].result = -EINVAL;
+ goto copy;
+ }
+ kcmds[i].result = ep_ctl_do(epfd, kcmds[i].op,
+ kcmds[i].fd, ev);
+ if (kcmds[i].result)
+ goto copy;
+ ret++;
+ }
+copy:
+ r = copy_to_user(cmds, kcmds,
+ sizeof(struct epoll_ctl_cmd) * ncmds);
+ /* Failing to copy the command results back will leave
+ * userspace no way to know the actual error code, but we still
+ * report the number of succeeded commands with ret, so it's
+ * not a big problem. Ignore it for now.
+ */
+ (void) r;
+out:
+ kfree(kcmds);
+ return ret;
+}
+
#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..7d784e3 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -12,6 +12,7 @@
#define _LINUX_SYSCALLS_H
struct epoll_event;
+struct epoll_ctl_cmd;
struct iattr;
struct inode;
struct iocb;
@@ -634,6 +635,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
int maxevents, int timeout,
const sigset_t __user *sigmask,
size_t sigsetsize);
+asmlinkage long sys_epoll_ctl_batch(int epfd, int flags,
+ int ncmds,
+ struct epoll_ctl_cmd __user *cmds);
asmlinkage long sys_gethostname(char __user *name, int len);
asmlinkage long sys_sethostname(char __user *name, int len);
asmlinkage long sys_setdomainname(char __user *name, int len);
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..4e18b17 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -18,6 +18,8 @@
#include <linux/fcntl.h>
#include <linux/types.h>
+#include <linux/signal.h>
+
/* Flags for epoll_create1. */
#define EPOLL_CLOEXEC O_CLOEXEC
@@ -61,6 +63,15 @@ struct epoll_event {
__u64 data;
} EPOLL_PACKED;
+struct epoll_ctl_cmd {
+ int flags;
+ int op;
+ int fd;
+ __u32 events;
+ __u64 data;
+ int result;
+} EPOLL_PACKED;
+
#ifdef CONFIG_PM_SLEEP
static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
{
--
1.9.3
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2015-02-15 5:31 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <54DE71E4.6070304@gmail.com>
2015-02-14 0:06 ` [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch Dan Rosenberg
2015-02-15 5:31 ` Fam Zheng
2015-02-13 9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
2015-02-13 9:04 ` [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch Fam Zheng
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox