* [PATCH RFC v3 2/7] epoll: Specify clockid explicitly
2015-02-13 9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
@ 2015-02-13 9:03 ` Fam Zheng
2015-02-13 9:03 ` [PATCH RFC v3 3/7] epoll: Extract ep_ctl_do Fam Zheng
` (6 subsequent siblings)
7 siblings, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-02-13 9:03 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
Later we will add clockid in the interface, so let's start using explicit
clockid internally. Now we specify CLOCK_MONOTONIC, which is the same as before.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
fs/eventpoll.c | 29 +++++++++++++++++------------
1 file changed, 17 insertions(+), 12 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4cf359d..5840b03 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1570,7 +1570,7 @@ static int ep_send_events(struct eventpoll *ep,
* error code, in case of error.
*/
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
- int maxevents, const ktime_t timeout)
+ int maxevents, int clockid, const ktime_t timeout)
{
int res = 0, eavail, timed_out = 0;
unsigned long flags;
@@ -1578,6 +1578,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
wait_queue_t wait;
ktime_t expires, *to = NULL;
+ if (clockid != CLOCK_MONOTONIC && clockid != CLOCK_REALTIME)
+ return -EINVAL;
if (!ktime_to_ns(timeout)) {
/*
* Avoid the unnecessary trip to the wait queue loop, if the
@@ -1624,7 +1626,8 @@ fetch_events:
}
spin_unlock_irqrestore(&ep->lock, flags);
- if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+ if (!schedule_hrtimeout_range_clock(to, slack,
+ HRTIMER_MODE_ABS, clockid))
timed_out = 1;
spin_lock_irqsave(&ep->lock, flags);
@@ -1945,7 +1948,8 @@ error_return:
}
static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
- int maxevents, const ktime_t timeout)
+ int maxevents, int clockid,
+ const ktime_t timeout)
{
int error;
struct fd f;
@@ -1979,7 +1983,7 @@ static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
ep = f.file->private_data;
/* Time to fish for events ... */
- error = ep_poll(ep, events, maxevents, timeout);
+ error = ep_poll(ep, events, maxevents, clockid, timeout);
error_fput:
fdput(f);
@@ -1994,12 +1998,13 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
int, maxevents, int, timeout)
{
ktime_t kt = ms_to_ktime(timeout);
- return epoll_wait_do(epfd, events, maxevents, kt);
+ return epoll_wait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt);
}
static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
- int maxevents, ktime_t timeout,
- sigset_t *sigmask, size_t sigsetsize)
+ int maxevents,
+ int clockid, ktime_t timeout,
+ sigset_t *sigmask)
{
int error;
sigset_t sigsaved;
@@ -2013,7 +2018,7 @@ static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
set_current_blocked(sigmask);
}
- error = epoll_wait_do(epfd, events, maxevents, timeout);
+ error = epoll_wait_do(epfd, events, maxevents, clockid, timeout);
/*
* If we changed the signal mask, we need to restore the original one.
@@ -2050,8 +2055,8 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
return -EFAULT;
}
- return epoll_pwait_do(epfd, events, maxevents, kt,
- sigmask ? &ksigmask : NULL, sigsetsize);
+ return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+ sigmask ? &ksigmask : NULL);
}
#ifdef CONFIG_COMPAT
@@ -2073,8 +2078,8 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
sigset_from_compat(&ksigmask, &csigmask);
}
- return epoll_pwait_do(epfd, events, maxevents, kt,
- sigmask ? &ksigmask : NULL, sigsetsize);
+ return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+ sigmask ? &ksigmask : NULL);
}
#endif
--
1.9.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RFC v3 3/7] epoll: Extract ep_ctl_do
2015-02-13 9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
2015-02-13 9:03 ` [PATCH RFC v3 2/7] epoll: Specify clockid explicitly Fam Zheng
@ 2015-02-13 9:03 ` Fam Zheng
[not found] ` <1423818243-15410-1-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (5 subsequent siblings)
7 siblings, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-02-13 9:03 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
This is a common part from epoll_ctl implementation which will be shared with
the coming epoll_ctl_wait.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
fs/eventpoll.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 5840b03..6a2b0a4 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1810,22 +1810,15 @@ SYSCALL_DEFINE1(epoll_create, int, size)
* the eventpoll file that enables the insertion/removal/change of
* file descriptors inside the interest set.
*/
-SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
- struct epoll_event __user *, event)
+int ep_ctl_do(int epfd, int op, int fd, struct epoll_event epds)
{
int error;
int full_check = 0;
struct fd f, tf;
struct eventpoll *ep;
struct epitem *epi;
- struct epoll_event epds;
struct eventpoll *tep = NULL;
- error = -EFAULT;
- if (ep_op_has_event(op) &&
- copy_from_user(&epds, event, sizeof(struct epoll_event)))
- goto error_return;
-
error = -EBADF;
f = fdget(epfd);
if (!f.file)
@@ -1947,6 +1940,23 @@ error_return:
return error;
}
+/*
+ * The following function implements the controller interface for
+ * the eventpoll file that enables the insertion/removal/change of
+ * file descriptors inside the interest set.
+ */
+SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+ struct epoll_event __user *, event)
+{
+ struct epoll_event epds;
+
+ if (ep_op_has_event(op) &&
+ copy_from_user(&epds, event, sizeof(struct epoll_event)))
+ return -EFAULT;
+
+ return ep_ctl_do(epfd, op, fd, epds);
+}
+
static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
int maxevents, int clockid,
const ktime_t timeout)
--
1.9.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
[parent not found: <1423818243-15410-1-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* [PATCH RFC v3 1/7] epoll: Extract epoll_wait_do and epoll_pwait_do
[not found] ` <1423818243-15410-1-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-02-13 9:03 ` Fam Zheng
2015-02-13 9:04 ` [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch Fam Zheng
1 sibling, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-02-13 9:03 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro, Andrew Morton,
Kees Cook, Andy Lutomirski, David Herrmann, Alexei Starovoitov,
Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
Fam Zheng, Peter Zijlstra <peter>
In preparation of new epoll syscalls, this patch allows reusing the code from
epoll_pwait implementation. The new functions uses ktime_t for more accuracy.
Signed-off-by: Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
fs/eventpoll.c | 162 ++++++++++++++++++++++++++-------------------------------
1 file changed, 75 insertions(+), 87 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..4cf359d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1554,17 +1554,6 @@ static int ep_send_events(struct eventpoll *ep,
return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
}
-static inline struct timespec ep_set_mstimeout(long ms)
-{
- struct timespec now, ts = {
- .tv_sec = ms / MSEC_PER_SEC,
- .tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC),
- };
-
- ktime_get_ts(&now);
- return timespec_add_safe(now, ts);
-}
-
/**
* ep_poll - Retrieves ready events, and delivers them to the caller supplied
* event buffer.
@@ -1573,17 +1562,15 @@ static inline struct timespec ep_set_mstimeout(long ms)
* @events: Pointer to the userspace buffer where the ready events should be
* stored.
* @maxevents: Size (in terms of number of events) of the caller event buffer.
- * @timeout: Maximum timeout for the ready events fetch operation, in
- * milliseconds. If the @timeout is zero, the function will not block,
- * while if the @timeout is less than zero, the function will block
- * until at least one event has been retrieved (or an error
- * occurred).
+ * @timeout: Maximum timeout for the ready events fetch operation. If 0, the
+ * function will not block. If negative, the function will block until
+ * at least one event has been retrieved (or an error occurred).
*
* Returns: Returns the number of ready events which have been fetched, or an
* error code, in case of error.
*/
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
- int maxevents, long timeout)
+ int maxevents, const ktime_t timeout)
{
int res = 0, eavail, timed_out = 0;
unsigned long flags;
@@ -1591,13 +1578,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
wait_queue_t wait;
ktime_t expires, *to = NULL;
- if (timeout > 0) {
- struct timespec end_time = ep_set_mstimeout(timeout);
-
- slack = select_estimate_accuracy(&end_time);
- to = &expires;
- *to = timespec_to_ktime(end_time);
- } else if (timeout == 0) {
+ if (!ktime_to_ns(timeout)) {
/*
* Avoid the unnecessary trip to the wait queue loop, if the
* caller specified a non blocking operation.
@@ -1605,6 +1586,15 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
timed_out = 1;
spin_lock_irqsave(&ep->lock, flags);
goto check_events;
+ } else if (ktime_to_ns(timeout) > 0) {
+ struct timespec now, end_time;
+
+ ktime_get_ts(&now);
+ end_time = timespec_add_safe(now, ktime_to_timespec(timeout));
+
+ slack = select_estimate_accuracy(&end_time);
+ to = &expires;
+ *to = timespec_to_ktime(end_time);
}
fetch_events:
@@ -1954,12 +1944,8 @@ error_return:
return error;
}
-/*
- * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_wait(2).
- */
-SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
- int, maxevents, int, timeout)
+static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
+ int maxevents, const ktime_t timeout)
{
int error;
struct fd f;
@@ -2002,91 +1988,93 @@ error_fput:
/*
* Implement the event wait interface for the eventpoll file. It is the kernel
+ * part of the user space epoll_wait(2).
+ */
+SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
+ int, maxevents, int, timeout)
+{
+ ktime_t kt = ms_to_ktime(timeout);
+ return epoll_wait_do(epfd, events, maxevents, kt);
+}
+
+static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
+ int maxevents, ktime_t timeout,
+ sigset_t *sigmask, size_t sigsetsize)
+{
+ int error;
+ sigset_t sigsaved;
+
+ /*
+ * If the caller wants a certain signal mask to be set during the wait,
+ * we apply it here.
+ */
+ if (sigmask) {
+ sigsaved = current->blocked;
+ set_current_blocked(sigmask);
+ }
+
+ error = epoll_wait_do(epfd, events, maxevents, timeout);
+
+ /*
+ * If we changed the signal mask, we need to restore the original one.
+ * In case we've got a signal while waiting, we do not restore the
+ * signal mask yet, and we allow do_signal() to deliver the signal on
+ * the way back to userspace, before the signal mask is restored.
+ */
+ if (sigmask) {
+ if (error == -EINTR) {
+ memcpy(¤t->saved_sigmask, &sigsaved,
+ sizeof(sigsaved));
+ set_restore_sigmask();
+ } else
+ set_current_blocked(&sigsaved);
+ }
+
+ return error;
+}
+
+/*
+ * Implement the event wait interface for the eventpoll file. It is the kernel
* part of the user space epoll_pwait(2).
*/
SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
int, maxevents, int, timeout, const sigset_t __user *, sigmask,
size_t, sigsetsize)
{
- int error;
- sigset_t ksigmask, sigsaved;
+ ktime_t kt = ms_to_ktime(timeout);
+ sigset_t ksigmask;
- /*
- * If the caller wants a certain signal mask to be set during the wait,
- * we apply it here.
- */
if (sigmask) {
if (sigsetsize != sizeof(sigset_t))
return -EINVAL;
if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
return -EFAULT;
- sigsaved = current->blocked;
- set_current_blocked(&ksigmask);
}
-
- error = sys_epoll_wait(epfd, events, maxevents, timeout);
-
- /*
- * If we changed the signal mask, we need to restore the original one.
- * In case we've got a signal while waiting, we do not restore the
- * signal mask yet, and we allow do_signal() to deliver the signal on
- * the way back to userspace, before the signal mask is restored.
- */
- if (sigmask) {
- if (error == -EINTR) {
- memcpy(¤t->saved_sigmask, &sigsaved,
- sizeof(sigsaved));
- set_restore_sigmask();
- } else
- set_current_blocked(&sigsaved);
- }
-
- return error;
+ return epoll_pwait_do(epfd, events, maxevents, kt,
+ sigmask ? &ksigmask : NULL, sigsetsize);
}
#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
- struct epoll_event __user *, events,
- int, maxevents, int, timeout,
- const compat_sigset_t __user *, sigmask,
- compat_size_t, sigsetsize)
+ struct epoll_event __user *, events,
+ int, maxevents, int, timeout,
+ const compat_sigset_t __user *, sigmask,
+ compat_size_t, sigsetsize)
{
- long err;
compat_sigset_t csigmask;
- sigset_t ksigmask, sigsaved;
+ sigset_t ksigmask;
+ ktime_t kt = ms_to_ktime(timeout);
- /*
- * If the caller wants a certain signal mask to be set during the wait,
- * we apply it here.
- */
if (sigmask) {
if (sigsetsize != sizeof(compat_sigset_t))
return -EINVAL;
if (copy_from_user(&csigmask, sigmask, sizeof(csigmask)))
return -EFAULT;
sigset_from_compat(&ksigmask, &csigmask);
- sigsaved = current->blocked;
- set_current_blocked(&ksigmask);
}
- err = sys_epoll_wait(epfd, events, maxevents, timeout);
-
- /*
- * If we changed the signal mask, we need to restore the original one.
- * In case we've got a signal while waiting, we do not restore the
- * signal mask yet, and we allow do_signal() to deliver the signal on
- * the way back to userspace, before the signal mask is restored.
- */
- if (sigmask) {
- if (err == -EINTR) {
- memcpy(¤t->saved_sigmask, &sigsaved,
- sizeof(sigsaved));
- set_restore_sigmask();
- } else
- set_current_blocked(&sigsaved);
- }
-
- return err;
+ return epoll_pwait_do(epfd, events, maxevents, kt,
+ sigmask ? &ksigmask : NULL, sigsetsize);
}
#endif
--
1.9.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch
[not found] ` <1423818243-15410-1-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-13 9:03 ` [PATCH RFC v3 1/7] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
@ 2015-02-13 9:04 ` Fam Zheng
1 sibling, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-02-13 9:04 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro, Andrew Morton,
Kees Cook, Andy Lutomirski, David Herrmann, Alexei Starovoitov,
Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
Fam Zheng, Peter Zijlstra <peter>
This new syscall is a batched version of epoll_ctl. It will execute each
command as specified in cmds in given order, and stop at first failure
or upon completion of all commands.
Signed-off-by: Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
fs/eventpoll.c | 48 ++++++++++++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 4 ++++
include/uapi/linux/eventpoll.h | 11 ++++++++++
3 files changed, 63 insertions(+)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 6a2b0a4..12e2e63 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2069,6 +2069,54 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
sigmask ? &ksigmask : NULL);
}
+SYSCALL_DEFINE4(epoll_ctl_batch, int, epfd, int, flags,
+ int, ncmds, struct epoll_ctl_cmd __user *, cmds)
+{
+ struct epoll_ctl_cmd *kcmds = NULL;
+ int i, r, ret = 0;
+ int cmd_size;
+
+ if (flags)
+ return -EINVAL;
+ if (ncmds <= 0 || !cmds)
+ return -EINVAL;
+ cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
+ kcmds = kmalloc(cmd_size, GFP_KERNEL);
+ if (!kcmds)
+ return -ENOMEM;
+ if (copy_from_user(kcmds, cmds, cmd_size)) {
+ ret = -EFAULT;
+ goto out;
+ }
+ for (i = 0; i < ncmds; i++) {
+ struct epoll_event ev = (struct epoll_event) {
+ .events = kcmds[i].events,
+ .data = kcmds[i].data,
+ };
+ if (kcmds[i].flags) {
+ kcmds[i].result = -EINVAL;
+ goto copy;
+ }
+ kcmds[i].result = ep_ctl_do(epfd, kcmds[i].op,
+ kcmds[i].fd, ev);
+ if (kcmds[i].result)
+ goto copy;
+ ret++;
+ }
+copy:
+ r = copy_to_user(cmds, kcmds,
+ sizeof(struct epoll_ctl_cmd) * ncmds);
+ /* Failing to copy the command results back will leave
+ * userspace no way to know the actual error code, but we still
+ * report the number of succeeded commands with ret, so it's
+ * not a big problem. Ignore it for now.
+ */
+ (void) r;
+out:
+ kfree(kcmds);
+ return ret;
+}
+
#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..7d784e3 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -12,6 +12,7 @@
#define _LINUX_SYSCALLS_H
struct epoll_event;
+struct epoll_ctl_cmd;
struct iattr;
struct inode;
struct iocb;
@@ -634,6 +635,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
int maxevents, int timeout,
const sigset_t __user *sigmask,
size_t sigsetsize);
+asmlinkage long sys_epoll_ctl_batch(int epfd, int flags,
+ int ncmds,
+ struct epoll_ctl_cmd __user *cmds);
asmlinkage long sys_gethostname(char __user *name, int len);
asmlinkage long sys_sethostname(char __user *name, int len);
asmlinkage long sys_setdomainname(char __user *name, int len);
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..4e18b17 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -18,6 +18,8 @@
#include <linux/fcntl.h>
#include <linux/types.h>
+#include <linux/signal.h>
+
/* Flags for epoll_create1. */
#define EPOLL_CLOEXEC O_CLOEXEC
@@ -61,6 +63,15 @@ struct epoll_event {
__u64 data;
} EPOLL_PACKED;
+struct epoll_ctl_cmd {
+ int flags;
+ int op;
+ int fd;
+ __u32 events;
+ __u64 data;
+ int result;
+} EPOLL_PACKED;
+
#ifdef CONFIG_PM_SLEEP
static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
{
--
1.9.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RFC v3 5/7] x86: Hook up epoll_ctl_batch syscall
2015-02-13 9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
` (2 preceding siblings ...)
[not found] ` <1423818243-15410-1-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-02-13 9:04 ` Fam Zheng
2015-02-13 9:04 ` [PATCH RFC v3 6/7] epoll: Add implementation for epoll_pwait1 Fam Zheng
` (3 subsequent siblings)
7 siblings, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-02-13 9:04 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
Signed-off-by: Fam Zheng <famz@redhat.com>
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..fe809f6 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
356 i386 memfd_create sys_memfd_create
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
+359 i386 epoll_ctl_batch sys_epoll_ctl_batch
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..67b2ea4 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
322 64 execveat stub_execveat
+323 64 epoll_ctl_batch sys_epoll_ctl_batch
#
# x32-specific system call numbers start at 512 to avoid cache impact
--
1.9.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RFC v3 6/7] epoll: Add implementation for epoll_pwait1
2015-02-13 9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
` (3 preceding siblings ...)
2015-02-13 9:04 ` [PATCH RFC v3 5/7] x86: Hook up epoll_ctl_batch syscall Fam Zheng
@ 2015-02-13 9:04 ` Fam Zheng
2015-02-13 9:04 ` [PATCH RFC v3 7/7] x86: Hook up epoll_pwait1 syscall Fam Zheng
` (2 subsequent siblings)
7 siblings, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-02-13 9:04 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
This is the new implementation for poll which has a flags parameter and
packs a number of parameters into a structure.
The main advantage of it over existing epoll_pwait is about timeout:
epoll_pwait expects a relative millisecond value, while epoll_pwait1
accepts 1) a timespec which is in nanosecond granularity; 2) a clockid
to allow using a clock other than CLOCK_MONOTONIC.
The 'flags' field in params is reserved for now and must be zero. The
next step would be allowing absolute timeout value.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
fs/eventpoll.c | 27 +++++++++++++++++++++++++++
include/linux/syscalls.h | 5 +++++
include/uapi/linux/eventpoll.h | 7 +++++++
3 files changed, 39 insertions(+)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 12e2e63..cc51e8c 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2117,6 +2117,33 @@ out:
return ret;
}
+SYSCALL_DEFINE5(epoll_pwait1, int, epfd, int, flags,
+ struct epoll_event __user *, events,
+ int, maxevents,
+ struct epoll_wait_params __user *, params)
+{
+ struct epoll_wait_params p;
+ ktime_t kt = { 0 };
+ sigset_t sigmask;
+
+ if (flags)
+ return -EINVAL;
+ if (!params)
+ return -EINVAL;
+ if (copy_from_user(&p, params, sizeof(p)))
+ return -EFAULT;
+ if (p.sigmask) {
+ if (copy_from_user(&sigmask, p.sigmask, sizeof(sigmask)))
+ return -EFAULT;
+ if (p.sigsetsize != sizeof(p.sigmask))
+ return -EINVAL;
+ }
+ kt = timespec_to_ktime(p.timeout);
+
+ return epoll_pwait_do(epfd, events, maxevents, p.clockid,
+ kt, p.sigmask ? &sigmask : NULL);
+}
+
#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 7d784e3..a4823d9 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -13,6 +13,7 @@
struct epoll_event;
struct epoll_ctl_cmd;
+struct epoll_wait_params;
struct iattr;
struct inode;
struct iocb;
@@ -635,6 +636,10 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
int maxevents, int timeout,
const sigset_t __user *sigmask,
size_t sigsetsize);
+asmlinkage long sys_epoll_pwait1(int epfd, int flags,
+ struct epoll_event __user *events,
+ int maxevents,
+ struct epoll_wait_params __user *params);
asmlinkage long sys_epoll_ctl_batch(int epfd, int flags,
int ncmds,
struct epoll_ctl_cmd __user *cmds);
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 4e18b17..d35c591 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -72,6 +72,13 @@ struct epoll_ctl_cmd {
int result;
} EPOLL_PACKED;
+struct epoll_wait_params {
+ int clockid;
+ struct timespec timeout;
+ sigset_t *sigmask;
+ size_t sigsetsize;
+} EPOLL_PACKED;
+
#ifdef CONFIG_PM_SLEEP
static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
{
--
1.9.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RFC v3 7/7] x86: Hook up epoll_pwait1 syscall
2015-02-13 9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
` (4 preceding siblings ...)
2015-02-13 9:04 ` [PATCH RFC v3 6/7] epoll: Add implementation for epoll_pwait1 Fam Zheng
@ 2015-02-13 9:04 ` Fam Zheng
2015-02-13 9:53 ` [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Omar Sandoval
2015-02-15 22:00 ` Jonathan Corbet
7 siblings, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-02-13 9:04 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
Signed-off-by: Fam Zheng <famz@redhat.com>
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index fe809f6..bf912d8 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -366,3 +366,4 @@
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
359 i386 epoll_ctl_batch sys_epoll_ctl_batch
+360 i386 epoll_pwait1 sys_epoll_pwait1
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 67b2ea4..9246ad5 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -330,6 +330,7 @@
321 common bpf sys_bpf
322 64 execveat stub_execveat
323 64 epoll_ctl_batch sys_epoll_ctl_batch
+324 64 epoll_pwait1 sys_epoll_pwait1
#
# x32-specific system call numbers start at 512 to avoid cache impact
--
1.9.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
2015-02-13 9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
` (5 preceding siblings ...)
2015-02-13 9:04 ` [PATCH RFC v3 7/7] x86: Hook up epoll_pwait1 syscall Fam Zheng
@ 2015-02-13 9:53 ` Omar Sandoval
2015-02-15 15:16 ` Michael Kerrisk (man-pages)
2015-02-15 22:00 ` Jonathan Corbet
7 siblings, 1 reply; 16+ messages in thread
From: Omar Sandoval @ 2015-02-13 9:53 UTC (permalink / raw)
To: Fam Zheng
Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
Peter Zijlstra <peterz@
On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote:
> Hi all,
>
> This is the updated series for the new epoll system calls, with the cover
> letter rewritten which includes some more explanation. Comments are very
> welcome!
>
> Original Motivation
> ===================
>
> QEMU, and probably many more select/poll based applications, will consider
> epoll as an alternative, when its event loop needs to handle a big number of
> fds. However, there are currently two concerns with epoll which prevents the
> switching from ppoll to epoll.
>
> The major one is the timeout precision.
>
> For example in QEMU, the main loop takes care of calling callbacks at a
> specific timeout - the QEMU timer API. The timeout value in ppoll depends on
> the next firing timer. epoll_pwait's millisecond timeout is so coarse that
> rounding up the timeout will hurt performance badly.
>
> The minor one is the number of system call to update fd set.
>
> While epoll can handle a large number of fds quickly, it still requires one
> epoll_ctl per fd update, compared to the one-shot call to select/poll with an
> fd array. This may as well make epoll inferior to ppoll in the cases where a
> small, but frequently changing set of fds are polled by the event loop.
>
> This series introduces two new epoll APIs to address them respectively. The
> idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
> suggested clockid as a parameter in epoll_pwait1.
>
> Discussion
> ==========
>
> [Note: This is the question part regarding the interface contract of
> epoll_ctl_batch. If you don't have the context of what is epoll_ctl_batch yet,
> please skip this part and probably start with the man page style documentation.
> You can resume to this section later.]
>
> [Thanks to Omar Sandoval <osandov@osandov.com>, who pointed out this in
> reviewing v1]
>
> We try to report status for each command in epoll_ctl_batch, by writting to
> user provided command array (pointed to cmds). The tricky thing in the
> implementation is that, copying the results back to userspace comes last, after
> the commands are executed. At this point, if the copy_to_user fails, the
> effects are done and no return - or if we add some mechanism to revert it, the
> code will be too complicated and slow.
>
> In above corner case, the return value of epoll_ctl_batch is smaller than
> ncmds, which assures our caller that last N commands failed, where N = ncmds -
> ret. But they'll also find that cmd.result is not changed to error code.
>
> I suppose this is not a big deal, because 1) it should happen very rarely. 2)
> user does know the actual number of commands that succeed.
>
> So, do we leave it as is? Or is there any way to improve?
>
> One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
> before we execute the commands. If it succeeds, it's even less likely the last
> copy_to_user could fail, so that we can even probably assert it won't. The
> testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it
> right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the
> performance impact, especially when @ncmds is big.
>
I don't think this is the right thing to do, since, for example, another
thread could unmap the memory region holding buffer between the "check"
copy_to_user and the actual one.
The two alternatives that I see are:
1. If copy_to_user fails, ignore it and return the number of commands
that succeeded (i.e., what the code in your patch does).
2. If copy_to_user fails, return -EFAULT, regardless of how many
commands succeeded.
The problem with 1 is that it could potentially mask bugs in a user
program. You could imagine a buggy program that passes a read-only
buffer to epoll_ctl_batch and never finds out about it because they
don't get the error. (Then, when there's a real error in one of the
epoll_ctl_cmds, but .result is 0, someone will be very confused.)
So I think 2 is the better option. Sure, the user will have no idea how
many commands were executed, but when would EFAULT on this system call
be part of normal operation, anyways? You'll find the memory bug, fix
it, and rest easy knowing that nothing is silently failing behind your
back.
> Links
> =====
>
> [1]: http://lists.openwall.net/linux-kernel/2015/01/08/542
>
> Changelog
> =========
>
> Changes from v2 (https://lkml.org/lkml/2015/2/4/105)
> ----------------------------------------------------
>
> - Rename epoll_ctl_cmd.error_hint to "result". [Michael]
>
> - Add background introduction in cover letter. [Michael]
>
> - Expand the last struct of epoll_pwait1, add clockid and timespec.
>
> - Update man page in cover letter accordingly:
>
> * "error_hint" -> "result".
> * The result field's caveat in "RETURN VALUE" secion of epoll_ctl_batch.
>
> Please review!
>
> Changes from v1 (https://lkml.org/lkml/2015/1/20/189)
> -----------------------------------------------------
>
> - As discussed in previous thread [1], split the call to epoll_ctl_batch and
> epoll_pwait. [Michael]
>
> - Fix memory leaks. [Omar]
>
> - Add a short comment about the ignored copy_to_user failure. [Omar]
>
> - Cover letter rewritten.
>
[snip]
--
Omar
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
2015-02-13 9:53 ` [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Omar Sandoval
@ 2015-02-15 15:16 ` Michael Kerrisk (man-pages)
0 siblings, 0 replies; 16+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-02-15 15:16 UTC (permalink / raw)
To: Omar Sandoval, Fam Zheng
Cc: mtk.manpages, linux-kernel, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, x86, Alexander Viro, Andrew Morton, Kees Cook,
Andy Lutomirski, David Herrmann, Alexei Starovoitov,
Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers
On 02/13/2015 10:53 AM, Omar Sandoval wrote:
> On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote:
>> Hi all,
>>
>> This is the updated series for the new epoll system calls, with the cover
>> letter rewritten which includes some more explanation. Comments are very
>> welcome!
>>
>> Original Motivation
>> ===================
>>
>> QEMU, and probably many more select/poll based applications, will consider
>> epoll as an alternative, when its event loop needs to handle a big number of
>> fds. However, there are currently two concerns with epoll which prevents the
>> switching from ppoll to epoll.
>>
>> The major one is the timeout precision.
>>
>> For example in QEMU, the main loop takes care of calling callbacks at a
>> specific timeout - the QEMU timer API. The timeout value in ppoll depends on
>> the next firing timer. epoll_pwait's millisecond timeout is so coarse that
>> rounding up the timeout will hurt performance badly.
>>
>> The minor one is the number of system call to update fd set.
>>
>> While epoll can handle a large number of fds quickly, it still requires one
>> epoll_ctl per fd update, compared to the one-shot call to select/poll with an
>> fd array. This may as well make epoll inferior to ppoll in the cases where a
>> small, but frequently changing set of fds are polled by the event loop.
>>
>> This series introduces two new epoll APIs to address them respectively. The
>> idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
>> suggested clockid as a parameter in epoll_pwait1.
>>
>> Discussion
>> ==========
>>
>> [Note: This is the question part regarding the interface contract of
>> epoll_ctl_batch. If you don't have the context of what is epoll_ctl_batch yet,
>> please skip this part and probably start with the man page style documentation.
>> You can resume to this section later.]
>>
>> [Thanks to Omar Sandoval <osandov@osandov.com>, who pointed out this in
>> reviewing v1]
>>
>> We try to report status for each command in epoll_ctl_batch, by writting to
>> user provided command array (pointed to cmds). The tricky thing in the
>> implementation is that, copying the results back to userspace comes last, after
>> the commands are executed. At this point, if the copy_to_user fails, the
>> effects are done and no return - or if we add some mechanism to revert it, the
>> code will be too complicated and slow.
>>
>> In above corner case, the return value of epoll_ctl_batch is smaller than
>> ncmds, which assures our caller that last N commands failed, where N = ncmds -
>> ret. But they'll also find that cmd.result is not changed to error code.
>>
>> I suppose this is not a big deal, because 1) it should happen very rarely. 2)
>> user does know the actual number of commands that succeed.
>>
>> So, do we leave it as is? Or is there any way to improve?
>>
>> One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
>> before we execute the commands. If it succeeds, it's even less likely the last
>> copy_to_user could fail, so that we can even probably assert it won't. The
>> testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it
>> right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the
>> performance impact, especially when @ncmds is big.
>>
> I don't think this is the right thing to do, since, for example, another
> thread could unmap the memory region holding buffer between the "check"
> copy_to_user and the actual one.
>
> The two alternatives that I see are:
>
> 1. If copy_to_user fails, ignore it and return the number of commands
> that succeeded (i.e., what the code in your patch does).
> 2. If copy_to_user fails, return -EFAULT, regardless of how many
> commands succeeded.
>
> The problem with 1 is that it could potentially mask bugs in a user
> program. You could imagine a buggy program that passes a read-only
> buffer to epoll_ctl_batch and never finds out about it because they
> don't get the error. (Then, when there's a real error in one of the
> epoll_ctl_cmds, but .result is 0, someone will be very confused.)
>
> So I think 2 is the better option. Sure, the user will have no idea how
> many commands were executed, but when would EFAULT on this system call
> be part of normal operation, anyways? You'll find the memory bug, fix
> it, and rest easy knowing that nothing is silently failing behind your
> back.
What Omar says makes sense to me too. Best to have the user get a clear
error indication for this case.
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
2015-02-13 9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
` (6 preceding siblings ...)
2015-02-13 9:53 ` [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Omar Sandoval
@ 2015-02-15 22:00 ` Jonathan Corbet
[not found] ` <20150215150011.0340686c-T1hC0tSOHrs@public.gmane.org>
7 siblings, 1 reply; 16+ messages in thread
From: Jonathan Corbet @ 2015-02-15 22:00 UTC (permalink / raw)
To: Fam Zheng
Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
Peter Zijlstra <peter>
On Fri, 13 Feb 2015 17:03:56 +0800
Fam Zheng <famz@redhat.com> wrote:
> SYNOPSIS
>
> #include <sys/epoll.h>
>
> int epoll_pwait1(int epfd, int flags,
> struct epoll_event *events,
> int maxevents,
> struct epoll_wait_params *params);
Quick, possibly dumb question: might it make sense to also pass in
sizeof(struct epoll_wait_params)? That way, when somebody wants to add
another parameter in the future, the kernel can tell which version is in
use and they won't have to do an epoll_pwait2()?
Thanks,
jon
^ permalink raw reply [flat|nested] 16+ messages in thread