* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall [not found] <1420705550-24245-1-git-send-email-famz@redhat.com> @ 2015-01-08 9:12 ` Miklos Szeredi [not found] ` <1420708372.18399.15.camel-AlSwsSmVLrQ@public.gmane.org> 2015-01-08 17:57 ` Andy Lutomirski 0 siblings, 2 replies; 15+ messages in thread From: Miklos Szeredi @ 2015-01-08 9:12 UTC (permalink / raw) To: Fam Zheng Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov, Mathieu Desnoyers, Fabian Frederick, Josh Triplett On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote: > Applications could use epoll interface when then need to poll a big number of > files in their main loops, to achieve better performance than ppoll(2). Except > for one concern: epoll only takes timeout parameters in microseconds, rather > than nanoseconds. > > That is a drawback we should address. For a real case in QEMU, we run into a > scalability issue with ppoll(2) when many devices are attached to guest, in > which case many host fds, such as virtual disk images and sockets, need to be > polled by the main loop. As a result we are looking at switching to epoll, but > the coarse timeout precision is a trouble, as explained below. > > We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement > timers in the main loop; and we call ppoll(2) with the next firing timer as > timeout, so when ppoll(2) returns, we know that we have more work to do (either > handling IO events, or fire a timer callback). This is natual and efficient, > except that ppoll(2) itself is slow. > > Now that we want to switch to epoll, to speed up the polling. However the timer > slack setting will be effectively undone, because that way we will have to > round up the timeout to microseconds honoring timer contract. But consequently, > this hurts the general responsiveness. > > Note: there are two alternatives, without changing kernel: > > 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't > be slow as one fd is polled. No more scalability issue. And if there are > events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with > timeout=0; otherwise, there can't be events for the epoll, skip the following > epoll_wait and just continue with other work. > > 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1). > The timerfd will hopefully force epoll_wait to return when it timeouts, even if > no other events have arrived. This will inheritly give us timerfd's precision. > Note that for each poll, the desired timeout is different because the next > timer is different, so that, before each epoll_wait(2), there will be a > timerfd_settime syscall to set it to a proper value. > > Unfortunately, both approaches require one more syscall per iteration, compared > to the original single ppoll(2), cost of which is unneglectable when we talk > about nanosecond granularity. Please consider adding a "flags" argument to the new syscall (and returning EINVAL if non-zero). See this article, which shows that extended syscalls almost always want flags, and they often get it only on the second try: http://lwn.net/Articles/585415/ Thanks, Miklos P.S. stray apostrophes in To: and Cc: lines seems to be causing trouble. ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <1420708372.18399.15.camel-AlSwsSmVLrQ@public.gmane.org>]
* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall [not found] ` <1420708372.18399.15.camel-AlSwsSmVLrQ@public.gmane.org> @ 2015-01-08 11:07 ` Michael Kerrisk (man-pages) 0 siblings, 0 replies; 15+ messages in thread From: Michael Kerrisk (man-pages) @ 2015-01-08 11:07 UTC (permalink / raw) To: Miklos Szeredi Cc: Fam Zheng, lkml, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov, Mathieu Desnoyers On 8 January 2015 at 10:12, Miklos Szeredi <mszeredi-AlSwsSmVLrQ@public.gmane.org> wrote: > On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote: >> Applications could use epoll interface when then need to poll a big number of >> files in their main loops, to achieve better performance than ppoll(2). Except >> for one concern: epoll only takes timeout parameters in microseconds, rather >> than nanoseconds. >> >> That is a drawback we should address. For a real case in QEMU, we run into a >> scalability issue with ppoll(2) when many devices are attached to guest, in >> which case many host fds, such as virtual disk images and sockets, need to be >> polled by the main loop. As a result we are looking at switching to epoll, but >> the coarse timeout precision is a trouble, as explained below. >> >> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement >> timers in the main loop; and we call ppoll(2) with the next firing timer as >> timeout, so when ppoll(2) returns, we know that we have more work to do (either >> handling IO events, or fire a timer callback). This is natual and efficient, >> except that ppoll(2) itself is slow. >> >> Now that we want to switch to epoll, to speed up the polling. However the timer >> slack setting will be effectively undone, because that way we will have to >> round up the timeout to microseconds honoring timer contract. But consequently, >> this hurts the general responsiveness. >> >> Note: there are two alternatives, without changing kernel: >> >> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't >> be slow as one fd is polled. No more scalability issue. And if there are >> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with >> timeout=0; otherwise, there can't be events for the epoll, skip the following >> epoll_wait and just continue with other work. >> >> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1). >> The timerfd will hopefully force epoll_wait to return when it timeouts, even if >> no other events have arrived. This will inheritly give us timerfd's precision. >> Note that for each poll, the desired timeout is different because the next >> timer is different, so that, before each epoll_wait(2), there will be a >> timerfd_settime syscall to set it to a proper value. >> >> Unfortunately, both approaches require one more syscall per iteration, compared >> to the original single ppoll(2), cost of which is unneglectable when we talk >> about nanosecond granularity. > > Please consider adding a "flags" argument to the new syscall (and > returning EINVAL if non-zero). See this article, which shows that > extended syscalls almost always want flags, and they often get it only > on the second try: > > http://lwn.net/Articles/585415/ Yes, please ensure that the new call gets a flags argument. We've made this mistake way too many times in the past. Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall 2015-01-08 9:12 ` [PATCH 0/3] epoll: Add epoll_pwait1 syscall Miklos Szeredi [not found] ` <1420708372.18399.15.camel-AlSwsSmVLrQ@public.gmane.org> @ 2015-01-08 17:57 ` Andy Lutomirski [not found] ` <CALCETrVyPij1Zxwmw7p06UrZjoyYDXqEjmxyQ-KJ8Y7dx7mL3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 15+ messages in thread From: Andy Lutomirski @ 2015-01-08 17:57 UTC (permalink / raw) To: Miklos Szeredi Cc: Fam Zheng, linux-kernel@vger.kernel.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov, Mathieu Desnoyers, Fab On Thu, Jan 8, 2015 at 1:12 AM, Miklos Szeredi <mszeredi@suse.cz> wrote: > On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote: >> Applications could use epoll interface when then need to poll a big number of >> files in their main loops, to achieve better performance than ppoll(2). Except >> for one concern: epoll only takes timeout parameters in microseconds, rather >> than nanoseconds. >> >> That is a drawback we should address. For a real case in QEMU, we run into a >> scalability issue with ppoll(2) when many devices are attached to guest, in >> which case many host fds, such as virtual disk images and sockets, need to be >> polled by the main loop. As a result we are looking at switching to epoll, but >> the coarse timeout precision is a trouble, as explained below. >> >> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement >> timers in the main loop; and we call ppoll(2) with the next firing timer as >> timeout, so when ppoll(2) returns, we know that we have more work to do (either >> handling IO events, or fire a timer callback). This is natual and efficient, >> except that ppoll(2) itself is slow. >> >> Now that we want to switch to epoll, to speed up the polling. However the timer >> slack setting will be effectively undone, because that way we will have to >> round up the timeout to microseconds honoring timer contract. But consequently, >> this hurts the general responsiveness. >> >> Note: there are two alternatives, without changing kernel: >> >> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't >> be slow as one fd is polled. No more scalability issue. And if there are >> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with >> timeout=0; otherwise, there can't be events for the epoll, skip the following >> epoll_wait and just continue with other work. >> >> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1). >> The timerfd will hopefully force epoll_wait to return when it timeouts, even if >> no other events have arrived. This will inheritly give us timerfd's precision. >> Note that for each poll, the desired timeout is different because the next >> timer is different, so that, before each epoll_wait(2), there will be a >> timerfd_settime syscall to set it to a proper value. >> >> Unfortunately, both approaches require one more syscall per iteration, compared >> to the original single ppoll(2), cost of which is unneglectable when we talk >> about nanosecond granularity. I'd like to see a more ambitious change, since the timer isn't the only problem like this. Specifically, I'd like a syscall that does a list of epoll-related things and then waits. The list of things could include, at least: - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to want to turn on and off their requests for events on a somewhat regular basis. - timerfd_settime actions: this allows a single syscall to wait and adjust *both* monotonic and real-time wakeups. Would this make sense? It could look like: int epoll_mod_and_pwait(int epfd, struct epoll_event *events, int maxevents, struct epoll_command *commands, int ncommands, const sigset_t *sigmask); --Andy ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <CALCETrVyPij1Zxwmw7p06UrZjoyYDXqEjmxyQ-KJ8Y7dx7mL3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall [not found] ` <CALCETrVyPij1Zxwmw7p06UrZjoyYDXqEjmxyQ-KJ8Y7dx7mL3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-01-08 18:42 ` josh-iaAMLnmF4UmaiuxdJuQwMA 2015-01-08 19:31 ` Alexei Starovoitov 2015-01-08 19:42 ` Andy Lutomirski 2015-01-09 1:25 ` Fam Zheng 1 sibling, 2 replies; 15+ messages in thread From: josh-iaAMLnmF4UmaiuxdJuQwMA @ 2015-01-08 18:42 UTC (permalink / raw) To: Andy Lutomirski Cc: Miklos Szeredi, Fam Zheng, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov, Mathieu Desnoyers <math> On Thu, Jan 08, 2015 at 09:57:24AM -0800, Andy Lutomirski wrote: > On Thu, Jan 8, 2015 at 1:12 AM, Miklos Szeredi <mszeredi-AlSwsSmVLrQ@public.gmane.org> wrote: > > On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote: > >> Applications could use epoll interface when then need to poll a big number of > >> files in their main loops, to achieve better performance than ppoll(2). Except > >> for one concern: epoll only takes timeout parameters in microseconds, rather > >> than nanoseconds. > >> > >> That is a drawback we should address. For a real case in QEMU, we run into a > >> scalability issue with ppoll(2) when many devices are attached to guest, in > >> which case many host fds, such as virtual disk images and sockets, need to be > >> polled by the main loop. As a result we are looking at switching to epoll, but > >> the coarse timeout precision is a trouble, as explained below. > >> > >> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement > >> timers in the main loop; and we call ppoll(2) with the next firing timer as > >> timeout, so when ppoll(2) returns, we know that we have more work to do (either > >> handling IO events, or fire a timer callback). This is natual and efficient, > >> except that ppoll(2) itself is slow. > >> > >> Now that we want to switch to epoll, to speed up the polling. However the timer > >> slack setting will be effectively undone, because that way we will have to > >> round up the timeout to microseconds honoring timer contract. But consequently, > >> this hurts the general responsiveness. > >> > >> Note: there are two alternatives, without changing kernel: > >> > >> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't > >> be slow as one fd is polled. No more scalability issue. And if there are > >> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with > >> timeout=0; otherwise, there can't be events for the epoll, skip the following > >> epoll_wait and just continue with other work. > >> > >> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1). > >> The timerfd will hopefully force epoll_wait to return when it timeouts, even if > >> no other events have arrived. This will inheritly give us timerfd's precision. > >> Note that for each poll, the desired timeout is different because the next > >> timer is different, so that, before each epoll_wait(2), there will be a > >> timerfd_settime syscall to set it to a proper value. > >> > >> Unfortunately, both approaches require one more syscall per iteration, compared > >> to the original single ppoll(2), cost of which is unneglectable when we talk > >> about nanosecond granularity. > > I'd like to see a more ambitious change, since the timer isn't the > only problem like this. Specifically, I'd like a syscall that does a > list of epoll-related things and then waits. The list of things could > include, at least: > > - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to > want to turn on and off their requests for events on a somewhat > regular basis. > > - timerfd_settime actions: this allows a single syscall to wait and > adjust *both* monotonic and real-time wakeups. > > Would this make sense? It could look like: > > int epoll_mod_and_pwait(int epfd, > struct epoll_event *events, int maxevents, > struct epoll_command *commands, int ncommands, > const sigset_t *sigmask); That's a complicated syscall. (And it also doesn't have room for the flags argument.) At that point, why not just have a syscall like this: struct syscall { unsigned long num; unsigned long params[6]; }; int sys_many(size_t count, struct syscall *syscalls, int *results, unsigned long flags); I think that has been discussed in the past. Or, these days, that might be better done via eBPF, which would avoid the need for flags like "return on error"; an eBPF program could decide how to proceed after each call. - Josh Triplett ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall 2015-01-08 18:42 ` josh-iaAMLnmF4UmaiuxdJuQwMA @ 2015-01-08 19:31 ` Alexei Starovoitov 2015-01-08 19:42 ` Andy Lutomirski 1 sibling, 0 replies; 15+ messages in thread From: Alexei Starovoitov @ 2015-01-08 19:31 UTC (permalink / raw) To: Josh Triplett Cc: Andy Lutomirski, Miklos Szeredi, Fam Zheng, linux-kernel@vger.kernel.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook, David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov On Thu, Jan 8, 2015 at 10:42 AM, <josh@joshtriplett.org> wrote: >> I'd like to see a more ambitious change, since the timer isn't the >> only problem like this. Specifically, I'd like a syscall that does a >> list of epoll-related things and then waits. The list of things could >> include, at least: >> >> - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to >> want to turn on and off their requests for events on a somewhat >> regular basis. >> >> - timerfd_settime actions: this allows a single syscall to wait and >> adjust *both* monotonic and real-time wakeups. >> >> Would this make sense? It could look like: >> >> int epoll_mod_and_pwait(int epfd, >> struct epoll_event *events, int maxevents, >> struct epoll_command *commands, int ncommands, >> const sigset_t *sigmask); > > That's a complicated syscall. (And it also doesn't have room for the > flags argument.) > > At that point, why not just have a syscall like this: > > struct syscall { > unsigned long num; > unsigned long params[6]; > }; > > int sys_many(size_t count, struct syscall *syscalls, int *results, unsigned long flags); > > I think that has been discussed in the past. > > Or, these days, that might be better done via eBPF, which would avoid > the need for flags like "return on error"; an eBPF program could decide > how to proceed after each call. I'm afraid that will break seccomp or will make it much more complicated. Also I think syscall latency with the latest improvements is actually quite fast, so chaining of syscalls is probably an overkill. Same goes to Andy's argument of doing immediate CTL_MOD. Let user space do it. It makes sense to combine things only when atomicity is needed. Here it's not the case. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall 2015-01-08 18:42 ` josh-iaAMLnmF4UmaiuxdJuQwMA 2015-01-08 19:31 ` Alexei Starovoitov @ 2015-01-08 19:42 ` Andy Lutomirski 1 sibling, 0 replies; 15+ messages in thread From: Andy Lutomirski @ 2015-01-08 19:42 UTC (permalink / raw) To: Josh Triplett Cc: Michael Kerrisk, Mathieu Desnoyers, Zach Brown, Fam Zheng, Vivek Goyal, Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Alexander Viro, Mike Frysinger, Linux FS Devel, Alexei Starovoitov, Rasmus Villemoes, Stefan Hajnoczi, Dario Faggioli, Paolo Bonzini, Juri Lelli, X86 ML, Oleg Nesterov, Heiko Carstens, H. Peter Anvin, Andrew Morton, Linux API, "David S. Miller" <dav> On Jan 8, 2015 10:42 AM, <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote: > > On Thu, Jan 08, 2015 at 09:57:24AM -0800, Andy Lutomirski wrote: > > On Thu, Jan 8, 2015 at 1:12 AM, Miklos Szeredi <mszeredi-AlSwsSmVLrQ@public.gmane.org> wrote: > > > On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote: > > >> Applications could use epoll interface when then need to poll a big number of > > >> files in their main loops, to achieve better performance than ppoll(2). Except > > >> for one concern: epoll only takes timeout parameters in microseconds, rather > > >> than nanoseconds. > > >> > > >> That is a drawback we should address. For a real case in QEMU, we run into a > > >> scalability issue with ppoll(2) when many devices are attached to guest, in > > >> which case many host fds, such as virtual disk images and sockets, need to be > > >> polled by the main loop. As a result we are looking at switching to epoll, but > > >> the coarse timeout precision is a trouble, as explained below. > > >> > > >> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement > > >> timers in the main loop; and we call ppoll(2) with the next firing timer as > > >> timeout, so when ppoll(2) returns, we know that we have more work to do (either > > >> handling IO events, or fire a timer callback). This is natual and efficient, > > >> except that ppoll(2) itself is slow. > > >> > > >> Now that we want to switch to epoll, to speed up the polling. However the timer > > >> slack setting will be effectively undone, because that way we will have to > > >> round up the timeout to microseconds honoring timer contract. But consequently, > > >> this hurts the general responsiveness. > > >> > > >> Note: there are two alternatives, without changing kernel: > > >> > > >> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't > > >> be slow as one fd is polled. No more scalability issue. And if there are > > >> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with > > >> timeout=0; otherwise, there can't be events for the epoll, skip the following > > >> epoll_wait and just continue with other work. > > >> > > >> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1). > > >> The timerfd will hopefully force epoll_wait to return when it timeouts, even if > > >> no other events have arrived. This will inheritly give us timerfd's precision. > > >> Note that for each poll, the desired timeout is different because the next > > >> timer is different, so that, before each epoll_wait(2), there will be a > > >> timerfd_settime syscall to set it to a proper value. > > >> > > >> Unfortunately, both approaches require one more syscall per iteration, compared > > >> to the original single ppoll(2), cost of which is unneglectable when we talk > > >> about nanosecond granularity. > > > > I'd like to see a more ambitious change, since the timer isn't the > > only problem like this. Specifically, I'd like a syscall that does a > > list of epoll-related things and then waits. The list of things could > > include, at least: > > > > - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to > > want to turn on and off their requests for events on a somewhat > > regular basis. > > > > - timerfd_settime actions: this allows a single syscall to wait and > > adjust *both* monotonic and real-time wakeups. > > > > Would this make sense? It could look like: > > > > int epoll_mod_and_pwait(int epfd, > > struct epoll_event *events, int maxevents, > > struct epoll_command *commands, int ncommands, > > const sigset_t *sigmask); > > That's a complicated syscall. (And it also doesn't have room for the > flags argument.) > > At that point, why not just have a syscall like this: > > struct syscall { > unsigned long num; > unsigned long params[6]; > }; Too minimize locking and such. The user / kernel transitions are very fast in some configurations, and I suspect that all the fd lookups, epoll data structure locks, etc are important, too. This particular pattern (EPOLL_CTL_MOD, then timerfd_settime, then epoll_pwait) is so common that optimizing it from three syscalls to one (as opposed to two as in your patch) seems like it could be worthwhile. syscall entry + exit latency is pretty good (~180 cycles total, maybe) these days, but only in some configurations. In others, it blows up pretty badly. I'm slowly working on improving that. Also, simplicity is a win. The multi-syscall thing has never gone smoothly, and, as Alexei pointed out, it's a big mess for seccomp. > > int sys_many(size_t count, struct syscall *syscalls, int *results, unsigned long flags); > > I think that has been discussed in the past. > > Or, these days, that might be better done via eBPF, which would avoid > the need for flags like "return on error"; an eBPF program could decide > how to proceed after each call. > > - Josh Triplett ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall [not found] ` <CALCETrVyPij1Zxwmw7p06UrZjoyYDXqEjmxyQ-KJ8Y7dx7mL3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-01-08 18:42 ` josh-iaAMLnmF4UmaiuxdJuQwMA @ 2015-01-09 1:25 ` Fam Zheng [not found] ` <20150109011608.GA2924-+wGkCoP0yD+sDdueE5tM26fLeoKvNuZc@public.gmane.org> 1 sibling, 1 reply; 15+ messages in thread From: Fam Zheng @ 2015-01-09 1:25 UTC (permalink / raw) To: Andy Lutomirski Cc: Miklos Szeredi, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov, Mathieu Desnoyers On Thu, 01/08 09:57, Andy Lutomirski wrote: > On Thu, Jan 8, 2015 at 1:12 AM, Miklos Szeredi <mszeredi-AlSwsSmVLrQ@public.gmane.org> wrote: > > On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote: > >> Applications could use epoll interface when then need to poll a big number of > >> files in their main loops, to achieve better performance than ppoll(2). Except > >> for one concern: epoll only takes timeout parameters in microseconds, rather > >> than nanoseconds. > >> > >> That is a drawback we should address. For a real case in QEMU, we run into a > >> scalability issue with ppoll(2) when many devices are attached to guest, in > >> which case many host fds, such as virtual disk images and sockets, need to be > >> polled by the main loop. As a result we are looking at switching to epoll, but > >> the coarse timeout precision is a trouble, as explained below. > >> > >> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement > >> timers in the main loop; and we call ppoll(2) with the next firing timer as > >> timeout, so when ppoll(2) returns, we know that we have more work to do (either > >> handling IO events, or fire a timer callback). This is natual and efficient, > >> except that ppoll(2) itself is slow. > >> > >> Now that we want to switch to epoll, to speed up the polling. However the timer > >> slack setting will be effectively undone, because that way we will have to > >> round up the timeout to microseconds honoring timer contract. But consequently, > >> this hurts the general responsiveness. > >> > >> Note: there are two alternatives, without changing kernel: > >> > >> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't > >> be slow as one fd is polled. No more scalability issue. And if there are > >> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with > >> timeout=0; otherwise, there can't be events for the epoll, skip the following > >> epoll_wait and just continue with other work. > >> > >> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1). > >> The timerfd will hopefully force epoll_wait to return when it timeouts, even if > >> no other events have arrived. This will inheritly give us timerfd's precision. > >> Note that for each poll, the desired timeout is different because the next > >> timer is different, so that, before each epoll_wait(2), there will be a > >> timerfd_settime syscall to set it to a proper value. > >> > >> Unfortunately, both approaches require one more syscall per iteration, compared > >> to the original single ppoll(2), cost of which is unneglectable when we talk > >> about nanosecond granularity. > > I'd like to see a more ambitious change, since the timer isn't the > only problem like this. Specifically, I'd like a syscall that does a > list of epoll-related things and then waits. The list of things could > include, at least: > > - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to > want to turn on and off their requests for events on a somewhat > regular basis. This sounds good to me. > > - timerfd_settime actions: this allows a single syscall to wait and > adjust *both* monotonic and real-time wakeups. I'm not sure, doesn't this break orthogonality between epoll and timerfd? > > Would this make sense? It could look like: > > int epoll_mod_and_pwait(int epfd, > struct epoll_event *events, int maxevents, > struct epoll_command *commands, int ncommands, > const sigset_t *sigmask); What about flags? Fam ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <20150109011608.GA2924-+wGkCoP0yD+sDdueE5tM26fLeoKvNuZc@public.gmane.org>]
* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall [not found] ` <20150109011608.GA2924-+wGkCoP0yD+sDdueE5tM26fLeoKvNuZc@public.gmane.org> @ 2015-01-09 1:28 ` Andy Lutomirski 2015-01-09 1:52 ` Fam Zheng 0 siblings, 1 reply; 15+ messages in thread From: Andy Lutomirski @ 2015-01-09 1:28 UTC (permalink / raw) To: Fam Zheng Cc: Miklos Szeredi, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov, Mathieu Desnoyers On Thu, Jan 8, 2015 at 5:25 PM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Thu, 01/08 09:57, Andy Lutomirski wrote: >> On Thu, Jan 8, 2015 at 1:12 AM, Miklos Szeredi <mszeredi-AlSwsSmVLrQ@public.gmane.org> wrote: >> > On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote: >> >> Applications could use epoll interface when then need to poll a big number of >> >> files in their main loops, to achieve better performance than ppoll(2). Except >> >> for one concern: epoll only takes timeout parameters in microseconds, rather >> >> than nanoseconds. >> >> >> >> That is a drawback we should address. For a real case in QEMU, we run into a >> >> scalability issue with ppoll(2) when many devices are attached to guest, in >> >> which case many host fds, such as virtual disk images and sockets, need to be >> >> polled by the main loop. As a result we are looking at switching to epoll, but >> >> the coarse timeout precision is a trouble, as explained below. >> >> >> >> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement >> >> timers in the main loop; and we call ppoll(2) with the next firing timer as >> >> timeout, so when ppoll(2) returns, we know that we have more work to do (either >> >> handling IO events, or fire a timer callback). This is natual and efficient, >> >> except that ppoll(2) itself is slow. >> >> >> >> Now that we want to switch to epoll, to speed up the polling. However the timer >> >> slack setting will be effectively undone, because that way we will have to >> >> round up the timeout to microseconds honoring timer contract. But consequently, >> >> this hurts the general responsiveness. >> >> >> >> Note: there are two alternatives, without changing kernel: >> >> >> >> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't >> >> be slow as one fd is polled. No more scalability issue. And if there are >> >> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with >> >> timeout=0; otherwise, there can't be events for the epoll, skip the following >> >> epoll_wait and just continue with other work. >> >> >> >> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1). >> >> The timerfd will hopefully force epoll_wait to return when it timeouts, even if >> >> no other events have arrived. This will inheritly give us timerfd's precision. >> >> Note that for each poll, the desired timeout is different because the next >> >> timer is different, so that, before each epoll_wait(2), there will be a >> >> timerfd_settime syscall to set it to a proper value. >> >> >> >> Unfortunately, both approaches require one more syscall per iteration, compared >> >> to the original single ppoll(2), cost of which is unneglectable when we talk >> >> about nanosecond granularity. >> >> I'd like to see a more ambitious change, since the timer isn't the >> only problem like this. Specifically, I'd like a syscall that does a >> list of epoll-related things and then waits. The list of things could >> include, at least: >> >> - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to >> want to turn on and off their requests for events on a somewhat >> regular basis. > > This sounds good to me. > >> >> - timerfd_settime actions: this allows a single syscall to wait and >> adjust *both* monotonic and real-time wakeups. > > I'm not sure, doesn't this break orthogonality between epoll and timerfd? Yes. It's not very elegant, and more elegant ideas are welcome. > >> >> Would this make sense? It could look like: >> >> int epoll_mod_and_pwait(int epfd, >> struct epoll_event *events, int maxevents, >> struct epoll_command *commands, int ncommands, >> const sigset_t *sigmask); > > What about flags? > No room. Maybe it should just be a struct for everything instead of separate args. > Fam -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall 2015-01-09 1:28 ` Andy Lutomirski @ 2015-01-09 1:52 ` Fam Zheng [not found] ` <20150109015248.GA5034-+wGkCoP0yD+sDdueE5tM26fLeoKvNuZc@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Fam Zheng @ 2015-01-09 1:52 UTC (permalink / raw) To: Andy Lutomirski Cc: Miklos Szeredi, linux-kernel@vger.kernel.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov, Mathieu Desnoyers On Thu, 01/08 17:28, Andy Lutomirski wrote: > On Thu, Jan 8, 2015 at 5:25 PM, Fam Zheng <famz@redhat.com> wrote: > > On Thu, 01/08 09:57, Andy Lutomirski wrote: > >> I'd like to see a more ambitious change, since the timer isn't the > >> only problem like this. Specifically, I'd like a syscall that does a > >> list of epoll-related things and then waits. The list of things could > >> include, at least: > >> > >> - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to > >> want to turn on and off their requests for events on a somewhat > >> regular basis. > > > > This sounds good to me. > > > >> > >> - timerfd_settime actions: this allows a single syscall to wait and > >> adjust *both* monotonic and real-time wakeups. > > > > I'm not sure, doesn't this break orthogonality between epoll and timerfd? > > Yes. It's not very elegant, and more elegant ideas are welcome. What is the purpose of embedding timerfd operation here? Modifying timerfd for each poll doesn't sound a common pattern to me. > > > > >> > >> Would this make sense? It could look like: > >> > >> int epoll_mod_and_pwait(int epfd, > >> struct epoll_event *events, int maxevents, > >> struct epoll_command *commands, int ncommands, > >> const sigset_t *sigmask); > > > > What about flags? > > > > No room. Maybe it should just be a struct for everything instead of > separate args. Also no room for timeout. A single struct sounds the only way to go. Fam ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <20150109015248.GA5034-+wGkCoP0yD+sDdueE5tM26fLeoKvNuZc@public.gmane.org>]
* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall [not found] ` <20150109015248.GA5034-+wGkCoP0yD+sDdueE5tM26fLeoKvNuZc@public.gmane.org> @ 2015-01-09 2:24 ` Andy Lutomirski 2015-01-09 4:49 ` Fam Zheng 0 siblings, 1 reply; 15+ messages in thread From: Andy Lutomirski @ 2015-01-09 2:24 UTC (permalink / raw) To: Fam Zheng Cc: Miklos Szeredi, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov, Mathieu Desnoyers On Thu, Jan 8, 2015 at 5:52 PM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Thu, 01/08 17:28, Andy Lutomirski wrote: >> On Thu, Jan 8, 2015 at 5:25 PM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> > On Thu, 01/08 09:57, Andy Lutomirski wrote: >> >> I'd like to see a more ambitious change, since the timer isn't the >> >> only problem like this. Specifically, I'd like a syscall that does a >> >> list of epoll-related things and then waits. The list of things could >> >> include, at least: >> >> >> >> - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to >> >> want to turn on and off their requests for events on a somewhat >> >> regular basis. >> > >> > This sounds good to me. >> > >> >> >> >> - timerfd_settime actions: this allows a single syscall to wait and >> >> adjust *both* monotonic and real-time wakeups. >> > >> > I'm not sure, doesn't this break orthogonality between epoll and timerfd? >> >> Yes. It's not very elegant, and more elegant ideas are welcome. > > What is the purpose of embedding timerfd operation here? Modifying timerfd > for each poll doesn't sound a common pattern to me. Setting a timeout is definitely a common pattern, hence this thread. But the current timeout interface sucks, and people should really use absolute time. (My epoll software uses absolute time.) But then users need to decide whether to have their timeout based on the monotonic clock or the realtime clock (or something else entirely). Some bigger programs may want both -- they may have internal events queued for certain times and for certain timeouts, and those should use realtime and monotonic respectively. Heck, users may also want separate slack values on those. Timerfd is the only thing we have right now that is anywhere near flexible enough. Obviously if epoll became fancy enough, then we could do away with the timerfd entirely here. > >> >> > >> >> >> >> Would this make sense? It could look like: >> >> >> >> int epoll_mod_and_pwait(int epfd, >> >> struct epoll_event *events, int maxevents, >> >> struct epoll_command *commands, int ncommands, >> >> const sigset_t *sigmask); >> > >> > What about flags? >> > >> >> No room. Maybe it should just be a struct for everything instead of >> separate args. > > Also no room for timeout. A single struct sounds the only way to go. That's what timerfd is for. I think it would be a bit weird to support "timeout" and detailed timerfd control. --Andy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall 2015-01-09 2:24 ` Andy Lutomirski @ 2015-01-09 4:49 ` Fam Zheng 2015-01-09 5:21 ` Josh Triplett 0 siblings, 1 reply; 15+ messages in thread From: Fam Zheng @ 2015-01-09 4:49 UTC (permalink / raw) To: Andy Lutomirski Cc: Miklos Szeredi, linux-kernel@vger.kernel.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov, Mathieu Desnoyers On Thu, 01/08 18:24, Andy Lutomirski wrote: > On Thu, Jan 8, 2015 at 5:52 PM, Fam Zheng <famz@redhat.com> wrote: > > On Thu, 01/08 17:28, Andy Lutomirski wrote: > >> On Thu, Jan 8, 2015 at 5:25 PM, Fam Zheng <famz@redhat.com> wrote: > >> > On Thu, 01/08 09:57, Andy Lutomirski wrote: > >> >> I'd like to see a more ambitious change, since the timer isn't the > >> >> only problem like this. Specifically, I'd like a syscall that does a > >> >> list of epoll-related things and then waits. The list of things could > >> >> include, at least: > >> >> > >> >> - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to > >> >> want to turn on and off their requests for events on a somewhat > >> >> regular basis. > >> > > >> > This sounds good to me. > >> > > >> >> > >> >> - timerfd_settime actions: this allows a single syscall to wait and > >> >> adjust *both* monotonic and real-time wakeups. > >> > > >> > I'm not sure, doesn't this break orthogonality between epoll and timerfd? > >> > >> Yes. It's not very elegant, and more elegant ideas are welcome. > > > > What is the purpose of embedding timerfd operation here? Modifying timerfd > > for each poll doesn't sound a common pattern to me. > > Setting a timeout is definitely a common pattern, hence this thread. > But the current timeout interface sucks, and people should really use > absolute time. (My epoll software uses absolute time.) But then > users need to decide whether to have their timeout based on the > monotonic clock or the realtime clock (or something else entirely). > Some bigger programs may want both -- they may have internal events > queued for certain times and for certain timeouts, and those should > use realtime and monotonic respectively. Heck, users may also want > separate slack values on those. > > Timerfd is the only thing we have right now that is anywhere near > flexible enough. Obviously if epoll became fancy enough, then we > could do away with the timerfd entirely here. > > > > >> > >> > > >> >> > >> >> Would this make sense? It could look like: > >> >> > >> >> int epoll_mod_and_pwait(int epfd, > >> >> struct epoll_event *events, int maxevents, > >> >> struct epoll_command *commands, int ncommands, > >> >> const sigset_t *sigmask); > >> > > >> > What about flags? > >> > > >> > >> No room. Maybe it should just be a struct for everything instead of > >> separate args. > > > > Also no room for timeout. A single struct sounds the only way to go. > > That's what timerfd is for. I think it would be a bit weird to > support "timeout" and detailed timerfd control. I see what you mean. Thanks. I still don't like hooking timerfd in the interface. Besides the unclean interface, it also feels cubersome and overkill to let users setup and add a dedicated timerfd to implement timeout. How about this: int epoll_mod_wait(int epfd, struct epoll_mod_wait_data *data); struct epoll_mod_wait_data { struct epoll_event *events; int maxevents; struct epoll_mod_cmd { int op, int fd; void *data; } *cmds; int ncmds; int flags; sigset_t *sigmask; }; Commands ops are: EPOLL_CTL_ADD @fd is the fd to modify; @data is epoll_event. EPOLL_CTL_MOD @fd is the fd to modify; @data is epoll_event. EPOLL_CTL_DEL @fd is the fd to modify; @data is epoll_event. EPOLL_CTL_SET_TIMEOUT @fd is ignored, @data is timespec. Clock type and relative/absolute are selected by flags as below. Flags are given to override timeout defaults: EPOLL_FL_MONOTONIC_CLOCK If set, don't use realtime clock, use monotonic clock. EPOLL_FL_ABSOLUTE_TIMEOUT If set, don't use relative timeout, use absolute timeout. Thanks, Fam ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall 2015-01-09 4:49 ` Fam Zheng @ 2015-01-09 5:21 ` Josh Triplett 2015-01-12 8:24 ` Fam Zheng 0 siblings, 1 reply; 15+ messages in thread From: Josh Triplett @ 2015-01-09 5:21 UTC (permalink / raw) To: Fam Zheng Cc: Andy Lutomirski, Miklos Szeredi, linux-kernel@vger.kernel.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov, Mathieu On Fri, Jan 09, 2015 at 12:49:08PM +0800, Fam Zheng wrote: > On Thu, 01/08 18:24, Andy Lutomirski wrote: > > On Thu, Jan 8, 2015 at 5:52 PM, Fam Zheng <famz@redhat.com> wrote: > > > On Thu, 01/08 17:28, Andy Lutomirski wrote: > > >> On Thu, Jan 8, 2015 at 5:25 PM, Fam Zheng <famz@redhat.com> wrote: > > >> > On Thu, 01/08 09:57, Andy Lutomirski wrote: > > >> >> I'd like to see a more ambitious change, since the timer isn't the > > >> >> only problem like this. Specifically, I'd like a syscall that does a > > >> >> list of epoll-related things and then waits. The list of things could > > >> >> include, at least: > > >> >> > > >> >> - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to > > >> >> want to turn on and off their requests for events on a somewhat > > >> >> regular basis. > > >> > > > >> > This sounds good to me. > > >> > > > >> >> > > >> >> - timerfd_settime actions: this allows a single syscall to wait and > > >> >> adjust *both* monotonic and real-time wakeups. > > >> > > > >> > I'm not sure, doesn't this break orthogonality between epoll and timerfd? > > >> > > >> Yes. It's not very elegant, and more elegant ideas are welcome. > > > > > > What is the purpose of embedding timerfd operation here? Modifying timerfd > > > for each poll doesn't sound a common pattern to me. > > > > Setting a timeout is definitely a common pattern, hence this thread. > > But the current timeout interface sucks, and people should really use > > absolute time. (My epoll software uses absolute time.) But then > > users need to decide whether to have their timeout based on the > > monotonic clock or the realtime clock (or something else entirely). > > Some bigger programs may want both -- they may have internal events > > queued for certain times and for certain timeouts, and those should > > use realtime and monotonic respectively. Heck, users may also want > > separate slack values on those. > > > > Timerfd is the only thing we have right now that is anywhere near > > flexible enough. Obviously if epoll became fancy enough, then we > > could do away with the timerfd entirely here. > > > > > > > >> > > >> > > > >> >> > > >> >> Would this make sense? It could look like: > > >> >> > > >> >> int epoll_mod_and_pwait(int epfd, > > >> >> struct epoll_event *events, int maxevents, > > >> >> struct epoll_command *commands, int ncommands, > > >> >> const sigset_t *sigmask); > > >> > > > >> > What about flags? > > >> > > > >> > > >> No room. Maybe it should just be a struct for everything instead of > > >> separate args. > > > > > > Also no room for timeout. A single struct sounds the only way to go. > > > > That's what timerfd is for. I think it would be a bit weird to > > support "timeout" and detailed timerfd control. > > I see what you mean. Thanks. > > I still don't like hooking timerfd in the interface. Besides the unclean > interface, it also feels cubersome and overkill to let users setup and add a > dedicated timerfd to implement timeout. > > How about this: > > int epoll_mod_wait(int epfd, struct epoll_mod_wait_data *data); > > struct epoll_mod_wait_data { > struct epoll_event *events; > int maxevents; > struct epoll_mod_cmd { > int op, > int fd; > void *data; > } *cmds; > int ncmds; > int flags; > sigset_t *sigmask; > }; > > Commands ops are: > > EPOLL_CTL_ADD > @fd is the fd to modify; @data is epoll_event. > EPOLL_CTL_MOD > @fd is the fd to modify; @data is epoll_event. > EPOLL_CTL_DEL > @fd is the fd to modify; @data is epoll_event. > > EPOLL_CTL_SET_TIMEOUT > @fd is ignored, @data is timespec. > Clock type and relative/absolute are selected by flags as below. > > Flags are given to override timeout defaults: > EPOLL_FL_MONOTONIC_CLOCK > If set, don't use realtime clock, use monotonic clock. > EPOLL_FL_ABSOLUTE_TIMEOUT > If set, don't use relative timeout, use absolute timeout. I'd suggest using an "int clockid" field instead, like timerfd_settime; even if it only accepts CLOCK_REALTIME and CLOCK_MONOTONIC, if it needs extending in the future, it'd be painful to have to remap new CLOCK_* constants into the EPOLL_FL_* namespace. (I do think dropping timeouts in favor of timerfds makes things more nicely orthogonal, but epoll_wait already has a timeout parameter, so *shrug*.) Also, I think that structure has too many levels of indirection; it'd produce many unnecessary cache misses; considering you're trying to eliminate the overhead of one or two extra syscalls, you don't want to introduce a pile of unnecessary cache misses in the processes. I'd suggest inlining cmds as an array at the end of the structure, and turning "void *data" into an inline epoll_event. (Or, you could use "events" as an in/out parameter.) You could drop EPOLL_CTL_SET_TIMEOUT, and just include a clockid and timespec directly in the top-level structure. And I'd suggest either making flags a top-level parameter or putting it at the start of the structure, to make future extension easier. </bikeshed> - Josh Triplett ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall 2015-01-09 5:21 ` Josh Triplett @ 2015-01-12 8:24 ` Fam Zheng 2015-01-12 10:08 ` Josh Triplett 0 siblings, 1 reply; 15+ messages in thread From: Fam Zheng @ 2015-01-12 8:24 UTC (permalink / raw) To: Josh Triplett Cc: Andy Lutomirski, Miklos Szeredi, linux-kernel@vger.kernel.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov, Mathieu On Thu, 01/08 21:21, Josh Triplett wrote: > On Fri, Jan 09, 2015 at 12:49:08PM +0800, Fam Zheng wrote: > > On Thu, 01/08 18:24, Andy Lutomirski wrote: > > > On Thu, Jan 8, 2015 at 5:52 PM, Fam Zheng <famz@redhat.com> wrote: > > > > On Thu, 01/08 17:28, Andy Lutomirski wrote: > > > >> On Thu, Jan 8, 2015 at 5:25 PM, Fam Zheng <famz@redhat.com> wrote: > > > >> > On Thu, 01/08 09:57, Andy Lutomirski wrote: > > > >> >> I'd like to see a more ambitious change, since the timer isn't the > > > >> >> only problem like this. Specifically, I'd like a syscall that does a > > > >> >> list of epoll-related things and then waits. The list of things could > > > >> >> include, at least: > > > >> >> > > > >> >> - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to > > > >> >> want to turn on and off their requests for events on a somewhat > > > >> >> regular basis. > > > >> > > > > >> > This sounds good to me. > > > >> > > > > >> >> > > > >> >> - timerfd_settime actions: this allows a single syscall to wait and > > > >> >> adjust *both* monotonic and real-time wakeups. > > > >> > > > > >> > I'm not sure, doesn't this break orthogonality between epoll and timerfd? > > > >> > > > >> Yes. It's not very elegant, and more elegant ideas are welcome. > > > > > > > > What is the purpose of embedding timerfd operation here? Modifying timerfd > > > > for each poll doesn't sound a common pattern to me. > > > > > > Setting a timeout is definitely a common pattern, hence this thread. > > > But the current timeout interface sucks, and people should really use > > > absolute time. (My epoll software uses absolute time.) But then > > > users need to decide whether to have their timeout based on the > > > monotonic clock or the realtime clock (or something else entirely). > > > Some bigger programs may want both -- they may have internal events > > > queued for certain times and for certain timeouts, and those should > > > use realtime and monotonic respectively. Heck, users may also want > > > separate slack values on those. > > > > > > Timerfd is the only thing we have right now that is anywhere near > > > flexible enough. Obviously if epoll became fancy enough, then we > > > could do away with the timerfd entirely here. > > > > > > > > > > >> > > > >> > > > > >> >> > > > >> >> Would this make sense? It could look like: > > > >> >> > > > >> >> int epoll_mod_and_pwait(int epfd, > > > >> >> struct epoll_event *events, int maxevents, > > > >> >> struct epoll_command *commands, int ncommands, > > > >> >> const sigset_t *sigmask); > > > >> > > > > >> > What about flags? > > > >> > > > > >> > > > >> No room. Maybe it should just be a struct for everything instead of > > > >> separate args. > > > > > > > > Also no room for timeout. A single struct sounds the only way to go. > > > > > > That's what timerfd is for. I think it would be a bit weird to > > > support "timeout" and detailed timerfd control. > > > > I see what you mean. Thanks. > > > > I still don't like hooking timerfd in the interface. Besides the unclean > > interface, it also feels cubersome and overkill to let users setup and add a > > dedicated timerfd to implement timeout. > > > > How about this: > > > > int epoll_mod_wait(int epfd, struct epoll_mod_wait_data *data); > > > > struct epoll_mod_wait_data { > > struct epoll_event *events; > > int maxevents; > > struct epoll_mod_cmd { > > int op, > > int fd; > > void *data; > > } *cmds; > > int ncmds; > > int flags; > > sigset_t *sigmask; > > }; > > > > Commands ops are: > > > > EPOLL_CTL_ADD > > @fd is the fd to modify; @data is epoll_event. > > EPOLL_CTL_MOD > > @fd is the fd to modify; @data is epoll_event. > > EPOLL_CTL_DEL > > @fd is the fd to modify; @data is epoll_event. > > > > EPOLL_CTL_SET_TIMEOUT > > @fd is ignored, @data is timespec. > > Clock type and relative/absolute are selected by flags as below. > > > > Flags are given to override timeout defaults: > > EPOLL_FL_MONOTONIC_CLOCK > > If set, don't use realtime clock, use monotonic clock. > > EPOLL_FL_ABSOLUTE_TIMEOUT > > If set, don't use relative timeout, use absolute timeout. > > I'd suggest using an "int clockid" field instead, like timerfd_settime; > even if it only accepts CLOCK_REALTIME and CLOCK_MONOTONIC, if it needs > extending in the future, it'd be painful to have to remap new CLOCK_* > constants into the EPOLL_FL_* namespace. (I do think dropping timeouts > in favor of timerfds makes things more nicely orthogonal, but epoll_wait > already has a timeout parameter, so *shrug*.) > > Also, I think that structure has too many levels of indirection; it'd > produce many unnecessary cache misses; considering you're trying to > eliminate the overhead of one or two extra syscalls, you don't want to > introduce a pile of unnecessary cache misses in the processes. I'd > suggest inlining cmds as an array at the end of the structure, and > turning "void *data" into an inline epoll_event. (Or, you could use > "events" as an in/out parameter.) > > You could drop EPOLL_CTL_SET_TIMEOUT, and just include a clockid and > timespec directly in the top-level structure. > > And I'd suggest either making flags a top-level parameter or putting it > at the start of the structure, to make future extension easier. Makes sense to me, thanks. Also the number of cmds are undecided until we do a copy_from_user for the header fields before another one for specified number of cmds. So I think it's better to move ncmds and cmds to top level parameter. Fam > > </bikeshed> > > - Josh Triplett ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall 2015-01-12 8:24 ` Fam Zheng @ 2015-01-12 10:08 ` Josh Triplett 2015-01-12 13:23 ` Fam Zheng 0 siblings, 1 reply; 15+ messages in thread From: Josh Triplett @ 2015-01-12 10:08 UTC (permalink / raw) To: Fam Zheng Cc: Andy Lutomirski, Miklos Szeredi, linux-kernel@vger.kernel.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov, Mathieu On Mon, Jan 12, 2015 at 04:24:00PM +0800, Fam Zheng wrote: > On Thu, 01/08 21:21, Josh Triplett wrote: > > On Fri, Jan 09, 2015 at 12:49:08PM +0800, Fam Zheng wrote: > > > On Thu, 01/08 18:24, Andy Lutomirski wrote: > > > > On Thu, Jan 8, 2015 at 5:52 PM, Fam Zheng <famz@redhat.com> wrote: > > > > > On Thu, 01/08 17:28, Andy Lutomirski wrote: > > > > >> On Thu, Jan 8, 2015 at 5:25 PM, Fam Zheng <famz@redhat.com> wrote: > > > > >> > On Thu, 01/08 09:57, Andy Lutomirski wrote: > > > > >> >> I'd like to see a more ambitious change, since the timer isn't the > > > > >> >> only problem like this. Specifically, I'd like a syscall that does a > > > > >> >> list of epoll-related things and then waits. The list of things could > > > > >> >> include, at least: > > > > >> >> > > > > >> >> - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to > > > > >> >> want to turn on and off their requests for events on a somewhat > > > > >> >> regular basis. > > > > >> > > > > > >> > This sounds good to me. > > > > >> > > > > > >> >> > > > > >> >> - timerfd_settime actions: this allows a single syscall to wait and > > > > >> >> adjust *both* monotonic and real-time wakeups. > > > > >> > > > > > >> > I'm not sure, doesn't this break orthogonality between epoll and timerfd? > > > > >> > > > > >> Yes. It's not very elegant, and more elegant ideas are welcome. > > > > > > > > > > What is the purpose of embedding timerfd operation here? Modifying timerfd > > > > > for each poll doesn't sound a common pattern to me. > > > > > > > > Setting a timeout is definitely a common pattern, hence this thread. > > > > But the current timeout interface sucks, and people should really use > > > > absolute time. (My epoll software uses absolute time.) But then > > > > users need to decide whether to have their timeout based on the > > > > monotonic clock or the realtime clock (or something else entirely). > > > > Some bigger programs may want both -- they may have internal events > > > > queued for certain times and for certain timeouts, and those should > > > > use realtime and monotonic respectively. Heck, users may also want > > > > separate slack values on those. > > > > > > > > Timerfd is the only thing we have right now that is anywhere near > > > > flexible enough. Obviously if epoll became fancy enough, then we > > > > could do away with the timerfd entirely here. > > > > > > > > > > > > > >> > > > > >> > > > > > >> >> > > > > >> >> Would this make sense? It could look like: > > > > >> >> > > > > >> >> int epoll_mod_and_pwait(int epfd, > > > > >> >> struct epoll_event *events, int maxevents, > > > > >> >> struct epoll_command *commands, int ncommands, > > > > >> >> const sigset_t *sigmask); > > > > >> > > > > > >> > What about flags? > > > > >> > > > > > >> > > > > >> No room. Maybe it should just be a struct for everything instead of > > > > >> separate args. > > > > > > > > > > Also no room for timeout. A single struct sounds the only way to go. > > > > > > > > That's what timerfd is for. I think it would be a bit weird to > > > > support "timeout" and detailed timerfd control. > > > > > > I see what you mean. Thanks. > > > > > > I still don't like hooking timerfd in the interface. Besides the unclean > > > interface, it also feels cubersome and overkill to let users setup and add a > > > dedicated timerfd to implement timeout. > > > > > > How about this: > > > > > > int epoll_mod_wait(int epfd, struct epoll_mod_wait_data *data); > > > > > > struct epoll_mod_wait_data { > > > struct epoll_event *events; > > > int maxevents; > > > struct epoll_mod_cmd { > > > int op, > > > int fd; > > > void *data; > > > } *cmds; > > > int ncmds; > > > int flags; > > > sigset_t *sigmask; > > > }; > > > > > > Commands ops are: > > > > > > EPOLL_CTL_ADD > > > @fd is the fd to modify; @data is epoll_event. > > > EPOLL_CTL_MOD > > > @fd is the fd to modify; @data is epoll_event. > > > EPOLL_CTL_DEL > > > @fd is the fd to modify; @data is epoll_event. > > > > > > EPOLL_CTL_SET_TIMEOUT > > > @fd is ignored, @data is timespec. > > > Clock type and relative/absolute are selected by flags as below. > > > > > > Flags are given to override timeout defaults: > > > EPOLL_FL_MONOTONIC_CLOCK > > > If set, don't use realtime clock, use monotonic clock. > > > EPOLL_FL_ABSOLUTE_TIMEOUT > > > If set, don't use relative timeout, use absolute timeout. > > > > I'd suggest using an "int clockid" field instead, like timerfd_settime; > > even if it only accepts CLOCK_REALTIME and CLOCK_MONOTONIC, if it needs > > extending in the future, it'd be painful to have to remap new CLOCK_* > > constants into the EPOLL_FL_* namespace. (I do think dropping timeouts > > in favor of timerfds makes things more nicely orthogonal, but epoll_wait > > already has a timeout parameter, so *shrug*.) > > > > Also, I think that structure has too many levels of indirection; it'd > > produce many unnecessary cache misses; considering you're trying to > > eliminate the overhead of one or two extra syscalls, you don't want to > > introduce a pile of unnecessary cache misses in the processes. I'd > > suggest inlining cmds as an array at the end of the structure, and > > turning "void *data" into an inline epoll_event. (Or, you could use > > "events" as an in/out parameter.) > > > > You could drop EPOLL_CTL_SET_TIMEOUT, and just include a clockid and > > timespec directly in the top-level structure. > > > > And I'd suggest either making flags a top-level parameter or putting it > > at the start of the structure, to make future extension easier. > > Makes sense to me, thanks. > > Also the number of cmds are undecided until we do a copy_from_user for the > header fields before another one for specified number of cmds. So I think it's > better to move ncmds and cmds to top level parameter. That seems like an even better idea, yeah. - Josh Triplett ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall 2015-01-12 10:08 ` Josh Triplett @ 2015-01-12 13:23 ` Fam Zheng 0 siblings, 0 replies; 15+ messages in thread From: Fam Zheng @ 2015-01-12 13:23 UTC (permalink / raw) To: Josh Triplett Cc: Andy Lutomirski, Miklos Szeredi, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov, Mathieu On Mon, 01/12 02:08, Josh Triplett wrote: > On Mon, Jan 12, 2015 at 04:24:00PM +0800, Fam Zheng wrote: > > On Thu, 01/08 21:21, Josh Triplett wrote: > > > On Fri, Jan 09, 2015 at 12:49:08PM +0800, Fam Zheng wrote: > > > > On Thu, 01/08 18:24, Andy Lutomirski wrote: > > > > > On Thu, Jan 8, 2015 at 5:52 PM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > > > > On Thu, 01/08 17:28, Andy Lutomirski wrote: > > > > > >> On Thu, Jan 8, 2015 at 5:25 PM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > > > >> > On Thu, 01/08 09:57, Andy Lutomirski wrote: > > > > > >> >> I'd like to see a more ambitious change, since the timer isn't the > > > > > >> >> only problem like this. Specifically, I'd like a syscall that does a > > > > > >> >> list of epoll-related things and then waits. The list of things could > > > > > >> >> include, at least: > > > > > >> >> > > > > > >> >> - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to > > > > > >> >> want to turn on and off their requests for events on a somewhat > > > > > >> >> regular basis. > > > > > >> > > > > > > >> > This sounds good to me. > > > > > >> > > > > > > >> >> > > > > > >> >> - timerfd_settime actions: this allows a single syscall to wait and > > > > > >> >> adjust *both* monotonic and real-time wakeups. > > > > > >> > > > > > > >> > I'm not sure, doesn't this break orthogonality between epoll and timerfd? > > > > > >> > > > > > >> Yes. It's not very elegant, and more elegant ideas are welcome. > > > > > > > > > > > > What is the purpose of embedding timerfd operation here? Modifying timerfd > > > > > > for each poll doesn't sound a common pattern to me. > > > > > > > > > > Setting a timeout is definitely a common pattern, hence this thread. > > > > > But the current timeout interface sucks, and people should really use > > > > > absolute time. (My epoll software uses absolute time.) But then > > > > > users need to decide whether to have their timeout based on the > > > > > monotonic clock or the realtime clock (or something else entirely). > > > > > Some bigger programs may want both -- they may have internal events > > > > > queued for certain times and for certain timeouts, and those should > > > > > use realtime and monotonic respectively. Heck, users may also want > > > > > separate slack values on those. > > > > > > > > > > Timerfd is the only thing we have right now that is anywhere near > > > > > flexible enough. Obviously if epoll became fancy enough, then we > > > > > could do away with the timerfd entirely here. > > > > > > > > > > > > > > > > >> > > > > > >> > > > > > > >> >> > > > > > >> >> Would this make sense? It could look like: > > > > > >> >> > > > > > >> >> int epoll_mod_and_pwait(int epfd, > > > > > >> >> struct epoll_event *events, int maxevents, > > > > > >> >> struct epoll_command *commands, int ncommands, > > > > > >> >> const sigset_t *sigmask); > > > > > >> > > > > > > >> > What about flags? > > > > > >> > > > > > > >> > > > > > >> No room. Maybe it should just be a struct for everything instead of > > > > > >> separate args. > > > > > > > > > > > > Also no room for timeout. A single struct sounds the only way to go. > > > > > > > > > > That's what timerfd is for. I think it would be a bit weird to > > > > > support "timeout" and detailed timerfd control. > > > > > > > > I see what you mean. Thanks. > > > > > > > > I still don't like hooking timerfd in the interface. Besides the unclean > > > > interface, it also feels cubersome and overkill to let users setup and add a > > > > dedicated timerfd to implement timeout. > > > > > > > > How about this: > > > > > > > > int epoll_mod_wait(int epfd, struct epoll_mod_wait_data *data); > > > > > > > > struct epoll_mod_wait_data { > > > > struct epoll_event *events; > > > > int maxevents; > > > > struct epoll_mod_cmd { > > > > int op, > > > > int fd; > > > > void *data; > > > > } *cmds; > > > > int ncmds; > > > > int flags; > > > > sigset_t *sigmask; > > > > }; > > > > > > > > Commands ops are: > > > > > > > > EPOLL_CTL_ADD > > > > @fd is the fd to modify; @data is epoll_event. > > > > EPOLL_CTL_MOD > > > > @fd is the fd to modify; @data is epoll_event. > > > > EPOLL_CTL_DEL > > > > @fd is the fd to modify; @data is epoll_event. > > > > > > > > EPOLL_CTL_SET_TIMEOUT > > > > @fd is ignored, @data is timespec. > > > > Clock type and relative/absolute are selected by flags as below. > > > > > > > > Flags are given to override timeout defaults: > > > > EPOLL_FL_MONOTONIC_CLOCK > > > > If set, don't use realtime clock, use monotonic clock. > > > > EPOLL_FL_ABSOLUTE_TIMEOUT > > > > If set, don't use relative timeout, use absolute timeout. > > > > > > I'd suggest using an "int clockid" field instead, like timerfd_settime; > > > even if it only accepts CLOCK_REALTIME and CLOCK_MONOTONIC, if it needs > > > extending in the future, it'd be painful to have to remap new CLOCK_* > > > constants into the EPOLL_FL_* namespace. (I do think dropping timeouts > > > in favor of timerfds makes things more nicely orthogonal, but epoll_wait > > > already has a timeout parameter, so *shrug*.) > > > > > > Also, I think that structure has too many levels of indirection; it'd > > > produce many unnecessary cache misses; considering you're trying to > > > eliminate the overhead of one or two extra syscalls, you don't want to > > > introduce a pile of unnecessary cache misses in the processes. I'd > > > suggest inlining cmds as an array at the end of the structure, and > > > turning "void *data" into an inline epoll_event. (Or, you could use > > > "events" as an in/out parameter.) > > > > > > You could drop EPOLL_CTL_SET_TIMEOUT, and just include a clockid and > > > timespec directly in the top-level structure. > > > > > > And I'd suggest either making flags a top-level parameter or putting it > > > at the start of the structure, to make future extension easier. > > > > Makes sense to me, thanks. > > > > Also the number of cmds are undecided until we do a copy_from_user for the > > header fields before another one for specified number of cmds. So I think it's > > better to move ncmds and cmds to top level parameter. > > That seems like an even better idea, yeah. > One more question I'm not sure regarding the semantics: should we make the syscall atomic? I.e if one of the cmds failed even before wait, or if all the cmds are executed, but the eventual wait failed, should we revert the commands' effect? Or return overall result and result for each cmd if failed? Or just claim that it's possible for a first few cmds to be effective even on error? It will be way more complicated to make it atomic, so I'd like to be clear what we should do. Ideas? Thanks, Fam ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2015-01-12 13:23 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1420705550-24245-1-git-send-email-famz@redhat.com> 2015-01-08 9:12 ` [PATCH 0/3] epoll: Add epoll_pwait1 syscall Miklos Szeredi [not found] ` <1420708372.18399.15.camel-AlSwsSmVLrQ@public.gmane.org> 2015-01-08 11:07 ` Michael Kerrisk (man-pages) 2015-01-08 17:57 ` Andy Lutomirski [not found] ` <CALCETrVyPij1Zxwmw7p06UrZjoyYDXqEjmxyQ-KJ8Y7dx7mL3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-01-08 18:42 ` josh-iaAMLnmF4UmaiuxdJuQwMA 2015-01-08 19:31 ` Alexei Starovoitov 2015-01-08 19:42 ` Andy Lutomirski 2015-01-09 1:25 ` Fam Zheng [not found] ` <20150109011608.GA2924-+wGkCoP0yD+sDdueE5tM26fLeoKvNuZc@public.gmane.org> 2015-01-09 1:28 ` Andy Lutomirski 2015-01-09 1:52 ` Fam Zheng [not found] ` <20150109015248.GA5034-+wGkCoP0yD+sDdueE5tM26fLeoKvNuZc@public.gmane.org> 2015-01-09 2:24 ` Andy Lutomirski 2015-01-09 4:49 ` Fam Zheng 2015-01-09 5:21 ` Josh Triplett 2015-01-12 8:24 ` Fam Zheng 2015-01-12 10:08 ` Josh Triplett 2015-01-12 13:23 ` Fam Zheng
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).