linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RESEND PATCH 0/3] epoll: Add epoll_pwait1 syscall
@ 2015-01-08  9:16 Fam Zheng
  2015-01-08  9:16 ` [RESEND PATCH 1/3] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Fam Zheng @ 2015-01-08  9:16 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro, Andrew Morton,
	Miklos Szeredi, Juri Lelli, Zach Brown, David Drysdale, Fam Zheng,
	Kees Cook, Alexei Starovoitov, David Herrmann, Dario Faggioli,
	Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger,
	Heiko Carstens, Rasmus Villemoes, Oleg Nesterov,
	Mathieu Desnoyers, Fabian Frederick, Josh

[Resend because my script screwed the recipient format, sorry for the noise.]

Applications could use epoll interface when then need to poll a big number of
files in their main loops, to achieve better performance than ppoll(2). Except
for one concern: epoll only takes timeout parameters in microseconds, rather
than nanoseconds.

That is a drawback we should address. For a real case in QEMU, we run into a
scalability issue with ppoll(2) when many devices are attached to guest, in
which case many host fds, such as virtual disk images and sockets, need to be
polled by the main loop. As a result we are looking at switching to epoll, but
the coarse timeout precision is a trouble, as explained below. 

We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement
timers in the main loop; and we call ppoll(2) with the next firing timer as
timeout, so when ppoll(2) returns, we know that we have more work to do (either
handling IO events, or fire a timer callback). This is natual and efficient,
except that ppoll(2) itself is slow.

Now that we want to switch to epoll, to speed up the polling. However the timer
slack setting will be effectively undone, because that way we will have to
round up the timeout to microseconds honoring timer contract. But consequently,
this hurts the general responsiveness.

Note: there are two alternatives, without changing kernel:

1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't
be slow as one fd is polled. No more scalability issue. And if there are
events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with
timeout=0; otherwise, there can't be events for the epoll, skip the following
epoll_wait and just continue with other work.

2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1).
The timerfd will hopefully force epoll_wait to return when it timeouts, even if
no other events have arrived. This will inheritly give us timerfd's precision.
Note that for each poll, the desired timeout is different because the next
timer is different, so that, before each epoll_wait(2), there will be a
timerfd_settime syscall to set it to a proper value.

Unfortunately, both approaches require one more syscall per iteration, compared
to the original single ppoll(2), cost of which is unneglectable when we talk
about nanosecond granularity.

Fam


Fam Zheng (3):
  epoll: Extract epoll_wait_do and epoll_pwait_do
  epoll: Add implementation for epoll_pwait1
  x86: hook up epoll_pwait1 syscall

 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/eventpoll.c                   | 160 +++++++++++++++++++++++----------------
 include/linux/syscalls.h         |   4 +
 kernel/sys_ni.c                  |   3 +
 5 files changed, 103 insertions(+), 66 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 7+ messages in thread
* Re: [RESEND PATCH 1/3] epoll: Extract epoll_wait_do and epoll_pwait_do
@ 2015-01-08 17:51 Alexei Starovoitov
  0 siblings, 0 replies; 7+ messages in thread
From: Alexei Starovoitov @ 2015-01-08 17:51 UTC (permalink / raw)
  To: Fam Zheng
  Cc: LKML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
	Alexander Viro, Andrew Morton, Miklos Szeredi, Juri Lelli,
	Zach Brown, David Drysdale, Kees Cook, David Herrmann,
	Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal,
	Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov,
	Mathieu Desnoyers, Fabian Frederick, Josh Triplett

On Thu, Jan 8, 2015 at 1:16 AM, Fam Zheng <famz@redhat.com> wrote:
> +       if (!timeout || (timeout->tv_nsec == 0 && timeout->tv_sec == 0)) {
..
> +       } else if (timeout->tv_nsec >= 0 && timeout->tv_sec >= 0) {

the check for tv_nsec is not enough, which points
to the fragility of passing user timespec around.
I think it would be safer and cleaner to do it futex style:
  if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
      return -EFAULT;
  if (!timespec_valid(&ts))
      return -EINVAL;
  t = timespec_to_ktime(ts);
and then only pass ktime_t around.

> +               struct timespec end_time = ep_set_mstimeout(timeout);

the name is now wrong, since it's no longer MStimeout.

I think handling of compat is missing as well.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-01-08 17:51 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-08  9:16 [RESEND PATCH 0/3] epoll: Add epoll_pwait1 syscall Fam Zheng
2015-01-08  9:16 ` [RESEND PATCH 1/3] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
2015-01-08  9:16 ` [RESEND PATCH 2/3] epoll: Add implementation for epoll_pwait1 Fam Zheng
2015-01-08 11:10   ` Paolo Bonzini
     [not found]     ` <54AE65BB.1020707-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-08 11:48       ` Michael Kerrisk (man-pages)
2015-01-08  9:16 ` [RESEND PATCH 3/3] x86: hook up epoll_pwait1 syscall Fam Zheng
  -- strict thread matches above, loose matches on Subject: below --
2015-01-08 17:51 [RESEND PATCH 1/3] epoll: Extract epoll_wait_do and epoll_pwait_do Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).