All of lore.kernel.org
 help / color / mirror / Atom feed
From: Miklos Szeredi <mszeredi@suse.cz>
To: Fam Zheng <famz@redhat.com>
Cc: linux-kernel@vger.kernel.org,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Juri Lelli <juri.lelli@gmail.com>, Zach Brown <zab@zabbo.net>,
	David Drysdale <drysdale@google.com>,
	Kees Cook <keescook@chromium.org>,
	Alexei Starovoitov <ast@plumgrid.com>,
	David Herrmann <dh.herrmann@gmail.com>,
	Dario Faggioli <raistlin@linux.it>, Theodore Ts'o <tytso@mit.edu>,
	Peter Zijlstra <peterz@infradead.org>,
	Vivek Goyal <vgoyal@redhat.com>,
	Mike Frysinger <vapier@gentoo.org>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	Oleg Nesterov <oleg@redhat.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Fabian Frederick <fabf@skynet.be>,
	Josh Triplett <josh@joshtriplett.org>D
Subject: Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall
Date: Thu, 08 Jan 2015 10:12:52 +0100	[thread overview]
Message-ID: <1420708372.18399.15.camel@suse.cz> (raw)
In-Reply-To: <1420705550-24245-1-git-send-email-famz@redhat.com>

On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote:
> Applications could use epoll interface when then need to poll a big number of
> files in their main loops, to achieve better performance than ppoll(2). Except
> for one concern: epoll only takes timeout parameters in microseconds, rather
> than nanoseconds.
> 
> That is a drawback we should address. For a real case in QEMU, we run into a
> scalability issue with ppoll(2) when many devices are attached to guest, in
> which case many host fds, such as virtual disk images and sockets, need to be
> polled by the main loop. As a result we are looking at switching to epoll, but
> the coarse timeout precision is a trouble, as explained below. 
> 
> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement
> timers in the main loop; and we call ppoll(2) with the next firing timer as
> timeout, so when ppoll(2) returns, we know that we have more work to do (either
> handling IO events, or fire a timer callback). This is natual and efficient,
> except that ppoll(2) itself is slow.
> 
> Now that we want to switch to epoll, to speed up the polling. However the timer
> slack setting will be effectively undone, because that way we will have to
> round up the timeout to microseconds honoring timer contract. But consequently,
> this hurts the general responsiveness.
> 
> Note: there are two alternatives, without changing kernel:
> 
> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't
> be slow as one fd is polled. No more scalability issue. And if there are
> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with
> timeout=0; otherwise, there can't be events for the epoll, skip the following
> epoll_wait and just continue with other work.
> 
> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1).
> The timerfd will hopefully force epoll_wait to return when it timeouts, even if
> no other events have arrived. This will inheritly give us timerfd's precision.
> Note that for each poll, the desired timeout is different because the next
> timer is different, so that, before each epoll_wait(2), there will be a
> timerfd_settime syscall to set it to a proper value.
> 
> Unfortunately, both approaches require one more syscall per iteration, compared
> to the original single ppoll(2), cost of which is unneglectable when we talk
> about nanosecond granularity.

Please consider adding a "flags" argument to the new syscall (and
returning EINVAL if non-zero).  See this article, which shows that
extended syscalls almost always want flags, and they often get it only
on the second try:

http://lwn.net/Articles/585415/

Thanks,
Miklos

P.S. stray apostrophes in To: and Cc: lines seems to be causing trouble.

WARNING: multiple messages have this Message-ID (diff)
From: Miklos Szeredi <mszeredi@suse.cz>
To: Fam Zheng <famz@redhat.com>
Cc: linux-kernel@vger.kernel.org,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Juri Lelli <juri.lelli@gmail.com>, Zach Brown <zab@zabbo.net>,
	David Drysdale <drysdale@google.com>,
	Kees Cook <keescook@chromium.org>,
	Alexei Starovoitov <ast@plumgrid.com>,
	David Herrmann <dh.herrmann@gmail.com>,
	Dario Faggioli <raistlin@linux.it>, Theodore Ts'o <tytso@mit.edu>,
	Peter Zijlstra <peterz@infradead.org>,
	Vivek Goyal <vgoyal@redhat.com>,
	Mike Frysinger <vapier@gentoo.org>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	Oleg Nesterov <oleg@redhat.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Fabian Frederick <fabf@skynet.be>,
	Josh Triplett <josh@joshtriplett.org>,
	"D
Subject: Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall
Date: Thu, 08 Jan 2015 10:12:52 +0100	[thread overview]
Message-ID: <1420708372.18399.15.camel@suse.cz> (raw)
In-Reply-To: <1420705550-24245-1-git-send-email-famz@redhat.com>

On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote:
> Applications could use epoll interface when then need to poll a big number of
> files in their main loops, to achieve better performance than ppoll(2). Except
> for one concern: epoll only takes timeout parameters in microseconds, rather
> than nanoseconds.
> 
> That is a drawback we should address. For a real case in QEMU, we run into a
> scalability issue with ppoll(2) when many devices are attached to guest, in
> which case many host fds, such as virtual disk images and sockets, need to be
> polled by the main loop. As a result we are looking at switching to epoll, but
> the coarse timeout precision is a trouble, as explained below. 
> 
> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement
> timers in the main loop; and we call ppoll(2) with the next firing timer as
> timeout, so when ppoll(2) returns, we know that we have more work to do (either
> handling IO events, or fire a timer callback). This is natual and efficient,
> except that ppoll(2) itself is slow.
> 
> Now that we want to switch to epoll, to speed up the polling. However the timer
> slack setting will be effectively undone, because that way we will have to
> round up the timeout to microseconds honoring timer contract. But consequently,
> this hurts the general responsiveness.
> 
> Note: there are two alternatives, without changing kernel:
> 
> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't
> be slow as one fd is polled. No more scalability issue. And if there are
> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with
> timeout=0; otherwise, there can't be events for the epoll, skip the following
> epoll_wait and just continue with other work.
> 
> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1).
> The timerfd will hopefully force epoll_wait to return when it timeouts, even if
> no other events have arrived. This will inheritly give us timerfd's precision.
> Note that for each poll, the desired timeout is different because the next
> timer is different, so that, before each epoll_wait(2), there will be a
> timerfd_settime syscall to set it to a proper value.
> 
> Unfortunately, both approaches require one more syscall per iteration, compared
> to the original single ppoll(2), cost of which is unneglectable when we talk
> about nanosecond granularity.

Please consider adding a "flags" argument to the new syscall (and
returning EINVAL if non-zero).  See this article, which shows that
extended syscalls almost always want flags, and they often get it only
on the second try:

http://lwn.net/Articles/585415/

Thanks,
Miklos

P.S. stray apostrophes in To: and Cc: lines seems to be causing trouble.

WARNING: multiple messages have this Message-ID (diff)
From: Miklos Szeredi <mszeredi@suse.cz>
To: Fam Zheng <famz@redhat.com>
Cc: linux-kernel@vger.kernel.org,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Juri Lelli <juri.lelli@gmail.com>, Zach Brown <zab@zabbo.net>,
	David Drysdale <drysdale@google.com>,
	Kees Cook <keescook@chromium.org>,
	Alexei Starovoitov <ast@plumgrid.com>,
	David Herrmann <dh.herrmann@gmail.com>,
	Dario Faggioli <raistlin@linux.it>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Peter Zijlstra <peterz@infradead.org>,
	Vivek Goyal <vgoyal@redhat.com>,
	Mike Frysinger <vapier@gentoo.org>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	Oleg Nesterov <oleg@redhat.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Fabian Frederick <fabf@skynet.be>,
	Josh Triplett <josh@joshtriplett.org>,
	"David S. Miller" <davem@davemloft.net>,
	linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org,
	Paolo Bonzini <pbonzini@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Subject: Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall
Date: Thu, 08 Jan 2015 10:12:52 +0100	[thread overview]
Message-ID: <1420708372.18399.15.camel@suse.cz> (raw)
In-Reply-To: <1420705550-24245-1-git-send-email-famz@redhat.com>

On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote:
> Applications could use epoll interface when then need to poll a big number of
> files in their main loops, to achieve better performance than ppoll(2). Except
> for one concern: epoll only takes timeout parameters in microseconds, rather
> than nanoseconds.
> 
> That is a drawback we should address. For a real case in QEMU, we run into a
> scalability issue with ppoll(2) when many devices are attached to guest, in
> which case many host fds, such as virtual disk images and sockets, need to be
> polled by the main loop. As a result we are looking at switching to epoll, but
> the coarse timeout precision is a trouble, as explained below. 
> 
> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement
> timers in the main loop; and we call ppoll(2) with the next firing timer as
> timeout, so when ppoll(2) returns, we know that we have more work to do (either
> handling IO events, or fire a timer callback). This is natual and efficient,
> except that ppoll(2) itself is slow.
> 
> Now that we want to switch to epoll, to speed up the polling. However the timer
> slack setting will be effectively undone, because that way we will have to
> round up the timeout to microseconds honoring timer contract. But consequently,
> this hurts the general responsiveness.
> 
> Note: there are two alternatives, without changing kernel:
> 
> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't
> be slow as one fd is polled. No more scalability issue. And if there are
> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with
> timeout=0; otherwise, there can't be events for the epoll, skip the following
> epoll_wait and just continue with other work.
> 
> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1).
> The timerfd will hopefully force epoll_wait to return when it timeouts, even if
> no other events have arrived. This will inheritly give us timerfd's precision.
> Note that for each poll, the desired timeout is different because the next
> timer is different, so that, before each epoll_wait(2), there will be a
> timerfd_settime syscall to set it to a proper value.
> 
> Unfortunately, both approaches require one more syscall per iteration, compared
> to the original single ppoll(2), cost of which is unneglectable when we talk
> about nanosecond granularity.

Please consider adding a "flags" argument to the new syscall (and
returning EINVAL if non-zero).  See this article, which shows that
extended syscalls almost always want flags, and they often get it only
on the second try:

http://lwn.net/Articles/585415/

Thanks,
Miklos

P.S. stray apostrophes in To: and Cc: lines seems to be causing trouble.


       reply	other threads:[~2015-01-08  9:12 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1420705550-24245-1-git-send-email-famz@redhat.com>
2015-01-08  9:12 ` Miklos Szeredi [this message]
2015-01-08  9:12   ` [PATCH 0/3] epoll: Add epoll_pwait1 syscall Miklos Szeredi
2015-01-08  9:12   ` Miklos Szeredi
     [not found]   ` <1420708372.18399.15.camel-AlSwsSmVLrQ@public.gmane.org>
2015-01-08 11:07     ` Michael Kerrisk (man-pages)
2015-01-08 11:07       ` Michael Kerrisk (man-pages)
2015-01-08 11:07       ` Michael Kerrisk (man-pages)
2015-01-08 17:57   ` Andy Lutomirski
2015-01-08 17:57     ` Andy Lutomirski
     [not found]     ` <CALCETrVyPij1Zxwmw7p06UrZjoyYDXqEjmxyQ-KJ8Y7dx7mL3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-08 18:42       ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-01-08 18:42         ` josh
2015-01-08 18:42         ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-01-08 19:31         ` Alexei Starovoitov
2015-01-08 19:31           ` Alexei Starovoitov
2015-01-08 19:31           ` Alexei Starovoitov
2015-01-08 19:42         ` Andy Lutomirski
2015-01-08 19:42           ` Andy Lutomirski
2015-01-08 19:42           ` Andy Lutomirski
2015-01-09  1:25       ` Fam Zheng
2015-01-09  1:25         ` Fam Zheng
     [not found]         ` <20150109011608.GA2924-+wGkCoP0yD+sDdueE5tM26fLeoKvNuZc@public.gmane.org>
2015-01-09  1:28           ` Andy Lutomirski
2015-01-09  1:28             ` Andy Lutomirski
2015-01-09  1:52             ` Fam Zheng
2015-01-09  1:52               ` Fam Zheng
     [not found]               ` <20150109015248.GA5034-+wGkCoP0yD+sDdueE5tM26fLeoKvNuZc@public.gmane.org>
2015-01-09  2:24                 ` Andy Lutomirski
2015-01-09  2:24                   ` Andy Lutomirski
2015-01-09  4:49                   ` Fam Zheng
2015-01-09  4:49                     ` Fam Zheng
2015-01-09  5:21                     ` Josh Triplett
2015-01-09  5:21                       ` Josh Triplett
2015-01-09  5:21                       ` Josh Triplett
2015-01-12  8:24                       ` Fam Zheng
2015-01-12  8:24                         ` Fam Zheng
2015-01-12  8:24                         ` Fam Zheng
2015-01-12 10:08                         ` Josh Triplett
2015-01-12 10:08                           ` Josh Triplett
2015-01-12 10:08                           ` Josh Triplett
2015-01-12 13:23                           ` Fam Zheng
2015-01-12 13:23                             ` Fam Zheng
2015-01-12 13:23                             ` Fam Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1420708372.18399.15.camel@suse.cz \
    --to=mszeredi@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=ast@plumgrid.com \
    --cc=dh.herrmann@gmail.com \
    --cc=drysdale@google.com \
    --cc=fabf@skynet.be \
    --cc=famz@redhat.com \
    --cc=heiko.carstens@de.ibm.com \
    --cc=hpa@zytor.com \
    --cc=josh@joshtriplett.org \
    --cc=juri.lelli@gmail.com \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@redhat.com \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=raistlin@linux.it \
    --cc=tglx@linutronix.de \
    --cc=tytso@mit.edu \
    --cc=vapier@gentoo.org \
    --cc=vgoyal@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=x86@kernel.org \
    --cc=zab@zabbo.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.