linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Madars Vitolins <m@silodev.com>
To: Jason Baron <jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org>
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
	viro-rfM+Q5joDG/XmaaqVzeoHQ@public.gmane.org,
	normalperson-rMlxZR9MS24@public.gmane.org,
	corbet-T1hC0tSOHrs@public.gmane.org,
	luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org,
	hagen-GvnIQ6b/HdU@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [PATCH] epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT
Date: Thu, 04 Feb 2016 23:44:05 +0200	[thread overview]
Message-ID: <00ac7df620d4d0500b18af4a30034815@silodev.com> (raw)
In-Reply-To: <1454600344-23655-1-git-send-email-jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org>

Hi Jason,


Just run off the original tests with this patch (eventpoll.c from 
4.5-rc2 + patch bellow). Got the same good results, no regression.

$ time ./bankcl
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 101528948.40 USD

real	0m41.826s
user	0m29.513s
sys	0m6.490s


Test case: 
https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/

PS,

Original cases 0m24.953s vs 0m41.826s now probably is related with my pc 
setup. As I just now re-run test with original patch, got the same ~41 
sec.

So I am fine with this patch!

Thanks,
Madars


Jason Baron @ 2016-02-04 17:39 rakstīja:
> In the current implementation of the EPOLLEXCLUSIVE flag (added for 
> 4.5-rc1),
> if epoll waiters create different POLL* sets and register them as 
> exclusive
> against the same target fd, the current implementation will stop waking 
> any
> further waiters once it finds the first idle waiter. This means that 
> waiters
> could miss wakeups in certain cases.
> 
> For example, when we wake up a pipe for reading we do:
> wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM);
> So if one epoll set or epfd is added to pipe p with POLLIN and a second 
> set
> epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the 
> wakeup
> since the current implementation will stop after it finds any 
> intersection
> of events with a waiter that is blocked in epoll_wait().
> 
> We could potentially address this by requiring all epoll waiters that 
> are
> added to p be required to pass the same set of POLL* events. IE the 
> first
> EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL* 
> flags to
> be used by any other epfds that are added as EPOLLEXCLUSIVE. However, I 
> think
> it might be somewhat confusing interface as we would have to reference 
> count
> the number of users for that set, and so userspace would have to keep 
> track
> of that count, or we would need a more involved interface. It also adds 
> some
> shared state that we'd have store somewhere. I don't think anybody will 
> want
> to bloat __wait_queue_head for this.
> 
> I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE 
> such
> that it can only be specified with EPOLLIN and/or EPOLLOUT. So that way 
> if
> the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop once we hit 
> the
> first idle waiter that specifies the EPOLLIN bit, since any remaining 
> waiters
> that only have 'POLLOUT' set wouldn't need to be woken. Likewise, we 
> can do
> the same thing if 'POLLOUT' is in the wakeup bit set and not 'POLLIN'. 
> If both
> 'POLLOUT' and 'POLLIN' are set in the wake bit set (there is at least 
> one
> example of this I saw in fs/pipe.c), then we just wake the entire 
> exclusive
> list. Having both 'POLLOUT' and 'POLLIN' both set should not be on any
> performance critical path, so I think that's ok (in fs/pipe.c its in
> pipe_release()). We also continue to include EPOLLERR and EPOLLHUP by 
> default
> in any exclusive set. Thus, the user can specify EPOLLERR and/or 
> EPOLLHUP but
> is not required to do so.
> 
> Since epoll waiters may be interested in other events as well besides 
> EPOLLIN,
> EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by doing a 
> 'dup' call
> on the target fd and adding that as one normally would with 
> EPOLL_CTL_ADD. Since
> I think that the POLLIN and POLLOUT events are what we are interest in
> balancing,
> I think that the 'dup' thing could perhaps be added to only one of the 
> waiter
> threads. However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP 
> should
> be sufficient for the majority of use-cases.
> 
> Since EPOLLEXCLUSIVE is intended to be used with a target fd shared 
> among
> multiple epfds, where between 1 and n of the epfds may receive an 
> event, it
> does not satisfy the semantics of EPOLLONESHOT where only 1 epfd would 
> get an
> event. Thus, it is not allowed to be specified in conjunction with
> EPOLLEXCLUSIVE.
> 
> EPOLL_CTL_MOD is also not allowed if the fd was previously added as
> EPOLLEXCLUSIVE. It seems with the limited number of flags to not be as
> interesting, but this could be relaxed at some further point.
> 
> Signed-off-by: Jason Baron <jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/eventpoll.c | 38 ++++++++++++++++++++++++++++++++------
>  1 file changed, 32 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index ae1dbcf..cde6074 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -94,6 +94,11 @@
>  /* Epoll private bits inside the event mask */
>  #define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | 
> EPOLLEXCLUSIVE)
> 
> +#define EPOLLINOUT_BITS (POLLIN | POLLOUT)
> +
> +#define EPOLLEXCLUSIVE_OK_BITS (EPOLLINOUT_BITS | POLLERR | POLLHUP | 
> \
> +				EPOLLWAKEUP | EPOLLET | EPOLLEXCLUSIVE)
> +
>  /* Maximum number of nesting allowed inside epoll sets */
>  #define EP_MAX_NESTS 4
> 
> @@ -1068,7 +1073,22 @@ static int ep_poll_callback(wait_queue_t *wait,
> unsigned mode, int sync, void *k
>  	 * wait list.
>  	 */
>  	if (waitqueue_active(&ep->wq)) {
> -		ewake = 1;
> +		if ((epi->event.events & EPOLLEXCLUSIVE) &&
> +					!((unsigned long)key & POLLFREE)) {
> +			switch ((unsigned long)key & EPOLLINOUT_BITS) {
> +			case POLLIN:
> +				if (epi->event.events & POLLIN)
> +					ewake = 1;
> +				break;
> +			case POLLOUT:
> +				if (epi->event.events & POLLOUT)
> +					ewake = 1;
> +				break;
> +			case 0:
> +				ewake = 1;
> +				break;
> +			}
> +		}
>  		wake_up_locked(&ep->wq);
>  	}
>  	if (waitqueue_active(&ep->poll_wait))
> @@ -1875,9 +1895,13 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, 
> int, fd,
>  	 * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
>  	 * Also, we do not currently supported nested exclusive wakeups.
>  	 */
> -	if ((epds.events & EPOLLEXCLUSIVE) && (op == EPOLL_CTL_MOD ||
> -		(op == EPOLL_CTL_ADD && is_file_epoll(tf.file))))
> -		goto error_tgt_fput;
> +	if (epds.events & EPOLLEXCLUSIVE) {
> +		if (op == EPOLL_CTL_MOD)
> +			goto error_tgt_fput;
> +		if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
> +				(epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
> +			goto error_tgt_fput;
> +	}
> 
>  	/*
>  	 * At this point it is safe to assume that the "private_data" 
> contains
> @@ -1950,8 +1974,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, 
> int, fd,
>  		break;
>  	case EPOLL_CTL_MOD:
>  		if (epi) {
> -			epds.events |= POLLERR | POLLHUP;
> -			error = ep_modify(ep, epi, &epds);
> +			if (!(epi->event.events & EPOLLEXCLUSIVE)) {
> +				epds.events |= POLLERR | POLLHUP;
> +				error = ep_modify(ep, epi, &epds);
> +			}
>  		} else
>  			error = -ENOENT;
>  		break;

  parent reply	other threads:[~2016-02-04 21:44 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-04 15:39 [PATCH] epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT Jason Baron
     [not found] ` <1454600344-23655-1-git-send-email-jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org>
2016-02-04 21:44   ` Madars Vitolins [this message]
2016-02-04 22:59     ` Andrew Morton
     [not found]       ` <20160204145952.53f536f5d4cdef96844104b7-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2016-02-05  9:49         ` Madars Vitolins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=00ac7df620d4d0500b18af4a30034815@silodev.com \
    --to=m@silodev.com \
    --cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    --cc=corbet-T1hC0tSOHrs@public.gmane.org \
    --cc=hagen-GvnIQ6b/HdU@public.gmane.org \
    --cc=jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org \
    --cc=linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org \
    --cc=mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    --cc=mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=normalperson-rMlxZR9MS24@public.gmane.org \
    --cc=peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org \
    --cc=torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    --cc=viro-rfM+Q5joDG/XmaaqVzeoHQ@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).