All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Jason Baron <jbaron@akamai.com>
Cc: akpm@linux-foundation.org, normalperson@yhbt.net,
	nzimmer@sgi.com, viro@zeniv.linux.org.uk, nelhage@nelhage.com,
	davidel@xmailserver.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 1/2 v2] epoll: optimize EPOLL_CTL_DEL using rcu
Date: Thu, 24 Oct 2013 01:52:39 -0700	[thread overview]
Message-ID: <20131024085238.GA4834@linux.vnet.ibm.com> (raw)
In-Reply-To: <788e40eac585039ded51e6d70856c8d432b6e97b.1380645717.git.jbaron@akamai.com>

On Tue, Oct 01, 2013 at 05:08:10PM +0000, Jason Baron wrote:
> Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by
> converting the file->f_ep_links list into an rcu one.  In this way, we can
> traverse the epoll network on the add path in parallel with deletes. 
> Since deletes can't create loops or worse wakeup paths, this is safe.
> 
> This patch in combination with the patch "epoll: Do not take global 'epmutex'
> for simple topologies", shows a dramatic performance improvement in
> scalability for SPECjbb.

A few questions and comments below.

							Thanx, Paul

> Signed-off-by: Jason Baron <jbaron@akamai.com>
> Tested-by: Nathan Zimmer <nzimmer@sgi.com>
> Cc: Eric Wong <normalperson@yhbt.net>
> Cc: Nelson Elhage <nelhage@nelhage.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Davide Libenzi <davidel@xmailserver.org>
> Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  fs/eventpoll.c |   65 +++++++++++++++++++++++++++++------------------
>  1 file changed, 41 insertions(+), 24 deletions(-)
> 
> diff -puN fs/eventpoll.c~epoll-optimize-epoll_ctl_del-using-rcu fs/eventpoll.c
> --- a/fs/eventpoll.c~epoll-optimize-epoll_ctl_del-using-rcu
> +++ a/fs/eventpoll.c
> @@ -42,6 +42,7 @@
>  #include <linux/proc_fs.h>
>  #include <linux/seq_file.h>
>  #include <linux/compat.h>
> +#include <linux/rculist.h>
> 
>  /*
>   * LOCKING:
> @@ -134,8 +135,12 @@ struct nested_calls {
>   * of these on a server and we do not want this to take another cache line.
>   */
>  struct epitem {
> -	/* RB tree node used to link this structure to the eventpoll RB tree */
> -	struct rb_node rbn;
> +	union {
> +		/* RB tree node links this structure to the eventpoll RB tree */
> +		struct rb_node rbn;

And RCU readers never need to use rbn, right?  (That appears to be the case,
so good.)

> +		/* Used to free the struct epitem */
> +		struct rcu_head rcu;
> +	};
> 
>  	/* List header used to link this structure to the eventpoll ready list */
>  	struct list_head rdllink;
> @@ -166,6 +171,9 @@ struct epitem {
> 
>  	/* The structure that describe the interested events and the source fd */
>  	struct epoll_event event;
> +
> +	/* The fllink is in use. Since rcu can't do 'list_del_init()' */
> +	int on_list;
>  };
> 
>  /*
> @@ -672,6 +680,12 @@ static int ep_scan_ready_list(struct eve
>  	return error;
>  }
> 
> +static void epi_rcu_free(struct rcu_head *head)
> +{
> +	struct epitem *epi = container_of(head, struct epitem, rcu);
> +	kmem_cache_free(epi_cache, epi);

Sigh.  I suppose that I need to create a kmem_cache_free_rcu() at some
point...

> +}
> +
>  /*
>   * Removes a "struct epitem" from the eventpoll RB tree and deallocates
>   * all the associated resources. Must be called with "mtx" held.
> @@ -693,8 +707,10 @@ static int ep_remove(struct eventpoll *e
> 
>  	/* Remove the current item from the list of epoll hooks */
>  	spin_lock(&file->f_lock);
> -	if (ep_is_linked(&epi->fllink))
> -		list_del_init(&epi->fllink);
> +	if (epi->on_list) {
> +		list_del_rcu(&epi->fllink);

OK, here is the list_del_rcu().  It does seem to precede the call_rcu()
below, as it must.  Of course, if !epi->onlist, you could just free
it without going through call_rcu(), but perhaps that optimization is
not worthwhile.

> +		epi->on_list = 0;
> +	}
>  	spin_unlock(&file->f_lock);
> 
>  	rb_erase(&epi->rbn, &ep->rbr);
> @@ -705,9 +721,14 @@ static int ep_remove(struct eventpoll *e
>  	spin_unlock_irqrestore(&ep->lock, flags);
> 
>  	wakeup_source_unregister(ep_wakeup_source(epi));
> -
> -	/* At this point it is safe to free the eventpoll item */
> -	kmem_cache_free(epi_cache, epi);
> +	/*
> +	 * At this point it is safe to free the eventpoll item. Use the union
> +	 * field epi->rcu, since we are trying to minimize the size of
> +	 * 'struct epitem'. The 'rbn' field is no longer in use. Protected by
> +	 * ep->mtx. The rcu read side, reverse_path_check_proc(), does not make
> +	 * use of the rbn field.
> +	 */
> +	call_rcu(&epi->rcu, epi_rcu_free);

And here is the call_rcu(), so good.  At least assuming there are no other
RCU-reader-accessible lists that this thing is on.  (If there are, then it
needs to be removed from these lists as well.)

> 
>  	atomic_long_dec(&ep->user->epoll_watches);
> 
> @@ -873,7 +894,6 @@ static const struct file_operations even
>   */
>  void eventpoll_release_file(struct file *file)
>  {
> -	struct list_head *lsthead = &file->f_ep_links;
>  	struct eventpoll *ep;
>  	struct epitem *epi;
> 
> @@ -891,17 +911,12 @@ void eventpoll_release_file(struct file
>  	 * Besides, ep_remove() acquires the lock, so we can't hold it here.
>  	 */
>  	mutex_lock(&epmutex);
> -
> -	while (!list_empty(lsthead)) {
> -		epi = list_first_entry(lsthead, struct epitem, fllink);
> -
> +	list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) {
>  		ep = epi->ep;
> -		list_del_init(&epi->fllink);
>  		mutex_lock_nested(&ep->mtx, 0);
>  		ep_remove(ep, epi);
>  		mutex_unlock(&ep->mtx);
>  	}
> -
>  	mutex_unlock(&epmutex);
>  }
> 
> @@ -1139,7 +1154,9 @@ static int reverse_path_check_proc(void
>  	struct file *child_file;
>  	struct epitem *epi;
> 
> -	list_for_each_entry(epi, &file->f_ep_links, fllink) {
> +	/* CTL_DEL can remove links here, but that can't increase our count */
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) {
>  		child_file = epi->ep->file;
>  		if (is_file_epoll(child_file)) {
>  			if (list_empty(&child_file->f_ep_links)) {

Hmmm...  It looks like ep_call_nested() acquires a global spinlock.
Perhaps that is fixed in a later patch in this series?  Otherwise,
this will be a bottleneck on large systems.

> @@ -1161,6 +1178,7 @@ static int reverse_path_check_proc(void
>  				"file is not an ep!\n");
>  		}
>  	}
> +	rcu_read_unlock();
>  	return error;
>  }
> 
> @@ -1255,6 +1273,7 @@ static int ep_insert(struct eventpoll *e
>  	epi->event = *event;
>  	epi->nwait = 0;
>  	epi->next = EP_UNACTIVE_PTR;
> +	epi->on_list = 0;
>  	if (epi->event.events & EPOLLWAKEUP) {
>  		error = ep_create_wakeup_source(epi);
>  		if (error)
> @@ -1287,7 +1306,8 @@ static int ep_insert(struct eventpoll *e
> 
>  	/* Add the current item to the list of active epoll hook for this file */
>  	spin_lock(&tfile->f_lock);

Looks like ->f_lock protects updates to the RCU-protected structure.

> -	list_add_tail(&epi->fllink, &tfile->f_ep_links);
> +	list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);
> +	epi->on_list = 1;
>  	spin_unlock(&tfile->f_lock);
> 
>  	/*
> @@ -1328,8 +1348,8 @@ static int ep_insert(struct eventpoll *e
> 
>  error_remove_epi:
>  	spin_lock(&tfile->f_lock);

More evidence in favor of ->f_lock protecting updates to the RCU-protected
data structure.

> -	if (ep_is_linked(&epi->fllink))
> -		list_del_init(&epi->fllink);
> +	if (epi->on_list)
> +		list_del_rcu(&epi->fllink);

OK, this list_del_rcu() call is for handling failed inserts.

But if we had to use list_del_rcu(), don't we also need to wait a grace
period before freeing the item?  Or does rb_erase() somehow do that?

Or has placing it ->on_list somehow avoided exposing it to readers?
(Of course, it this is the case, the list_del_rcu() could remain
a list_del_init()...)

>  	spin_unlock(&tfile->f_lock);
> 
>  	rb_erase(&epi->rbn, &ep->rbr);
> @@ -1846,15 +1866,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, in
>  	 * and hang them on the tfile_check_list, so we can check that we
>  	 * haven't created too many possible wakeup paths.
>  	 *
> -	 * We need to hold the epmutex across both ep_insert and ep_remove
> -	 * b/c we want to make sure we are looking at a coherent view of
> -	 * epoll network.
> +	 * We need to hold the epmutex across ep_insert to prevent
> +	 * multple adds from creating loops in parallel.
>  	 */
> -	if (op == EPOLL_CTL_ADD || op == EPOLL_CTL_DEL) {
> +	if (op == EPOLL_CTL_ADD) {
>  		mutex_lock(&epmutex);
>  		did_lock_epmutex = 1;
> -	}
> -	if (op == EPOLL_CTL_ADD) {
>  		if (is_file_epoll(tf.file)) {
>  			error = -ELOOP;
>  			if (ep_loop_check(ep, tf.file) != 0) {
> _
> 

  reply	other threads:[~2013-10-24  8:52 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-01 17:08 [PATCH 0/2 v2] epoll: reduce 'epmutex' lock contention Jason Baron
2013-10-01 17:08 ` [PATCH 1/2 v2] epoll: optimize EPOLL_CTL_DEL using rcu Jason Baron
2013-10-24  8:52   ` Paul E. McKenney [this message]
2013-10-24 14:56     ` Jason Baron
2013-10-01 17:08 ` [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies Jason Baron
2013-10-03 21:50   ` Andrew Morton
2013-10-04 15:16     ` Jason Baron
2013-10-04 17:54       ` Nathan Zimmer
2013-10-24 10:09   ` Paul E. McKenney
2013-10-24 16:00     ` Jason Baron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131024085238.GA4834@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=davidel@xmailserver.org \
    --cc=jbaron@akamai.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nelhage@nelhage.com \
    --cc=normalperson@yhbt.net \
    --cc=nzimmer@sgi.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.