* [PATCH 0/2 v2] epoll: reduce 'epmutex' lock contention @ 2013-10-01 17:08 Jason Baron 2013-10-01 17:08 ` [PATCH 1/2 v2] epoll: optimize EPOLL_CTL_DEL using rcu Jason Baron 2013-10-01 17:08 ` [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies Jason Baron 0 siblings, 2 replies; 10+ messages in thread From: Jason Baron @ 2013-10-01 17:08 UTC (permalink / raw) To: akpm Cc: normalperson, nzimmer, viro, nelhage, davidel, linux-kernel, linux-fsdevel Hi, Updated series, using a 'union' for the rcu callback. Nathan Zimmer found that once we get over 10+ cpus, the scalability of SPECjbb falls over due to the contention on the global 'epmutex', which is taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations. Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path by using rcu to guard against any concurrent traversals. Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for simple topologies. IE when adding a link from an epoll file descriptor to a wakeup source, where the epoll file descriptor is not nested. Performance of SPECjbb improves considerably. From Nathan Zimmer's testing. Thread: http://marc.info/?l=linux-kernel&m=137908766013329&w=2 " On the 16 socket run the performance went from 35k jOPS to 125k jOPS. In addition the benchmark when from scaling well on 10 sockets to scaling well on just over 40 sockets. I should also note there system responsiveness of various commands was quite improved when under the full load of the benchmark. ... Currently the benchmark stops scaling at around 40-44 sockets but it seems like I found a second unrelated bottleneck. " Thanks, -Jason Changes in v2: -use a union for rcu callback and drop patch #3 v1: http://marc.info/?l=linux-kernel&m=138057995316598&w=2 Jason Baron (2): epoll: optimize EPOLL_CTL_DEL using rcu epoll: Do not take global 'epmutex' for simple topologies fs/eventpoll.c | 153 +++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 106 insertions(+), 47 deletions(-) -- 1.8.2 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/2 v2] epoll: optimize EPOLL_CTL_DEL using rcu 2013-10-01 17:08 [PATCH 0/2 v2] epoll: reduce 'epmutex' lock contention Jason Baron @ 2013-10-01 17:08 ` Jason Baron 2013-10-24 8:52 ` Paul E. McKenney 2013-10-01 17:08 ` [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies Jason Baron 1 sibling, 1 reply; 10+ messages in thread From: Jason Baron @ 2013-10-01 17:08 UTC (permalink / raw) To: akpm Cc: normalperson, nzimmer, viro, nelhage, davidel, linux-kernel, linux-fsdevel Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by converting the file->f_ep_links list into an rcu one. In this way, we can traverse the epoll network on the add path in parallel with deletes. Since deletes can't create loops or worse wakeup paths, this is safe. This patch in combination with the patch "epoll: Do not take global 'epmutex' for simple topologies", shows a dramatic performance improvement in scalability for SPECjbb. Tested-by: Nathan Zimmer <nzimmer@sgi.com> Signed-off-by: Jason Baron <jbaron@akamai.com> --- fs/eventpoll.c | 65 ++++++++++++++++++++++++++++++++++++---------------------- 1 file changed, 41 insertions(+), 24 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 473e09d..dd9fae1 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -42,6 +42,7 @@ #include <linux/proc_fs.h> #include <linux/seq_file.h> #include <linux/compat.h> +#include <linux/rculist.h> /* * LOCKING: @@ -134,8 +135,12 @@ struct nested_calls { * of these on a server and we do not want this to take another cache line. */ struct epitem { - /* RB tree node used to link this structure to the eventpoll RB tree */ - struct rb_node rbn; + union { + /* RB tree node links this structure to the eventpoll RB tree */ + struct rb_node rbn; + /* Used to free the struct epitem */ + struct rcu_head rcu; + }; /* List header used to link this structure to the eventpoll ready list */ struct list_head rdllink; @@ -166,6 +171,9 @@ struct epitem { /* The structure that describe the interested events and the source fd */ struct epoll_event event; + + /* The fllink is in use. Since rcu can't do 'list_del_init()' */ + int on_list; }; /* @@ -672,6 +680,12 @@ static int ep_scan_ready_list(struct eventpoll *ep, return error; } +static void epi_rcu_free(struct rcu_head *head) +{ + struct epitem *epi = container_of(head, struct epitem, rcu); + kmem_cache_free(epi_cache, epi); +} + /* * Removes a "struct epitem" from the eventpoll RB tree and deallocates * all the associated resources. Must be called with "mtx" held. @@ -693,8 +707,10 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi) /* Remove the current item from the list of epoll hooks */ spin_lock(&file->f_lock); - if (ep_is_linked(&epi->fllink)) - list_del_init(&epi->fllink); + if (epi->on_list) { + list_del_rcu(&epi->fllink); + epi->on_list = 0; + } spin_unlock(&file->f_lock); rb_erase(&epi->rbn, &ep->rbr); @@ -705,9 +721,14 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi) spin_unlock_irqrestore(&ep->lock, flags); wakeup_source_unregister(ep_wakeup_source(epi)); - - /* At this point it is safe to free the eventpoll item */ - kmem_cache_free(epi_cache, epi); + /* + * At this point it is safe to free the eventpoll item. Use the union + * field epi->rcu, since we are trying to minimize the size of + * 'struct epitem'. The 'rbn' field is no longer in use. Protected by + * ep->mtx. The rcu read side, reverse_path_check_proc(), does not make + * use of the rbn field. + */ + call_rcu(&epi->rcu, epi_rcu_free); atomic_long_dec(&ep->user->epoll_watches); @@ -873,7 +894,6 @@ static const struct file_operations eventpoll_fops = { */ void eventpoll_release_file(struct file *file) { - struct list_head *lsthead = &file->f_ep_links; struct eventpoll *ep; struct epitem *epi; @@ -891,17 +911,12 @@ void eventpoll_release_file(struct file *file) * Besides, ep_remove() acquires the lock, so we can't hold it here. */ mutex_lock(&epmutex); - - while (!list_empty(lsthead)) { - epi = list_first_entry(lsthead, struct epitem, fllink); - + list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) { ep = epi->ep; - list_del_init(&epi->fllink); mutex_lock_nested(&ep->mtx, 0); ep_remove(ep, epi); mutex_unlock(&ep->mtx); } - mutex_unlock(&epmutex); } @@ -1139,7 +1154,9 @@ static int reverse_path_check_proc(void *priv, void *cookie, int call_nests) struct file *child_file; struct epitem *epi; - list_for_each_entry(epi, &file->f_ep_links, fllink) { + /* CTL_DEL can remove links here, but that can't increase our count */ + rcu_read_lock(); + list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) { child_file = epi->ep->file; if (is_file_epoll(child_file)) { if (list_empty(&child_file->f_ep_links)) { @@ -1161,6 +1178,7 @@ static int reverse_path_check_proc(void *priv, void *cookie, int call_nests) "file is not an ep!\n"); } } + rcu_read_unlock(); return error; } @@ -1255,6 +1273,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event, epi->event = *event; epi->nwait = 0; epi->next = EP_UNACTIVE_PTR; + epi->on_list = 0; if (epi->event.events & EPOLLWAKEUP) { error = ep_create_wakeup_source(epi); if (error) @@ -1287,7 +1306,8 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event, /* Add the current item to the list of active epoll hook for this file */ spin_lock(&tfile->f_lock); - list_add_tail(&epi->fllink, &tfile->f_ep_links); + list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links); + epi->on_list = 1; spin_unlock(&tfile->f_lock); /* @@ -1328,8 +1348,8 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event, error_remove_epi: spin_lock(&tfile->f_lock); - if (ep_is_linked(&epi->fllink)) - list_del_init(&epi->fllink); + if (epi->on_list) + list_del_rcu(&epi->fllink); spin_unlock(&tfile->f_lock); rb_erase(&epi->rbn, &ep->rbr); @@ -1846,15 +1866,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, * and hang them on the tfile_check_list, so we can check that we * haven't created too many possible wakeup paths. * - * We need to hold the epmutex across both ep_insert and ep_remove - * b/c we want to make sure we are looking at a coherent view of - * epoll network. + * We need to hold the epmutex across ep_insert to prevent + * multple adds from creating loops in parallel. */ - if (op == EPOLL_CTL_ADD || op == EPOLL_CTL_DEL) { + if (op == EPOLL_CTL_ADD) { mutex_lock(&epmutex); did_lock_epmutex = 1; - } - if (op == EPOLL_CTL_ADD) { if (is_file_epoll(tf.file)) { error = -ELOOP; if (ep_loop_check(ep, tf.file) != 0) { -- 1.8.2 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2 v2] epoll: optimize EPOLL_CTL_DEL using rcu 2013-10-01 17:08 ` [PATCH 1/2 v2] epoll: optimize EPOLL_CTL_DEL using rcu Jason Baron @ 2013-10-24 8:52 ` Paul E. McKenney 2013-10-24 14:56 ` Jason Baron 0 siblings, 1 reply; 10+ messages in thread From: Paul E. McKenney @ 2013-10-24 8:52 UTC (permalink / raw) To: Jason Baron Cc: akpm, normalperson, nzimmer, viro, nelhage, davidel, linux-kernel, linux-fsdevel On Tue, Oct 01, 2013 at 05:08:10PM +0000, Jason Baron wrote: > Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by > converting the file->f_ep_links list into an rcu one. In this way, we can > traverse the epoll network on the add path in parallel with deletes. > Since deletes can't create loops or worse wakeup paths, this is safe. > > This patch in combination with the patch "epoll: Do not take global 'epmutex' > for simple topologies", shows a dramatic performance improvement in > scalability for SPECjbb. A few questions and comments below. Thanx, Paul > Signed-off-by: Jason Baron <jbaron@akamai.com> > Tested-by: Nathan Zimmer <nzimmer@sgi.com> > Cc: Eric Wong <normalperson@yhbt.net> > Cc: Nelson Elhage <nelhage@nelhage.com> > Cc: Al Viro <viro@zeniv.linux.org.uk> > Cc: Davide Libenzi <davidel@xmailserver.org> > Cc: "Paul E. McKenney" <paulmck@us.ibm.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > fs/eventpoll.c | 65 +++++++++++++++++++++++++++++------------------ > 1 file changed, 41 insertions(+), 24 deletions(-) > > diff -puN fs/eventpoll.c~epoll-optimize-epoll_ctl_del-using-rcu fs/eventpoll.c > --- a/fs/eventpoll.c~epoll-optimize-epoll_ctl_del-using-rcu > +++ a/fs/eventpoll.c > @@ -42,6 +42,7 @@ > #include <linux/proc_fs.h> > #include <linux/seq_file.h> > #include <linux/compat.h> > +#include <linux/rculist.h> > > /* > * LOCKING: > @@ -134,8 +135,12 @@ struct nested_calls { > * of these on a server and we do not want this to take another cache line. > */ > struct epitem { > - /* RB tree node used to link this structure to the eventpoll RB tree */ > - struct rb_node rbn; > + union { > + /* RB tree node links this structure to the eventpoll RB tree */ > + struct rb_node rbn; And RCU readers never need to use rbn, right? (That appears to be the case, so good.) > + /* Used to free the struct epitem */ > + struct rcu_head rcu; > + }; > > /* List header used to link this structure to the eventpoll ready list */ > struct list_head rdllink; > @@ -166,6 +171,9 @@ struct epitem { > > /* The structure that describe the interested events and the source fd */ > struct epoll_event event; > + > + /* The fllink is in use. Since rcu can't do 'list_del_init()' */ > + int on_list; > }; > > /* > @@ -672,6 +680,12 @@ static int ep_scan_ready_list(struct eve > return error; > } > > +static void epi_rcu_free(struct rcu_head *head) > +{ > + struct epitem *epi = container_of(head, struct epitem, rcu); > + kmem_cache_free(epi_cache, epi); Sigh. I suppose that I need to create a kmem_cache_free_rcu() at some point... > +} > + > /* > * Removes a "struct epitem" from the eventpoll RB tree and deallocates > * all the associated resources. Must be called with "mtx" held. > @@ -693,8 +707,10 @@ static int ep_remove(struct eventpoll *e > > /* Remove the current item from the list of epoll hooks */ > spin_lock(&file->f_lock); > - if (ep_is_linked(&epi->fllink)) > - list_del_init(&epi->fllink); > + if (epi->on_list) { > + list_del_rcu(&epi->fllink); OK, here is the list_del_rcu(). It does seem to precede the call_rcu() below, as it must. Of course, if !epi->onlist, you could just free it without going through call_rcu(), but perhaps that optimization is not worthwhile. > + epi->on_list = 0; > + } > spin_unlock(&file->f_lock); > > rb_erase(&epi->rbn, &ep->rbr); > @@ -705,9 +721,14 @@ static int ep_remove(struct eventpoll *e > spin_unlock_irqrestore(&ep->lock, flags); > > wakeup_source_unregister(ep_wakeup_source(epi)); > - > - /* At this point it is safe to free the eventpoll item */ > - kmem_cache_free(epi_cache, epi); > + /* > + * At this point it is safe to free the eventpoll item. Use the union > + * field epi->rcu, since we are trying to minimize the size of > + * 'struct epitem'. The 'rbn' field is no longer in use. Protected by > + * ep->mtx. The rcu read side, reverse_path_check_proc(), does not make > + * use of the rbn field. > + */ > + call_rcu(&epi->rcu, epi_rcu_free); And here is the call_rcu(), so good. At least assuming there are no other RCU-reader-accessible lists that this thing is on. (If there are, then it needs to be removed from these lists as well.) > > atomic_long_dec(&ep->user->epoll_watches); > > @@ -873,7 +894,6 @@ static const struct file_operations even > */ > void eventpoll_release_file(struct file *file) > { > - struct list_head *lsthead = &file->f_ep_links; > struct eventpoll *ep; > struct epitem *epi; > > @@ -891,17 +911,12 @@ void eventpoll_release_file(struct file > * Besides, ep_remove() acquires the lock, so we can't hold it here. > */ > mutex_lock(&epmutex); > - > - while (!list_empty(lsthead)) { > - epi = list_first_entry(lsthead, struct epitem, fllink); > - > + list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) { > ep = epi->ep; > - list_del_init(&epi->fllink); > mutex_lock_nested(&ep->mtx, 0); > ep_remove(ep, epi); > mutex_unlock(&ep->mtx); > } > - > mutex_unlock(&epmutex); > } > > @@ -1139,7 +1154,9 @@ static int reverse_path_check_proc(void > struct file *child_file; > struct epitem *epi; > > - list_for_each_entry(epi, &file->f_ep_links, fllink) { > + /* CTL_DEL can remove links here, but that can't increase our count */ > + rcu_read_lock(); > + list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) { > child_file = epi->ep->file; > if (is_file_epoll(child_file)) { > if (list_empty(&child_file->f_ep_links)) { Hmmm... It looks like ep_call_nested() acquires a global spinlock. Perhaps that is fixed in a later patch in this series? Otherwise, this will be a bottleneck on large systems. > @@ -1161,6 +1178,7 @@ static int reverse_path_check_proc(void > "file is not an ep!\n"); > } > } > + rcu_read_unlock(); > return error; > } > > @@ -1255,6 +1273,7 @@ static int ep_insert(struct eventpoll *e > epi->event = *event; > epi->nwait = 0; > epi->next = EP_UNACTIVE_PTR; > + epi->on_list = 0; > if (epi->event.events & EPOLLWAKEUP) { > error = ep_create_wakeup_source(epi); > if (error) > @@ -1287,7 +1306,8 @@ static int ep_insert(struct eventpoll *e > > /* Add the current item to the list of active epoll hook for this file */ > spin_lock(&tfile->f_lock); Looks like ->f_lock protects updates to the RCU-protected structure. > - list_add_tail(&epi->fllink, &tfile->f_ep_links); > + list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links); > + epi->on_list = 1; > spin_unlock(&tfile->f_lock); > > /* > @@ -1328,8 +1348,8 @@ static int ep_insert(struct eventpoll *e > > error_remove_epi: > spin_lock(&tfile->f_lock); More evidence in favor of ->f_lock protecting updates to the RCU-protected data structure. > - if (ep_is_linked(&epi->fllink)) > - list_del_init(&epi->fllink); > + if (epi->on_list) > + list_del_rcu(&epi->fllink); OK, this list_del_rcu() call is for handling failed inserts. But if we had to use list_del_rcu(), don't we also need to wait a grace period before freeing the item? Or does rb_erase() somehow do that? Or has placing it ->on_list somehow avoided exposing it to readers? (Of course, it this is the case, the list_del_rcu() could remain a list_del_init()...) > spin_unlock(&tfile->f_lock); > > rb_erase(&epi->rbn, &ep->rbr); > @@ -1846,15 +1866,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, in > * and hang them on the tfile_check_list, so we can check that we > * haven't created too many possible wakeup paths. > * > - * We need to hold the epmutex across both ep_insert and ep_remove > - * b/c we want to make sure we are looking at a coherent view of > - * epoll network. > + * We need to hold the epmutex across ep_insert to prevent > + * multple adds from creating loops in parallel. > */ > - if (op == EPOLL_CTL_ADD || op == EPOLL_CTL_DEL) { > + if (op == EPOLL_CTL_ADD) { > mutex_lock(&epmutex); > did_lock_epmutex = 1; > - } > - if (op == EPOLL_CTL_ADD) { > if (is_file_epoll(tf.file)) { > error = -ELOOP; > if (ep_loop_check(ep, tf.file) != 0) { > _ > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2 v2] epoll: optimize EPOLL_CTL_DEL using rcu 2013-10-24 8:52 ` Paul E. McKenney @ 2013-10-24 14:56 ` Jason Baron 0 siblings, 0 replies; 10+ messages in thread From: Jason Baron @ 2013-10-24 14:56 UTC (permalink / raw) To: paulmck@linux.vnet.ibm.com Cc: akpm@linux-foundation.org, normalperson@yhbt.net, nzimmer@sgi.com, viro@zeniv.linux.org.uk, nelhage@nelhage.com, davidel@xmailserver.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Thanks for looking at this. Comments below. -Jason On 10/24/2013 04:52 AM, Paul E. McKenney wrote: > On Tue, Oct 01, 2013 at 05:08:10PM +0000, Jason Baron wrote: >> Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by >> converting the file->f_ep_links list into an rcu one. In this way, we can >> traverse the epoll network on the add path in parallel with deletes. >> Since deletes can't create loops or worse wakeup paths, this is safe. >> >> This patch in combination with the patch "epoll: Do not take global 'epmutex' >> for simple topologies", shows a dramatic performance improvement in >> scalability for SPECjbb. > > A few questions and comments below. > > Thanx, Paul > >> Signed-off-by: Jason Baron <jbaron@akamai.com> >> Tested-by: Nathan Zimmer <nzimmer@sgi.com> >> Cc: Eric Wong <normalperson@yhbt.net> >> Cc: Nelson Elhage <nelhage@nelhage.com> >> Cc: Al Viro <viro@zeniv.linux.org.uk> >> Cc: Davide Libenzi <davidel@xmailserver.org> >> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> >> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> >> --- >> >> fs/eventpoll.c | 65 +++++++++++++++++++++++++++++------------------ >> 1 file changed, 41 insertions(+), 24 deletions(-) >> >> diff -puN fs/eventpoll.c~epoll-optimize-epoll_ctl_del-using-rcu fs/eventpoll.c >> --- a/fs/eventpoll.c~epoll-optimize-epoll_ctl_del-using-rcu >> +++ a/fs/eventpoll.c >> @@ -42,6 +42,7 @@ >> #include <linux/proc_fs.h> >> #include <linux/seq_file.h> >> #include <linux/compat.h> >> +#include <linux/rculist.h> >> >> /* >> * LOCKING: >> @@ -134,8 +135,12 @@ struct nested_calls { >> * of these on a server and we do not want this to take another cache line. >> */ >> struct epitem { >> - /* RB tree node used to link this structure to the eventpoll RB tree */ >> - struct rb_node rbn; >> + union { >> + /* RB tree node links this structure to the eventpoll RB tree */ >> + struct rb_node rbn; > > And RCU readers never need to use rbn, right? (That appears to be the case, > so good.) Right RCU readers, specifically sections under rcu_read_lock() are not touching rbn. The idea here is to remove the usage of the rbn, via the call to rb_erase() in ep_remove() before we make use of the struct rcu_head rcu. The concern I have is that there could be a race where somebody accesses rbn, while we are using it as rcu. However, I don't think its possible. The idea is that all access to rbn is guarded by ep->mtx. The only exception I found was in the top of 'ep_free()', but in that case we are in an 'fput' and I don't believe anybody else could get a reference to the ep and hence the rbn. But please double check. > >> + /* Used to free the struct epitem */ >> + struct rcu_head rcu; >> + }; >> >> /* List header used to link this structure to the eventpoll ready list */ >> struct list_head rdllink; >> @@ -166,6 +171,9 @@ struct epitem { >> >> /* The structure that describe the interested events and the source fd */ >> struct epoll_event event; >> + >> + /* The fllink is in use. Since rcu can't do 'list_del_init()' */ >> + int on_list; >> }; >> >> /* >> @@ -672,6 +680,12 @@ static int ep_scan_ready_list(struct eve >> return error; >> } >> >> +static void epi_rcu_free(struct rcu_head *head) >> +{ >> + struct epitem *epi = container_of(head, struct epitem, rcu); >> + kmem_cache_free(epi_cache, epi); > > Sigh. I suppose that I need to create a kmem_cache_free_rcu() at some > point... > >> +} >> + >> /* >> * Removes a "struct epitem" from the eventpoll RB tree and deallocates >> * all the associated resources. Must be called with "mtx" held. >> @@ -693,8 +707,10 @@ static int ep_remove(struct eventpoll *e >> >> /* Remove the current item from the list of epoll hooks */ >> spin_lock(&file->f_lock); >> - if (ep_is_linked(&epi->fllink)) >> - list_del_init(&epi->fllink); >> + if (epi->on_list) { >> + list_del_rcu(&epi->fllink); > > OK, here is the list_del_rcu(). It does seem to precede the call_rcu() > below, as it must. Of course, if !epi->onlist, you could just free > it without going through call_rcu(), but perhaps that optimization is > not worthwhile. > hmmm...yeah I think we're ok without it. >> + epi->on_list = 0; >> + } >> spin_unlock(&file->f_lock); >> >> rb_erase(&epi->rbn, &ep->rbr); >> @@ -705,9 +721,14 @@ static int ep_remove(struct eventpoll *e >> spin_unlock_irqrestore(&ep->lock, flags); >> >> wakeup_source_unregister(ep_wakeup_source(epi)); >> - >> - /* At this point it is safe to free the eventpoll item */ >> - kmem_cache_free(epi_cache, epi); >> + /* >> + * At this point it is safe to free the eventpoll item. Use the union >> + * field epi->rcu, since we are trying to minimize the size of >> + * 'struct epitem'. The 'rbn' field is no longer in use. Protected by >> + * ep->mtx. The rcu read side, reverse_path_check_proc(), does not make >> + * use of the rbn field. >> + */ >> + call_rcu(&epi->rcu, epi_rcu_free); > > And here is the call_rcu(), so good. At least assuming there are no other > RCU-reader-accessible lists that this thing is on. (If there are, then it > needs to be removed from these lists as well.) > It is removed from the epi->fdllink before here as well, but my patch doesn't alter any ordering with respect to that removal. >> >> atomic_long_dec(&ep->user->epoll_watches); >> >> @@ -873,7 +894,6 @@ static const struct file_operations even >> */ >> void eventpoll_release_file(struct file *file) >> { >> - struct list_head *lsthead = &file->f_ep_links; >> struct eventpoll *ep; >> struct epitem *epi; >> >> @@ -891,17 +911,12 @@ void eventpoll_release_file(struct file >> * Besides, ep_remove() acquires the lock, so we can't hold it here. >> */ >> mutex_lock(&epmutex); >> - >> - while (!list_empty(lsthead)) { >> - epi = list_first_entry(lsthead, struct epitem, fllink); >> - >> + list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) { >> ep = epi->ep; >> - list_del_init(&epi->fllink); >> mutex_lock_nested(&ep->mtx, 0); >> ep_remove(ep, epi); >> mutex_unlock(&ep->mtx); >> } >> - >> mutex_unlock(&epmutex); >> } >> >> @@ -1139,7 +1154,9 @@ static int reverse_path_check_proc(void >> struct file *child_file; >> struct epitem *epi; >> >> - list_for_each_entry(epi, &file->f_ep_links, fllink) { >> + /* CTL_DEL can remove links here, but that can't increase our count */ >> + rcu_read_lock(); >> + list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) { >> child_file = epi->ep->file; >> if (is_file_epoll(child_file)) { >> if (list_empty(&child_file->f_ep_links)) { > > Hmmm... It looks like ep_call_nested() acquires a global spinlock. > Perhaps that is fixed in a later patch in this series? Otherwise, > this will be a bottleneck on large systems. > Yes - the spinlock is pointed to by 'poll_loop_ncalls' which is is used by the both the loop checking logic and the this 'reverse_patch_check_proc' which detects if we have added to many wakeup paths to a wakeup source. The point of this patch, is to only do these more expensive checks when the epoll graph or nesting is 'complex'. So, in theory, the the fast paths wouldn't even take this lock. Regardless, the current logic still requires a global lock for these checks - the 'epmutex', so that would be the higher level lock to fix, if there were still scaling issues here. Note that with these patches the 'epmutex' is now only taken on ep_insert, in certain cases, and not at all on ep_remove (due to the new rcu usage). >> @@ -1161,6 +1178,7 @@ static int reverse_path_check_proc(void >> "file is not an ep!\n"); >> } >> } >> + rcu_read_unlock(); >> return error; >> } >> >> @@ -1255,6 +1273,7 @@ static int ep_insert(struct eventpoll *e >> epi->event = *event; >> epi->nwait = 0; >> epi->next = EP_UNACTIVE_PTR; >> + epi->on_list = 0; >> if (epi->event.events & EPOLLWAKEUP) { >> error = ep_create_wakeup_source(epi); >> if (error) >> @@ -1287,7 +1306,8 @@ static int ep_insert(struct eventpoll *e >> >> /* Add the current item to the list of active epoll hook for this file */ >> spin_lock(&tfile->f_lock); > > Looks like ->f_lock protects updates to the RCU-protected structure. > yes. >> - list_add_tail(&epi->fllink, &tfile->f_ep_links); >> + list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links); >> + epi->on_list = 1; >> spin_unlock(&tfile->f_lock); >> >> /* >> @@ -1328,8 +1348,8 @@ static int ep_insert(struct eventpoll *e >> >> error_remove_epi: >> spin_lock(&tfile->f_lock); > > More evidence in favor of ->f_lock protecting updates to the RCU-protected > data structure. > >> - if (ep_is_linked(&epi->fllink)) >> - list_del_init(&epi->fllink); >> + if (epi->on_list) >> + list_del_rcu(&epi->fllink); > > OK, this list_del_rcu() call is for handling failed inserts. > > But if we had to use list_del_rcu(), don't we also need to wait a grace > period before freeing the item? Or does rb_erase() somehow do that? > > Or has placing it ->on_list somehow avoided exposing it to readers? > (Of course, it this is the case, the list_del_rcu() could remain > a list_del_init()...) > Right so the only way we can get a failed insert is if we get to the 'error_remove_epi:' label. Which is only possible if 'full_check' is set. That implies that we hold the global 'epmutex' which means that 'reverse_path_check_proc()' can not be running. Thus, we could use either list_del_rcu or list_del_init here. I chose list_del_rcu() to be more consistent. Perhaps, I should add a comment above to explain this? >> spin_unlock(&tfile->f_lock); >> >> rb_erase(&epi->rbn, &ep->rbr); >> @@ -1846,15 +1866,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, in >> * and hang them on the tfile_check_list, so we can check that we >> * haven't created too many possible wakeup paths. >> * >> - * We need to hold the epmutex across both ep_insert and ep_remove >> - * b/c we want to make sure we are looking at a coherent view of >> - * epoll network. >> + * We need to hold the epmutex across ep_insert to prevent >> + * multple adds from creating loops in parallel. >> */ >> - if (op == EPOLL_CTL_ADD || op == EPOLL_CTL_DEL) { >> + if (op == EPOLL_CTL_ADD) { >> mutex_lock(&epmutex); >> did_lock_epmutex = 1; >> - } >> - if (op == EPOLL_CTL_ADD) { >> if (is_file_epoll(tf.file)) { >> error = -ELOOP; >> if (ep_loop_check(ep, tf.file) != 0) { >> _ >> ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies 2013-10-01 17:08 [PATCH 0/2 v2] epoll: reduce 'epmutex' lock contention Jason Baron 2013-10-01 17:08 ` [PATCH 1/2 v2] epoll: optimize EPOLL_CTL_DEL using rcu Jason Baron @ 2013-10-01 17:08 ` Jason Baron 2013-10-03 21:50 ` Andrew Morton 2013-10-24 10:09 ` Paul E. McKenney 1 sibling, 2 replies; 10+ messages in thread From: Jason Baron @ 2013-10-01 17:08 UTC (permalink / raw) To: akpm Cc: normalperson, nzimmer, viro, nelhage, davidel, linux-kernel, linux-fsdevel When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached directly to a wakeup source, we do not need to take the global 'epmutex', unless the epoll file descriptor is nested. The purpose of taking the 'epmutex' on add is to prevent complex topologies such as loops and deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD operations. However, for the simple case of an epoll file descriptor attached directly to a wakeup source (with no nesting), we do not need to hold the 'epmutex'. This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb performance: " On the 16 socket run the performance went from 35k jOPS to 125k jOPS. In addition the benchmark when from scaling well on 10 sockets to scaling well on just over 40 sockets. ... Currently the benchmark stops scaling at around 40-44 sockets but it seems like I found a second unrelated bottleneck. " Tested-by: Nathan Zimmer <nzimmer@sgi.com> Signed-off-by: Jason Baron <jbaron@akamai.com> --- fs/eventpoll.c | 94 ++++++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 68 insertions(+), 26 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index dd9fae1..0f25162 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -595,8 +595,7 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi) static int ep_scan_ready_list(struct eventpoll *ep, int (*sproc)(struct eventpoll *, struct list_head *, void *), - void *priv, - int depth) + void *priv, int depth, int ep_locked) { int error, pwake = 0; unsigned long flags; @@ -607,7 +606,9 @@ static int ep_scan_ready_list(struct eventpoll *ep, * We need to lock this because we could be hit by * eventpoll_release_file() and epoll_ctl(). */ - mutex_lock_nested(&ep->mtx, depth); + + if (!ep_locked) + mutex_lock_nested(&ep->mtx, depth); /* * Steal the ready list, and re-init the original one to the @@ -671,7 +672,8 @@ static int ep_scan_ready_list(struct eventpoll *ep, } spin_unlock_irqrestore(&ep->lock, flags); - mutex_unlock(&ep->mtx); + if (!ep_locked) + mutex_unlock(&ep->mtx); /* We have to call this outside the lock */ if (pwake) @@ -829,15 +831,34 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head, return 0; } +static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, + poll_table *pt); + +struct readyevents_arg { + struct eventpoll *ep; + int locked; +}; + static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests) { - return ep_scan_ready_list(priv, ep_read_events_proc, NULL, call_nests + 1); + struct readyevents_arg *arg = (struct readyevents_arg *)priv; + + return ep_scan_ready_list(arg->ep, ep_read_events_proc, NULL, + call_nests + 1, arg->locked); } static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait) { int pollflags; struct eventpoll *ep = file->private_data; + struct readyevents_arg arg; + + /* + * During ep_insert() we already hold the ep->mtx for the tfile. + * Prevent re-aquisition. + */ + arg.locked = ((wait && (wait->_qproc == ep_ptable_queue_proc)) ? 1 : 0); + arg.ep = ep; /* Insert inside our poll wait queue */ poll_wait(file, &ep->poll_wait, wait); @@ -849,7 +870,7 @@ static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait) * could re-enter here. */ pollflags = ep_call_nested(&poll_readywalk_ncalls, EP_MAX_NESTS, - ep_poll_readyevents_proc, ep, ep, current); + ep_poll_readyevents_proc, &arg, ep, current); return pollflags != -1 ? pollflags : 0; } @@ -1250,7 +1271,7 @@ static noinline void ep_destroy_wakeup_source(struct epitem *epi) * Must be called with "mtx" held. */ static int ep_insert(struct eventpoll *ep, struct epoll_event *event, - struct file *tfile, int fd) + struct file *tfile, int fd, int full_check) { int error, revents, pwake = 0; unsigned long flags; @@ -1318,7 +1339,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event, /* now check if we've created too many backpaths */ error = -EINVAL; - if (reverse_path_check()) + if (full_check && reverse_path_check()) goto error_remove_epi; /* We have to drop the new item inside our item list to keep track of it */ @@ -1542,7 +1563,7 @@ static int ep_send_events(struct eventpoll *ep, esed.maxevents = maxevents; esed.events = events; - return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0); + return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, 0); } static inline struct timespec ep_set_mstimeout(long ms) @@ -1813,11 +1834,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, struct epoll_event __user *, event) { int error; - int did_lock_epmutex = 0; + int full_check = 0; struct fd f, tf; struct eventpoll *ep; struct epitem *epi; struct epoll_event epds; + struct eventpoll *tep = NULL; error = -EFAULT; if (ep_op_has_event(op) && @@ -1866,23 +1888,40 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, * and hang them on the tfile_check_list, so we can check that we * haven't created too many possible wakeup paths. * - * We need to hold the epmutex across ep_insert to prevent - * multple adds from creating loops in parallel. + * We do not need to take the global 'epumutex' on EPOLL_CTL_ADD when + * the epoll file descriptor is attaching directly to a wakeup source, + * unless the epoll file descriptor is nested. The purpose of taking the + * 'epmutex' on add is to prevent complex toplogies such as loops and + * deep wakeup paths from forming in parallel through multiple + * EPOLL_CTL_ADD operations. */ + mutex_lock_nested(&ep->mtx, 0); if (op == EPOLL_CTL_ADD) { - mutex_lock(&epmutex); - did_lock_epmutex = 1; - if (is_file_epoll(tf.file)) { - error = -ELOOP; - if (ep_loop_check(ep, tf.file) != 0) { - clear_tfile_check_list(); - goto error_tgt_fput; + if (!list_empty(&f.file->f_ep_links) || + is_file_epoll(tf.file)) { + full_check = 1; + mutex_unlock(&ep->mtx); + mutex_lock(&epmutex); + if (is_file_epoll(tf.file)) { + error = -ELOOP; + if (ep_loop_check(ep, tf.file) != 0) { + clear_tfile_check_list(); + goto error_tgt_fput; + } + } else + list_add(&tf.file->f_tfile_llink, + &tfile_check_list); + mutex_lock_nested(&ep->mtx, 0); + if (is_file_epoll(tf.file)) { + tep = tf.file->private_data; + mutex_lock_nested(&tep->mtx, 1); } - } else - list_add(&tf.file->f_tfile_llink, &tfile_check_list); + } + } + if (op == EPOLL_CTL_DEL && is_file_epoll(tf.file)) { + tep = tf.file->private_data; + mutex_lock_nested(&tep->mtx, 1); } - - mutex_lock_nested(&ep->mtx, 0); /* * Try to lookup the file inside our RB tree, Since we grabbed "mtx" @@ -1896,10 +1935,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, case EPOLL_CTL_ADD: if (!epi) { epds.events |= POLLERR | POLLHUP; - error = ep_insert(ep, &epds, tf.file, fd); + error = ep_insert(ep, &epds, tf.file, fd, full_check); } else error = -EEXIST; - clear_tfile_check_list(); + if (full_check) + clear_tfile_check_list(); break; case EPOLL_CTL_DEL: if (epi) @@ -1915,10 +1955,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, error = -ENOENT; break; } + if (tep != NULL) + mutex_unlock(&tep->mtx); mutex_unlock(&ep->mtx); error_tgt_fput: - if (did_lock_epmutex) + if (full_check) mutex_unlock(&epmutex); fdput(tf); -- 1.8.2 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies 2013-10-01 17:08 ` [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies Jason Baron @ 2013-10-03 21:50 ` Andrew Morton 2013-10-04 15:16 ` Jason Baron 2013-10-24 10:09 ` Paul E. McKenney 1 sibling, 1 reply; 10+ messages in thread From: Andrew Morton @ 2013-10-03 21:50 UTC (permalink / raw) To: Jason Baron Cc: normalperson, nzimmer, viro, nelhage, davidel, linux-kernel, linux-fsdevel On Tue, 1 Oct 2013 17:08:14 +0000 (GMT) Jason Baron <jbaron@akamai.com> wrote: > When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached > directly to a wakeup source, we do not need to take the global 'epmutex', > unless the epoll file descriptor is nested. The purpose of taking > the 'epmutex' on add is to prevent complex topologies such as loops and > deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD > operations. However, for the simple case of an epoll file descriptor > attached directly to a wakeup source (with no nesting), we do not need > to hold the 'epmutex'. > > This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves > scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb > performance: > > " > On the 16 socket run the performance went from 35k jOPS to 125k jOPS. > In addition the benchmark when from scaling well on 10 sockets to scaling well > on just over 40 sockets. > > ... > > Currently the benchmark stops scaling at around 40-44 sockets but it seems like > I found a second unrelated bottleneck. > " I couldn't resist fiddling. Please review: From: Andrew Morton <akpm@linux-foundation.org> Subject: epoll-do-not-take-global-epmutex-for-simple-topologies-fix - use `bool' for boolean variables - remove unneeded/undesirable cast of void* - add missed ep_scan_ready_list() kerneldoc Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Eric Wong <normalperson@yhbt.net> Cc: Jason Baron <jbaron@akamai.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Nelson Elhage <nelhage@nelhage.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- fs/eventpoll.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff -puN fs/eventpoll.c~epoll-do-not-take-global-epmutex-for-simple-topologies-fix fs/eventpoll.c --- a/fs/eventpoll.c~epoll-do-not-take-global-epmutex-for-simple-topologies-fix +++ a/fs/eventpoll.c @@ -589,13 +589,14 @@ static inline void ep_pm_stay_awake_rcu( * @sproc: Pointer to the scan callback. * @priv: Private opaque data passed to the @sproc callback. * @depth: The current depth of recursive f_op->poll calls. + * @ep_locked: caller already holds ep->mtx * * Returns: The same integer error code returned by the @sproc callback. */ static int ep_scan_ready_list(struct eventpoll *ep, int (*sproc)(struct eventpoll *, struct list_head *, void *), - void *priv, int depth, int ep_locked) + void *priv, int depth, bool ep_locked) { int error, pwake = 0; unsigned long flags; @@ -836,12 +837,12 @@ static void ep_ptable_queue_proc(struct struct readyevents_arg { struct eventpoll *ep; - int locked; + bool locked; }; static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests) { - struct readyevents_arg *arg = (struct readyevents_arg *)priv; + struct readyevents_arg *arg = priv; return ep_scan_ready_list(arg->ep, ep_read_events_proc, NULL, call_nests + 1, arg->locked); @@ -857,7 +858,7 @@ static unsigned int ep_eventpoll_poll(st * During ep_insert() we already hold the ep->mtx for the tfile. * Prevent re-aquisition. */ - arg.locked = ((wait && (wait->_qproc == ep_ptable_queue_proc)) ? 1 : 0); + arg.locked = wait && (wait->_qproc == ep_ptable_queue_proc); arg.ep = ep; /* Insert inside our poll wait queue */ @@ -1563,7 +1564,7 @@ static int ep_send_events(struct eventpo esed.maxevents = maxevents; esed.events = events; - return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, 0); + return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false); } static inline struct timespec ep_set_mstimeout(long ms) _ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies 2013-10-03 21:50 ` Andrew Morton @ 2013-10-04 15:16 ` Jason Baron 2013-10-04 17:54 ` Nathan Zimmer 0 siblings, 1 reply; 10+ messages in thread From: Jason Baron @ 2013-10-04 15:16 UTC (permalink / raw) To: Andrew Morton Cc: normalperson@yhbt.net, nzimmer@sgi.com, viro@zeniv.linux.org.uk, nelhage@nelhage.com, davidel@xmailserver.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org On 10/03/2013 05:50 PM, Andrew Morton wrote: > On Tue, 1 Oct 2013 17:08:14 +0000 (GMT) Jason Baron <jbaron@akamai.com> wrote: > >> When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached >> directly to a wakeup source, we do not need to take the global 'epmutex', >> unless the epoll file descriptor is nested. The purpose of taking >> the 'epmutex' on add is to prevent complex topologies such as loops and >> deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD >> operations. However, for the simple case of an epoll file descriptor >> attached directly to a wakeup source (with no nesting), we do not need >> to hold the 'epmutex'. >> >> This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves >> scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb >> performance: >> >> " >> On the 16 socket run the performance went from 35k jOPS to 125k jOPS. >> In addition the benchmark when from scaling well on 10 sockets to scaling well >> on just over 40 sockets. >> >> ... >> >> Currently the benchmark stops scaling at around 40-44 sockets but it seems like >> I found a second unrelated bottleneck. >> " > I couldn't resist fiddling. Please review > > From: Andrew Morton <akpm@linux-foundation.org> > Subject: epoll-do-not-take-global-epmutex-for-simple-topologies-fix > > - use `bool' for boolean variables > - remove unneeded/undesirable cast of void* > - add missed ep_scan_ready_list() kerneldoc Hi Andrew, Looks good to me. Feel free to add: Reviewed-by: Jason Baron <jbaron@akamai.com> Thanks, -Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies 2013-10-04 15:16 ` Jason Baron @ 2013-10-04 17:54 ` Nathan Zimmer 0 siblings, 0 replies; 10+ messages in thread From: Nathan Zimmer @ 2013-10-04 17:54 UTC (permalink / raw) To: Jason Baron Cc: Andrew Morton, normalperson@yhbt.net, nzimmer@sgi.com, viro@zeniv.linux.org.uk, nelhage@nelhage.com, davidel@xmailserver.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org On Fri, Oct 04, 2013 at 11:16:12AM -0400, Jason Baron wrote: > On 10/03/2013 05:50 PM, Andrew Morton wrote: > > On Tue, 1 Oct 2013 17:08:14 +0000 (GMT) Jason Baron <jbaron@akamai.com> wrote: > > > >> When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached > >> directly to a wakeup source, we do not need to take the global 'epmutex', > >> unless the epoll file descriptor is nested. The purpose of taking > >> the 'epmutex' on add is to prevent complex topologies such as loops and > >> deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD > >> operations. However, for the simple case of an epoll file descriptor > >> attached directly to a wakeup source (with no nesting), we do not need > >> to hold the 'epmutex'. > >> > >> This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves > >> scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb > >> performance: > >> > >> " > >> On the 16 socket run the performance went from 35k jOPS to 125k jOPS. > >> In addition the benchmark when from scaling well on 10 sockets to scaling well > >> on just over 40 sockets. > >> > >> ... > >> > >> Currently the benchmark stops scaling at around 40-44 sockets but it seems like > >> I found a second unrelated bottleneck. > >> " > > I couldn't resist fiddling. Please review > > > > From: Andrew Morton <akpm@linux-foundation.org> > > Subject: epoll-do-not-take-global-epmutex-for-simple-topologies-fix > > > > - use `bool' for boolean variables > > - remove unneeded/undesirable cast of void* > > - add missed ep_scan_ready_list() kerneldoc > > Hi Andrew, > > Looks good to me. Feel free to add: > > Reviewed-by: Jason Baron <jbaron@akamai.com> > > Thanks, > > -Jason Just finished rerunning the benchmarks with this latest patchset, including the fiddling, and it still looks great. Thanks, Nate ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies 2013-10-01 17:08 ` [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies Jason Baron 2013-10-03 21:50 ` Andrew Morton @ 2013-10-24 10:09 ` Paul E. McKenney 2013-10-24 16:00 ` Jason Baron 1 sibling, 1 reply; 10+ messages in thread From: Paul E. McKenney @ 2013-10-24 10:09 UTC (permalink / raw) To: Jason Baron Cc: akpm, normalperson, nzimmer, viro, nelhage, davidel, linux-kernel, linux-fsdevel On Tue, Oct 01, 2013 at 05:08:14PM +0000, Jason Baron wrote: > When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached > directly to a wakeup source, we do not need to take the global 'epmutex', > unless the epoll file descriptor is nested. The purpose of taking > the 'epmutex' on add is to prevent complex topologies such as loops and > deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD > operations. However, for the simple case of an epoll file descriptor > attached directly to a wakeup source (with no nesting), we do not need > to hold the 'epmutex'. > > This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves > scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb > performance: > > " > On the 16 socket run the performance went from 35k jOPS to 125k jOPS. > In addition the benchmark when from scaling well on 10 sockets to scaling well > on just over 40 sockets. > > ... > > Currently the benchmark stops scaling at around 40-44 sockets but it seems like > I found a second unrelated bottleneck. > " Some questions and comments below. > Tested-by: Nathan Zimmer <nzimmer@sgi.com> > Signed-off-by: Jason Baron <jbaron@akamai.com> > --- > fs/eventpoll.c | 94 ++++++++++++++++++++++++++++++++++++++++++---------------- > 1 file changed, 68 insertions(+), 26 deletions(-) > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c > index dd9fae1..0f25162 100644 > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -595,8 +595,7 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi) > static int ep_scan_ready_list(struct eventpoll *ep, > int (*sproc)(struct eventpoll *, > struct list_head *, void *), > - void *priv, > - int depth) > + void *priv, int depth, int ep_locked) > { > int error, pwake = 0; > unsigned long flags; > @@ -607,7 +606,9 @@ static int ep_scan_ready_list(struct eventpoll *ep, > * We need to lock this because we could be hit by > * eventpoll_release_file() and epoll_ctl(). > */ > - mutex_lock_nested(&ep->mtx, depth); > + > + if (!ep_locked) > + mutex_lock_nested(&ep->mtx, depth); > > /* > * Steal the ready list, and re-init the original one to the > @@ -671,7 +672,8 @@ static int ep_scan_ready_list(struct eventpoll *ep, > } > spin_unlock_irqrestore(&ep->lock, flags); > > - mutex_unlock(&ep->mtx); > + if (!ep_locked) > + mutex_unlock(&ep->mtx); > > /* We have to call this outside the lock */ > if (pwake) OK, allowing calls to ep_scan_ready_list() to be made under the lock. Usually you would use a __ep_scan_ready_list() instead of adding the flag, but you do have the wakeup that needs to be outside of the lock. But doesn't that mean that you need something like? if (pwake && !ep_locked) And then wouldn't you also need some way to inform the caller to do the wakeup after releasing ep->mtx? Or is it now safe to do the wakeup under ep->mtx? If so, why? > @@ -829,15 +831,34 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head, > return 0; > } > > +static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, > + poll_table *pt); > + > +struct readyevents_arg { > + struct eventpoll *ep; > + int locked; > +}; > + > static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests) > { OK, I will bite... What is constraining the arguments of the ep_poll_readyevents_proc()? I don't see any use of it except in this file. Nor do I see any use of ep_call_nested() except in this file. Might it be a bit cleaner to add a flag to the argument list? I don't feel strongly about this, just asking the question. > - return ep_scan_ready_list(priv, ep_read_events_proc, NULL, call_nests + 1); > + struct readyevents_arg *arg = (struct readyevents_arg *)priv; > + > + return ep_scan_ready_list(arg->ep, ep_read_events_proc, NULL, > + call_nests + 1, arg->locked); > } > > static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait) > { > int pollflags; > struct eventpoll *ep = file->private_data; > + struct readyevents_arg arg; > + > + /* > + * During ep_insert() we already hold the ep->mtx for the tfile. > + * Prevent re-aquisition. > + */ > + arg.locked = ((wait && (wait->_qproc == ep_ptable_queue_proc)) ? 1 : 0); > + arg.ep = ep; > > /* Insert inside our poll wait queue */ > poll_wait(file, &ep->poll_wait, wait); > @@ -849,7 +870,7 @@ static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait) > * could re-enter here. > */ > pollflags = ep_call_nested(&poll_readywalk_ncalls, EP_MAX_NESTS, > - ep_poll_readyevents_proc, ep, ep, current); > + ep_poll_readyevents_proc, &arg, ep, current); > > return pollflags != -1 ? pollflags : 0; > } > @@ -1250,7 +1271,7 @@ static noinline void ep_destroy_wakeup_source(struct epitem *epi) > * Must be called with "mtx" held. > */ > static int ep_insert(struct eventpoll *ep, struct epoll_event *event, > - struct file *tfile, int fd) > + struct file *tfile, int fd, int full_check) > { > int error, revents, pwake = 0; > unsigned long flags; > @@ -1318,7 +1339,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event, > > /* now check if we've created too many backpaths */ > error = -EINVAL; > - if (reverse_path_check()) > + if (full_check && reverse_path_check()) > goto error_remove_epi; > > /* We have to drop the new item inside our item list to keep track of it */ > @@ -1542,7 +1563,7 @@ static int ep_send_events(struct eventpoll *ep, > esed.maxevents = maxevents; > esed.events = events; > > - return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0); > + return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, 0); > } > > static inline struct timespec ep_set_mstimeout(long ms) > @@ -1813,11 +1834,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, > struct epoll_event __user *, event) > { > int error; > - int did_lock_epmutex = 0; > + int full_check = 0; > struct fd f, tf; > struct eventpoll *ep; > struct epitem *epi; > struct epoll_event epds; > + struct eventpoll *tep = NULL; > > error = -EFAULT; > if (ep_op_has_event(op) && > @@ -1866,23 +1888,40 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, > * and hang them on the tfile_check_list, so we can check that we > * haven't created too many possible wakeup paths. > * > - * We need to hold the epmutex across ep_insert to prevent > - * multple adds from creating loops in parallel. > + * We do not need to take the global 'epumutex' on EPOLL_CTL_ADD when > + * the epoll file descriptor is attaching directly to a wakeup source, > + * unless the epoll file descriptor is nested. The purpose of taking the > + * 'epmutex' on add is to prevent complex toplogies such as loops and > + * deep wakeup paths from forming in parallel through multiple > + * EPOLL_CTL_ADD operations. > */ > + mutex_lock_nested(&ep->mtx, 0); > if (op == EPOLL_CTL_ADD) { > - mutex_lock(&epmutex); > - did_lock_epmutex = 1; > - if (is_file_epoll(tf.file)) { > - error = -ELOOP; > - if (ep_loop_check(ep, tf.file) != 0) { > - clear_tfile_check_list(); > - goto error_tgt_fput; > + if (!list_empty(&f.file->f_ep_links) || > + is_file_epoll(tf.file)) { > + full_check = 1; > + mutex_unlock(&ep->mtx); > + mutex_lock(&epmutex); > + if (is_file_epoll(tf.file)) { > + error = -ELOOP; > + if (ep_loop_check(ep, tf.file) != 0) { > + clear_tfile_check_list(); > + goto error_tgt_fput; > + } > + } else > + list_add(&tf.file->f_tfile_llink, > + &tfile_check_list); > + mutex_lock_nested(&ep->mtx, 0); > + if (is_file_epoll(tf.file)) { > + tep = tf.file->private_data; > + mutex_lock_nested(&tep->mtx, 1); > } > - } else > - list_add(&tf.file->f_tfile_llink, &tfile_check_list); > + } > + } > + if (op == EPOLL_CTL_DEL && is_file_epoll(tf.file)) { > + tep = tf.file->private_data; > + mutex_lock_nested(&tep->mtx, 1); > } > - > - mutex_lock_nested(&ep->mtx, 0); > > /* > * Try to lookup the file inside our RB tree, Since we grabbed "mtx" > @@ -1896,10 +1935,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, > case EPOLL_CTL_ADD: > if (!epi) { > epds.events |= POLLERR | POLLHUP; > - error = ep_insert(ep, &epds, tf.file, fd); > + error = ep_insert(ep, &epds, tf.file, fd, full_check); > } else > error = -EEXIST; > - clear_tfile_check_list(); > + if (full_check) > + clear_tfile_check_list(); > break; > case EPOLL_CTL_DEL: > if (epi) > @@ -1915,10 +1955,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, > error = -ENOENT; > break; > } > + if (tep != NULL) > + mutex_unlock(&tep->mtx); > mutex_unlock(&ep->mtx); > > error_tgt_fput: > - if (did_lock_epmutex) > + if (full_check) > mutex_unlock(&epmutex); > > fdput(tf); > -- > 1.8.2 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies 2013-10-24 10:09 ` Paul E. McKenney @ 2013-10-24 16:00 ` Jason Baron 0 siblings, 0 replies; 10+ messages in thread From: Jason Baron @ 2013-10-24 16:00 UTC (permalink / raw) To: paulmck@linux.vnet.ibm.com Cc: akpm@linux-foundation.org, normalperson@yhbt.net, nzimmer@sgi.com, viro@zeniv.linux.org.uk, nelhage@nelhage.com, davidel@xmailserver.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org On 10/24/2013 06:09 AM, Paul E. McKenney wrote: > On Tue, Oct 01, 2013 at 05:08:14PM +0000, Jason Baron wrote: >> When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached >> directly to a wakeup source, we do not need to take the global 'epmutex', >> unless the epoll file descriptor is nested. The purpose of taking >> the 'epmutex' on add is to prevent complex topologies such as loops and >> deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD >> operations. However, for the simple case of an epoll file descriptor >> attached directly to a wakeup source (with no nesting), we do not need >> to hold the 'epmutex'. >> >> This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves >> scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb >> performance: >> >> " >> On the 16 socket run the performance went from 35k jOPS to 125k jOPS. >> In addition the benchmark when from scaling well on 10 sockets to scaling well >> on just over 40 sockets. >> >> ... >> >> Currently the benchmark stops scaling at around 40-44 sockets but it seems like >> I found a second unrelated bottleneck. >> " > > Some questions and comments below. > >> Tested-by: Nathan Zimmer <nzimmer@sgi.com> >> Signed-off-by: Jason Baron <jbaron@akamai.com> >> --- >> fs/eventpoll.c | 94 ++++++++++++++++++++++++++++++++++++++++++---------------- >> 1 file changed, 68 insertions(+), 26 deletions(-) >> >> diff --git a/fs/eventpoll.c b/fs/eventpoll.c >> index dd9fae1..0f25162 100644 >> --- a/fs/eventpoll.c >> +++ b/fs/eventpoll.c >> @@ -595,8 +595,7 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi) >> static int ep_scan_ready_list(struct eventpoll *ep, >> int (*sproc)(struct eventpoll *, >> struct list_head *, void *), >> - void *priv, >> - int depth) >> + void *priv, int depth, int ep_locked) >> { >> int error, pwake = 0; >> unsigned long flags; >> @@ -607,7 +606,9 @@ static int ep_scan_ready_list(struct eventpoll *ep, >> * We need to lock this because we could be hit by >> * eventpoll_release_file() and epoll_ctl(). >> */ >> - mutex_lock_nested(&ep->mtx, depth); >> + >> + if (!ep_locked) >> + mutex_lock_nested(&ep->mtx, depth); >> >> /* >> * Steal the ready list, and re-init the original one to the >> @@ -671,7 +672,8 @@ static int ep_scan_ready_list(struct eventpoll *ep, >> } >> spin_unlock_irqrestore(&ep->lock, flags); >> >> - mutex_unlock(&ep->mtx); >> + if (!ep_locked) >> + mutex_unlock(&ep->mtx); >> >> /* We have to call this outside the lock */ >> if (pwake) > > OK, allowing calls to ep_scan_ready_list() to be made under the lock. > > Usually you would use a __ep_scan_ready_list() instead of adding the > flag, but you do have the wakeup that needs to be outside of the lock. > > But doesn't that mean that you need something like? > > if (pwake && !ep_locked) > > And then wouldn't you also need some way to inform the caller to > do the wakeup after releasing ep->mtx? > > Or is it now safe to do the wakeup under ep->mtx? If so, why? > All I was trying to do here was to preserve the locking ep->mtx as it was prior to this patch series. Because the patches now take the 'destination' ep->mtx on an epoll_ctl() operation. It might be already held leading to double locking (i hit this case while testing this series). So the idea here is simply to realize when we've already grabbed this ep->mtx. >> @@ -829,15 +831,34 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head, >> return 0; >> } >> >> +static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, >> + poll_table *pt); >> + >> +struct readyevents_arg { >> + struct eventpoll *ep; >> + int locked; >> +}; >> + >> static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests) >> { > > OK, I will bite... What is constraining the arguments of the > ep_poll_readyevents_proc()? I don't see any use of it except in this > file. Nor do I see any use of ep_call_nested() except in this file. > Might it be a bit cleaner to add a flag to the argument list? > > I don't feel strongly about this, just asking the question. > I don't feel too strongly either, but other users of ep_call_nested don't need this either... Thanks, -Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2013-10-24 16:00 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-10-01 17:08 [PATCH 0/2 v2] epoll: reduce 'epmutex' lock contention Jason Baron 2013-10-01 17:08 ` [PATCH 1/2 v2] epoll: optimize EPOLL_CTL_DEL using rcu Jason Baron 2013-10-24 8:52 ` Paul E. McKenney 2013-10-24 14:56 ` Jason Baron 2013-10-01 17:08 ` [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies Jason Baron 2013-10-03 21:50 ` Andrew Morton 2013-10-04 15:16 ` Jason Baron 2013-10-04 17:54 ` Nathan Zimmer 2013-10-24 10:09 ` Paul E. McKenney 2013-10-24 16:00 ` Jason Baron
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).