From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756156Ab1IBVth (ORCPT ); Fri, 2 Sep 2011 17:49:37 -0400 Received: from smtp1.linux-foundation.org ([140.211.169.13]:35999 "EHLO smtp1.linux-foundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756077Ab1IBVtg (ORCPT ); Fri, 2 Sep 2011 17:49:36 -0400 Date: Fri, 2 Sep 2011 14:49:26 -0700 From: Andrew Morton To: Jason Baron Cc: davidel@xmailserver.org, nelhage@ksplice.com, torvalds@linux-foundation.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH RFC] epoll: limit paths Message-Id: <20110902144926.69a00396.akpm@linux-foundation.org> In-Reply-To: <20110902185922.GA8770@redhat.com> References: <20110902185922.GA8770@redhat.com> X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2 Sep 2011 14:59:23 -0400 Jason Baron wrote: > The current epoll code can be tickled to run basically indefinitely in both > loop detection path check (on ep_insert()), and in the wakeup paths. The > programs that tickle this behavior set up deeply linked networks of epoll > file descriptors that cause the epoll algorithms to traverse them indefinitely. > A couple of these sample programs have been previously posted in this thread: > https://lkml.org/lkml/2011/2/25/297. > > To fix the loop detection path check algorithms, I simply keep track of the > epoll nodes that have been already visited. Thus, the loop detection > becomes proportional to the number of epoll file descriptor and links. This > dramatically decreases the run-time of the loop check algorithm. In one > diabolical case I tried it reduced the run-time from 15 mintues > (all in kernel time) to .3 seconds. > > Fixing the wakeup paths could be done at wakeup time in a similar manner by > keeping track of nodes that have already been visited, but the complexity is > harder, since there can be multiple wakeups on different cpus...Thus, I've > opted to limit the number of possible wakeup paths when the paths are created. > > This is accomplished, by noting that the end file descriptor points that are > found during the loop detection pass (from the newly added link), are actually > the sources for wakeup events. I keep a list of these file descriptors and > limit the number and length of these paths that emanate from these 'source file > descriptors'. In the current implemetation I allow 1000 paths of length 1, > 500 of length 2, 100 of length 3, 50 of length 4 and 10 of length 5. Note that > it is sufficient to check the 'source file descriptors' reachable from the newly > added link, since no other 'source file descriptors' will have newly added > links. This allows us to check only the wakeup paths that may have gotten too > long, and not re-check all possible wakeup paths on the system. > > In terms of the path limit selection, I think its first worth noting that the > most common case for epoll, is probably the model where you have 1 epoll file > descriptor that is monitoring n number of 'source file descriptors'. In this > case, each 'source file descriptor' has a 1 path of length 1. Thus, I believe > that the limits I'm proposing are quite reasonable and in fact may be too > generous. Thus, I'm hoping that the proposed limits will not prevent any > workloads that currently work to fail. > > In terms of locking, I have extended the use of the 'epmutex' to all epoll_ctl > add and remove operations. Currently its only used in a subset of the add paths. > I need to hold the epmutex, so that we can correctly traverse a coherent graph, > to check the number of paths. I believe that this additional locking is > probably ok, since its in the setup/teardown paths, and doesn't affect the > running paths, but it certainly is going to add some extra overhead. Also, > worth noting is that the epmuex was recently added to the ep_ctl add operations > in the initial path loop detection code using the argument that it was > not on a critical path. > > Another thing to note here, is the length of epoll chains that is allowed. > Currently, eventpoll.c defines: > > /* Maximum number of nesting allowed inside epoll sets */ > #define EP_MAX_NESTS 4 > > This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS + 1). > However, this limit is currently only enforced during the loop check detection > code, and only when the epoll file descriptors are added in a certain order. > Thus, this limit is currently easily bypassed. The newly added check for wakeup > paths, stricly limits the wakeup paths to a length of 5, regardless of the order > in which ep's are linked together. Thus, a side-effect of the new code is a more > consistent enforcement of the graph depth. > > Thus far, I've tested this, using the sample programs previously mentioned, > which now either return quickly or return -EINVAL. I've also testing using > the piptest.c epoll tester, which showed no difference in performance. I've > also created a number of different epoll networks and tested that they behave > as expectdd. > > I believe this solves the original diabolical test cases, while still preserving > the sane epoll nesting. > Cool, thanks for working on this - it is rather a stinker. I don't think we have any maintained public test code for epoll? And I trust you have some? It would be good if you could merge whatever you have into the main kernel. Then each time we fix bugs or add features, I can harrass people to update the test harness to track the changes. A number of minor things: > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -188,6 +188,12 @@ struct eventpoll { > > /* The user that created the eventpoll descriptor */ > struct user_struct *user; > + > + struct file *file; > + > + /* used to optimize loop detection check */ > + int visited; > + struct list_head visitedllink; Strange name. Can we improve this? Something like visited_loop_link? > }; > > /* Wait structure used by the poll hooks */ > @@ -246,6 +252,12 @@ static struct kmem_cache *epi_cache __read_mostly; > /* Slab cache used to allocate "struct eppoll_entry" */ > static struct kmem_cache *pwq_cache __read_mostly; > > +/* Visited nodes during ep_loop_check(), so we can unset them when we finish */ > +LIST_HEAD(visited_list); static > +/* Files with newly added links, which need a limit on emanating paths */ > +LIST_HEAD(tfile_check_list); static Add a comment describing the locking for this. That locking will need to be kernel-wide, which might have scalability issues? > > ... > > @@ -915,6 +927,96 @@ static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi) > rb_insert_color(&epi->rbn, &ep->rbr); > } > > + > + > +#define PATH_ARR_SIZE 5 > +/* These are the number paths of length 1 to 5, that we are allowing to emanate The conventional comment layout is /* * These are the number paths of length 1 to 5, that we are allowing to emanate > + * from a single file of interest. For example, we allow 1000 paths of length > + * 1, to emanate from each file of interest. This essentially represents the > + * potential wakeup paths, which need to be limited in order to avoid massive > + * uncontrolled wakeup storms. The common use case should be a single ep which > + * is connected to n file sources. In this case each file source has 1 path > + * of length 1. Thus, the numbers below should be more than sufficient. > + */ > +int path_limits[PATH_ARR_SIZE] = { 1000, 500, 100, 50, 10 }; static const > +int path_count[PATH_ARR_SIZE]; static Add a comment describing the locking which protects this. That lock will necessarily be kernel-wide. Seems nasty. What are the implications of this? > > ... > > @@ -1264,18 +1379,35 @@ static int ep_loop_check_proc(void *priv, void *cookie, int call_nests) > int error = 0; > struct file *file = priv; > struct eventpoll *ep = file->private_data; > + struct eventpoll *ep_tovisit; > struct rb_node *rbp; > struct epitem *epi; > > mutex_lock(&ep->mtx); > + ep->visited = 1; > + list_add(&ep->visitedllink, &visited_list); > for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) { > epi = rb_entry(rbp, struct epitem, rbn); > if (unlikely(is_file_epoll(epi->ffd.file))) { > + ep_tovisit = epi->ffd.file->private_data; > + if (ep_tovisit->visited) > + continue; > error = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS, > ep_loop_check_proc, epi->ffd.file, > - epi->ffd.file->private_data, current); > + ep_tovisit, current); > if (error != 0) > break; > + } else { > + /* if we've reached a file that is not associated with /* * If > + * an ep, then then we need to check if the newly added s/then // > + * links are going to add too many wakeup paths. We do > + * this by adding it to the tfile_check_list, if it's > + * not already there, and calling reverse_path_check() > + * during ep_insert() > + */ > + if (list_empty(&epi->ffd.file->f_tfile_llink)) > + list_add(&epi->ffd.file->f_tfile_llink, > + &tfile_check_list); I guess that tfile_check_list is protected by the big epmutex lock. I assume you've verified that all paths that lead to manipulation of all these new globals reliably take that lock. > } > } > mutex_unlock(&ep->mtx); > > ... >