All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Jason Baron <jbaron@redhat.com>
Cc: davidel@xmailserver.org, nelhage@ksplice.com,
	torvalds@linux-foundation.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC] epoll: limit paths
Date: Fri, 2 Sep 2011 14:49:26 -0700	[thread overview]
Message-ID: <20110902144926.69a00396.akpm@linux-foundation.org> (raw)
In-Reply-To: <20110902185922.GA8770@redhat.com>

On Fri, 2 Sep 2011 14:59:23 -0400
Jason Baron <jbaron@redhat.com> wrote:

> The current epoll code can be tickled to run basically indefinitely in both
> loop detection path check (on ep_insert()), and in the wakeup paths. The
> programs that tickle this behavior set up deeply linked networks of epoll
> file descriptors that cause the epoll algorithms to traverse them indefinitely.
> A couple of these sample programs have been previously posted in this thread:
> https://lkml.org/lkml/2011/2/25/297.
> 
> To fix the loop detection path check algorithms, I simply keep track of the
> epoll nodes that have been already visited. Thus, the loop detection
> becomes proportional to the number of epoll file descriptor and links. This
> dramatically decreases the run-time of the loop check algorithm. In one
> diabolical case I tried it reduced the run-time from 15 mintues
> (all in kernel time) to .3 seconds.
> 
> Fixing the wakeup paths could be done at wakeup time in a similar manner by
> keeping track of nodes that have already been visited, but the complexity is
> harder, since there can be multiple wakeups on different cpus...Thus, I've
> opted to limit the number of possible wakeup paths when the paths are created.
> 
> This is accomplished, by noting that the end file descriptor points that are
> found during the loop detection pass (from the newly added link), are actually
> the sources for wakeup events. I keep a list of these file descriptors and
> limit the number and length of these paths that emanate from these 'source file
> descriptors'. In the current implemetation I allow 1000 paths of length 1,
> 500 of length 2, 100 of length 3, 50 of length 4 and 10 of length 5. Note that
> it is sufficient to check the 'source file descriptors' reachable from the newly
> added link, since no other 'source file descriptors' will have newly added
> links. This allows us to check only the wakeup paths that may have gotten too
> long, and not re-check all possible wakeup paths on the system.
> 
> In terms of the path limit selection, I think its first worth noting that the
> most common case for epoll, is probably the model where you have 1 epoll file
> descriptor that is monitoring n number of 'source file descriptors'. In this
> case, each 'source file descriptor' has a 1 path of length 1. Thus, I believe
> that the limits I'm proposing are quite reasonable and in fact may be too
> generous. Thus, I'm hoping that the proposed limits will not prevent any
> workloads that currently work to fail.
> 
> In terms of locking, I have extended the use of the 'epmutex' to all epoll_ctl
> add and remove operations. Currently its only used in a subset of the add paths.
> I need to hold the epmutex, so that we can correctly traverse a coherent graph,
> to check the number of paths. I believe that this additional locking is
> probably ok, since its in the setup/teardown paths, and doesn't affect the
> running paths, but it certainly is going to add some extra overhead. Also,
> worth noting is that the epmuex was recently added to the ep_ctl add operations
> in the initial path loop detection code using the argument that it was
> not on a critical path.
> 
> Another thing to note here, is the length of epoll chains that is allowed.
> Currently, eventpoll.c defines:
> 
> /* Maximum number of nesting allowed inside epoll sets */
> #define EP_MAX_NESTS 4
> 
> This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS + 1).
> However, this limit is currently only enforced during the loop check detection
> code, and only when the epoll file descriptors are added in a certain order.
> Thus, this limit is currently easily bypassed. The newly added check for wakeup
> paths, stricly limits the wakeup paths to a length of 5, regardless of the order
> in which ep's are linked together. Thus, a side-effect of the new code is a more
> consistent enforcement of the graph depth.
> 
> Thus far, I've tested this, using the sample programs previously mentioned,
> which now either return quickly or return -EINVAL. I've also testing using
> the piptest.c epoll tester, which showed no difference in performance. I've
> also created a number of different epoll networks and tested that they behave
> as expectdd.
> 
> I believe this solves the original diabolical test cases, while still preserving
> the sane epoll nesting.
> 

Cool, thanks for working on this - it is rather a stinker.

I don't think we have any maintained public test code for epoll?  And I
trust you have some?  It would be good if you could merge whatever you
have into the main kernel.  Then each time we fix bugs or add features,
I can harrass people to update the test harness to track the changes.

A number of minor things:

> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -188,6 +188,12 @@ struct eventpoll {
>  
>  	/* The user that created the eventpoll descriptor */
>  	struct user_struct *user;
> +
> +	struct file *file;
> +
> +	/* used to optimize loop detection check */
> +	int visited;
> +	struct list_head visitedllink;

Strange name.  Can we improve this?  Something like visited_loop_link?  

>  };
>  
>  /* Wait structure used by the poll hooks */
> @@ -246,6 +252,12 @@ static struct kmem_cache *epi_cache __read_mostly;
>  /* Slab cache used to allocate "struct eppoll_entry" */
>  static struct kmem_cache *pwq_cache __read_mostly;
>  
> +/* Visited nodes during ep_loop_check(), so we can unset them when we finish */
> +LIST_HEAD(visited_list);

static

> +/* Files with newly added links, which need a limit on emanating paths */
> +LIST_HEAD(tfile_check_list);

static

Add a comment describing the locking for this.

That locking will need to be kernel-wide, which might have scalability
issues?

>
> ...
>
> @@ -915,6 +927,96 @@ static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi)
>  	rb_insert_color(&epi->rbn, &ep->rbr);
>  }
>  
> +
> +
> +#define PATH_ARR_SIZE 5
> +/* These are the number paths of length 1 to 5, that we are allowing to emanate

The conventional comment layout is

/*
 * These are the number paths of length 1 to 5, that we are allowing to emanate

> + * from a single file of interest. For example, we allow 1000 paths of length
> + * 1, to emanate from each file of interest. This essentially represents the
> + * potential wakeup paths, which need to be limited in order to avoid massive
> + * uncontrolled wakeup storms. The common use case should be a single ep which
> + * is connected to n file sources. In this case each file source has 1 path
> + * of length 1. Thus, the numbers below should be more than sufficient.
> + */
> +int path_limits[PATH_ARR_SIZE] = { 1000, 500, 100, 50, 10 };

static const

> +int path_count[PATH_ARR_SIZE];

static

Add a comment describing the locking which protects this.

That lock will necessarily be kernel-wide.  Seems nasty.  What are the
implications of this?

>
> ...
>
> @@ -1264,18 +1379,35 @@ static int ep_loop_check_proc(void *priv, void *cookie, int call_nests)
>  	int error = 0;
>  	struct file *file = priv;
>  	struct eventpoll *ep = file->private_data;
> +	struct eventpoll *ep_tovisit;
>  	struct rb_node *rbp;
>  	struct epitem *epi;
>  
>  	mutex_lock(&ep->mtx);
> +	ep->visited = 1;
> +	list_add(&ep->visitedllink, &visited_list);
>  	for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
>  		epi = rb_entry(rbp, struct epitem, rbn);
>  		if (unlikely(is_file_epoll(epi->ffd.file))) {
> +			ep_tovisit = epi->ffd.file->private_data;
> +			if (ep_tovisit->visited)
> +				continue;
>  			error = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
>  					       ep_loop_check_proc, epi->ffd.file,
> -					       epi->ffd.file->private_data, current);
> +					       ep_tovisit, current);
>  			if (error != 0)
>  				break;
> +		} else {
> +			/* if we've reached a file that is not associated with

/*
 * If

> +			 * an ep, then then we need to check if the newly added

s/then //

> +			 * links are going to add too many wakeup paths. We do
> +			 * this by adding it to the tfile_check_list, if it's
> +			 * not already there, and calling reverse_path_check()
> +			 * during ep_insert()
> +			 */
> +			if (list_empty(&epi->ffd.file->f_tfile_llink))
> +				list_add(&epi->ffd.file->f_tfile_llink,
> +					 &tfile_check_list);

I guess that tfile_check_list is protected by the big epmutex lock.

I assume you've verified that all paths that lead to manipulation of
all these new globals reliably take that lock.

>  		}
>  	}
>  	mutex_unlock(&ep->mtx);
>
> ...
>


  reply	other threads:[~2011-09-02 21:49 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-09-02 18:59 [PATCH RFC] epoll: limit paths Jason Baron
2011-09-02 21:49 ` Andrew Morton [this message]
2011-09-06 13:36   ` Jason Baron
2011-09-06 19:45     ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110902144926.69a00396.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=davidel@xmailserver.org \
    --cc=jbaron@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nelhage@ksplice.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.