[RFC PATCH 0/5] Remove global locks from epoll

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] Remove global locks from epoll
@ 2015-01-15 21:02 Jason Baron
  2015-01-15 21:02 ` [RFC PATCH 1/5] epoll: remove ep_call_nested() from ep_eventpoll_poll() Jason Baron
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Jason Baron @ 2015-01-15 21:02 UTC (permalink / raw)
  To: akpm, famz
  Cc: normalperson, nzimmer, viro, davidel, rostedt, linux-kernel,
	linux-fsdevel

Hi,

There are a number of 'global' locks taken in epoll. The first two patches
remove taking these in the poll() and wakeup paths, since we're already
preventing loops, and excessive wakeup paths during EPOLL_CTL_ADD. The
final 3 introduce the idea of breaking up the current global 'epmutex' by
keeping track of the files that make up a connected epoll set. In this way
we can limit locking to be within in a connected component as opposed to
the current design which limits things globally. This will mostly help
workloads that are doing deeper than 1 level epoll nesting (which may
that be that common), since we currently don't take the 'epmutex' for
EPOLL_CTL_ADD when there is only 1 level of nesting. However, we do eliminate
the global 'epmutex' from the close() path. There is more detail on this
in the patch descriptions.

One aspect that I'd like to improve is that these patches add a 'struct
list_head' and a pointer to the 'struct file', so essentially 3 pointers. I
think there are ways to reduce this (for example, by using single-link
lists where approriate), but I wanted to get general feedback on the approach
before attempting this.

Previous changes in this area showed really good improvements on SPECjbb,
see commit: 67347fe4e6326338ee217d7eb826bedf30b2e155. However, I don't think
that this patch series will improve things that much, as its mostly going
to help deeper epoll nesting setups. The motivation for these patches was that
it was bothering me that we were still taking some global locks in some
fairly hot paths (namely close()).

Patches are fairly lightly tested at this point (and need a lot more testing),
but I'm not aware of any outstanding issues.

Finally, I'd also like to potentially co-ordinate this series with the recent
syscall enhancements from Fam Zheng: http://lwn.net/Articles/628828/ since these
patches are somewhat invasive.

Thanks,

-Jason

Jason Baron (5):
  epoll: Remove ep_call_nested() from ep_eventpoll_poll()
  epoll: Remove ep_call_nested() from ep_poll_safewake()
  epoll: Add ep_call_nested_nolock()
  epoll: Allow topology checks to be parallelized
  epoll: Introduce epoll connected components (remove the epmutex)

 fs/eventpoll.c            | 596 +++++++++++++++++++++++++++++++---------------
 include/linux/eventpoll.h |  52 ++--
 include/linux/fs.h        |   3 +
 3 files changed, 442 insertions(+), 209 deletions(-)

-- 
1.8.2.rc2

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/5] epoll: remove ep_call_nested() from ep_eventpoll_poll()
  2015-01-15 21:02 [RFC PATCH 0/5] Remove global locks from epoll Jason Baron
@ 2015-01-15 21:02 ` Jason Baron
  2015-01-15 21:02 ` [RFC PATCH 2/5] epoll: remove ep_call_nested() from ep_poll_safewake() Jason Baron
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Jason Baron @ 2015-01-15 21:02 UTC (permalink / raw)
  To: akpm, famz
  Cc: normalperson, nzimmer, viro, davidel, rostedt, linux-kernel,
	linux-fsdevel

The use of ep_call_nested() in ep_eventpoll_poll(), which is the .poll
routine for an epoll fd, is used to prevent excessively deep epoll
nesting, and to prevent circular paths. However, we are already preventing
circular paths during EPOLL_CTL_ADD. In terms of too deep epoll chains,
we do in fact allow deep nesting of the epoll fds themselves (deeper
than EP_MAX_NESTS), however we don't allow more than EP_MAX_NESTS when
an epoll file descriptor is actually connected to a wakeup source. Thus,
any deeper nesting than EP_MAX_NESTS must terminate after only 1 level
nesting during poll(), since the poll() is called through ep_scan_ready_list(),
which only calls deeper nests, if there are events available. In the
case of no wakeup sources this is then not possible.

Removing ep_call_nested(), reduces the function call stack for deep epoll
chains, and in addition prevents the acquisition of a global spin_lock()
that was in use in ep_call_nested().

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c | 59 +++++++++++++++++++++++-----------------------------------
 1 file changed, 23 insertions(+), 36 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..e7b0c9e 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -268,9 +268,6 @@ static struct nested_calls poll_loop_ncalls;
 /* Used for safe wake up implementation */
 static struct nested_calls poll_safewake_ncalls;
 
-/* Used to call file's f_op->poll() under the nested calls boundaries */
-static struct nested_calls poll_readywalk_ncalls;
-
 /* Slab cache used to allocate "struct epitem" */
 static struct kmem_cache *epi_cache __read_mostly;
 
@@ -793,6 +790,9 @@ static int ep_eventpoll_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head,
+			       void *priv);
+
 static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt)
 {
 	pt->_key = epi->event.events;
@@ -800,16 +800,31 @@ static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt)
 	return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events;
 }
 
+static inline unsigned int ep_item_poll_recurse(struct epitem *epi,
+						poll_table *pt, int depth)
+{
+	pt->_key = epi->event.events;
+
+	if (is_file_epoll(epi->ffd.file)) {
+		depth++;
+		return ep_scan_ready_list(epi->ffd.file->private_data,
+					  ep_read_events_proc, &depth, depth,
+					  false) & epi->event.events;
+	}
+	return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events;
+}
+
 static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head,
 			       void *priv)
 {
 	struct epitem *epi, *tmp;
 	poll_table pt;
+	int depth = *(int *)priv;
 
 	init_poll_funcptr(&pt, NULL);
 
 	list_for_each_entry_safe(epi, tmp, head, rdllink) {
-		if (ep_item_poll(epi, &pt))
+		if (ep_item_poll_recurse(epi, &pt, depth))
 			return POLLIN | POLLRDNORM;
 		else {
 			/*
@@ -828,45 +843,20 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head,
 static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 				 poll_table *pt);
 
-struct readyevents_arg {
-	struct eventpoll *ep;
-	bool locked;
-};
-
-static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests)
-{
-	struct readyevents_arg *arg = priv;
-
-	return ep_scan_ready_list(arg->ep, ep_read_events_proc, NULL,
-				  call_nests + 1, arg->locked);
-}
-
 static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait)
 {
-	int pollflags;
 	struct eventpoll *ep = file->private_data;
-	struct readyevents_arg arg;
-
-	/*
-	 * During ep_insert() we already hold the ep->mtx for the tfile.
-	 * Prevent re-aquisition.
-	 */
-	arg.locked = wait && (wait->_qproc == ep_ptable_queue_proc);
-	arg.ep = ep;
+	int depth = 0;
 
 	/* Insert inside our poll wait queue */
 	poll_wait(file, &ep->poll_wait, wait);
 
 	/*
 	 * Proceed to find out if wanted events are really available inside
-	 * the ready list. This need to be done under ep_call_nested()
-	 * supervision, since the call to f_op->poll() done on listed files
-	 * could re-enter here.
+	 * the ready list.
 	 */
-	pollflags = ep_call_nested(&poll_readywalk_ncalls, EP_MAX_NESTS,
-				   ep_poll_readyevents_proc, &arg, ep, current);
-
-	return pollflags != -1 ? pollflags : 0;
+	return ep_scan_ready_list(ep, ep_read_events_proc, &depth, depth,
+				wait && (wait->_qproc == ep_ptable_queue_proc));
 }
 
 #ifdef CONFIG_PROC_FS
@@ -2111,9 +2101,6 @@ static int __init eventpoll_init(void)
 	/* Initialize the structure used to perform safe poll wait head wake ups */
 	ep_nested_calls_init(&poll_safewake_ncalls);
 
-	/* Initialize the structure used to perform file's f_op->poll() calls */
-	ep_nested_calls_init(&poll_readywalk_ncalls);
-
 	/*
 	 * We can have many thousands of epitems, so prevent this from
 	 * using an extra cache line on 64-bit (and smaller) CPUs
-- 
1.8.2.rc2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/5] epoll: remove ep_call_nested() from ep_poll_safewake()
  2015-01-15 21:02 [RFC PATCH 0/5] Remove global locks from epoll Jason Baron
  2015-01-15 21:02 ` [RFC PATCH 1/5] epoll: remove ep_call_nested() from ep_eventpoll_poll() Jason Baron
@ 2015-01-15 21:02 ` Jason Baron
  2015-01-15 21:02 ` [RFC PATCH 3/5] epoll: add ep_call_nested_nolock() Jason Baron
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Jason Baron @ 2015-01-15 21:02 UTC (permalink / raw)
  To: akpm, famz
  Cc: normalperson, nzimmer, viro, davidel, rostedt, linux-kernel,
	linux-fsdevel

The ep_poll_safewake() function is used to wakeup potentially nested epoll
file descriptors. The function uses ep_call_nested() to prevent entering
the same wake up queue more than once, and to prevent excessively deep wakeup
paths (deeper than EP_MAX_NESTS). However, this is not necessary since we already
preventing these conditions during EPOLL_CTL_ADD. This saves extra function
calls, and avoids taking a global lock during the ep_call_nested() calls.

I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC case,
since ep_call_nested() keeps track of the nesting level, and this is required
by the call to spin_lock_irqsave_nested(). It would be nice to remove the
ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as well, however
its not clear how to simply pass the nesting level through multiple wake_up()
levels. In any case, I don't think CONFIG_DEBUG_LOCK_ALLOC is generally used
for production, so this should provide a decent speedup for the cases that
we care about.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c | 47 ++++++++++++++++++-----------------------------
 1 file changed, 18 insertions(+), 29 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index e7b0c9e..2864d67 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -265,9 +265,6 @@ static DEFINE_MUTEX(epmutex);
 /* Used to check for epoll file descriptor inclusion loops */
 static struct nested_calls poll_loop_ncalls;
 
-/* Used for safe wake up implementation */
-static struct nested_calls poll_safewake_ncalls;
-
 /* Slab cache used to allocate "struct epitem" */
 static struct kmem_cache *epi_cache __read_mostly;
 
@@ -466,40 +463,21 @@ out_unlock:
  * this special case of epoll.
  */
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-static inline void ep_wake_up_nested(wait_queue_head_t *wqueue,
-				     unsigned long events, int subclass)
+
+static struct nested_calls poll_safewake_ncalls;
+
+static int ep_poll_wakeup_proc(void *priv, void *cookie, int call_nests)
 {
 	unsigned long flags;
+	wait_queue_head_t *wqueue = (wait_queue_head_t *)cookie;
 
-	spin_lock_irqsave_nested(&wqueue->lock, flags, subclass);
-	wake_up_locked_poll(wqueue, events);
+	spin_lock_irqsave_nested(&wqueue->lock, flags, call_nests + 1);
+	wake_up_locked_poll(wqueue, POLLIN);
 	spin_unlock_irqrestore(&wqueue->lock, flags);
-}
-#else
-static inline void ep_wake_up_nested(wait_queue_head_t *wqueue,
-				     unsigned long events, int subclass)
-{
-	wake_up_poll(wqueue, events);
-}
-#endif
 
-static int ep_poll_wakeup_proc(void *priv, void *cookie, int call_nests)
-{
-	ep_wake_up_nested((wait_queue_head_t *) cookie, POLLIN,
-			  1 + call_nests);
 	return 0;
 }
 
-/*
- * Perform a safe wake up of the poll wait list. The problem is that
- * with the new callback'd wake up system, it is possible that the
- * poll callback is reentered from inside the call to wake_up() done
- * on the poll wait queue head. The rule is that we cannot reenter the
- * wake up code from the same task more than EP_MAX_NESTS times,
- * and we cannot reenter the same wait queue head at all. This will
- * enable to have a hierarchy of epoll file descriptor of no more than
- * EP_MAX_NESTS deep.
- */
 static void ep_poll_safewake(wait_queue_head_t *wq)
 {
 	int this_cpu = get_cpu();
@@ -510,6 +488,15 @@ static void ep_poll_safewake(wait_queue_head_t *wq)
 	put_cpu();
 }
 
+#else
+
+static void ep_poll_safewake(wait_queue_head_t *wq)
+{
+	wake_up_poll(wq, POLLIN);
+}
+
+#endif
+
 static void ep_remove_wait_queue(struct eppoll_entry *pwq)
 {
 	wait_queue_head_t *whead;
@@ -2098,8 +2085,10 @@ static int __init eventpoll_init(void)
 	 */
 	ep_nested_calls_init(&poll_loop_ncalls);
 
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
 	/* Initialize the structure used to perform safe poll wait head wake ups */
 	ep_nested_calls_init(&poll_safewake_ncalls);
+#endif
 
 	/*
 	 * We can have many thousands of epitems, so prevent this from
-- 
1.8.2.rc2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 3/5] epoll: add ep_call_nested_nolock()
  2015-01-15 21:02 [RFC PATCH 0/5] Remove global locks from epoll Jason Baron
  2015-01-15 21:02 ` [RFC PATCH 1/5] epoll: remove ep_call_nested() from ep_eventpoll_poll() Jason Baron
  2015-01-15 21:02 ` [RFC PATCH 2/5] epoll: remove ep_call_nested() from ep_poll_safewake() Jason Baron
@ 2015-01-15 21:02 ` Jason Baron
  2015-01-15 21:02 ` [RFC PATCH 4/5] epoll: Allow topology checks to be parallelized Jason Baron
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Jason Baron @ 2015-01-15 21:02 UTC (permalink / raw)
  To: akpm, famz
  Cc: normalperson, nzimmer, viro, davidel, rostedt, linux-kernel,
	linux-fsdevel

Add an ep_call_nested_nolock() variant which functions the same as
the current ep_call_nested(), except it does not acquire any locks. This
call wil be used by subsequent patches which have provide their own
'external' locking.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c | 33 ++++++++++++++++++++++-----------
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 2864d67..d0a021a 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -373,11 +373,17 @@ static inline int ep_events_available(struct eventpoll *ep)
 	return !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR;
 }
 
+#define ep_call_nested(ncalls, max_nests, nproc, priv, cookie, ctx) \
+	_ep_call_nested(ncalls, max_nests, nproc, priv, cookie, ctx, 1)
+
+#define ep_call_nested_nolock(ncalls, max_nests, nproc, priv, cookie, ctx) \
+	_ep_call_nested(ncalls, max_nests, nproc, priv, cookie, ctx, 0)
+
 /**
- * ep_call_nested - Perform a bound (possibly) nested call, by checking
- *                  that the recursion limit is not exceeded, and that
- *                  the same nested call (by the meaning of same cookie) is
- *                  no re-entered.
+ * _ep_call_nested - Perform a bound (possibly) nested call, by checking
+ *                   that the recursion limit is not exceeded, and that
+ *                   the same nested call (by the meaning of same cookie) is
+ *                   no re-entered.
  *
  * @ncalls: Pointer to the nested_calls structure to be used for this call.
  * @max_nests: Maximum number of allowed nesting calls.
@@ -385,21 +391,23 @@ static inline int ep_events_available(struct eventpoll *ep)
  * @priv: Opaque data to be passed to the @nproc callback.
  * @cookie: Cookie to be used to identify this nested call.
  * @ctx: This instance context.
+ * @lock: protected by the lock or not
  *
  * Returns: Returns the code returned by the @nproc callback, or -1 if
  *          the maximum recursion limit has been exceeded.
  */
-static int ep_call_nested(struct nested_calls *ncalls, int max_nests,
+static int _ep_call_nested(struct nested_calls *ncalls, int max_nests,
 			  int (*nproc)(void *, void *, int), void *priv,
-			  void *cookie, void *ctx)
+			  void *cookie, void *ctx, bool lock)
 {
 	int error, call_nests = 0;
-	unsigned long flags;
+	unsigned long flags = 0;
 	struct list_head *lsthead = &ncalls->tasks_call_list;
 	struct nested_call_node *tncur;
 	struct nested_call_node tnode;
 
-	spin_lock_irqsave(&ncalls->lock, flags);
+	if (lock)
+		spin_lock_irqsave(&ncalls->lock, flags);
 
 	/*
 	 * Try to see if the current task is already inside this wakeup call.
@@ -423,16 +431,19 @@ static int ep_call_nested(struct nested_calls *ncalls, int max_nests,
 	tnode.cookie = cookie;
 	list_add(&tnode.llink, lsthead);
 
-	spin_unlock_irqrestore(&ncalls->lock, flags);
+	if (lock)
+		spin_unlock_irqrestore(&ncalls->lock, flags);
 
 	/* Call the nested function */
 	error = (*nproc)(priv, cookie, call_nests);
 
 	/* Remove the current task from the list */
-	spin_lock_irqsave(&ncalls->lock, flags);
+	if (lock)
+		spin_lock_irqsave(&ncalls->lock, flags);
 	list_del(&tnode.llink);
 out_unlock:
-	spin_unlock_irqrestore(&ncalls->lock, flags);
+	if (lock)
+		spin_unlock_irqrestore(&ncalls->lock, flags);
 
 	return error;
 }
-- 
1.8.2.rc2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 4/5] epoll: Allow topology checks to be parallelized
  2015-01-15 21:02 [RFC PATCH 0/5] Remove global locks from epoll Jason Baron
                   ` (2 preceding siblings ...)
  2015-01-15 21:02 ` [RFC PATCH 3/5] epoll: add ep_call_nested_nolock() Jason Baron
@ 2015-01-15 21:02 ` Jason Baron
  2015-01-15 21:02 ` [RFC PATCH 5/5] epoll: introduce epoll connected components (remove the epmutex) Jason Baron
  2015-01-16  1:36 ` [RFC PATCH 0/5] Remove global locks from epoll Fam Zheng
  5 siblings, 0 replies; 9+ messages in thread
From: Jason Baron @ 2015-01-15 21:02 UTC (permalink / raw)
  To: akpm, famz
  Cc: normalperson, nzimmer, viro, davidel, rostedt, linux-kernel,
	linux-fsdevel

The ep_loop_check() (which checks for loops) and the reverse_path_check()
(which prevents excessively deep wakeup paths), functions currently make
use of a number of global variables, since they are always called under the
global 'epmutex'. In preparation, for removing the 'epmutex', we make use
of per-instance data structures, such that these topology checks can safely
be made in parallel. In addition, we make use of the ep_call_nested_nolock(),
which avoids the nested_calls->lock. This is safe here, since we are already
holding the 'epmutex' across all of these calls.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c | 148 +++++++++++++++++++++++++++++----------------------------
 1 file changed, 76 insertions(+), 72 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d0a021a..8fb23f4 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -251,6 +251,14 @@ struct ep_send_events_data {
 	struct epoll_event __user *events;
 };
 
+struct loop_check_arg {
+	struct file *file;
+	struct nested_calls *loop_check_ncalls;
+	struct list_head *visited_list;
+	struct list_head *loop_check_list;
+	u16 *path_count_arr;
+};
+
 /*
  * Configuration options available inside /proc/sys/fs/epoll/
  */
@@ -262,24 +270,12 @@ static long max_user_watches __read_mostly;
  */
 static DEFINE_MUTEX(epmutex);
 
-/* Used to check for epoll file descriptor inclusion loops */
-static struct nested_calls poll_loop_ncalls;
-
 /* Slab cache used to allocate "struct epitem" */
 static struct kmem_cache *epi_cache __read_mostly;
 
 /* Slab cache used to allocate "struct eppoll_entry" */
 static struct kmem_cache *pwq_cache __read_mostly;
 
-/* Visited nodes during ep_loop_check(), so we can unset them when we finish */
-static LIST_HEAD(visited_list);
-
-/*
- * List of files with newly added links, where we may need to limit the number
- * of emanating paths. Protected by the epmutex.
- */
-static LIST_HEAD(tfile_check_list);
-
 #ifdef CONFIG_SYSCTL
 
 #include <linux/sysctl.h>
@@ -353,13 +349,6 @@ static inline int ep_op_has_event(int op)
 	return op != EPOLL_CTL_DEL;
 }
 
-/* Initialize the poll safe wake up structure */
-static void ep_nested_calls_init(struct nested_calls *ncalls)
-{
-	INIT_LIST_HEAD(&ncalls->tasks_call_list);
-	spin_lock_init(&ncalls->lock);
-}
-
 /**
  * ep_events_available - Checks if ready events might be available.
  *
@@ -477,6 +466,13 @@ out_unlock:
 
 static struct nested_calls poll_safewake_ncalls;
 
+/* Initialize the poll safe wake up structure */
+static void ep_nested_calls_init(struct nested_calls *ncalls)
+{
+	INIT_LIST_HEAD(&ncalls->tasks_call_list);
+	spin_lock_init(&ncalls->lock);
+}
+
 static int ep_poll_wakeup_proc(void *priv, void *cookie, int call_nests)
 {
 	unsigned long flags;
@@ -1111,9 +1107,6 @@ static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi)
 	rb_insert_color(&epi->rbn, &ep->rbr);
 }
 
-
-
-#define PATH_ARR_SIZE 5
 /*
  * These are the number paths of length 1 to 5, that we are allowing to emanate
  * from a single file of interest. For example, we allow 1000 paths of length
@@ -1125,32 +1118,26 @@ static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi)
  * path limits are enforced during an EPOLL_CTL_ADD operation, since a modify
  * and delete can't add additional paths. Protected by the epmutex.
  */
-static const int path_limits[PATH_ARR_SIZE] = { 1000, 500, 100, 50, 10 };
-static int path_count[PATH_ARR_SIZE];
+static const u16 path_limits[EP_MAX_NESTS] = { 500, 100, 50, 10 };
 
-static int path_count_inc(int nests)
+static int path_count_inc(int nests, u16 *path_count_arr)
 {
+	BUG_ON(nests < 0 || nests >= EP_MAX_NESTS);
+
 	/* Allow an arbitrary number of depth 1 paths */
 	if (nests == 0)
 		return 0;
 
-	if (++path_count[nests] > path_limits[nests])
+	if (++path_count_arr[nests - 1] > path_limits[nests - 1])
 		return -1;
 	return 0;
 }
 
-static void path_count_init(void)
-{
-	int i;
-
-	for (i = 0; i < PATH_ARR_SIZE; i++)
-		path_count[i] = 0;
-}
-
 static int reverse_path_check_proc(void *priv, void *cookie, int call_nests)
 {
 	int error = 0;
-	struct file *file = priv;
+	struct loop_check_arg *args = priv;
+	struct file *file = args->file;
 	struct file *child_file;
 	struct epitem *epi;
 
@@ -1160,15 +1147,18 @@ static int reverse_path_check_proc(void *priv, void *cookie, int call_nests)
 		child_file = epi->ep->file;
 		if (is_file_epoll(child_file)) {
 			if (list_empty(&child_file->f_ep_links)) {
-				if (path_count_inc(call_nests)) {
+				if (path_count_inc(call_nests,
+							args->path_count_arr)) {
 					error = -1;
 					break;
 				}
 			} else {
-				error = ep_call_nested(&poll_loop_ncalls,
+				args->file = child_file;
+				error = ep_call_nested_nolock(
+							args->loop_check_ncalls,
 							EP_MAX_NESTS,
 							reverse_path_check_proc,
-							child_file, child_file,
+							args, child_file,
 							current);
 			}
 			if (error != 0)
@@ -1192,17 +1182,24 @@ static int reverse_path_check_proc(void *priv, void *cookie, int call_nests)
  * Returns: Returns zero if the proposed links don't create too many paths,
  *	    -1 otherwise.
  */
-static int reverse_path_check(void)
+static int reverse_path_check(struct list_head *loop_check_list)
 {
 	int error = 0;
 	struct file *current_file;
-
+	struct nested_calls reverse_loop_ncalls;
+	struct loop_check_arg args;
+	u16 path_count[EP_MAX_NESTS];
+
+	args.loop_check_ncalls = &reverse_loop_ncalls;
+	INIT_LIST_HEAD(&reverse_loop_ncalls.tasks_call_list);
+	memset(path_count, 0, sizeof(path_count));
+	args.path_count_arr = path_count;
 	/* let's call this for all tfiles */
-	list_for_each_entry(current_file, &tfile_check_list, f_tfile_llink) {
-		path_count_init();
-		error = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
-					reverse_path_check_proc, current_file,
-					current_file, current);
+	list_for_each_entry(current_file, loop_check_list, f_tfile_llink) {
+		args.file = current_file;
+		error = ep_call_nested_nolock(&reverse_loop_ncalls,
+					EP_MAX_NESTS, reverse_path_check_proc,
+					&args, current_file, current);
 		if (error)
 			break;
 	}
@@ -1250,7 +1247,8 @@ static noinline void ep_destroy_wakeup_source(struct epitem *epi)
  * Must be called with "mtx" held.
  */
 static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
-		     struct file *tfile, int fd, int full_check)
+		     struct file *tfile, int fd, int full_check,
+		     struct list_head *loop_check_list)
 {
 	int error, revents, pwake = 0;
 	unsigned long flags;
@@ -1316,7 +1314,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
 
 	/* now check if we've created too many backpaths */
 	error = -EINVAL;
-	if (full_check && reverse_path_check())
+	if (full_check && reverse_path_check(loop_check_list))
 		goto error_remove_epi;
 
 	/* We have to drop the new item inside our item list to keep track of it */
@@ -1667,7 +1665,8 @@ check_events:
 static int ep_loop_check_proc(void *priv, void *cookie, int call_nests)
 {
 	int error = 0;
-	struct file *file = priv;
+	struct loop_check_arg *args = priv;
+	struct file *file = args->file;
 	struct eventpoll *ep = file->private_data;
 	struct eventpoll *ep_tovisit;
 	struct rb_node *rbp;
@@ -1675,15 +1674,16 @@ static int ep_loop_check_proc(void *priv, void *cookie, int call_nests)
 
 	mutex_lock_nested(&ep->mtx, call_nests + 1);
 	ep->visited = 1;
-	list_add(&ep->visited_list_link, &visited_list);
+	list_add(&ep->visited_list_link, args->visited_list);
 	for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
 		epi = rb_entry(rbp, struct epitem, rbn);
 		if (unlikely(is_file_epoll(epi->ffd.file))) {
 			ep_tovisit = epi->ffd.file->private_data;
 			if (ep_tovisit->visited)
 				continue;
-			error = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
-					ep_loop_check_proc, epi->ffd.file,
+			args->file = epi->ffd.file;
+			error = ep_call_nested_nolock(args->loop_check_ncalls,
+					EP_MAX_NESTS, ep_loop_check_proc, args,
 					ep_tovisit, current);
 			if (error != 0)
 				break;
@@ -1698,7 +1698,7 @@ static int ep_loop_check_proc(void *priv, void *cookie, int call_nests)
 			 */
 			if (list_empty(&epi->ffd.file->f_tfile_llink))
 				list_add(&epi->ffd.file->f_tfile_llink,
-					 &tfile_check_list);
+					 args->loop_check_list);
 		}
 	}
 	mutex_unlock(&ep->mtx);
@@ -1717,13 +1717,22 @@ static int ep_loop_check_proc(void *priv, void *cookie, int call_nests)
  * Returns: Returns zero if adding the epoll @file inside current epoll
  *          structure @ep does not violate the constraints, or -1 otherwise.
  */
-static int ep_loop_check(struct eventpoll *ep, struct file *file)
+static int ep_loop_check(struct eventpoll *ep, struct file *file,
+			 struct list_head *loop_check_list)
 {
 	int ret;
 	struct eventpoll *ep_cur, *ep_next;
-
-	ret = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
-			      ep_loop_check_proc, file, ep, current);
+	struct nested_calls loop_check_ncalls;
+	struct loop_check_arg args;
+	LIST_HEAD(visited_list);
+
+	args.file = file;
+	args.visited_list = &visited_list;
+	args.loop_check_list = loop_check_list;
+	args.loop_check_ncalls = &loop_check_ncalls;
+	INIT_LIST_HEAD(&loop_check_ncalls.tasks_call_list);
+	ret = ep_call_nested_nolock(&loop_check_ncalls, EP_MAX_NESTS,
+			     ep_loop_check_proc, &args, ep, current);
 	/* clear visited list */
 	list_for_each_entry_safe(ep_cur, ep_next, &visited_list,
 							visited_list_link) {
@@ -1733,17 +1742,15 @@ static int ep_loop_check(struct eventpoll *ep, struct file *file)
 	return ret;
 }
 
-static void clear_tfile_check_list(void)
+static void clear_tfile_list(struct list_head *loop_check_list)
 {
 	struct file *file;
 
-	/* first clear the tfile_check_list */
-	while (!list_empty(&tfile_check_list)) {
-		file = list_first_entry(&tfile_check_list, struct file,
+	while (!list_empty(loop_check_list)) {
+		file = list_first_entry(loop_check_list, struct file,
 					f_tfile_llink);
 		list_del_init(&file->f_tfile_llink);
 	}
-	INIT_LIST_HEAD(&tfile_check_list);
 }
 
 /*
@@ -1815,6 +1822,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	struct epitem *epi;
 	struct epoll_event epds;
 	struct eventpoll *tep = NULL;
+	LIST_HEAD(loop_check_list);
 
 	error = -EFAULT;
 	if (ep_op_has_event(op) &&
@@ -1879,13 +1887,14 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 			mutex_lock(&epmutex);
 			if (is_file_epoll(tf.file)) {
 				error = -ELOOP;
-				if (ep_loop_check(ep, tf.file) != 0) {
-					clear_tfile_check_list();
+				if (ep_loop_check(ep, tf.file,
+						  &loop_check_list) != 0) {
+					clear_tfile_list(&loop_check_list);
 					goto error_tgt_fput;
 				}
 			} else
 				list_add(&tf.file->f_tfile_llink,
-							&tfile_check_list);
+							&loop_check_list);
 			mutex_lock_nested(&ep->mtx, 0);
 			if (is_file_epoll(tf.file)) {
 				tep = tf.file->private_data;
@@ -1906,11 +1915,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	case EPOLL_CTL_ADD:
 		if (!epi) {
 			epds.events |= POLLERR | POLLHUP;
-			error = ep_insert(ep, &epds, tf.file, fd, full_check);
+			error = ep_insert(ep, &epds, tf.file, fd, full_check,
+					  &loop_check_list);
 		} else
 			error = -EEXIST;
 		if (full_check)
-			clear_tfile_check_list();
+			clear_tfile_list(&loop_check_list);
 		break;
 	case EPOLL_CTL_DEL:
 		if (epi)
@@ -2090,12 +2100,6 @@ static int __init eventpoll_init(void)
 		EP_ITEM_COST;
 	BUG_ON(max_user_watches < 0);
 
-	/*
-	 * Initialize the structure used to perform epoll file descriptor
-	 * inclusion loops checks.
-	 */
-	ep_nested_calls_init(&poll_loop_ncalls);
-
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	/* Initialize the structure used to perform safe poll wait head wake ups */
 	ep_nested_calls_init(&poll_safewake_ncalls);
-- 
1.8.2.rc2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 5/5] epoll: introduce epoll connected components (remove the epmutex)
  2015-01-15 21:02 [RFC PATCH 0/5] Remove global locks from epoll Jason Baron
                   ` (3 preceding siblings ...)
  2015-01-15 21:02 ` [RFC PATCH 4/5] epoll: Allow topology checks to be parallelized Jason Baron
@ 2015-01-15 21:02 ` Jason Baron
  2015-01-15 23:10   ` Eric Wong
  2015-01-16  1:36 ` [RFC PATCH 0/5] Remove global locks from epoll Fam Zheng
  5 siblings, 1 reply; 9+ messages in thread
From: Jason Baron @ 2015-01-15 21:02 UTC (permalink / raw)
  To: akpm, famz
  Cc: normalperson, nzimmer, viro, davidel, rostedt, linux-kernel,
	linux-fsdevel

We currently use the global'epmutex' in the epoll code to serialize when
files are being closed (and we need to remove them from epoll) and
in certain cases when we are doing epoll_ctl(EPOLL_CTL_ADD). The
requirements for epoll_ctl(EPOLL_CTL_ADD) are that we don't create an
epoll topology with loops or with too many wakeup paths. We need
the global 'epmutex' on the add path to both prevent loops and
excessive wakeup path from forming in parallel, as well as to prevent
the underlying 'struct file' from going away from us while we
are traversing the epoll topologies.

The idea of this patch is to eliminate the global 'epmutex' by keeping
track of connected epoll components and only serialize operations
within the connected component when necessary. Thus, we introduce a
reference counted data structure, 'struct ep_cc', which is pointed
to by each file in an epoll connected component via the file->f_ep_cc
pointer. As part of epoll_create(), we allocate a new 'struct ep_cc'
and point the epoll file at it. Regular files, or non-epoll files,
do not allocate a 'struct ep_cc' when they are created, since if they
are going to be used with epoll they are going to be merged with
at least one epoll file which already has an associated 'struct ep_cc'.

Thus, over time we merge connected components together as they are
added together via epoll_ctl(EPOLL_CTL_ADD). We do not decouple components
during epoll_ctl(EPOLL_CTL_DEL) in part due to the extra complexity
of determining if the graph is truly separated and because we do not
think this will add much extra value in practice. That is, its unlikely
for files to be added and removed from unrelated epoll sets. In any case,
we in theory can do no worse than the current scheme by ending up having
merged all epoll files together (thus, essentially reverting to a single
global lock). However, we do remove files from a connected component when
they are closed, but do not try and separate them into separate connected
components.

I've done a bit of performance evaluation on a dual socket, 10 core, hyper
threading enabled box: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz. For the
simple epfdN->epfdN->pipefdN topology case where each thread has its
own unique files and is doing EPOLL_CTL_ADD and EPOLL_CTL_DEL on the pipefd,
I see an almost 300% improvement. This is obviously a very contrived case,
but shows the motivation for this patch.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c            | 325 ++++++++++++++++++++++++++++++++++++++--------
 include/linux/eventpoll.h |  52 +++++---
 include/linux/fs.h        |   3 +
 3 files changed, 311 insertions(+), 69 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 8fb23f4..8db5d96 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -47,7 +47,7 @@
  * LOCKING:
  * There are three level of locking required by epoll :
  *
- * 1) epmutex (mutex)
+ * 1) ep_cc->cc_mtx (mutex)
  * 2) ep->mtx (mutex)
  * 3) ep->lock (spinlock)
  *
@@ -60,17 +60,22 @@
  * user space) we could end up sleeping due a copy_to_user(), so
  * we need a lock that will allow us to sleep. This lock is a
  * mutex (ep->mtx). It is acquired during the event transfer loop,
- * during epoll_ctl(EPOLL_CTL_DEL) and during eventpoll_release_file().
- * Then we also need a global mutex to serialize eventpoll_release_file()
- * and ep_free().
- * This mutex is acquired by ep_free() during the epoll file
- * cleanup path and it is also acquired by eventpoll_release_file()
- * if a file has been pushed inside an epoll set and it is then
- * close()d without a previous call to epoll_ctl(EPOLL_CTL_DEL).
- * It is also acquired when inserting an epoll fd onto another epoll
+ * during epoll_ctl(EPOLL_CTL_DEL, EPOLL_CTL_ADD) and during
+ * eventpoll_release_file(). Then we also have a mutex that covers
+ * 'connected components'. That is every file descriptor that is part
+ * of epoll, both 'source' files and epoll files, have a file->f_ep_cc
+ * that points to a connected component, that serializes operations
+ * for those files. This obviates the need for a global mutex since
+ * we know how the epoll graph is connected. The connected component
+ * mutex is required to serialize eventpoll_release_file(), ep_free(),
+ * and is also taken during EPOLL_CTL_ADD to prevent loops and overly
+ * complex wakeup paths. This mutex is acquired by ep_free() during
+ * the epoll file cleanup path and it is also acquired by
+ * eventpoll_release_file() if a file has been pushed inside an epoll
+ * set. It is also acquired when inserting an epoll fd onto another epoll
  * fd. We do this so that we walk the epoll tree and ensure that this
  * insertion does not create a cycle of epoll file descriptors, which
- * could lead to deadlock. We need a global mutex to prevent two
+ * could lead to deadlock. We need a per-component mutex to prevent two
  * simultaneous inserts (A into B and B into A) from racing and
  * constructing a cycle without either insert observing that it is
  * going to.
@@ -83,12 +88,9 @@
  * order to communicate this nesting to lockdep, when walking a tree
  * of epoll file descriptors, we use the current recursion depth as
  * the lockdep subkey.
- * It is possible to drop the "ep->mtx" and to use the global
- * mutex "epmutex" (together with "ep->lock") to have it working,
+ * It is possible to drop the "ep->mtx" and to use the per-component
+ * mutex (together with "ep->lock") to have it working,
  * but having "ep->mtx" will make the interface more scalable.
- * Events that require holding "epmutex" are very rare, while for
- * normal operations the epoll private "ep->mtx" will guarantee
- * a better scalability.
  */
 
 /* Epoll private bits inside the event mask */
@@ -265,11 +267,6 @@ struct loop_check_arg {
 /* Maximum number of epoll watched descriptors, per user */
 static long max_user_watches __read_mostly;
 
-/*
- * This mutex is used to serialize ep_free() and eventpoll_release_file().
- */
-static DEFINE_MUTEX(epmutex);
-
 /* Slab cache used to allocate "struct epitem" */
 static struct kmem_cache *epi_cache __read_mostly;
 
@@ -299,7 +296,7 @@ struct ctl_table epoll_table[] = {
 
 static const struct file_operations eventpoll_fops;
 
-static inline int is_file_epoll(struct file *f)
+int is_file_epoll(struct file *f)
 {
 	return f->f_op == &eventpoll_fops;
 }
@@ -362,6 +359,175 @@ static inline int ep_events_available(struct eventpoll *ep)
 	return !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR;
 }
 
+static void cc_rcu_free(struct rcu_head *head)
+{
+	struct ep_cc *cc = container_of(head, struct ep_cc, rcu);
+
+	kfree(cc);
+}
+
+static void put_cc(struct ep_cc *cc)
+{
+	if (atomic_dec_and_test(&cc->refcount))
+		call_rcu(&cc->rcu, cc_rcu_free);
+}
+
+/**
+ * lock_and_get_cc - Obtains a reference and lock on the 'connected components'
+ *                   for the file(s) supplied. When called by CTL_ADD we modify
+ *                   the file->f_ep_cc (the pointer to the connected component)
+ *                   if its file->f_ep_cc is NULL. This applies to 'event
+ *                   sources' not epoll files. In affect, this makes the 'event
+ *                   source' file part of epoll file's connected comoponent that
+ *                   it is being added to. Note that we do not 'undo' this
+ *                   operation, since its assumed to be a rare that the add
+ *                   fails and if the connected component is larger than is
+ *                   strictly necessary its ok.
+ *
+ * @sfile: 'source' file
+ * @tfile: 'target' file, although NULL if called on file put paths
+ * @res: updates the struct ep_cc_lock_res in the callers frame which needs
+ *       to be supplied to unlock_and_put_cc() and merge_cc()
+ */
+static void lock_and_get_cc(struct file *sfile, struct file *tfile,
+			    struct ep_cc_lock_res *res)
+{
+	struct ep_cc *cc_a, *cc_b;
+	bool init;
+
+	memset(res, 0, sizeof(struct ep_cc_lock_res));
+retry:
+	init = false;
+	if (!tfile) {
+		rcu_read_lock();
+		cc_a = rcu_dereference(sfile->f_ep_cc);
+		if (!atomic_inc_not_zero(&cc_a->refcount)) {
+			rcu_read_unlock();
+			cpu_relax();
+			goto retry;
+		}
+		rcu_read_unlock();
+		mutex_lock(&cc_a->cc_mtx);
+		if (cc_a != ACCESS_ONCE(sfile->f_ep_cc)) {
+			mutex_unlock(&cc_a->cc_mtx);
+			put_cc(cc_a);
+			cpu_relax();
+			goto retry;
+		}
+		res->cc_a = cc_a;
+		res->init = init;
+		return;
+	}
+	rcu_read_lock();
+	cc_a = rcu_dereference(sfile->f_ep_cc);
+	if (!atomic_inc_not_zero(&cc_a->refcount)) {
+		rcu_read_unlock();
+		cpu_relax();
+		goto retry;
+	}
+	cc_b = rcu_dereference(tfile->f_ep_cc);
+	if (cc_b == NULL) {
+		cc_b = cc_a;
+		init = true;
+	}
+	if ((cc_a != cc_b) && !atomic_inc_not_zero(&cc_b->refcount)) {
+		rcu_read_unlock();
+		put_cc(cc_a);
+		cpu_relax();
+		goto retry;
+	}
+	rcu_read_unlock();
+	if (cc_a == cc_b) {
+		mutex_lock(&cc_a->cc_mtx);
+		if (init) {
+			if (!(cmpxchg(&tfile->f_ep_cc, NULL, cc_a) == NULL)) {
+				mutex_unlock(&cc_a->cc_mtx);
+				put_cc(cc_a);
+				cpu_relax();
+				goto retry;
+			}
+		}
+	} else {
+		if (cc_a < cc_b) {
+			mutex_lock_nested(&cc_a->cc_mtx, 0);
+			mutex_lock_nested(&cc_b->cc_mtx, 1);
+		} else {
+			mutex_lock_nested(&cc_b->cc_mtx, 0);
+			mutex_lock_nested(&cc_a->cc_mtx, 1);
+		}
+	}
+	if ((cc_a != ACCESS_ONCE(sfile->f_ep_cc)) || ((cc_a != cc_b) &&
+				(cc_b != ACCESS_ONCE(tfile->f_ep_cc)))) {
+		if (init)
+			rcu_assign_pointer(tfile->f_ep_cc, NULL);
+		mutex_unlock(&cc_a->cc_mtx);
+		put_cc(cc_a);
+		if (cc_a != cc_b) {
+			mutex_unlock(&cc_b->cc_mtx);
+			put_cc(cc_b);
+		}
+		cpu_relax();
+		goto retry;
+	}
+	res->cc_a = cc_a;
+	res->cc_b = (cc_a != cc_b) ? cc_b : NULL;
+	res->init = init;
+}
+
+/**
+ * unlock_and_put_cc - undo a previous lock_and_get_cc() operation
+ *
+ * @lock_res: result from previous lock_and_get_cc()
+ */
+static void unlock_and_put_cc(struct ep_cc_lock_res *lock_res)
+{
+	mutex_unlock(&lock_res->cc_a->cc_mtx);
+	put_cc(lock_res->cc_a);
+	if (lock_res->cc_b) {
+		mutex_unlock(&lock_res->cc_b->cc_mtx);
+		put_cc(lock_res->cc_b);
+	}
+}
+
+/**
+ * merge_cc - merge 2 components in the add pth. Must be proceeded by a call
+ *            to lock_and_get_cc()
+ *
+ * @lock_res: result from previous lock_and_get_cc()
+ * @tfile: target file descriptor
+ */
+static void merge_cc(struct ep_cc_lock_res *lock_res, struct file *tfile)
+{
+	struct ep_cc *smaller_cc, *larger_cc;
+	struct file *elem, *tmp;
+	int smaller_length;
+
+	if (lock_res->init) {
+		lock_res->cc_a->length += 1;
+		atomic_inc(&lock_res->cc_a->refcount);
+		list_add(&tfile->f_ep_cc_link, &lock_res->cc_a->cc_list);
+		return;
+	}
+	/*
+	 * If cc_b is NULL then cc_a == cc_b and nothing to update as they are
+	 * already merged
+	 */
+	if (!lock_res->cc_b)
+		return;
+	larger_cc = lock_res->cc_a;
+	smaller_cc = lock_res->cc_b;
+	if (smaller_cc->length > larger_cc->length)
+		swap(smaller_cc, larger_cc);
+	list_for_each_entry_safe(elem, tmp, &smaller_cc->cc_list, f_ep_cc_link)
+		rcu_assign_pointer(elem->f_ep_cc, larger_cc);
+	list_splice_tail(&smaller_cc->cc_list, &larger_cc->cc_list);
+	smaller_length = smaller_cc->length;
+	atomic_add(smaller_length, &larger_cc->refcount);
+	atomic_sub(smaller_length, &smaller_cc->refcount);
+	larger_cc->length += smaller_length;
+	smaller_cc->length -= smaller_length;
+}
+
 #define ep_call_nested(ncalls, max_nests, nproc, priv, cookie, ctx) \
 	_ep_call_nested(ncalls, max_nests, nproc, priv, cookie, ctx, 1)
 
@@ -726,6 +892,7 @@ static void ep_free(struct eventpoll *ep)
 {
 	struct rb_node *rbp;
 	struct epitem *epi;
+	struct ep_cc_lock_res res;
 
 	/* We need to release all tasks waiting for these file */
 	if (waitqueue_active(&ep->poll_wait))
@@ -739,7 +906,9 @@ static void ep_free(struct eventpoll *ep)
 	 * anymore. The only hit might come from eventpoll_release_file() but
 	 * holding "epmutex" is sufficient here.
 	 */
-	mutex_lock(&epmutex);
+	/* only NULL if fail during ep_create */
+	if (ep->file)
+		lock_and_get_cc(ep->file, NULL, &res);
 
 	/*
 	 * Walks through the whole tree by unregistering poll callbacks.
@@ -767,7 +936,12 @@ static void ep_free(struct eventpoll *ep)
 	}
 	mutex_unlock(&ep->mtx);
 
-	mutex_unlock(&epmutex);
+	if (ep->file) {
+		list_del_init(&ep->file->f_ep_cc_link);
+		ep->file->f_ep_cc->length--;
+		unlock_and_put_cc(&res);
+		put_cc(ep->file->f_ep_cc);
+	}
 	mutex_destroy(&ep->mtx);
 	free_uid(ep->user);
 	wakeup_source_unregister(ep->ws);
@@ -892,6 +1066,7 @@ void eventpoll_release_file(struct file *file)
 {
 	struct eventpoll *ep;
 	struct epitem *epi, *next;
+	struct ep_cc_lock_res res;
 
 	/*
 	 * We don't want to get "file->f_lock" because it is not
@@ -906,14 +1081,36 @@ void eventpoll_release_file(struct file *file)
 	 *
 	 * Besides, ep_remove() acquires the lock, so we can't hold it here.
 	 */
-	mutex_lock(&epmutex);
+	lock_and_get_cc(file, NULL, &res);
 	list_for_each_entry_safe(epi, next, &file->f_ep_links, fllink) {
 		ep = epi->ep;
 		mutex_lock_nested(&ep->mtx, 0);
 		ep_remove(ep, epi);
 		mutex_unlock(&ep->mtx);
 	}
-	mutex_unlock(&epmutex);
+	if (!is_file_epoll(file)) {
+		list_del_init(&file->f_ep_cc_link);
+		file->f_ep_cc->length--;
+	}
+	unlock_and_put_cc(&res);
+	if (!is_file_epoll(file))
+		put_cc(file->f_ep_cc);
+}
+
+static int cc_alloc(struct ep_cc **ccp)
+{
+	struct ep_cc *cc;
+
+	cc = kzalloc(sizeof(*cc), GFP_KERNEL);
+	if (unlikely(!cc))
+		return -ENOMEM;
+	mutex_init(&cc->cc_mtx);
+	INIT_LIST_HEAD(&cc->cc_list);
+	cc->length = 1;
+	atomic_set(&cc->refcount, 1);
+	*ccp = cc;
+
+	return 0;
 }
 
 static int ep_alloc(struct eventpoll **pep)
@@ -1248,7 +1445,8 @@ static noinline void ep_destroy_wakeup_source(struct epitem *epi)
  */
 static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
 		     struct file *tfile, int fd, int full_check,
-		     struct list_head *loop_check_list)
+		     struct list_head *loop_check_list,
+		     struct ep_cc_lock_res *lock_res)
 {
 	int error, revents, pwake = 0;
 	unsigned long flags;
@@ -1290,6 +1488,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
 	 * this operation completes, the poll callback can start hitting
 	 * the new item.
 	 */
+
 	revents = ep_item_poll(epi, &epq.pt);
 
 	/*
@@ -1760,6 +1959,7 @@ SYSCALL_DEFINE1(epoll_create1, int, flags)
 {
 	int error, fd;
 	struct eventpoll *ep = NULL;
+	struct ep_cc *cc = NULL;
 	struct file *file;
 
 	/* Check the EPOLL_* constant for consistency.  */
@@ -1773,6 +1973,11 @@ SYSCALL_DEFINE1(epoll_create1, int, flags)
 	error = ep_alloc(&ep);
 	if (error < 0)
 		return error;
+
+	error = cc_alloc(&cc);
+	if (error < 0)
+		goto out_free_ep;
+
 	/*
 	 * Creates all the items needed to setup an eventpoll file. That is,
 	 * a file structure and a free file descriptor.
@@ -1780,7 +1985,7 @@ SYSCALL_DEFINE1(epoll_create1, int, flags)
 	fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
 	if (fd < 0) {
 		error = fd;
-		goto out_free_ep;
+		goto out_free_cc;
 	}
 	file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
 				 O_RDWR | (flags & O_CLOEXEC));
@@ -1789,11 +1994,15 @@ SYSCALL_DEFINE1(epoll_create1, int, flags)
 		goto out_free_fd;
 	}
 	ep->file = file;
+	ep->file->f_ep_cc = cc;
+	list_add(&ep->file->f_ep_cc_link, &cc->cc_list);
 	fd_install(fd, file);
 	return fd;
 
 out_free_fd:
 	put_unused_fd(fd);
+out_free_cc:
+	kfree(cc);
 out_free_ep:
 	ep_free(ep);
 	return error;
@@ -1822,6 +2031,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	struct epitem *epi;
 	struct epoll_event epds;
 	struct eventpoll *tep = NULL;
+	struct ep_cc_lock_res lock_res;
 	LIST_HEAD(loop_check_list);
 
 	error = -EFAULT;
@@ -1878,30 +2088,38 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	 * deep wakeup paths from forming in parallel through multiple
 	 * EPOLL_CTL_ADD operations.
 	 */
-	mutex_lock_nested(&ep->mtx, 0);
 	if (op == EPOLL_CTL_ADD) {
-		if (!list_empty(&f.file->f_ep_links) ||
-						is_file_epoll(tf.file)) {
-			full_check = 1;
-			mutex_unlock(&ep->mtx);
-			mutex_lock(&epmutex);
-			if (is_file_epoll(tf.file)) {
-				error = -ELOOP;
-				if (ep_loop_check(ep, tf.file,
-						  &loop_check_list) != 0) {
-					clear_tfile_list(&loop_check_list);
-					goto error_tgt_fput;
-				}
-			} else
-				list_add(&tf.file->f_tfile_llink,
-							&loop_check_list);
-			mutex_lock_nested(&ep->mtx, 0);
-			if (is_file_epoll(tf.file)) {
-				tep = tf.file->private_data;
-				mutex_lock_nested(&tep->mtx, 1);
+		lock_and_get_cc(f.file, tf.file, &lock_res);
+		if (is_file_epoll(tf.file)) {
+			error = -ELOOP;
+			if (ep_loop_check(ep, tf.file, &loop_check_list) != 0) {
+				clear_tfile_list(&loop_check_list);
+				unlock_and_put_cc(&lock_res);
+				goto error_tgt_fput;
 			}
+			full_check = 1;
+		} else if (!list_empty(&f.file->f_ep_links)) {
+			full_check = 1;
+			list_add(&tf.file->f_tfile_llink, &loop_check_list);
 		}
-	}
+		mutex_lock_nested(&ep->mtx, 0);
+		if (is_file_epoll(tf.file)) {
+			tep = tf.file->private_data;
+			mutex_lock_nested(&tep->mtx, 1);
+		}
+		merge_cc(&lock_res, tf.file);
+		if (!full_check) {
+			/*
+			 * hold the component mutex as short as possible.
+			 * Even if we error past this point and don't actually
+			 * add the link, its ok since it just means we'll have
+			 * a larger connected component than is stricly
+			 * necessary and that's not the common path.
+			 */
+			unlock_and_put_cc(&lock_res);
+		}
+	} else
+		mutex_lock_nested(&ep->mtx, 0);
 
 	/*
 	 * Try to lookup the file inside our RB tree, Since we grabbed "mtx"
@@ -1916,11 +2134,9 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 		if (!epi) {
 			epds.events |= POLLERR | POLLHUP;
 			error = ep_insert(ep, &epds, tf.file, fd, full_check,
-					  &loop_check_list);
+					  &loop_check_list, &lock_res);
 		} else
 			error = -EEXIST;
-		if (full_check)
-			clear_tfile_list(&loop_check_list);
 		break;
 	case EPOLL_CTL_DEL:
 		if (epi)
@@ -1939,11 +2155,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	if (tep != NULL)
 		mutex_unlock(&tep->mtx);
 	mutex_unlock(&ep->mtx);
+	if ((op == EPOLL_CTL_ADD) && full_check) {
+		clear_tfile_list(&loop_check_list);
+		unlock_and_put_cc(&lock_res);
+	}
 
 error_tgt_fput:
-	if (full_check)
-		mutex_unlock(&epmutex);
-
 	fdput(tf);
 error_fput:
 	fdput(f);
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index 6daf6d4..73225f7 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -22,16 +22,40 @@ struct file;
 
 #ifdef CONFIG_EPOLL
 
+struct ep_cc {
+	/* guards the component - replaces the old global 'epmutex' */
+	struct mutex cc_mtx;
+
+	/* list of ep's that are part of this component */
+	struct list_head cc_list;
+
+	/* list length */
+	unsigned int length;
+
+	/* refcount */
+	atomic_t refcount;
+
+	/* rcu call back point */
+	struct rcu_head rcu;
+};
+
+struct ep_cc_lock_res {
+	struct ep_cc *cc_a, *cc_b;
+	bool init;
+};
+
 /* Used to initialize the epoll bits inside the "struct file" */
 static inline void eventpoll_init_file(struct file *file)
 {
 	INIT_LIST_HEAD(&file->f_ep_links);
 	INIT_LIST_HEAD(&file->f_tfile_llink);
+	INIT_LIST_HEAD(&file->f_ep_cc_link);
+	file->f_ep_cc = NULL;
 }
 
-
 /* Used to release the epoll bits inside the "struct file" */
 void eventpoll_release_file(struct file *file);
+int is_file_epoll(struct file *f);
 
 /*
  * This is called from inside fs/file_table.c:__fput() to unlink files
@@ -41,23 +65,21 @@ void eventpoll_release_file(struct file *file);
  */
 static inline void eventpoll_release(struct file *file)
 {
-
 	/*
-	 * Fast check to avoid the get/release of the semaphore. Since
-	 * we're doing this outside the semaphore lock, it might return
-	 * false negatives, but we don't care. It'll help in 99.99% of cases
-	 * to avoid the semaphore lock. False positives simply cannot happen
-	 * because the file in on the way to be removed and nobody ( but
-	 * eventpoll ) has still a reference to this file.
+	 * If a 'regular' file (non-epoll) is part of a connected component
+	 * then we have to make sure to drop its reference count on the
+	 * connected component via eventpoll_release_file(). For epoll files
+	 * we will drop the reference in ep_free, so we only need to call
+	 * eventpoll_release_file() if the epoll file has back links. The
+	 * smp_mb__after_atomic() is to ensure that although we are ordered
+	 * here against the last operation on file (since we're the last
+	 * reference), we want to make sure the reads below don't move up.
+	 * So this is after the atomic_dec() from the fput().
 	 */
-	if (likely(list_empty(&file->f_ep_links)))
+	smp_mb__after_atomic();
+	if (!file->f_ep_cc || (is_file_epoll(file) &&
+				list_empty(&file->f_ep_links)))
 		return;
-
-	/*
-	 * The file is being closed while it is still linked to an epoll
-	 * descriptor. We need to handle this by correctly unlinking it
-	 * from its containers.
-	 */
 	eventpoll_release_file(file);
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f90c028..94e292a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -835,6 +835,9 @@ struct file {
 	/* Used by fs/eventpoll.c to link all the hooks to this file */
 	struct list_head	f_ep_links;
 	struct list_head	f_tfile_llink;
+	/* connected component */
+	struct list_head	f_ep_cc_link;
+	struct ep_cc __rcu	*f_ep_cc;
 #endif /* #ifdef CONFIG_EPOLL */
 	struct address_space	*f_mapping;
 } __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */
-- 
1.8.2.rc2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 5/5] epoll: introduce epoll connected components (remove the epmutex)
  2015-01-15 21:02 ` [RFC PATCH 5/5] epoll: introduce epoll connected components (remove the epmutex) Jason Baron
@ 2015-01-15 23:10   ` Eric Wong
  2015-01-16 17:03     ` Jason Baron
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2015-01-15 23:10 UTC (permalink / raw)
  To: Jason Baron
  Cc: akpm, famz, nzimmer, viro, davidel, rostedt, linux-kernel,
	linux-fsdevel

Jason Baron <jbaron@akamai.com> wrote:
> I've done a bit of performance evaluation on a dual socket, 10 core, hyper
> threading enabled box: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz. For the
> simple epfdN->epfdN->pipefdN topology case where each thread has its
> own unique files and is doing EPOLL_CTL_ADD and EPOLL_CTL_DEL on the pipefd,
> I see an almost 300% improvement. This is obviously a very contrived case,
> but shows the motivation for this patch.

Any improvements for non-contrived cases? :)

> +++ b/include/linux/fs.h
> @@ -835,6 +835,9 @@ struct file {
>  	/* Used by fs/eventpoll.c to link all the hooks to this file */
>  	struct list_head	f_ep_links;
>  	struct list_head	f_tfile_llink;
> +	/* connected component */
> +	struct list_head	f_ep_cc_link;
> +	struct ep_cc __rcu	*f_ep_cc;
>  #endif /* #ifdef CONFIG_EPOLL */

This size increase worries me.  Perhaps this can be a separately
allocated struct to avoid penalizing non-epoll users?

	struct file_eventpoll {
		struct list_head	f_ep_links;
		struct list_head	f_tfile_llink;
		/* connected component */
		struct list_head	f_ep_cc_link;
		struct ep_cc __rcu	*f_ep_cc;
	};

But I wish Linux never allowed nesting epoll in the first place :/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/5] Remove global locks from epoll
  2015-01-15 21:02 [RFC PATCH 0/5] Remove global locks from epoll Jason Baron
                   ` (4 preceding siblings ...)
  2015-01-15 21:02 ` [RFC PATCH 5/5] epoll: introduce epoll connected components (remove the epmutex) Jason Baron
@ 2015-01-16  1:36 ` Fam Zheng
  5 siblings, 0 replies; 9+ messages in thread
From: Fam Zheng @ 2015-01-16  1:36 UTC (permalink / raw)
  To: Jason Baron
  Cc: akpm, normalperson, nzimmer, viro, davidel, rostedt, linux-kernel,
	linux-fsdevel

On Thu, 01/15 21:02, Jason Baron wrote:
> Finally, I'd also like to potentially co-ordinate this series with the recent
> syscall enhancements from Fam Zheng: http://lwn.net/Articles/628828/ since these
> patches are somewhat invasive.

What I am working on is a new call that can be seen as a list of epoll_ctl
operations followed by an wait, which has enriched clockid and timespec
flexibility. it will reuse existing calls' implementation, but that ideally
should be independent to most of the implementation internals, including
locking changes here.

Thanks for letting me know!

Fam

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 5/5] epoll: introduce epoll connected components (remove the epmutex)
  2015-01-15 23:10   ` Eric Wong
@ 2015-01-16 17:03     ` Jason Baron
  0 siblings, 0 replies; 9+ messages in thread
From: Jason Baron @ 2015-01-16 17:03 UTC (permalink / raw)
  To: Eric Wong, Jason Baron
  Cc: akpm, famz, nzimmer, viro, davidel, rostedt, linux-kernel,
	linux-fsdevel

On 01/15/2015 06:10 PM, Eric Wong wrote:
> Jason Baron <jbaron@akamai.com> wrote:
>> I've done a bit of performance evaluation on a dual socket, 10 core, hyper
>> threading enabled box: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz. For the
>> simple epfdN->epfdN->pipefdN topology case where each thread has its
>> own unique files and is doing EPOLL_CTL_ADD and EPOLL_CTL_DEL on the pipefd,
>> I see an almost 300% improvement. This is obviously a very contrived case,
>> but shows the motivation for this patch.
> Any improvements for non-contrived cases? :)

I plan to do some more testing and will post performance findings...

>> +++ b/include/linux/fs.h
>> @@ -835,6 +835,9 @@ struct file {
>>  	/* Used by fs/eventpoll.c to link all the hooks to this file */
>>  	struct list_head	f_ep_links;
>>  	struct list_head	f_tfile_llink;
>> +	/* connected component */
>> +	struct list_head	f_ep_cc_link;
>> +	struct ep_cc __rcu	*f_ep_cc;
>>  #endif /* #ifdef CONFIG_EPOLL */
> This size increase worries me.  Perhaps this can be a separately
> allocated struct to avoid penalizing non-epoll users?

Agreed. That was why I marked this RFC. I was planning to try and shrink
some of the lists to singly-linked lists. But I think breaking it out as 'file_eventpoll'
is a good suggestion. We could simply just always allocate it for an epfd,
and then just allocate it for a 'regular' struct file on the first EPOLL_CTL_ADD.
That would actually result in a net shrinkage of the 'struct file' by 3 pointers from
where we are today. So I think that would be nice.

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-01-16 17:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-15 21:02 [RFC PATCH 0/5] Remove global locks from epoll Jason Baron
2015-01-15 21:02 ` [RFC PATCH 1/5] epoll: remove ep_call_nested() from ep_eventpoll_poll() Jason Baron
2015-01-15 21:02 ` [RFC PATCH 2/5] epoll: remove ep_call_nested() from ep_poll_safewake() Jason Baron
2015-01-15 21:02 ` [RFC PATCH 3/5] epoll: add ep_call_nested_nolock() Jason Baron
2015-01-15 21:02 ` [RFC PATCH 4/5] epoll: Allow topology checks to be parallelized Jason Baron
2015-01-15 21:02 ` [RFC PATCH 5/5] epoll: introduce epoll connected components (remove the epmutex) Jason Baron
2015-01-15 23:10   ` Eric Wong
2015-01-16 17:03     ` Jason Baron
2015-01-16  1:36 ` [RFC PATCH 0/5] Remove global locks from epoll Fam Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).