All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: chiluk@canonical.com, Rafael Tinoco <rafael.tinoco@canonical.com>,
	linux-kernel@vger.kernel.org, davem@davemloft.net,
	Christopher Arges <chris.j.arges@canonical.com>,
	Jay Vosburgh <jay.vosburgh@canonical.com>
Subject: Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
Date: Wed, 11 Jun 2014 15:52:28 -0700	[thread overview]
Message-ID: <20140611225228.GO4581@linux.vnet.ibm.com> (raw)
In-Reply-To: <874mzr41kf.fsf@x220.int.ebiederm.org>

On Wed, Jun 11, 2014 at 01:46:08PM -0700, Eric W. Biederman wrote:
> Dave Chiluk <chiluk@canonical.com> writes:
> 
> > On 06/11/2014 11:18 AM, Paul E. McKenney wrote:
> >> On Wed, Jun 11, 2014 at 10:46:00AM -0500, David Chiluk wrote:
> >>> Now think about what happens when a gateway goes down, the namespaces
> >>> need to be migrated, or a new machine needs to be brought up to replace
> >>> it.  When we're talking about 3000 namespaces, the amount of time it
> >>> takes simply to recreate the namespaces becomes very significant.
> >>>
> >>> The script is a stripped down example of what exactly is being done on
> >>> the neutron gateway in order to create namespaces.
> >> 
> >> Are the namespaces torn down and recreated one at a time, or is there some
> >> syscall, ioctl(), or whatever that allows bulk tear down and recreating?
> >> 
> >> 							Thanx, Paul
> >
> > In the normal running case, the namespaces are created one at a time, as
> > new customers create a new set of VMs on the cloud.
> >
> > However, in the case of failover to a new neutron gateway the namespaces
> > are created all at once using the ip command (more or less serially).
> >
> > As far as I know there is no syscall or ioctl that allows bulk tear down
> > and recreation.  if such a beast exists that might be helpful.
> 
> Bulk teardown exists for network namespaces, and it happens
> automatically.  Bulk creation does not exist.  But then until now rcu
> was not know to even exist on the namespace creation path.
> 
> Which is what puzzles me.
> 
> Looking a little closer switch_task_namespaces which calls
> synchronize_rcu when the old nsproxy is dead may exists in both
> unshare/clone and in setns.  So that may be the culprit.
> 
> Certainly it is the only thing going on in the ip netns exec case.
> 
> ip netns add also performs a bind mount so we get into all of the vfs
> level locking as well.
> 
> On the chance it is dropping the old nsproxy which calls syncrhonize_rcu
> in switch_task_namespaces that is causing you problems I have attached
> a patch that changes from rcu_read_lock to task_lock for code that
> calls task_nsproxy from a different task.  The code should be safe
> and it should be an unquestions performance improvement but I have only
> compile tested it.
> 
> If you can try the patch it will tell is if the problem is the rcu
> access in switch_task_namespaces (the only one I am aware of network
> namespace creation) or if the problem rcu case is somewhere else.
> 
> If nothing else knowing which rcu accesses are causing the slow down
> seem important at the end of the day.
> 
> Eric
> 
> 
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> Date: Wed, 11 Jun 2014 13:33:47 -0700
> Subject: [PATCH] nsproxy: Protect remote reads of nsproxy with task_lock not rcu_read_lock.
> 
> Remote reads are rare and setns/clone can be slow because we are using
> syncrhonize_rcu.  Let's speed things up by using locking that does not
> optimize for a case that does not exist.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/namespace.c           |  4 ++--
>  fs/proc/proc_net.c       |  2 ++
>  fs/proc_namespace.c      |  6 ++----
>  include/linux/nsproxy.h  |  6 +++---
>  ipc/namespace.c          |  4 ++--
>  kernel/nsproxy.c         | 12 +++---------
>  kernel/utsname.c         |  4 ++--
>  net/core/net_namespace.c |  6 ++++--
>  8 files changed, 20 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 182bc41cd887..2d52c1676bbb 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2972,13 +2972,13 @@ static void *mntns_get(struct task_struct *task)
>  	struct mnt_namespace *ns = NULL;
>  	struct nsproxy *nsproxy;
> 
> -	rcu_read_lock();
> +	task_lock(task);
>  	nsproxy = task_nsproxy(task);
>  	if (nsproxy) {
>  		ns = nsproxy->mnt_ns;
>  		get_mnt_ns(ns);
>  	}
> -	rcu_read_unlock();
> +	task_unlock(task);
> 
>  	return ns;
>  }
> diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
> index 4677bb7dc7c2..a5e2d5576645 100644
> --- a/fs/proc/proc_net.c
> +++ b/fs/proc/proc_net.c
> @@ -113,9 +113,11 @@ static struct net *get_proc_task_net(struct inode *dir)
>  	rcu_read_lock();
>  	task = pid_task(proc_pid(dir), PIDTYPE_PID);
>  	if (task != NULL) {
> +		task_lock(task);
>  		ns = task_nsproxy(task);
>  		if (ns != NULL)
>  			net = get_net(ns->net_ns);
> +		task_unlock(task);
>  	}
>  	rcu_read_unlock();
> 
> diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> index 1a81373947f3..2b0f6455af54 100644
> --- a/fs/proc_namespace.c
> +++ b/fs/proc_namespace.c
> @@ -232,17 +232,15 @@ static int mounts_open_common(struct inode *inode, struct file *file,
>  	if (!task)
>  		goto err;
> 
> -	rcu_read_lock();
> +	task_lock(task);
>  	nsp = task_nsproxy(task);
>  	if (!nsp || !nsp->mnt_ns) {
> -		rcu_read_unlock();
> +		task_unlock(task);
>  		put_task_struct(task);
>  		goto err;
>  	}
>  	ns = nsp->mnt_ns;
>  	get_mnt_ns(ns);
> -	rcu_read_unlock();
> -	task_lock(task);
>  	if (!task->fs) {
>  		task_unlock(task);
>  		put_task_struct(task);
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index b4ec59d159ac..229aeb8ade5b 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -46,7 +46,7 @@ extern struct nsproxy init_nsproxy;
>   *     precautions should be taken - just dereference the pointers
>   *
>   *  3. the access to other task namespaces is performed like this
> - *     rcu_read_lock();
> + *     task_lock(tsk);
>   *     nsproxy = task_nsproxy(tsk);
>   *     if (nsproxy != NULL) {
>   *             / *
> @@ -57,13 +57,13 @@ extern struct nsproxy init_nsproxy;
>   *         * NULL task_nsproxy() means that this task is
>   *         * almost dead (zombie)
>   *         * /
> - *     rcu_read_unlock();
> + *     task_unlock(tsk);
>   *
>   */
> 
>  static inline struct nsproxy *task_nsproxy(struct task_struct *tsk)
>  {
> -	return rcu_dereference(tsk->nsproxy);
> +	return tsk->nsproxy;
>  }
> 
>  int copy_namespaces(unsigned long flags, struct task_struct *tsk);
> diff --git a/ipc/namespace.c b/ipc/namespace.c
> index 59451c1e214d..15b2ee95c3a9 100644
> --- a/ipc/namespace.c
> +++ b/ipc/namespace.c
> @@ -154,11 +154,11 @@ static void *ipcns_get(struct task_struct *task)
>  	struct ipc_namespace *ns = NULL;
>  	struct nsproxy *nsproxy;
> 
> -	rcu_read_lock();
> +	task_lock(task);
>  	nsproxy = task_nsproxy(task);
>  	if (nsproxy)
>  		ns = get_ipc_ns(nsproxy->ipc_ns);
> -	rcu_read_unlock();
> +	task_unlock(task);
> 
>  	return ns;
>  }
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index 8e7811086b82..20a9929ce342 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -204,18 +204,12 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
> 
>  	might_sleep();
> 
> +	task_lock(p);
>  	ns = p->nsproxy;
> -
> -	rcu_assign_pointer(p->nsproxy, new);
> +	p->nsproxy = new;
> +	task_unlock(p);
> 
>  	if (ns && atomic_dec_and_test(&ns->count)) {
> -		/*
> -		 * wait for others to get what they want from this nsproxy.
> -		 *
> -		 * cannot release this nsproxy via the call_rcu() since
> -		 * put_mnt_ns() will want to sleep
> -		 */
> -		synchronize_rcu();
>  		free_nsproxy(ns);

If this is the culprit, another approach would be to use workqueues from
RCU callbacks.  The following (untested, probably does not even build)
patch illustrates one such approach.

							Thanx, Paul

------------------------------------------------------------------------

nsproxy: Substitute call_rcu/queue_work for synchronize_rcu

This commit gets synchronize_rcu() out of the way by getting the work
done in workqueue context from an RCU callback function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index b4ec59d159ac..489bf4c7a3a0 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -33,6 +33,10 @@ struct nsproxy {
 	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns_for_children;
 	struct net 	     *net_ns;
+	union cu {
+		struct rcu_head      rh;
+		struct work_struct   ws;
+	};
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 8e7811086b82..d9a4730ce386 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -27,6 +27,7 @@
 #include <linux/syscalls.h>
 
 static struct kmem_cache *nsproxy_cachep;
+static struct workqueue_struct *nsproxy_wq;
 
 struct nsproxy init_nsproxy = {
 	.count			= ATOMIC_INIT(1),
@@ -198,6 +199,21 @@ out:
 	return err;
 }
 
+static void free_nsproxy_wq(struct work_struct *work)
+{
+	struct nsproxy *ns = container_of(rhp, struct nsproxy, cu.ws);
+
+	free_nsproxy(ns);
+}
+
+static void free_nsproxy_rcu(struct rcu_head *rhp)
+{
+	struct nsproxy *ns = container_of(rhp, struct nsproxy, cu.rh);
+
+	INIT_WORK(&ns->cu.ws, free_nsproxy_wq);
+	queue_work(nsproxy_wq, &ns->cu.ws);
+}
+
 void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
 {
 	struct nsproxy *ns;
@@ -210,13 +226,10 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
 
 	if (ns && atomic_dec_and_test(&ns->count)) {
 		/*
-		 * wait for others to get what they want from this nsproxy.
-		 *
-		 * cannot release this nsproxy via the call_rcu() since
-		 * put_mnt_ns() will want to sleep
+		 * Invoke free_nsproxy() (from workqueue context) to clean
+		 * up after others to get what they want from this nsproxy.
 		 */
-		synchronize_rcu();
-		free_nsproxy(ns);
+		call_rcu(&ns->rh, free_nsproxy_rcu);
 	}
 }
 
@@ -264,5 +277,6 @@ out:
 int __init nsproxy_cache_init(void)
 {
 	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
+	nsproxy_wq = alloc_workqueue("nsproxy_wq", WQ_MEM_RECLAIM, 0);
 	return 0;
 }


  parent reply	other threads:[~2014-06-11 22:52 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-11  5:52 Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus Rafael Tinoco
2014-06-11  7:07 ` Eric W. Biederman
2014-06-11 13:39 ` Paul E. McKenney
2014-06-11 15:17   ` Rafael Tinoco
2014-06-11 15:46     ` David Chiluk
2014-06-11 16:18       ` Paul E. McKenney
2014-06-11 18:27         ` Dave Chiluk
2014-06-11 19:48           ` Paul E. McKenney
2014-06-11 20:55             ` Eric W. Biederman
2014-06-11 21:03               ` Rafael Tinoco
2014-06-11 20:46           ` Eric W. Biederman
2014-06-11 21:14             ` Dave Chiluk
2014-06-11 22:52             ` Paul E. McKenney [this message]
2014-06-11 23:12               ` Eric W. Biederman
2014-06-11 23:49                 ` Paul E. McKenney
2014-06-12  0:14                   ` Eric W. Biederman
2014-06-12  0:25                     ` Rafael Tinoco
2014-06-12  1:09                       ` Eric W. Biederman
2014-06-12  1:14                         ` Rafael Tinoco
     [not found]                           ` <CAJE_dJzjcWP=e_CPM1M64URVHiEFFb+fP6g2YKZVdoFntkQMZg@mail.gmail.com>
2014-06-13 18:22                             ` Rafael Tinoco
2014-06-14  0:02                             ` Eric W. Biederman
2014-06-16 15:01                               ` Rafael Tinoco
2014-07-17 12:05                                 ` Rafael David Tinoco
2014-07-24  7:01                                   ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140611225228.GO4581@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=chiluk@canonical.com \
    --cc=chris.j.arges@canonical.com \
    --cc=davem@davemloft.net \
    --cc=ebiederm@xmission.com \
    --cc=jay.vosburgh@canonical.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rafael.tinoco@canonical.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.