All of lore.kernel.org
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ying Xue <ying.xue@windriver.com>,
	netdev@vger.kernel.org, cwang@twopensource.com, xemul@openvz.org,
	davem@davemloft.net, eric.dumazet@gmail.com,
	maxk@qti.qualcomm.com, stephen@networkplumber.org, tgraf@suug.ch,
	nicolas.dichtel@6wind.com, tom@herbertland.com,
	jchapman@katalix.com, erik.hugne@ericsson.com,
	jon.maloy@ericsson.com, horms@verge.net.au
Subject: Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
Date: Fri, 08 May 2015 12:36:56 -0500	[thread overview]
Message-ID: <87sib76kef.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <20150508140733.GA13325@gondor.apana.org.au> (Herbert Xu's message of "Fri, 8 May 2015 22:07:33 +0800")

Herbert Xu <herbert@gondor.apana.org.au> writes:

> On Thu, May 07, 2015 at 11:14:13AM -0500, Eric W. Biederman wrote:
>>
>> The following change shows how it is possible to always know that your
>> network namespace has a non-zero reference count in the network
>> namespace initialization methods.  My implementation of
>> lock_network_namespaces is problematic in that it does not sleep
>> while network namespaces are unregistering.  But it is enough to show
>> how the locking and reference counting can be fixed.
>
> If name spaces are created/deleted constantly this could take
> a long time.  How about forcibly cleaning up entries when we
> register a new op?

That certainly seems more sensible than waiting for an indefinite amount
of time, and the code looks reasonably sound.  Placing on things on
lists and then not removing them, and then later stomping the list
pointers gives me flashbacks of other bugs I have fought, but we already
do that in this code, and in this context it appears harmless.

So overall I think I like it.

At the same time while a network namespace remains discoverable we might
be using it in small ways.  So I am wondering if this is the right
approach.

It really is invalid for a network namespace init routine to grab the
reference count of it's network namespace (thus making the network
namespace unfreeable).  So I am wondering if perhaps all we need to do
is find a clean refactoring of the socket code so this case does not
come up at all.

Perhaps just a flag that says this is a kernel socket so don't get/put
the refcount on the network namespace.  

> ---8<---
> Subject: netns: Do not init dead nets when registering new op
>
> Currently when a new op is registered we will initialise all
> current namespaces through that op, even those that are zombies
> (zero ref count).  This is bad because it puts the onus on each
> op implementor to deal with this and they are bound to slip up.
>
> This patch fixes this by only initialising live namespaces.
>
> This however brings about a new problem as when those zombies
> are eventually cleaned up we will call op->exit even on them
> even though they have not been initialised.
>
> This is fixed by bringing forward parts of the work of the cleanup
> thread and performing them during registration.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
>
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index f733656..4931100 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -133,6 +133,7 @@ struct net {
>  #endif
>  	struct sock		*diag_nlsk;
>  	atomic_t		fnhe_genid;
> +	bool			dead;
>  };
>  
>  #include <linux/seq_file_net.h>
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 78fc04a..1f888a7 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -339,41 +339,38 @@ struct net *copy_net_ns(unsigned long flags,
>  	return net;
>  }
>  
> -static DEFINE_SPINLOCK(cleanup_list_lock);
> -static LIST_HEAD(cleanup_list);  /* Must hold cleanup_list_lock to touch */
> -
> -static void cleanup_net(struct work_struct *work)
> +static void unlink_net(struct net *net, struct list_head *net_exit_list)
>  {
> -	const struct pernet_operations *ops;
> -	struct net *net, *tmp;
> -	struct list_head net_kill_list;
> -	LIST_HEAD(net_exit_list);
> +	struct net *tmp;
>  
> -	/* Atomically snapshot the list of namespaces to cleanup */
> -	spin_lock_irq(&cleanup_list_lock);
> -	list_replace_init(&cleanup_list, &net_kill_list);
> -	spin_unlock_irq(&cleanup_list_lock);
> +	if (net->dead)
> +		return;
>  
> -	mutex_lock(&net_mutex);
> +	net->dead = true;
> +
> +	list_add_tail(&net->exit_list, net_exit_list);
>  
>  	/* Don't let anyone else find us. */
>  	rtnl_lock();
> -	list_for_each_entry(net, &net_kill_list, cleanup_list) {
> -		list_del_rcu(&net->list);
> -		list_add_tail(&net->exit_list, &net_exit_list);
> -		for_each_net(tmp) {
> -			int id = __peernet2id(tmp, net, false);
> +	list_del_rcu(&net->list);
>  
> -			if (id >= 0) {
> -				rtnl_net_notifyid(tmp, net, RTM_DELNSID, id);
> -				idr_remove(&tmp->netns_ids, id);
> -			}
> -		}
> -		idr_destroy(&net->netns_ids);
> +	for_each_net(tmp) {
> +		int id = __peernet2id(tmp, net, false);
>  
> +		if (id >= 0) {
> +			rtnl_net_notifyid(tmp, net, RTM_DELNSID, id);
> +			idr_remove(&tmp->netns_ids, id);
> +		}
>  	}
>  	rtnl_unlock();
>  
> +	idr_destroy(&net->netns_ids);
> +}
> +
> +static void exit_all_ops(struct list_head *net_exit_list)
> +{
> +	const struct pernet_operations *ops;
> +
>  	/*
>  	 * Another CPU might be rcu-iterating the list, wait for it.
>  	 * This needs to be before calling the exit() notifiers, so
> @@ -383,11 +380,33 @@ static void cleanup_net(struct work_struct *work)
>  
>  	/* Run all of the network namespace exit methods */
>  	list_for_each_entry_reverse(ops, &pernet_list, list)
> -		ops_exit_list(ops, &net_exit_list);
> +		ops_exit_list(ops, net_exit_list);
>  
>  	/* Free the net generic variables */
>  	list_for_each_entry_reverse(ops, &pernet_list, list)
> -		ops_free_list(ops, &net_exit_list);
> +		ops_free_list(ops, net_exit_list);
> +}
> +
> +static DEFINE_SPINLOCK(cleanup_list_lock);
> +static LIST_HEAD(cleanup_list);  /* Must hold cleanup_list_lock to touch */
> +
> +static void cleanup_net(struct work_struct *work)
> +{
> +	struct list_head net_kill_list;
> +	struct net *net, *tmp;
> +	LIST_HEAD(net_exit_list);
> +
> +	/* Atomically snapshot the list of namespaces to cleanup */
> +	spin_lock_irq(&cleanup_list_lock);
> +	list_replace_init(&cleanup_list, &net_kill_list);
> +	spin_unlock_irq(&cleanup_list_lock);
> +
> +	mutex_lock(&net_mutex);
> +
> +	list_for_each_entry(net, &net_kill_list, cleanup_list)
> +		unlink_net(net, &net_exit_list);
> +
> +	exit_all_ops(&net_exit_list);
>  
>  	mutex_unlock(&net_mutex);
>  
> @@ -397,8 +416,7 @@ static void cleanup_net(struct work_struct *work)
>  	rcu_barrier();
>  
>  	/* Finally it is safe to free my network namespace structure */
> -	list_for_each_entry_safe(net, tmp, &net_exit_list, exit_list) {
> -		list_del_init(&net->exit_list);
> +	list_for_each_entry_safe(net, tmp, &net_kill_list, cleanup_list) {
>  		put_user_ns(net->user_ns);
>  		net_drop_ns(net);
>  	}
> @@ -727,30 +745,59 @@ static int __init net_ns_init(void)
>  pure_initcall(net_ns_init);
>  
>  #ifdef CONFIG_NET_NS
> +static void grab_all_nets(struct list_head *all_net_list)
> +{
> +	struct net *net;
> +	LIST_HEAD(net_exit_list);
> +
> +	for_each_net(net) {
> +		if (maybe_get_net(net)) 
> +			list_add_tail(&net->cleanup_list, all_net_list);
> +		else
> +			unlink_net(net, &net_exit_list);
> +	}
> +
> +	exit_all_ops(&net_exit_list);
> +}
> +
> +static void drop_all_nets(struct list_head *all_net_list)
> +{
> +	struct net *net, *next;
> +
> +	list_for_each_entry_safe(net, next, all_net_list, cleanup_list)
> +		put_net(net);
> +}
> +
>  static int __register_pernet_operations(struct list_head *list,
>  					struct pernet_operations *ops)
>  {
>  	struct net *net;
> -	int error;
> +	int error = 0;
> +	LIST_HEAD(all_net_list);
>  	LIST_HEAD(net_exit_list);
>  
> +	grab_all_nets(&all_net_list);
> +
>  	list_add_tail(&ops->list, list);
>  	if (ops->init || (ops->id && ops->size)) {
> -		for_each_net(net) {
> +		list_for_each_entry(net, &all_net_list, cleanup_list) {
>  			error = ops_init(ops, net);
>  			if (error)
>  				goto out_undo;
>  			list_add_tail(&net->exit_list, &net_exit_list);
>  		}
>  	}
> -	return 0;
> +
> +out_drop_nets:
> +	drop_all_nets(&all_net_list);
> +	return error;
>  
>  out_undo:
>  	/* If I have an error cleanup all namespaces I initialized */
>  	list_del(&ops->list);
>  	ops_exit_list(ops, &net_exit_list);
>  	ops_free_list(ops, &net_exit_list);
> -	return error;
> +	goto out_drop_nets;
>  }
>  
>  static void __unregister_pernet_operations(struct pernet_operations *ops)

  reply	other threads:[~2015-05-08 17:41 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-07  8:52 [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 01/11] netns: Fix race between put_net() and netlink_kernel_create() Ying Xue
2015-05-07  9:04   ` Herbert Xu
2015-05-07 17:19     ` Cong Wang
2015-05-07 17:28       ` Eric W. Biederman
2015-05-08 11:20       ` Eric W. Biederman
2015-05-08 11:20       ` Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 02/11] netlink: avoid unnecessary namespace switch when create netlink kernel sockets Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 03/11] tun: avoid unnecessary namespace switch during kernel socket creation Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 04/11] inet: " Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 05/11] udp_tunnel: avoid to switch namespace for tunnel socket Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 06/11] ip6_udp_tunnel: " Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 07/11] l2tp: avoid to switch namespace for l2tp " Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 08/11] ipvs: avoid to switch namespace for ipvs kernel socket Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 09/11] tipc: fix net leak issue Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 10/11] tipc: remove sk_change_net interface Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 11/11] net: change behaviours of functions of creating and releasing kernel sockets Ying Xue
2015-05-07 16:14 ` [RFC PATCH net-next 00/11] netns: don't switch namespace while creating " Eric W. Biederman
2015-05-07 18:19   ` Cong Wang
2015-05-07 18:26     ` Eric W. Biederman
2015-05-07 18:53       ` Cong Wang
2015-05-07 18:58         ` Eric W. Biederman
2015-05-07 19:29           ` Cong Wang
2015-05-07 20:01             ` Eric W. Biederman
2015-05-08  9:10               ` Ying Xue
2015-05-08 11:15                 ` Eric W. Biederman
2015-05-08  8:50   ` Ying Xue
2015-05-08  9:25     ` Ying Xue
2015-05-08 11:07     ` Eric W. Biederman
2015-05-08 16:33       ` Cong Wang
2015-05-08 14:07   ` Herbert Xu
2015-05-08 17:36     ` Eric W. Biederman [this message]
2015-05-08 20:27       ` Cong Wang
2015-05-08 21:13         ` Cong Wang
2015-05-08 22:08           ` Eric W. Biederman
2015-05-09  1:13       ` Herbert Xu
2015-05-09  1:53         ` Eric W. Biederman
2015-05-09  2:05         ` [PATCH 0/6] Cleanup the " Eric W. Biederman
2015-05-09  2:07           ` [PATCH 1/6] tun: Utilize the normal socket network namespace refcounting Eric W. Biederman
2015-05-09  2:08           ` [PATCH 2/6] net: Add a struct net parameter to sock_create_kern Eric W. Biederman
2015-05-12  8:24             ` David Laight
2015-05-12  8:55               ` Eric W. Biederman
2015-05-12 11:48                 ` David Laight
2015-05-12 12:28                   ` Nicolas Dichtel
2015-05-12 13:16                     ` David Laight
2015-05-12 14:15                       ` Nicolas Dichtel
2015-05-12 15:58                       ` Eric W. Biederman
2015-05-12 14:45               ` David Miller
2015-05-09  2:09           ` [PATCH 3/6] net: Pass kern from net_proto_family.create to sk_alloc Eric W. Biederman
2015-05-09 16:51             ` Eric Dumazet
2015-05-09 17:31               ` Eric W. Biederman
2015-05-09  2:10           ` [PATCH 4/6] net: Modify sk_alloc to not reference count the netns of kernel sockets Eric W. Biederman
2015-05-09  2:11           ` [PATCH 5/6] netlink: Create kernel netlink sockets in the proper network namespace Eric W. Biederman
2015-05-09  2:12           ` [PATCH 6/6] net: kill sk_change_net and sk_release_kernel Eric W. Biederman
2015-05-09  2:38           ` [PATCH 0/6] Cleanup the kernel sockets Herbert Xu
2015-05-11 14:53           ` David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87sib76kef.fsf@x220.int.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=cwang@twopensource.com \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=erik.hugne@ericsson.com \
    --cc=herbert@gondor.apana.org.au \
    --cc=horms@verge.net.au \
    --cc=jchapman@katalix.com \
    --cc=jon.maloy@ericsson.com \
    --cc=maxk@qti.qualcomm.com \
    --cc=netdev@vger.kernel.org \
    --cc=nicolas.dichtel@6wind.com \
    --cc=stephen@networkplumber.org \
    --cc=tgraf@suug.ch \
    --cc=tom@herbertland.com \
    --cc=xemul@openvz.org \
    --cc=ying.xue@windriver.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.