From: ebiederm@xmission.com (Eric W. Biederman)
To: paulmck@linux.vnet.ibm.com
Cc: Guillaume Gaudonville <guillaume.gaudonville@6wind.com>,
linux-kernel@vger.kernel.org, serge.hallyn@canonical.com,
akpm@linux-foundation.org, viro@zeniv.linux.org.uk,
davem@davemloft.net, cmetcalf@tilera.com
Subject: Re: [RFC PATCH linux-next v2] ns: do not allocate a new nsproxy at each call
Date: Tue, 22 Oct 2013 11:47:44 -0700 [thread overview]
Message-ID: <8761spw3an.fsf@xmission.com> (raw)
In-Reply-To: <20131022165331.GA4118@linux.vnet.ibm.com> (Paul E. McKenney's message of "Tue, 22 Oct 2013 09:53:32 -0700")
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> On Tue, Oct 22, 2013 at 05:52:49PM +0200, Guillaume Gaudonville wrote:
>> On 10/19/2013 12:34 AM, Eric W. Biederman wrote:
>> >"Guillaume Gaudonville" <gaudonville@6wind.com> writes:
>> >
>> >>Currently, at each call of setns system call a new nsproxy is allocated,
>> >>the old nsproxy namespaces are copied into the new one and the old nsproxy
>> >>is freed if the task was the only one to use it.
>> >In principle this looks ok. However you are not using rcu properly.
>> >
>> >What you are doing is just far enough outside of normal rcu usage my
>> >brain refuses to think it through today.
>> Understood, since they are not dereferenced under rcu_read_lock()
>> and not freed under rcu protection.
>> >Paul can you give us a hand?
>
> Maybe...
>
> First let me see if I understand what you are trying to do...
>
> The pattern of code is consistent with a situation where you add data
> elements to a linked structure, but restrict removals in one of the
> following ways:
>
> 1. Never do removals.
>
> 2. Any data elements removed are leaked, so that they are never
> reused.
>
> 3. Any data elements removed are added elsewhere in the overall
> linked structure, and the possibility of readers being carried
> along with a given data element is correctly dealt with. This
> is actually useful in some cases, but it is harder to get right
> than it might initially appear.
>
> In this case, insertions can use rcu_assign_pointer() or one of
> the higher-level primitives based on rcu_assign_pointer(), such as
> list_add_rcu(). Traversals can use rcu_dereference().
>
> If for some reason rcu_assign_pointer() is considered bad, you would
> need to do something like the following:
>
> q = kmalloc(...);
> initialize(q);
> smp_wmb();
> ACCESS_ONCE(global_q) = q;
>
> Similarly, if for some reason rcu_dereference() is considered bad, you
> would need to do something like the following:
>
> q = ACCESS_ONCE(global_q);
> smp_read_barrier_depends();
> do_something_with(q);
>
> But I would recommend sticking with rcu_assign_pointer() and
> rcu_dereference(). They encapsulate the needed operations nicely
> and their action is well understood.
>
> Of course, if you don't have RCU read-side critical sections, then
> rcu_dereference() will complain given CONFIG_PROVE_RCU=y. One way
> to avoid this is to use rcu_dereference_raw() instead of
> rcu_dereference(), but in this case please add a comment saying
> what you are doing.
>
> Does this make sense, or am I confused about what you are trying to
> do?
Roughly I think it makes sense. I am still not certain if
rcu_assign_pointer without the barriers that come with a lock is safe.
My brain has processes this enough to say that we need to mark the
pointer in nsproxy as rcu pointers and use normal rcu discipline on them
regardless of the outcome of this patch.
Unfortunately there is a moderately deep problem with this approach.
Today we have normal refcounting and no rcu on the namespaces as seen by
nsproxy. And we get our rcu liveness guarantees by calling
synchronize_rcu before putting the nsproxy. Generally that is not a
problem.
There are cases where the synchronize_rcu makes the setns syscall take
more time than people would like to pay. Guillaume was attempting to
optimize that out.
Howevever fundamentally without chaning the namespace reference counting
we can not optimize out the synchronize_rcu, or else things like put_net
will execute too soon and we may point at a trashed data structure
inside of the rcu critical section.
This review has pointed out quite a bit of bit rot in the code that
needs to be cleaned up. Unfortunately the optimization is invalid.
The best I can see happening is adding a rcu head into nsproxy and
using call_rcu to delay the dropping of the reference counts, and I
don't think that is worth it.
>> >>There are still 2 suspicious functions, nfs_server_list_open() and
>> >>nfs_volume_list_open(). They are accessing directly to the net_ns
>> >>like below:
>> >>
>> >>struct net *net = pid_ns->child_reaper->nsproxy->net_ns;
Ick.
>> >>
>> >>It seems to me that currently they should access it under rcu_read_lock()
>> >>and using task_nsproxy(pid_ns->child_reaper). It looks like a bug, no?
>> Do you agree there's also an issue around there?
Yes. That pid_ns->child_reaper access is ugly. The child_reaper in
protected by the task_list_lock which is not held there.
Then once you have safely read child_reaper than you need task_nsproxy
and all of the rcu fun.
>> >>@@ -647,14 +647,15 @@ static void netns_put(void *ns)
>> >> static int netns_install(struct nsproxy *nsproxy, void *ns)
>> >> {
>> >>- struct net *net = ns;
>> >>+ struct net *old_net, *net = ns;
>> >> if (!ns_capable(net->user_ns, CAP_SYS_ADMIN) ||
>> >> !nsown_capable(CAP_SYS_ADMIN))
>> >> return -EPERM;
>> >>- put_net(nsproxy->net_ns);
>> >>- nsproxy->net_ns = get_net(net);
>> >>+ old_net = nsproxy->net_ns;
>> >>+ rcu_assign_pointer(nsproxy->net_ns, get_net(net));
>> >>+ put_net(old_net);
>> >The ordering of operations is correct. rcu_assign_pointer
>> >is not correct because net_ns is not rcu protected.
>> Agreed, I think we need a barrier between the pointer assignment and
>> the put, something like:
>>
>> nsproxy->net_ns = get_net(net);
>> smp_wmb();
>> put_net(old_net);
There is a deep problem here. From Paul's comments and thinking about it
this line seems fine.
rcu_assign_pointer(nsproxy->net_ns, get_net(net));
However to be safe with the current guarantees the code needs to be:
synchronize_rcu();
put_net(old_net);
I am pretty certain we want to apply a patch that does some rcu things
with nsproxy for documentation purposes. Plus the rcu_dereferences you
added. Something like the patch below.
I don't know if that will make using current->nsproxy->foo_ns warn but
shrug.
Eric
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index b4ec59d159ac..63c19cdfdbfd 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -28,11 +28,11 @@ struct fs_struct;
*/
struct nsproxy {
atomic_t count;
- struct uts_namespace *uts_ns;
- struct ipc_namespace *ipc_ns;
- struct mnt_namespace *mnt_ns;
- struct pid_namespace *pid_ns_for_children;
- struct net *net_ns;
+ struct uts_namespace __rcu *uts_ns;
+ struct ipc_namespace __rcu *ipc_ns;
+ struct mnt_namespace __rcu *mnt_ns;
+ struct pid_namespace __rcu *pid_ns_for_children;
+ struct net __rcu *net_ns;
};
extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 373f3abac94e..70c88d96053f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1200,7 +1200,7 @@ struct task_struct {
/* open file information */
struct files_struct *files;
/* namespaces */
- struct nsproxy *nsproxy;
+ struct nsproxy __rcu *nsproxy;
/* signal handlers */
struct signal_struct *signal;
struct sighand_struct *sighand;
next prev parent reply other threads:[~2013-10-22 18:48 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-15 17:35 [RFC PATCH linux-next] ns: do not allocate a new nsproxy at each call Guillaume Gaudonville
2013-10-15 18:40 ` Eric W. Biederman
2013-10-16 8:48 ` Guillaume Gaudonville
2013-10-16 19:42 ` Eric W. Biederman
2013-10-18 14:46 ` [RFC PATCH linux-next v2] " Guillaume Gaudonville
2013-10-18 22:34 ` Eric W. Biederman
2013-10-22 15:52 ` Guillaume Gaudonville
2013-10-22 16:53 ` Paul E. McKenney
2013-10-22 18:47 ` Eric W. Biederman [this message]
2013-10-22 19:44 ` Eric W. Biederman
2013-10-22 21:42 ` Paul E. McKenney
2013-10-23 8:27 ` Guillaume Gaudonville
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8761spw3an.fsf@xmission.com \
--to=ebiederm@xmission.com \
--cc=akpm@linux-foundation.org \
--cc=cmetcalf@tilera.com \
--cc=davem@davemloft.net \
--cc=guillaume.gaudonville@6wind.com \
--cc=linux-kernel@vger.kernel.org \
--cc=paulmck@linux.vnet.ibm.com \
--cc=serge.hallyn@canonical.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.