linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
@ 2011-06-04  0:15 Al Viro
  2011-06-04 21:22 ` Al Viro
  2011-06-06 19:03 ` Eric W. Biederman
  0 siblings, 2 replies; 13+ messages in thread
From: Al Viro @ 2011-06-04  0:15 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds

sysfs_readdir() starts with doing

        ns = sysfs_info(dentry->d_sb)->ns[type];

then proceeds to scan the directory with sysfs_dir_{,next_}pos(ns, ...)
even though we have no promise whatsoever that sysfs_exit_ns() has not
cleared that element of array in the meanwhile.

Granted, sysfs_dir_pos() doesn't dereference ns; however, it does compare
with it.  And ns might have been freed and reused by that point.

I don't like what's going on there; that code looks inherently racy.
We never set ->ns[...] non-NULL outside of mount().  So it looks like
the intended behaviour is to have all ns-specific entries of that type
disappear forever from that fs instance.  Having entries for _another_
namespace to show up for a (short) while, and that only in readdir()
(lookup runs completely under sysfs_mutex, so we don't have that race
there)...

I also don't like the s_instances abuse in sysfs_exit_ns() but I'd like
to understand what is that code *for* before starting to massage it into
something saner.  What are the possible locking conditions in callers of
that thing?  I've tried to look for callers, but the (too long) callchain
has led me to pernet_operations ->exit() and at that point I'd given up -
that over the top for grepping...

BTW, is kobj_ns_current() pinned down?  Can that ns disappear from under
sysfs_mount()?  It looks like it shouldn't, but then I haven't traced the
callers of sysfs_exit_ns()...

What I'm planning to do (for unrelated reasons - ubifs needs it) is to add
an analog of iterate_supers() that would go over the superblocks of given
type and call a function on them.  I would like to convert sysfs_exit_ns()
to it and kill the last abuser of s_instances (and one of the last sb_lock
ones), but that really depends on what kind of locking is needed for
readdir() and friends - as it is, the damn thing looks *wrong*.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
  2011-06-04  0:15 [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs Al Viro
@ 2011-06-04 21:22 ` Al Viro
  2011-06-04 21:55   ` Linus Torvalds
  2011-06-06 19:03 ` Eric W. Biederman
  1 sibling, 1 reply; 13+ messages in thread
From: Al Viro @ 2011-06-04 21:22 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds

On Sat, Jun 04, 2011 at 01:15:19AM +0100, Al Viro wrote:
> Granted, sysfs_dir_pos() doesn't dereference ns; however, it does compare
> with it.  And ns might have been freed and reused by that point.
> 
> I don't like what's going on there; that code looks inherently racy.
> We never set ->ns[...] non-NULL outside of mount().  So it looks like
> the intended behaviour is to have all ns-specific entries of that type
> disappear forever from that fs instance.  Having entries for _another_
> namespace to show up for a (short) while, and that only in readdir()
> (lookup runs completely under sysfs_mutex, so we don't have that race
> there)...

Eeep...  We do not have a *race* in lookup.  However, any lookup done
after that ->ns[...] = NULL is going to do the following:

        mutex_lock(&sysfs_mutex);

        type = sysfs_ns_type(parent_sd);
        ns = sysfs_info(dir->i_sb)->ns[type];
/* i.e. ns = NULL */
        sd = sysfs_find_dirent(parent_sd, ns, dentry->d_name.name);
which will do rather unpleasant things:
        for (sd = parent_sd->s_dir.children; sd; sd = sd->s_sibling) {
                if (ns && sd->s_ns && (sd->s_ns != ns))
                        continue;
                if (!strcmp(sd->s_name, name))
                        return sd;
        }
i.e. with ns == NULL it will outright ignore sd->s_ns and look for name
match and nothing else.  Any sysfs node with that name will do, whatever
it might have in ->s_ns.  E.g. for lookups in /sys/class/net it will find
the first sysfs node of network interface with that name in _some_ namespace.
Back to sysfs_lookup():
        /* no such entry */
        if (!sd) {
                ret = ERR_PTR(-ENOENT);
                goto out_unlock;
        }

        /* attach dentry and inode */
        inode = sysfs_get_inode(dir->i_sb, sd);
        if (!inode) {
                ret = ERR_PTR(-ENOMEM);
                goto out_unlock;
        }

and if there was an entry from some surviving net_ns with that name,
sysfs_get_inode() will cheerfully allocate an inode for us and associate
it with that sd.  Then we complete the lookup and return a shiny positive
dentry to caller...

Just what is that code supposed to do?  Somehow I doubt that "after net_ns
is gone, lookups for class/net/name succeed when there is an interface called
name in at least one net_ns; resulting object refers to one of such
interfaces, with no promises regarding which net_ns it is about" is the
intended behaviour here, especially since readdir() called after that
will skip all sysfs nodes with non-NULL ->s_ns, i.e. show empty class/net.

Frankly, my preference would be to kill the void * nonsense, introduce a
structure we'll embed into struct net (and all future tags) containing
refcount and "it is doomed, skip it" flag.  Purely passive refs - i.e. all
they guarantee is that memory won't be freed under you.  Having *active* refs
(i.e. your current net->count being non-zero) contributes 1 to that counter,
kfree() is done when that counter reaches zero.  With info->ns[] contributing
to passive refcount and exit_ns logics tagging the sucker as doomed and leaving
it at that.  kill_sb() would drop references, lookup and readdir check if
ns is doomed and skip the entry in that case.

However, it's your code and I don't know in which direction do you plan to take
it.  IMO it's badly overdesigned...

Are you OK with the sketch above?  I can probably cook a patch along those
lines by tomorrow...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
  2011-06-04 21:22 ` Al Viro
@ 2011-06-04 21:55   ` Linus Torvalds
  2011-06-04 22:23     ` Al Viro
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2011-06-04 21:55 UTC (permalink / raw)
  To: Al Viro; +Cc: Eric W. Biederman, linux-fsdevel

On Sun, Jun 5, 2011 at 6:22 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Are you OK with the sketch above?  I can probably cook a patch along those
> lines by tomorrow...

I'm ok with the sketch, but might think otherwise when actually seeing
the patch. And I wish some namespace person would look at this issue.
And the networking people would seem to need to be aware of it too,
but you've only cc'd fsdevel, so they are entirely unaware of the
whole thread..

             Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
  2011-06-04 21:55   ` Linus Torvalds
@ 2011-06-04 22:23     ` Al Viro
  0 siblings, 0 replies; 13+ messages in thread
From: Al Viro @ 2011-06-04 22:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Eric W. Biederman, linux-fsdevel, netdev

On Sun, Jun 05, 2011 at 06:55:12AM +0900, Linus Torvalds wrote:
> On Sun, Jun 5, 2011 at 6:22 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Are you OK with the sketch above? ?I can probably cook a patch along those
> > lines by tomorrow...
> 
> I'm ok with the sketch, but might think otherwise when actually seeing
> the patch. And I wish some namespace person would look at this issue.
> And the networking people would seem to need to be aware of it too,
> but you've only cc'd fsdevel, so they are entirely unaware of the
> whole thread..

Eh...  The question had been to Eric, unless that pile is actually yours...
netdev Cc'd, and I'll forward the previous posting there in a few.

FWIW, reproducing that is trivial: on a box with netns enabled:

dizzy:~# unshare -n -- sh -c "mount -t sysfs none /mnt; ls -l /mnt/class/net; ls
 -l /mnt/class/net/eth0"                                                        
[ 1301.429755] IPv4 FIB: Using LC-trie version 0.409                            
total 0                                                                         
lrwxrwxrwx 1 root root 0 Jun  4 18:16 lo -> ../../devices/virtual/net/lo        
lrwxrwxrwx 1 root root 0 Jun  4 18:16 sit0 -> ../../devices/virtual/net/sit0    
ls: cannot access /mnt/class/net/eth0: No such file or directory                
dizzy:~# ls -l /mnt/class/net/; ls -l /mnt/class/net/eth0                       
total 0                                                                         
lrwxrwxrwx 1 root root 0 Jun  4 18:17 /mnt/class/net/eth0 -> ../../devices/pci00
00:00/0000:00:09.0/net/eth0                                                     

IOW, while netns is alive we get sane behaviour - lookup for class/net/eth0
fails, since there's no such object in that netns.  Once it's gone, we get
class/net in that sysfs instance empty on readdir *and* lookup for
class/net/eth0 succeeds giving us an object from another netns.

Apologies for not adding netdev - it started as sysfs-internal race, so the
beginning of that thread went to fsdevel...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
@ 2011-06-04 22:25 Al Viro
  0 siblings, 0 replies; 13+ messages in thread
From: Al Viro @ 2011-06-04 22:25 UTC (permalink / raw)
  To: netdev; +Cc: Eric W. Biederman, linux-fsdevel, Linus Torvalds

[Forwarded to netdev]

On Sat, Jun 04, 2011 at 01:15:19AM +0100, Al Viro wrote:
> Granted, sysfs_dir_pos() doesn't dereference ns; however, it does compare
> with it.  And ns might have been freed and reused by that point.
> 
> I don't like what's going on there; that code looks inherently racy.
> We never set ->ns[...] non-NULL outside of mount().  So it looks like
> the intended behaviour is to have all ns-specific entries of that type
> disappear forever from that fs instance.  Having entries for _another_
> namespace to show up for a (short) while, and that only in readdir()
> (lookup runs completely under sysfs_mutex, so we don't have that race
> there)...

Eeep...  We do not have a *race* in lookup.  However, any lookup done
after that ->ns[...] = NULL is going to do the following:

        mutex_lock(&sysfs_mutex);

        type = sysfs_ns_type(parent_sd);
        ns = sysfs_info(dir->i_sb)->ns[type];
/* i.e. ns = NULL */
        sd = sysfs_find_dirent(parent_sd, ns, dentry->d_name.name);
which will do rather unpleasant things:
        for (sd = parent_sd->s_dir.children; sd; sd = sd->s_sibling) {
                if (ns && sd->s_ns && (sd->s_ns != ns))
                        continue;
                if (!strcmp(sd->s_name, name))
                        return sd;
        }
i.e. with ns == NULL it will outright ignore sd->s_ns and look for name
match and nothing else.  Any sysfs node with that name will do, whatever
it might have in ->s_ns.  E.g. for lookups in /sys/class/net it will find
the first sysfs node of network interface with that name in _some_ namespace.
Back to sysfs_lookup():
        /* no such entry */
        if (!sd) {
                ret = ERR_PTR(-ENOENT);
                goto out_unlock;
        }

        /* attach dentry and inode */
        inode = sysfs_get_inode(dir->i_sb, sd);
        if (!inode) {
                ret = ERR_PTR(-ENOMEM);
                goto out_unlock;
        }

and if there was an entry from some surviving net_ns with that name,
sysfs_get_inode() will cheerfully allocate an inode for us and associate
it with that sd.  Then we complete the lookup and return a shiny positive
dentry to caller...

Just what is that code supposed to do?  Somehow I doubt that "after net_ns
is gone, lookups for class/net/name succeed when there is an interface called
name in at least one net_ns; resulting object refers to one of such
interfaces, with no promises regarding which net_ns it is about" is the
intended behaviour here, especially since readdir() called after that
will skip all sysfs nodes with non-NULL ->s_ns, i.e. show empty class/net.

Frankly, my preference would be to kill the void * nonsense, introduce a
structure we'll embed into struct net (and all future tags) containing
refcount and "it is doomed, skip it" flag.  Purely passive refs - i.e. all
they guarantee is that memory won't be freed under you.  Having *active* refs
(i.e. your current net->count being non-zero) contributes 1 to that counter,
kfree() is done when that counter reaches zero.  With info->ns[] contributing
to passive refcount and exit_ns logics tagging the sucker as doomed and leaving
it at that.  kill_sb() would drop references, lookup and readdir check if
ns is doomed and skip the entry in that case.

However, it's your code and I don't know in which direction do you plan to take
it.  IMO it's badly overdesigned...

Are you OK with the sketch above?  I can probably cook a patch along those
lines by tomorrow...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
  2011-06-04  0:15 [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs Al Viro
  2011-06-04 21:22 ` Al Viro
@ 2011-06-06 19:03 ` Eric W. Biederman
  2011-06-07 21:58   ` Al Viro
  1 sibling, 1 reply; 13+ messages in thread
From: Eric W. Biederman @ 2011-06-06 19:03 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, netdev, Linux Containers


Al Viro <viro@ZenIV.linux.org.uk> writes:

> What I'm planning to do (for unrelated reasons - ubifs needs it) is to add
> an analog of iterate_supers() that would go over the superblocks of given
> type and call a function on them.  I would like to convert sysfs_exit_ns()
> to it and kill the last abuser of s_instances (and one of the last sb_lock
> ones), but that really depends on what kind of locking is needed for
> readdir() and friends - as it is, the damn thing looks *wrong*.

Answering your primary question first, what locking is needed in
sysfs_exit_ns().

Wrapping my head around this code again, to the best of my memory
the intent was.

For consistency "info->type[ns]" is an atomic value (void *) so
it can be safely read and written without relying on locks.  For
things like lookup 

The assumption is that is primarily an atomic value and so it can
be safely read by things like sysfs_readdir() and get a valid value
without relying on the locks. 

sysfs_mutex is needed in things like sysfs_lookup if we want the
value to not change.

There is indeed a small race in sysfs_readdir.

As for sysfs_lookup it looks like my code to handle untagged members in
directories where everything else is tagged, such as
"/sys/class/net/bonding_masters" introduced an overloading of what NULL
means in the context of sysfs_readdir and sysfs_lookup.   ns == NULL can
either mean that we have type == KOBJ_NS_TYPE_NONE, or ns == NULL can
mean that the namespace has gone away beneath us.  I looks like I need
to fix that.

To sysfs_exit_ns() we have the call chain:
cleanup_net()
  ops_exit_list()
     net_kobj_ns_exit()
        sysfs_exit_ns()

Which makes the locking order needed for that call path.
net_mutex()
   rtnl_lock()
   sysfs_mutex()

Now somewhere I was also careful that mount did not cause problems,
with sysfs_exit_ns() but I forget where.

You were asking about kobj_ns_current.
kboj_ns_current()
  net_current_ns()
     current->nsproxy->net_ns

And current has a reference on it's network namespace.

Other pieces of information that should be helpful to know.
- All sysfs directory entries for a network namespace should be
  removed from sysfs by the time sysfs_exit_ns is called.

Al hopefully that is enough to get you going and I will what I can
do with the rest of the sysfs ugliness.

Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
  2011-06-06 19:03 ` Eric W. Biederman
@ 2011-06-07 21:58   ` Al Viro
  2011-06-07 22:59     ` Al Viro
  0 siblings, 1 reply; 13+ messages in thread
From: Al Viro @ 2011-06-07 21:58 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds, netdev, Linux Containers

On Mon, Jun 06, 2011 at 12:03:52PM -0700, Eric W. Biederman wrote:

> Other pieces of information that should be helpful to know.
> - All sysfs directory entries for a network namespace should be
>   removed from sysfs by the time sysfs_exit_ns is called.

Then why do we need to do _anything_ with ->ns[...]?  Is there any problem
with postponing actual freeing of that sucker until after umount, so that
memory doesn't get reused?  Since everything stale should be gone by the
point when sysfs_exit_ns() would put NULL into ->ns[...], we can as well
keep the old pointer in there - it won't match anything else for as long
as the struct net is not freed...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
  2011-06-07 21:58   ` Al Viro
@ 2011-06-07 22:59     ` Al Viro
  2011-06-09  1:26       ` Al Viro
  0 siblings, 1 reply; 13+ messages in thread
From: Al Viro @ 2011-06-07 22:59 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds, netdev, Linux Containers

On Tue, Jun 07, 2011 at 10:58:34PM +0100, Al Viro wrote:
> On Mon, Jun 06, 2011 at 12:03:52PM -0700, Eric W. Biederman wrote:
> 
> > Other pieces of information that should be helpful to know.
> > - All sysfs directory entries for a network namespace should be
> >   removed from sysfs by the time sysfs_exit_ns is called.
> 
> Then why do we need to do _anything_ with ->ns[...]?  Is there any problem
> with postponing actual freeing of that sucker until after umount, so that
> memory doesn't get reused?  Since everything stale should be gone by the
> point when sysfs_exit_ns() would put NULL into ->ns[...], we can as well
> keep the old pointer in there - it won't match anything else for as long
> as the struct net is not freed...

OK, how about the following:
	* extra refcount in struct net, initialized to 1.
	* new method in kobj_ns_type_operations: put_ns.  For struct net
that'd be "decrement that refcount by 1, if it hits zero call net_free()".
Existing callers of net_free() replaced by calls of that function.
	* current_ns semantics change: for struct net it'd bump that refcount.
	* sysfs_kill_sb() does corresponding put_ns on everything it finds in
its ->ns[...]
	* so does sysfs_mount() if it decides to discard the sysfs_super_info
after having found a duplicate
	* sysfs_exit_ns() is gone, along with kobj_ns_exit(), net_kobj_ns_exit()
and kobj_net_ops.

Completely untested patch follows:

diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 2668957..1e7a2ee 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -111,8 +111,11 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
 		info->ns[type] = kobj_ns_current(type);
 
 	sb = sget(fs_type, sysfs_test_super, sysfs_set_super, info);
-	if (IS_ERR(sb) || sb->s_fs_info != info)
+	if (IS_ERR(sb) || sb->s_fs_info != info) {
+		for (type = KOBJ_NS_TYPE_NONE; type < KOBJ_NS_TYPES; type++)
+			kobj_ns_put(type, info->ns[type]);
 		kfree(info);
+	}
 	if (IS_ERR(sb))
 		return ERR_CAST(sb);
 	if (!sb->s_root) {
@@ -131,11 +134,14 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
 static void sysfs_kill_sb(struct super_block *sb)
 {
 	struct sysfs_super_info *info = sysfs_info(sb);
+	int type;
 
 	/* Remove the superblock from fs_supers/s_instances
 	 * so we can't find it, before freeing sysfs_super_info.
 	 */
 	kill_anon_super(sb);
+	for (type = KOBJ_NS_TYPE_NONE; type < KOBJ_NS_TYPES; type++)
+		kobj_ns_put(type, info->ns[type]);
 	kfree(info);
 }
 
@@ -145,28 +151,6 @@ static struct file_system_type sysfs_fs_type = {
 	.kill_sb	= sysfs_kill_sb,
 };
 
-void sysfs_exit_ns(enum kobj_ns_type type, const void *ns)
-{
-	struct super_block *sb;
-
-	mutex_lock(&sysfs_mutex);
-	spin_lock(&sb_lock);
-	list_for_each_entry(sb, &sysfs_fs_type.fs_supers, s_instances) {
-		struct sysfs_super_info *info = sysfs_info(sb);
-		/*
-		 * If we see a superblock on the fs_supers/s_instances
-		 * list the unmount has not completed and sb->s_fs_info
-		 * points to a valid struct sysfs_super_info.
-		 */
-		/* Ignore superblocks with the wrong ns */
-		if (info->ns[type] != ns)
-			continue;
-		info->ns[type] = NULL;
-	}
-	spin_unlock(&sb_lock);
-	mutex_unlock(&sysfs_mutex);
-}
-
 int __init sysfs_init(void)
 {
 	int err = -ENOMEM;
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index 3d28af3..2ed2404 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -136,7 +136,7 @@ struct sysfs_addrm_cxt {
  * instance).
  */
 struct sysfs_super_info {
-	const void *ns[KOBJ_NS_TYPES];
+	void *ns[KOBJ_NS_TYPES];
 };
 #define sysfs_info(SB) ((struct sysfs_super_info *)(SB->s_fs_info))
 extern struct sysfs_dirent sysfs_root;
diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
index 82cb5bf..5fa481c 100644
--- a/include/linux/kobject_ns.h
+++ b/include/linux/kobject_ns.h
@@ -38,9 +38,10 @@ enum kobj_ns_type {
  */
 struct kobj_ns_type_operations {
 	enum kobj_ns_type type;
-	const void *(*current_ns)(void);
+	void *(*current_ns)(void);
 	const void *(*netlink_ns)(struct sock *sk);
 	const void *(*initial_ns)(void);
+	void (*put_ns)(void *);
 };
 
 int kobj_ns_type_register(const struct kobj_ns_type_operations *ops);
@@ -48,9 +49,9 @@ int kobj_ns_type_registered(enum kobj_ns_type type);
 const struct kobj_ns_type_operations *kobj_child_ns_ops(struct kobject *parent);
 const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj);
 
-const void *kobj_ns_current(enum kobj_ns_type type);
+void *kobj_ns_current(enum kobj_ns_type type);
 const void *kobj_ns_netlink(enum kobj_ns_type type, struct sock *sk);
 const void *kobj_ns_initial(enum kobj_ns_type type);
-void kobj_ns_exit(enum kobj_ns_type type, const void *ns);
+void kobj_ns_put(enum kobj_ns_type type, void *ns);
 
 #endif /* _LINUX_KOBJECT_NS_H */
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index c3acda6..e2696d7 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -177,9 +177,6 @@ struct sysfs_dirent *sysfs_get_dirent(struct sysfs_dirent *parent_sd,
 struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd);
 void sysfs_put(struct sysfs_dirent *sd);
 
-/* Called to clear a ns tag when it is no longer valid */
-void sysfs_exit_ns(enum kobj_ns_type type, const void *tag);
-
 int __must_check sysfs_init(void);
 
 #else /* CONFIG_SYSFS */
@@ -338,10 +335,6 @@ static inline void sysfs_put(struct sysfs_dirent *sd)
 {
 }
 
-static inline void sysfs_exit_ns(int type, const void *tag)
-{
-}
-
 static inline int __must_check sysfs_init(void)
 {
 	return 0;
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 2bf9ed9..415af64 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -35,8 +35,11 @@ struct netns_ipvs;
 #define NETDEV_HASHENTRIES (1 << NETDEV_HASHBITS)
 
 struct net {
+	atomic_t		passive;	/* To decided when the network
+						 * namespace should be freed.
+						 */
 	atomic_t		count;		/* To decided when the network
-						 *  namespace should be freed.
+						 *  namespace should be shut down.
 						 */
 #ifdef NETNS_REFCNT_DEBUG
 	atomic_t		use_count;	/* To track references we
@@ -154,6 +157,9 @@ int net_eq(const struct net *net1, const struct net *net2)
 {
 	return net1 == net2;
 }
+
+extern void net_put_ns(void *);
+
 #else
 
 static inline struct net *get_net(struct net *net)
@@ -175,6 +181,8 @@ int net_eq(const struct net *net1, const struct net *net2)
 {
 	return 1;
 }
+
+#define net_put_ns NULL
 #endif
 
 
diff --git a/lib/kobject.c b/lib/kobject.c
index 82dc34c..f584cd4 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -948,9 +948,9 @@ const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj)
 }
 
 
-const void *kobj_ns_current(enum kobj_ns_type type)
+void *kobj_ns_current(enum kobj_ns_type type)
 {
-	const void *ns = NULL;
+	void *ns = NULL;
 
 	spin_lock(&kobj_ns_type_lock);
 	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
@@ -987,23 +987,15 @@ const void *kobj_ns_initial(enum kobj_ns_type type)
 	return ns;
 }
 
-/*
- * kobj_ns_exit - invalidate a namespace tag
- *
- * @type: the namespace type (i.e. KOBJ_NS_TYPE_NET)
- * @ns: the actual namespace being invalidated
- *
- * This is called when a tag is no longer valid.  For instance,
- * when a network namespace exits, it uses this helper to
- * make sure no sb's sysfs_info points to the now-invalidated
- * netns.
- */
-void kobj_ns_exit(enum kobj_ns_type type, const void *ns)
+void kobj_ns_put(enum kobj_ns_type type, void *ns)
 {
-	sysfs_exit_ns(type, ns);
+	spin_lock(&kobj_ns_type_lock);
+	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
+	    kobj_ns_ops_tbl[type])
+		kobj_ns_ops_tbl[type]->put_ns(ns);
+	spin_unlock(&kobj_ns_type_lock);
 }
 
-
 EXPORT_SYMBOL(kobject_get);
 EXPORT_SYMBOL(kobject_put);
 EXPORT_SYMBOL(kobject_del);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 11b98bc..b002c71 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1179,9 +1179,12 @@ static void remove_queue_kobjects(struct net_device *net)
 #endif
 }
 
-static const void *net_current_ns(void)
+static void *net_current_ns(void)
 {
-	return current->nsproxy->net_ns;
+	struct net *ns = current->nsproxy->net_ns;
+	if (ns)
+		atomic_inc(&ns->passive);
+	return ns;
 }
 
 static const void *net_initial_ns(void)
@@ -1199,19 +1202,10 @@ struct kobj_ns_type_operations net_ns_type_operations = {
 	.current_ns = net_current_ns,
 	.netlink_ns = net_netlink_ns,
 	.initial_ns = net_initial_ns,
+	.put_ns = net_put_ns,
 };
 EXPORT_SYMBOL_GPL(net_ns_type_operations);
 
-static void net_kobj_ns_exit(struct net *net)
-{
-	kobj_ns_exit(KOBJ_NS_TYPE_NET, net);
-}
-
-static struct pernet_operations kobj_net_ops = {
-	.exit = net_kobj_ns_exit,
-};
-
-
 #ifdef CONFIG_HOTPLUG
 static int netdev_uevent(struct device *d, struct kobj_uevent_env *env)
 {
@@ -1339,6 +1333,5 @@ EXPORT_SYMBOL(netdev_class_remove_file);
 int netdev_kobject_init(void)
 {
 	kobj_ns_type_register(&net_ns_type_operations);
-	register_pernet_subsys(&kobj_net_ops);
 	return class_register(&net_class);
 }
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 6c6b86d..19feb20 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -128,6 +128,7 @@ static __net_init int setup_net(struct net *net)
 	LIST_HEAD(net_exit_list);
 
 	atomic_set(&net->count, 1);
+	atomic_set(&net->passive, 1);
 
 #ifdef NETNS_REFCNT_DEBUG
 	atomic_set(&net->use_count, 0);
@@ -210,6 +211,13 @@ static void net_free(struct net *net)
 	kmem_cache_free(net_cachep, net);
 }
 
+void net_put_ns(void *p)
+{
+	struct net *ns = p;
+	if (ns && atomic_dec_and_test(&ns->passive))
+		net_free(ns);
+}
+
 struct net *copy_net_ns(unsigned long flags, struct net *old_net)
 {
 	struct net *net;
@@ -230,7 +238,7 @@ struct net *copy_net_ns(unsigned long flags, struct net *old_net)
 	}
 	mutex_unlock(&net_mutex);
 	if (rv < 0) {
-		net_free(net);
+		net_put_ns(net);
 		return ERR_PTR(rv);
 	}
 	return net;
@@ -286,7 +294,7 @@ static void cleanup_net(struct work_struct *work)
 	/* Finally it is safe to free my network namespace structure */
 	list_for_each_entry_safe(net, tmp, &net_exit_list, exit_list) {
 		list_del_init(&net->exit_list);
-		net_free(net);
+		net_put_ns(net);
 	}
 }
 static DECLARE_WORK(net_cleanup_work, cleanup_net);

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
  2011-06-07 22:59     ` Al Viro
@ 2011-06-09  1:26       ` Al Viro
  2011-06-12  7:15         ` Eric W. Biederman
  0 siblings, 1 reply; 13+ messages in thread
From: Al Viro @ 2011-06-09  1:26 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds, netdev, Linux Containers

On Tue, Jun 07, 2011 at 11:59:02PM +0100, Al Viro wrote:

> Completely untested patch follows:

FWIW, modulo a couple of brainos it seems to work here without blowing
everything up.  Actual freeing of struct net is delayed until after
umount of sysfs instances refering it, shutdown is *not* delayed, sysfs
entries are removed nicely as they ought to, no objects from other net_ns
are picked by readdir or lookup...

If somebody has objections - yell.  If I don't hear anything, the following
goes to -next:

commit 72d4e98002b45598bb88af74eeac20874f2789be
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Wed Jun 8 21:13:01 2011 -0400

    Delay struct net freeing while there's a sysfs instance refering to it
    
    	* new refcount in struct net, controlling actual freeing of the memory
    	* new method in kobj_ns_type_operations (->put_ns())
    	* ->current_ns() semantics change - it's supposed to be followed by
    corresponding ->put_ns().  For struct net in case of CONFIG_NET_NS it bumps
    the new refcount; net_put_ns() decrements it and calls net_free() if the
    last reference has been dropped.
    	* old net_free() callers call net_put_ns() instead.
    	* sysfs_exit_ns() is gone, along with a large part of callchain
    leading to it; now that the references stored in ->ns[...] stay valid we
    do not need to hunt them down and replace them with NULL.  That fixes
    problems in sysfs_lookup() and sysfs_readdir(), along with getting rid
    of sb->s_instances abuse.
    
    	Note that struct net *shutdown* logics has not changed - net_cleanup()
    is called exactly when it used to be called.  The only thing postponed by
    having a sysfs instance refering to that struct net is actual freeing of
    memory occupied by struct net.
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 2668957..1e7a2ee 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -111,8 +111,11 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
 		info->ns[type] = kobj_ns_current(type);
 
 	sb = sget(fs_type, sysfs_test_super, sysfs_set_super, info);
-	if (IS_ERR(sb) || sb->s_fs_info != info)
+	if (IS_ERR(sb) || sb->s_fs_info != info) {
+		for (type = KOBJ_NS_TYPE_NONE; type < KOBJ_NS_TYPES; type++)
+			kobj_ns_put(type, info->ns[type]);
 		kfree(info);
+	}
 	if (IS_ERR(sb))
 		return ERR_CAST(sb);
 	if (!sb->s_root) {
@@ -131,11 +134,14 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
 static void sysfs_kill_sb(struct super_block *sb)
 {
 	struct sysfs_super_info *info = sysfs_info(sb);
+	int type;
 
 	/* Remove the superblock from fs_supers/s_instances
 	 * so we can't find it, before freeing sysfs_super_info.
 	 */
 	kill_anon_super(sb);
+	for (type = KOBJ_NS_TYPE_NONE; type < KOBJ_NS_TYPES; type++)
+		kobj_ns_put(type, info->ns[type]);
 	kfree(info);
 }
 
@@ -145,28 +151,6 @@ static struct file_system_type sysfs_fs_type = {
 	.kill_sb	= sysfs_kill_sb,
 };
 
-void sysfs_exit_ns(enum kobj_ns_type type, const void *ns)
-{
-	struct super_block *sb;
-
-	mutex_lock(&sysfs_mutex);
-	spin_lock(&sb_lock);
-	list_for_each_entry(sb, &sysfs_fs_type.fs_supers, s_instances) {
-		struct sysfs_super_info *info = sysfs_info(sb);
-		/*
-		 * If we see a superblock on the fs_supers/s_instances
-		 * list the unmount has not completed and sb->s_fs_info
-		 * points to a valid struct sysfs_super_info.
-		 */
-		/* Ignore superblocks with the wrong ns */
-		if (info->ns[type] != ns)
-			continue;
-		info->ns[type] = NULL;
-	}
-	spin_unlock(&sb_lock);
-	mutex_unlock(&sysfs_mutex);
-}
-
 int __init sysfs_init(void)
 {
 	int err = -ENOMEM;
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index 3d28af3..2ed2404 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -136,7 +136,7 @@ struct sysfs_addrm_cxt {
  * instance).
  */
 struct sysfs_super_info {
-	const void *ns[KOBJ_NS_TYPES];
+	void *ns[KOBJ_NS_TYPES];
 };
 #define sysfs_info(SB) ((struct sysfs_super_info *)(SB->s_fs_info))
 extern struct sysfs_dirent sysfs_root;
diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
index 82cb5bf..5fa481c 100644
--- a/include/linux/kobject_ns.h
+++ b/include/linux/kobject_ns.h
@@ -38,9 +38,10 @@ enum kobj_ns_type {
  */
 struct kobj_ns_type_operations {
 	enum kobj_ns_type type;
-	const void *(*current_ns)(void);
+	void *(*current_ns)(void);
 	const void *(*netlink_ns)(struct sock *sk);
 	const void *(*initial_ns)(void);
+	void (*put_ns)(void *);
 };
 
 int kobj_ns_type_register(const struct kobj_ns_type_operations *ops);
@@ -48,9 +49,9 @@ int kobj_ns_type_registered(enum kobj_ns_type type);
 const struct kobj_ns_type_operations *kobj_child_ns_ops(struct kobject *parent);
 const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj);
 
-const void *kobj_ns_current(enum kobj_ns_type type);
+void *kobj_ns_current(enum kobj_ns_type type);
 const void *kobj_ns_netlink(enum kobj_ns_type type, struct sock *sk);
 const void *kobj_ns_initial(enum kobj_ns_type type);
-void kobj_ns_exit(enum kobj_ns_type type, const void *ns);
+void kobj_ns_put(enum kobj_ns_type type, void *ns);
 
 #endif /* _LINUX_KOBJECT_NS_H */
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index c3acda6..e2696d7 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -177,9 +177,6 @@ struct sysfs_dirent *sysfs_get_dirent(struct sysfs_dirent *parent_sd,
 struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd);
 void sysfs_put(struct sysfs_dirent *sd);
 
-/* Called to clear a ns tag when it is no longer valid */
-void sysfs_exit_ns(enum kobj_ns_type type, const void *tag);
-
 int __must_check sysfs_init(void);
 
 #else /* CONFIG_SYSFS */
@@ -338,10 +335,6 @@ static inline void sysfs_put(struct sysfs_dirent *sd)
 {
 }
 
-static inline void sysfs_exit_ns(int type, const void *tag)
-{
-}
-
 static inline int __must_check sysfs_init(void)
 {
 	return 0;
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 2bf9ed9..415af64 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -35,8 +35,11 @@ struct netns_ipvs;
 #define NETDEV_HASHENTRIES (1 << NETDEV_HASHBITS)
 
 struct net {
+	atomic_t		passive;	/* To decided when the network
+						 * namespace should be freed.
+						 */
 	atomic_t		count;		/* To decided when the network
-						 *  namespace should be freed.
+						 *  namespace should be shut down.
 						 */
 #ifdef NETNS_REFCNT_DEBUG
 	atomic_t		use_count;	/* To track references we
@@ -154,6 +157,9 @@ int net_eq(const struct net *net1, const struct net *net2)
 {
 	return net1 == net2;
 }
+
+extern void net_put_ns(void *);
+
 #else
 
 static inline struct net *get_net(struct net *net)
@@ -175,6 +181,8 @@ int net_eq(const struct net *net1, const struct net *net2)
 {
 	return 1;
 }
+
+#define net_put_ns NULL
 #endif
 
 
diff --git a/lib/kobject.c b/lib/kobject.c
index 82dc34c..43116b2 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -948,9 +948,9 @@ const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj)
 }
 
 
-const void *kobj_ns_current(enum kobj_ns_type type)
+void *kobj_ns_current(enum kobj_ns_type type)
 {
-	const void *ns = NULL;
+	void *ns = NULL;
 
 	spin_lock(&kobj_ns_type_lock);
 	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
@@ -987,23 +987,15 @@ const void *kobj_ns_initial(enum kobj_ns_type type)
 	return ns;
 }
 
-/*
- * kobj_ns_exit - invalidate a namespace tag
- *
- * @type: the namespace type (i.e. KOBJ_NS_TYPE_NET)
- * @ns: the actual namespace being invalidated
- *
- * This is called when a tag is no longer valid.  For instance,
- * when a network namespace exits, it uses this helper to
- * make sure no sb's sysfs_info points to the now-invalidated
- * netns.
- */
-void kobj_ns_exit(enum kobj_ns_type type, const void *ns)
+void kobj_ns_put(enum kobj_ns_type type, void *ns)
 {
-	sysfs_exit_ns(type, ns);
+	spin_lock(&kobj_ns_type_lock);
+	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
+	    kobj_ns_ops_tbl[type] && kobj_ns_ops_tbl[type]->put_ns)
+		kobj_ns_ops_tbl[type]->put_ns(ns);
+	spin_unlock(&kobj_ns_type_lock);
 }
 
-
 EXPORT_SYMBOL(kobject_get);
 EXPORT_SYMBOL(kobject_put);
 EXPORT_SYMBOL(kobject_del);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 11b98bc..0c799f1 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1179,9 +1179,14 @@ static void remove_queue_kobjects(struct net_device *net)
 #endif
 }
 
-static const void *net_current_ns(void)
+static void *net_current_ns(void)
 {
-	return current->nsproxy->net_ns;
+	struct net *ns = current->nsproxy->net_ns;
+#ifdef CONFIG_NET_NS
+	if (ns)
+		atomic_inc(&ns->passive);
+#endif
+	return ns;
 }
 
 static const void *net_initial_ns(void)
@@ -1199,19 +1204,10 @@ struct kobj_ns_type_operations net_ns_type_operations = {
 	.current_ns = net_current_ns,
 	.netlink_ns = net_netlink_ns,
 	.initial_ns = net_initial_ns,
+	.put_ns = net_put_ns,
 };
 EXPORT_SYMBOL_GPL(net_ns_type_operations);
 
-static void net_kobj_ns_exit(struct net *net)
-{
-	kobj_ns_exit(KOBJ_NS_TYPE_NET, net);
-}
-
-static struct pernet_operations kobj_net_ops = {
-	.exit = net_kobj_ns_exit,
-};
-
-
 #ifdef CONFIG_HOTPLUG
 static int netdev_uevent(struct device *d, struct kobj_uevent_env *env)
 {
@@ -1339,6 +1335,5 @@ EXPORT_SYMBOL(netdev_class_remove_file);
 int netdev_kobject_init(void)
 {
 	kobj_ns_type_register(&net_ns_type_operations);
-	register_pernet_subsys(&kobj_net_ops);
 	return class_register(&net_class);
 }
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 6c6b86d..19feb20 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -128,6 +128,7 @@ static __net_init int setup_net(struct net *net)
 	LIST_HEAD(net_exit_list);
 
 	atomic_set(&net->count, 1);
+	atomic_set(&net->passive, 1);
 
 #ifdef NETNS_REFCNT_DEBUG
 	atomic_set(&net->use_count, 0);
@@ -210,6 +211,13 @@ static void net_free(struct net *net)
 	kmem_cache_free(net_cachep, net);
 }
 
+void net_put_ns(void *p)
+{
+	struct net *ns = p;
+	if (ns && atomic_dec_and_test(&ns->passive))
+		net_free(ns);
+}
+
 struct net *copy_net_ns(unsigned long flags, struct net *old_net)
 {
 	struct net *net;
@@ -230,7 +238,7 @@ struct net *copy_net_ns(unsigned long flags, struct net *old_net)
 	}
 	mutex_unlock(&net_mutex);
 	if (rv < 0) {
-		net_free(net);
+		net_put_ns(net);
 		return ERR_PTR(rv);
 	}
 	return net;
@@ -286,7 +294,7 @@ static void cleanup_net(struct work_struct *work)
 	/* Finally it is safe to free my network namespace structure */
 	list_for_each_entry_safe(net, tmp, &net_exit_list, exit_list) {
 		list_del_init(&net->exit_list);
-		net_free(net);
+		net_put_ns(net);
 	}
 }
 static DECLARE_WORK(net_cleanup_work, cleanup_net);

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
  2011-06-09  1:26       ` Al Viro
@ 2011-06-12  7:15         ` Eric W. Biederman
  2011-06-12 17:59           ` Linus Torvalds
  2011-06-12 18:35           ` Al Viro
  0 siblings, 2 replies; 13+ messages in thread
From: Eric W. Biederman @ 2011-06-12  7:15 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, netdev, Linux Containers

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Tue, Jun 07, 2011 at 11:59:02PM +0100, Al Viro wrote:
>
>> Completely untested patch follows:
>
> FWIW, modulo a couple of brainos it seems to work here without blowing
> everything up.  Actual freeing of struct net is delayed until after
> umount of sysfs instances refering it, shutdown is *not* delayed, sysfs
> entries are removed nicely as they ought to, no objects from other net_ns
> are picked by readdir or lookup...
>
> If somebody has objections - yell.  If I don't hear anything, the following
> goes to -next:

The change seems reasonable, and much more likely to keep working after
small changes to sysfs.  Sigh.

I honestly hate the pattern that is being used here.  Holding a
reference count because we can't be bothered to free things reliably
when we actually stop using them.  It does look less error prone than
what I am doing today, and 2.5KiB for struct net isn't that much memory
to pin.

Will pinning an extra 2.5KiB be a problem when we get to the point where
unprivileged mounts are safe?  I expect there are easier ways to pin more
memory so I doubt this is worth worrying about.

The naming of things after your patch stinks.

It isn't clear what is taking or putting what kind of refcount from the
names.  If we don't correct the bad naming your patch will be worse for
maintenance than what we already have.

> commit 72d4e98002b45598bb88af74eeac20874f2789be
> Author: Al Viro <viro@zeniv.linux.org.uk>
> Date:   Wed Jun 8 21:13:01 2011 -0400
>
>     Delay struct net freeing while there's a sysfs instance refering to it
>     
>     	* new refcount in struct net, controlling actual freeing of the memory
>     	* new method in kobj_ns_type_operations (->put_ns())
>     	* ->current_ns() semantics change - it's supposed to be followed by
>     corresponding ->put_ns().  For struct net in case of CONFIG_NET_NS it bumps
>     the new refcount; net_put_ns() decrements it and calls net_free() if the
>     last reference has been dropped.

We need to rename kobj_ns_current so it is clear we get a ref count.

>     	* old net_free() callers call net_put_ns() instead.
>     	* sysfs_exit_ns() is gone, along with a large part of callchain
>     leading to it; now that the references stored in ->ns[...] stay valid we
>     do not need to hunt them down and replace them with NULL.  That fixes
>     problems in sysfs_lookup() and sysfs_readdir(), along with getting rid
>     of sb->s_instances abuse.

Honestly I can fix at least the lookup problems by assigning 
((void *)-1UL) instead of NULL, in sysfs_exit_ns.

>     	Note that struct net *shutdown* logics has not changed - net_cleanup()
>     is called exactly when it used to be called.  The only thing postponed by
>     having a sysfs instance refering to that struct net is actual freeing of
>     memory occupied by struct net.
>     
>     Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
>
> diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
> index 2668957..1e7a2ee 100644
> --- a/fs/sysfs/mount.c
> +++ b/fs/sysfs/mount.c
> @@ -111,8 +111,11 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
>  		info->ns[type] = kobj_ns_current(type);
>  
>  	sb = sget(fs_type, sysfs_test_super, sysfs_set_super, info);
> -	if (IS_ERR(sb) || sb->s_fs_info != info)
> +	if (IS_ERR(sb) || sb->s_fs_info != info) {
> +		for (type = KOBJ_NS_TYPE_NONE; type < KOBJ_NS_TYPES; type++)
> +			kobj_ns_put(type, info->ns[type]);
>  		kfree(info);
> +	}
>  	if (IS_ERR(sb))
>  		return ERR_CAST(sb);
>  	if (!sb->s_root) {
> @@ -131,11 +134,14 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
>  static void sysfs_kill_sb(struct super_block *sb)
>  {
>  	struct sysfs_super_info *info = sysfs_info(sb);
> +	int type;
>  
>  	/* Remove the superblock from fs_supers/s_instances
>  	 * so we can't find it, before freeing sysfs_super_info.
>  	 */
>  	kill_anon_super(sb);
> +	for (type = KOBJ_NS_TYPE_NONE; type < KOBJ_NS_TYPES; type++)
> +		kobj_ns_put(type, info->ns[type]);

This loop and the kfree probably deserve a small function of their own.

>  	kfree(info);
>  }
>  
> @@ -145,28 +151,6 @@ static struct file_system_type sysfs_fs_type = {
>  	.kill_sb	= sysfs_kill_sb,
>  };
>  
> -void sysfs_exit_ns(enum kobj_ns_type type, const void *ns)
> -{
> -	struct super_block *sb;
> -
> -	mutex_lock(&sysfs_mutex);
> -	spin_lock(&sb_lock);
> -	list_for_each_entry(sb, &sysfs_fs_type.fs_supers, s_instances) {
> -		struct sysfs_super_info *info = sysfs_info(sb);
> -		/*
> -		 * If we see a superblock on the fs_supers/s_instances
> -		 * list the unmount has not completed and sb->s_fs_info
> -		 * points to a valid struct sysfs_super_info.
> -		 */
> -		/* Ignore superblocks with the wrong ns */
> -		if (info->ns[type] != ns)
> -			continue;
> -		info->ns[type] = NULL;
> -	}
> -	spin_unlock(&sb_lock);
> -	mutex_unlock(&sysfs_mutex);
> -}
> -
>  int __init sysfs_init(void)
>  {
>  	int err = -ENOMEM;
> diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
> index 3d28af3..2ed2404 100644
> --- a/fs/sysfs/sysfs.h
> +++ b/fs/sysfs/sysfs.h
> @@ -136,7 +136,7 @@ struct sysfs_addrm_cxt {
>   * instance).
>   */
>  struct sysfs_super_info {
> -	const void *ns[KOBJ_NS_TYPES];
> +	void *ns[KOBJ_NS_TYPES];
>  };
>  #define sysfs_info(SB) ((struct sysfs_super_info *)(SB->s_fs_info))
>  extern struct sysfs_dirent sysfs_root;
> diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
> index 82cb5bf..5fa481c 100644
> --- a/include/linux/kobject_ns.h
> +++ b/include/linux/kobject_ns.h
> @@ -38,9 +38,10 @@ enum kobj_ns_type {
>   */
>  struct kobj_ns_type_operations {
>  	enum kobj_ns_type type;
> -	const void *(*current_ns)(void);
> +	void *(*current_ns)(void);

I really don't like removing const here.  It made it very clear that
what we are messing with is a token and not something that we ever will
deference.

>  	const void *(*netlink_ns)(struct sock *sk);
>  	const void *(*initial_ns)(void);
> +	void (*put_ns)(void *);
>  };
>  
>  int kobj_ns_type_register(const struct kobj_ns_type_operations *ops);
> @@ -48,9 +49,9 @@ int kobj_ns_type_registered(enum kobj_ns_type type);
>  const struct kobj_ns_type_operations *kobj_child_ns_ops(struct kobject *parent);
>  const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj);
>  
> -const void *kobj_ns_current(enum kobj_ns_type type);
> +void *kobj_ns_current(enum kobj_ns_type type);
>  const void *kobj_ns_netlink(enum kobj_ns_type type, struct sock *sk);
>  const void *kobj_ns_initial(enum kobj_ns_type type);
> -void kobj_ns_exit(enum kobj_ns_type type, const void *ns);
> +void kobj_ns_put(enum kobj_ns_type type, void *ns);
>  
>  #endif /* _LINUX_KOBJECT_NS_H */
> diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
> index c3acda6..e2696d7 100644
> --- a/include/linux/sysfs.h
> +++ b/include/linux/sysfs.h
> @@ -177,9 +177,6 @@ struct sysfs_dirent *sysfs_get_dirent(struct sysfs_dirent *parent_sd,
>  struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd);
>  void sysfs_put(struct sysfs_dirent *sd);
>  
> -/* Called to clear a ns tag when it is no longer valid */
> -void sysfs_exit_ns(enum kobj_ns_type type, const void *tag);
> -
>  int __must_check sysfs_init(void);
>  
>  #else /* CONFIG_SYSFS */
> @@ -338,10 +335,6 @@ static inline void sysfs_put(struct sysfs_dirent *sd)
>  {
>  }
>  
> -static inline void sysfs_exit_ns(int type, const void *tag)
> -{
> -}
> -
>  static inline int __must_check sysfs_init(void)
>  {
>  	return 0;
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index 2bf9ed9..415af64 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -35,8 +35,11 @@ struct netns_ipvs;
>  #define NETDEV_HASHENTRIES (1 << NETDEV_HASHBITS)
>  
>  struct net {
> +	atomic_t		passive;	/* To decided when the network
> +						 * namespace should be freed.
> +						 */
>  	atomic_t		count;		/* To decided when the network
> -						 *  namespace should be freed.
> +						 *  namespace should be shut down.
>  						 */
>  #ifdef NETNS_REFCNT_DEBUG
>  	atomic_t		use_count;	/* To track references we
> @@ -154,6 +157,9 @@ int net_eq(const struct net *net1, const struct net *net2)
>  {
>  	return net1 == net2;
>  }
> +
> +extern void net_put_ns(void *);
> +
>  #else
>  
>  static inline struct net *get_net(struct net *net)
> @@ -175,6 +181,8 @@ int net_eq(const struct net *net1, const struct net *net2)
>  {
>  	return 1;
>  }
> +
> +#define net_put_ns NULL
>  #endif
>  
>  
> diff --git a/lib/kobject.c b/lib/kobject.c
> index 82dc34c..43116b2 100644
> --- a/lib/kobject.c
> +++ b/lib/kobject.c
> @@ -948,9 +948,9 @@ const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj)
>  }
>  
>  
> -const void *kobj_ns_current(enum kobj_ns_type type)
> +void *kobj_ns_current(enum kobj_ns_type type)
>  {
> -	const void *ns = NULL;
> +	void *ns = NULL;
>  
>  	spin_lock(&kobj_ns_type_lock);
>  	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
> @@ -987,23 +987,15 @@ const void *kobj_ns_initial(enum kobj_ns_type type)
>  	return ns;
>  }
>  
> -/*
> - * kobj_ns_exit - invalidate a namespace tag
> - *
> - * @type: the namespace type (i.e. KOBJ_NS_TYPE_NET)
> - * @ns: the actual namespace being invalidated
> - *
> - * This is called when a tag is no longer valid.  For instance,
> - * when a network namespace exits, it uses this helper to
> - * make sure no sb's sysfs_info points to the now-invalidated
> - * netns.
> - */
> -void kobj_ns_exit(enum kobj_ns_type type, const void *ns)
> +void kobj_ns_put(enum kobj_ns_type type, void *ns)
>  {
> -	sysfs_exit_ns(type, ns);
> +	spin_lock(&kobj_ns_type_lock);
> +	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
> +	    kobj_ns_ops_tbl[type] && kobj_ns_ops_tbl[type]->put_ns)
> +		kobj_ns_ops_tbl[type]->put_ns(ns);
> +	spin_unlock(&kobj_ns_type_lock);
>  }
>  
> -
>  EXPORT_SYMBOL(kobject_get);
>  EXPORT_SYMBOL(kobject_put);
>  EXPORT_SYMBOL(kobject_del);
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 11b98bc..0c799f1 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -1179,9 +1179,14 @@ static void remove_queue_kobjects(struct net_device *net)
>  #endif
>  }
>  
> -static const void *net_current_ns(void)
> +static void *net_current_ns(void)
>  {
> -	return current->nsproxy->net_ns;
> +	struct net *ns = current->nsproxy->net_ns;
> +#ifdef CONFIG_NET_NS
> +	if (ns)
> +		atomic_inc(&ns->passive);
> +#endif
This code  doesn't need to be #ifdef'd
> +	return ns;
>  }
>  
>  static const void *net_initial_ns(void)
> @@ -1199,19 +1204,10 @@ struct kobj_ns_type_operations net_ns_type_operations = {
>  	.current_ns = net_current_ns,
>  	.netlink_ns = net_netlink_ns,
>  	.initial_ns = net_initial_ns,
> +	.put_ns = net_put_ns,
>  };
>  EXPORT_SYMBOL_GPL(net_ns_type_operations);
>  
> -static void net_kobj_ns_exit(struct net *net)
> -{
> -	kobj_ns_exit(KOBJ_NS_TYPE_NET, net);
> -}
> -
> -static struct pernet_operations kobj_net_ops = {
> -	.exit = net_kobj_ns_exit,
> -};
> -
> -
>  #ifdef CONFIG_HOTPLUG
>  static int netdev_uevent(struct device *d, struct kobj_uevent_env *env)
>  {
> @@ -1339,6 +1335,5 @@ EXPORT_SYMBOL(netdev_class_remove_file);
>  int netdev_kobject_init(void)
>  {
>  	kobj_ns_type_register(&net_ns_type_operations);
> -	register_pernet_subsys(&kobj_net_ops);
>  	return class_register(&net_class);
>  }
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 6c6b86d..19feb20 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -128,6 +128,7 @@ static __net_init int setup_net(struct net *net)
>  	LIST_HEAD(net_exit_list);
>  
>  	atomic_set(&net->count, 1);
> +	atomic_set(&net->passive, 1);
>  
>  #ifdef NETNS_REFCNT_DEBUG
>  	atomic_set(&net->use_count, 0);
> @@ -210,6 +211,13 @@ static void net_free(struct net *net)
>  	kmem_cache_free(net_cachep, net);
>  }
>  
> +void net_put_ns(void *p)
> +{
There has got to be a better name.  We already have put and get net
methods.

> +	struct net *ns = p;
> +	if (ns && atomic_dec_and_test(&ns->passive))
> +		net_free(ns);
> +}
> +
>  struct net *copy_net_ns(unsigned long flags, struct net *old_net)
>  {
>  	struct net *net;
> @@ -230,7 +238,7 @@ struct net *copy_net_ns(unsigned long flags, struct net *old_net)
>  	}
>  	mutex_unlock(&net_mutex);
>  	if (rv < 0) {
> -		net_free(net);
> +		net_put_ns(net);
>  		return ERR_PTR(rv);
>  	}
>  	return net;
> @@ -286,7 +294,7 @@ static void cleanup_net(struct work_struct *work)
>  	/* Finally it is safe to free my network namespace structure */
>  	list_for_each_entry_safe(net, tmp, &net_exit_list, exit_list) {
>  		list_del_init(&net->exit_list);
> -		net_free(net);
> +		net_put_ns(net);
>  	}
>  }
>  static DECLARE_WORK(net_cleanup_work, cleanup_net);

Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
  2011-06-12  7:15         ` Eric W. Biederman
@ 2011-06-12 17:59           ` Linus Torvalds
  2011-06-12 18:17             ` Al Viro
  2011-06-12 18:35           ` Al Viro
  1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2011-06-12 17:59 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Al Viro, linux-fsdevel, netdev, Linux Containers

On Sun, Jun 12, 2011 at 12:15 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> I honestly hate the pattern that is being used here.  Holding a
> reference count because we can't be bothered to free things reliably
> when we actually stop using them.

WHAT?

That's what a reference count *is*. It's all about "free things
reliably when we actually stop using them".

Your comment makes zero sense.

EVERY SINGLE kernel data structure should be reference counted. Read
Documentation/CodingStyle, or look at any of the good code in the
kernel (ie core process or VFS code). A non-refcounted data structure
that is used by more than one entity IS A BUG!

Quite frankly, your objection sounds moronic. If there is more than
one user, then a reference count is _always_ the right thing. Nothing
else ever works, and trust me, people have tried. They've tried
locking, they've tried luck, they've tried crazy things. Nothing but
refcounts works.

                           Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
  2011-06-12 17:59           ` Linus Torvalds
@ 2011-06-12 18:17             ` Al Viro
  0 siblings, 0 replies; 13+ messages in thread
From: Al Viro @ 2011-06-12 18:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Eric W. Biederman, linux-fsdevel, netdev, Linux Containers

On Sun, Jun 12, 2011 at 10:59:42AM -0700, Linus Torvalds wrote:
> On Sun, Jun 12, 2011 at 12:15 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> > I honestly hate the pattern that is being used here. ?Holding a
> > reference count because we can't be bothered to free things reliably
> > when we actually stop using them.
> 
> WHAT?
> 
> That's what a reference count *is*. It's all about "free things
> reliably when we actually stop using them".
> 
> Your comment makes zero sense.
> 
> EVERY SINGLE kernel data structure should be reference counted. Read
> Documentation/CodingStyle, or look at any of the good code in the
> kernel (ie core process or VFS code). A non-refcounted data structure
> that is used by more than one entity IS A BUG!
> 
> Quite frankly, your objection sounds moronic. If there is more than
> one user, then a reference count is _always_ the right thing. Nothing
> else ever works, and trust me, people have tried. They've tried
> locking, they've tried luck, they've tried crazy things. Nothing but
> refcounts works.

No, what the current code is trying to do is to have two kinds of references -
contributing to refcount (they do have one, all right) and non-contributing.
*AND* it attempts to hunt non-contributing ones down and replace them with
NULL when refcount hits zero.  And fscks up in dealing with the results.

What this patch does is pretty much the same thing we do for mm_struct and
superblocks - two refcounts, one controlling the shutdown of object and another
controlling the actual freeing of memory.  The second kind of references
contributes to the "memory" refcount and so does having non-zero "active"
refcount.  No games with replacing references with NULL, no races around those,
etc.

Eric's objection is that sysfs superblock would pin the memory occupied by
struct net down until it's unmounted.  Frankly, I think it's a BS -
aforementioned 2.5K are trivial to pin down *anyway*.  Just chdir deep
enough into that instance of sysfs tree and inodes/dentries you've pinned
down by that will easily eat this much.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
  2011-06-12  7:15         ` Eric W. Biederman
  2011-06-12 17:59           ` Linus Torvalds
@ 2011-06-12 18:35           ` Al Viro
  1 sibling, 0 replies; 13+ messages in thread
From: Al Viro @ 2011-06-12 18:35 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds, netdev, Linux Containers

On Sun, Jun 12, 2011 at 12:15:40AM -0700, Eric W. Biederman wrote:

> I honestly hate the pattern that is being used here.  Holding a
> reference count because we can't be bothered to free things reliably
> when we actually stop using them.  It does look less error prone than
> what I am doing today, and 2.5KiB for struct net isn't that much memory
> to pin.

What pattern?  Dual refcounts, one for tearing the contents down and another
for keeping the structure allocated (and it address not reused)?  We are
using that kind of stuff for many things.  Sure, we have cases when there
are pointers to object not contributing to refcount, to be removed on
object destruction.  But if you look at those you'll see that normally
it's done for something like "this object can be found in cyclic list/hash
chain/etc.; those references are not affecting refcount and we simply remove
the object from those lists when refcount hits zero".  The thing is, those
references are trivially found starting at the object.  In case of non-counting
references from sysfs superblocks it's nowhere near that.  And in cases of
that kind the use of dual refcounts is normal.

BTW, speaking of refcounts - I have a pending patch fixing a pid_namespace
leak in proc_set_super(); will be in the next pull request.  anon_set_super()
can fail...

> Will pinning an extra 2.5KiB be a problem when we get to the point where
> unprivileged mounts are safe?  I expect there are easier ways to pin more
> memory so I doubt this is worth worrying about.

*snort*
chdir deep enough into sysfs tree and you've got it.

> It isn't clear what is taking or putting what kind of refcount from the
> names.  If we don't correct the bad naming your patch will be worse for
> maintenance than what we already have.

Agreed.  Suggestions of better names are welcome.  In particular, I considered
s/count/active/ and calling new refcount "count".

> We need to rename kobj_ns_current so it is clear we get a ref count.

Agreed.  Suggestions?

> > +	for (type = KOBJ_NS_TYPE_NONE; type < KOBJ_NS_TYPES; type++)
> > +		kobj_ns_put(type, info->ns[type]);
> 
> This loop and the kfree probably deserve a small function of their own.

OK.

> I really don't like removing const here.  It made it very clear that
> what we are messing with is a token and not something that we ever will
> deference.

But we will - when we drop the reference, refcount is going to change...

> > +static void *net_current_ns(void)
> >  {
> > -	return current->nsproxy->net_ns;
> > +	struct net *ns = current->nsproxy->net_ns;
> > +#ifdef CONFIG_NET_NS
> > +	if (ns)
> > +		atomic_inc(&ns->passive);
> > +#endif
> This code  doesn't need to be #ifdef'd
> > +	return ns;
> >  }

It does, unless we want to make net_put_ns() non-NULL in !NET_NS case...

> > +void net_put_ns(void *p)
> > +{
> There has got to be a better name.  We already have put and get net
> methods.

Um...  Suggestions?  hold_current_ns/drop_ns?

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-06-12 18:35 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-04  0:15 [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs Al Viro
2011-06-04 21:22 ` Al Viro
2011-06-04 21:55   ` Linus Torvalds
2011-06-04 22:23     ` Al Viro
2011-06-06 19:03 ` Eric W. Biederman
2011-06-07 21:58   ` Al Viro
2011-06-07 22:59     ` Al Viro
2011-06-09  1:26       ` Al Viro
2011-06-12  7:15         ` Eric W. Biederman
2011-06-12 17:59           ` Linus Torvalds
2011-06-12 18:17             ` Al Viro
2011-06-12 18:35           ` Al Viro
  -- strict thread matches above, loose matches on Subject: below --
2011-06-04 22:25 Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).