linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
@ 2011-06-04  0:15 Al Viro
  2011-06-04 21:22 ` Al Viro
  2011-06-06 19:03 ` Eric W. Biederman
  0 siblings, 2 replies; 13+ messages in thread
From: Al Viro @ 2011-06-04  0:15 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds

sysfs_readdir() starts with doing

        ns = sysfs_info(dentry->d_sb)->ns[type];

then proceeds to scan the directory with sysfs_dir_{,next_}pos(ns, ...)
even though we have no promise whatsoever that sysfs_exit_ns() has not
cleared that element of array in the meanwhile.

Granted, sysfs_dir_pos() doesn't dereference ns; however, it does compare
with it.  And ns might have been freed and reused by that point.

I don't like what's going on there; that code looks inherently racy.
We never set ->ns[...] non-NULL outside of mount().  So it looks like
the intended behaviour is to have all ns-specific entries of that type
disappear forever from that fs instance.  Having entries for _another_
namespace to show up for a (short) while, and that only in readdir()
(lookup runs completely under sysfs_mutex, so we don't have that race
there)...

I also don't like the s_instances abuse in sysfs_exit_ns() but I'd like
to understand what is that code *for* before starting to massage it into
something saner.  What are the possible locking conditions in callers of
that thing?  I've tried to look for callers, but the (too long) callchain
has led me to pernet_operations ->exit() and at that point I'd given up -
that over the top for grepping...

BTW, is kobj_ns_current() pinned down?  Can that ns disappear from under
sysfs_mount()?  It looks like it shouldn't, but then I haven't traced the
callers of sysfs_exit_ns()...

What I'm planning to do (for unrelated reasons - ubifs needs it) is to add
an analog of iterate_supers() that would go over the superblocks of given
type and call a function on them.  I would like to convert sysfs_exit_ns()
to it and kill the last abuser of s_instances (and one of the last sb_lock
ones), but that really depends on what kind of locking is needed for
readdir() and friends - as it is, the damn thing looks *wrong*.

^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
@ 2011-06-04 22:25 Al Viro
  0 siblings, 0 replies; 13+ messages in thread
From: Al Viro @ 2011-06-04 22:25 UTC (permalink / raw)
  To: netdev; +Cc: Eric W. Biederman, linux-fsdevel, Linus Torvalds

[Forwarded to netdev]

On Sat, Jun 04, 2011 at 01:15:19AM +0100, Al Viro wrote:
> Granted, sysfs_dir_pos() doesn't dereference ns; however, it does compare
> with it.  And ns might have been freed and reused by that point.
> 
> I don't like what's going on there; that code looks inherently racy.
> We never set ->ns[...] non-NULL outside of mount().  So it looks like
> the intended behaviour is to have all ns-specific entries of that type
> disappear forever from that fs instance.  Having entries for _another_
> namespace to show up for a (short) while, and that only in readdir()
> (lookup runs completely under sysfs_mutex, so we don't have that race
> there)...

Eeep...  We do not have a *race* in lookup.  However, any lookup done
after that ->ns[...] = NULL is going to do the following:

        mutex_lock(&sysfs_mutex);

        type = sysfs_ns_type(parent_sd);
        ns = sysfs_info(dir->i_sb)->ns[type];
/* i.e. ns = NULL */
        sd = sysfs_find_dirent(parent_sd, ns, dentry->d_name.name);
which will do rather unpleasant things:
        for (sd = parent_sd->s_dir.children; sd; sd = sd->s_sibling) {
                if (ns && sd->s_ns && (sd->s_ns != ns))
                        continue;
                if (!strcmp(sd->s_name, name))
                        return sd;
        }
i.e. with ns == NULL it will outright ignore sd->s_ns and look for name
match and nothing else.  Any sysfs node with that name will do, whatever
it might have in ->s_ns.  E.g. for lookups in /sys/class/net it will find
the first sysfs node of network interface with that name in _some_ namespace.
Back to sysfs_lookup():
        /* no such entry */
        if (!sd) {
                ret = ERR_PTR(-ENOENT);
                goto out_unlock;
        }

        /* attach dentry and inode */
        inode = sysfs_get_inode(dir->i_sb, sd);
        if (!inode) {
                ret = ERR_PTR(-ENOMEM);
                goto out_unlock;
        }

and if there was an entry from some surviving net_ns with that name,
sysfs_get_inode() will cheerfully allocate an inode for us and associate
it with that sd.  Then we complete the lookup and return a shiny positive
dentry to caller...

Just what is that code supposed to do?  Somehow I doubt that "after net_ns
is gone, lookups for class/net/name succeed when there is an interface called
name in at least one net_ns; resulting object refers to one of such
interfaces, with no promises regarding which net_ns it is about" is the
intended behaviour here, especially since readdir() called after that
will skip all sysfs nodes with non-NULL ->s_ns, i.e. show empty class/net.

Frankly, my preference would be to kill the void * nonsense, introduce a
structure we'll embed into struct net (and all future tags) containing
refcount and "it is doomed, skip it" flag.  Purely passive refs - i.e. all
they guarantee is that memory won't be freed under you.  Having *active* refs
(i.e. your current net->count being non-zero) contributes 1 to that counter,
kfree() is done when that counter reaches zero.  With info->ns[] contributing
to passive refcount and exit_ns logics tagging the sucker as doomed and leaving
it at that.  kill_sb() would drop references, lookup and readdir check if
ns is doomed and skip the entry in that case.

However, it's your code and I don't know in which direction do you plan to take
it.  IMO it's badly overdesigned...

Are you OK with the sketch above?  I can probably cook a patch along those
lines by tomorrow...

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-06-12 18:35 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-04  0:15 [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs Al Viro
2011-06-04 21:22 ` Al Viro
2011-06-04 21:55   ` Linus Torvalds
2011-06-04 22:23     ` Al Viro
2011-06-06 19:03 ` Eric W. Biederman
2011-06-07 21:58   ` Al Viro
2011-06-07 22:59     ` Al Viro
2011-06-09  1:26       ` Al Viro
2011-06-12  7:15         ` Eric W. Biederman
2011-06-12 17:59           ` Linus Torvalds
2011-06-12 18:17             ` Al Viro
2011-06-12 18:35           ` Al Viro
  -- strict thread matches above, loose matches on Subject: below --
2011-06-04 22:25 Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).