netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
  • * Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
           [not found] <20110604001518.GT11521@ZenIV.linux.org.uk>
           [not found] ` <20110604212259.GY11521@ZenIV.linux.org.uk>
    @ 2011-06-06 19:03 ` Eric W. Biederman
      2011-06-07 21:58   ` Al Viro
      1 sibling, 1 reply; 10+ messages in thread
    From: Eric W. Biederman @ 2011-06-06 19:03 UTC (permalink / raw)
      To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, netdev, Linux Containers
    
    
    Al Viro <viro@ZenIV.linux.org.uk> writes:
    
    > What I'm planning to do (for unrelated reasons - ubifs needs it) is to add
    > an analog of iterate_supers() that would go over the superblocks of given
    > type and call a function on them.  I would like to convert sysfs_exit_ns()
    > to it and kill the last abuser of s_instances (and one of the last sb_lock
    > ones), but that really depends on what kind of locking is needed for
    > readdir() and friends - as it is, the damn thing looks *wrong*.
    
    Answering your primary question first, what locking is needed in
    sysfs_exit_ns().
    
    Wrapping my head around this code again, to the best of my memory
    the intent was.
    
    For consistency "info->type[ns]" is an atomic value (void *) so
    it can be safely read and written without relying on locks.  For
    things like lookup 
    
    The assumption is that is primarily an atomic value and so it can
    be safely read by things like sysfs_readdir() and get a valid value
    without relying on the locks. 
    
    sysfs_mutex is needed in things like sysfs_lookup if we want the
    value to not change.
    
    There is indeed a small race in sysfs_readdir.
    
    As for sysfs_lookup it looks like my code to handle untagged members in
    directories where everything else is tagged, such as
    "/sys/class/net/bonding_masters" introduced an overloading of what NULL
    means in the context of sysfs_readdir and sysfs_lookup.   ns == NULL can
    either mean that we have type == KOBJ_NS_TYPE_NONE, or ns == NULL can
    mean that the namespace has gone away beneath us.  I looks like I need
    to fix that.
    
    To sysfs_exit_ns() we have the call chain:
    cleanup_net()
      ops_exit_list()
         net_kobj_ns_exit()
            sysfs_exit_ns()
    
    Which makes the locking order needed for that call path.
    net_mutex()
       rtnl_lock()
       sysfs_mutex()
    
    Now somewhere I was also careful that mount did not cause problems,
    with sysfs_exit_ns() but I forget where.
    
    You were asking about kobj_ns_current.
    kboj_ns_current()
      net_current_ns()
         current->nsproxy->net_ns
    
    And current has a reference on it's network namespace.
    
    Other pieces of information that should be helpful to know.
    - All sysfs directory entries for a network namespace should be
      removed from sysfs by the time sysfs_exit_ns is called.
    
    Al hopefully that is enough to get you going and I will what I can
    do with the rest of the sysfs ugliness.
    
    Eric
    
    ^ permalink raw reply	[flat|nested] 10+ messages in thread
  • * Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
    @ 2011-06-04 22:25 Al Viro
      0 siblings, 0 replies; 10+ messages in thread
    From: Al Viro @ 2011-06-04 22:25 UTC (permalink / raw)
      To: netdev; +Cc: Eric W. Biederman, linux-fsdevel, Linus Torvalds
    
    [Forwarded to netdev]
    
    On Sat, Jun 04, 2011 at 01:15:19AM +0100, Al Viro wrote:
    > Granted, sysfs_dir_pos() doesn't dereference ns; however, it does compare
    > with it.  And ns might have been freed and reused by that point.
    > 
    > I don't like what's going on there; that code looks inherently racy.
    > We never set ->ns[...] non-NULL outside of mount().  So it looks like
    > the intended behaviour is to have all ns-specific entries of that type
    > disappear forever from that fs instance.  Having entries for _another_
    > namespace to show up for a (short) while, and that only in readdir()
    > (lookup runs completely under sysfs_mutex, so we don't have that race
    > there)...
    
    Eeep...  We do not have a *race* in lookup.  However, any lookup done
    after that ->ns[...] = NULL is going to do the following:
    
            mutex_lock(&sysfs_mutex);
    
            type = sysfs_ns_type(parent_sd);
            ns = sysfs_info(dir->i_sb)->ns[type];
    /* i.e. ns = NULL */
            sd = sysfs_find_dirent(parent_sd, ns, dentry->d_name.name);
    which will do rather unpleasant things:
            for (sd = parent_sd->s_dir.children; sd; sd = sd->s_sibling) {
                    if (ns && sd->s_ns && (sd->s_ns != ns))
                            continue;
                    if (!strcmp(sd->s_name, name))
                            return sd;
            }
    i.e. with ns == NULL it will outright ignore sd->s_ns and look for name
    match and nothing else.  Any sysfs node with that name will do, whatever
    it might have in ->s_ns.  E.g. for lookups in /sys/class/net it will find
    the first sysfs node of network interface with that name in _some_ namespace.
    Back to sysfs_lookup():
            /* no such entry */
            if (!sd) {
                    ret = ERR_PTR(-ENOENT);
                    goto out_unlock;
            }
    
            /* attach dentry and inode */
            inode = sysfs_get_inode(dir->i_sb, sd);
            if (!inode) {
                    ret = ERR_PTR(-ENOMEM);
                    goto out_unlock;
            }
    
    and if there was an entry from some surviving net_ns with that name,
    sysfs_get_inode() will cheerfully allocate an inode for us and associate
    it with that sd.  Then we complete the lookup and return a shiny positive
    dentry to caller...
    
    Just what is that code supposed to do?  Somehow I doubt that "after net_ns
    is gone, lookups for class/net/name succeed when there is an interface called
    name in at least one net_ns; resulting object refers to one of such
    interfaces, with no promises regarding which net_ns it is about" is the
    intended behaviour here, especially since readdir() called after that
    will skip all sysfs nodes with non-NULL ->s_ns, i.e. show empty class/net.
    
    Frankly, my preference would be to kill the void * nonsense, introduce a
    structure we'll embed into struct net (and all future tags) containing
    refcount and "it is doomed, skip it" flag.  Purely passive refs - i.e. all
    they guarantee is that memory won't be freed under you.  Having *active* refs
    (i.e. your current net->count being non-zero) contributes 1 to that counter,
    kfree() is done when that counter reaches zero.  With info->ns[] contributing
    to passive refcount and exit_ns logics tagging the sucker as doomed and leaving
    it at that.  kill_sb() would drop references, lookup and readdir check if
    ns is doomed and skip the entry in that case.
    
    However, it's your code and I don't know in which direction do you plan to take
    it.  IMO it's badly overdesigned...
    
    Are you OK with the sketch above?  I can probably cook a patch along those
    lines by tomorrow...
    
    ^ permalink raw reply	[flat|nested] 10+ messages in thread

    end of thread, other threads:[~2011-06-12 18:35 UTC | newest]
    
    Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
    -- links below jump to the message on this page --
         [not found] <20110604001518.GT11521@ZenIV.linux.org.uk>
         [not found] ` <20110604212259.GY11521@ZenIV.linux.org.uk>
         [not found]   ` <BANLkTikChZuG4pOZ_ipDYaEv8VQ8omh-YQ@mail.gmail.com>
    2011-06-04 22:23     ` [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs Al Viro
    2011-06-06 19:03 ` Eric W. Biederman
    2011-06-07 21:58   ` Al Viro
    2011-06-07 22:59     ` Al Viro
    2011-06-09  1:26       ` Al Viro
    2011-06-12  7:15         ` Eric W. Biederman
    2011-06-12 17:59           ` Linus Torvalds
    2011-06-12 18:17             ` Al Viro
    2011-06-12 18:35           ` Al Viro
    2011-06-04 22:25 Al Viro
    

    This is a public inbox, see mirroring instructions
    for how to clone and mirror all data and code used for this inbox;
    as well as URLs for NNTP newsgroup(s).