All of lore.kernel.org
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Benjamin Thery <benjamin.thery@bull.net>,
	Greg KH <greg@kroah.com>,
	linux-kernel@vger.kernel.org,
	"Serge E. Hallyn" <serue@us.ibm.com>,
	Al Viro <viro@ftp.linux.org.uk>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Tejun Heo <tj@kernel.org>
Subject: Re: sysfs: tagged directories not merged completely yet
Date: Tue, 07 Oct 2008 01:08:08 -0700	[thread overview]
Message-ID: <m1hc7oddzr.fsf@frodo.ebiederm.org> (raw)
In-Reply-To: <20081003101331.GH28946@ZenIV.linux.org.uk> (Al Viro's message of "Fri, 3 Oct 2008 11:13:31 +0100")

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Tue, Sep 23, 2008 at 11:23:57AM -0700, Eric W. Biederman wrote:
>> Benjamin Thery <benjamin.thery@bull.net> writes:
>> >
>> > Oh.
>> > It's a pity Al couldn't re-review them before. We've already lost a lot
>> > of time with this patchset and it's blocking easier testing of network
>> > namespaces (right now, with a mainline kernel, we have to disable sysfs
>> > to build network namespaces).
>> 
>> I am confident that we have a good base with these patches and the rest of
>> the work can be done incrementally on top of them if any issues turn up.
>> 
>> Al recent rework of sysctl has a very similar structure.

Al thank you for taking a fresh look at sysfs. 

For the rest of the readers.  Most of the bugs and what seem to
me to be the most significant bugs that Al is seeing existing
without my sysfs tagged directory patchset.

> No, it does not.  My apologies for delay, but here are more printable parts
> of review.

The concept of a single directory tree in the basic data structure
where only some parts show up depending upon the namespace you are in
conceptually the same, and that is what I was referring to.


I'm going to attempt to step back a bit and see if I can convert this into 
a constructive exchange.  Picking on the nasty details without discussing
the reasons for those details won't get us anywhere.

>From the point of view of the VFS sysfs is a distributed filesystem.  

As a distributed filesystem we don't always go through the VFS sanity
checks when a filesystem change occurs, and only update the VFS caches
to the state of the change.

As a distributed filesystem sysfs internally needs to provide all of
the locking necessary to ensure the sysfs data structures are consistent.

Currently there are two basic paths into sysfs:
- Directly to the kernel internal state to the internal sysfs state.
  This is where all mutation happens.
- Read-only through the VFS.

In the case of directory structure mutation, the mutation has happened
and the VFS simply must be told about it.

Which means that in a sane and simple design of a global filesystem
the locking would work was follows:

o Obtain filesystem lock
o Modify filesystem internal state.
o For each mounted instance
o   Obtain VFS locks
o   Update the mounted instance.
o   Release VFS locks
o Release filesystem lock.

The above is simple and allows the same locking ordering for all accesses.

Unfortunately the VFS does not give us the opportunity to takes the locks
in the proper order.  And it insists on grabbing a VFS lock before we can
grab a global lock.

Which in my case leads to weirdness like:

> a) You do *not* share struct inode between the superblocks, for fsck sake!

I can't currently use separate inodes because of the current locking of sysfs,
because by the current VFS limitations.  In particular:

The currently locking order is:
    mutex_lock(&inode->i_mutex);
    mutex_lock(&sysfs_mutex);

If I use separate inodes that locking order can not be maintained.  Frankly I would
love to change this.

> d) You REALLY, REALLY do not unhash busy dentries of directories.

Quiet frankly if the VFS requires the dcache to be out of sync
with the filesystem that is a VFS bug.

If the filesystem deletes the directory then it should be unhashed
so it can not be looked up.

Currently I know unhashing a busy directory potentially causes
mount leaks.  Are you thinking of anything else?

> * may I politely suggest that
> again:
> 	mutex_lock(&A);
> 	if (!mutex_trylock(&B)) {
> 		mutex_unlock(&A);
> 		goto again;
> 	}
>
> is somewhat, er, deficient way to deal with buggered locking hierarchy?

Nah.  In this case just a silly oversight.  It is a trivial fix.

> Not to mention anything else, that's obviously FUBAR on UP box - if we
> have B contended, we've just killed the box dead since we'll never give
> the CPU back to whoever happens to hold B.  See sysfs_mv_dir() for a lovely
> example.

Good point about UP deadlocks.

It shouldn't be hard to do the equivalent of lock_rename() and unlock_rename().

> c) You do *not* change dentry tree topology without s_vfs_rename_mutex on
> affected superblock.  That, BTW, is broken in mainline sysfs as well.

Hardly.    sb->s_vfs_rename_mutex only serializes against renames in the VFS.
As such it isn't useful to protect the sysfs data structures.  That is
the job of sysfs_rename_mutex.  At which point sb->s_vfs_rename_mutex
is redundant.

The only places I can see where s_vfs_rename_mutex is taken is in
lock_rename(), unlock_rename() and d_materialise_unique().
d_materialise_unique() only gets called by NFS.  lock_rename() and
unlock_rename() only get called from renameat which always fails in the
case of sysfs and we aren't doing anything silly like hard links to
directories so I don't see a way to support that path.

> b) You do *not* grab i_mutex on ancestors after having grabbed it on
> file, as sysfs_chmod_file() does.

Ouch!  Good point.  That is simple enough to solve by reordering the
operations.

> In addition to that, there are interesting internal problems:
> * inumbers are released by final sysfs_put(); that can happen before the
> final iput() on corresponding inode.  Guess what happens if new entry is
> created in the meanwhile, reuses the same inumber and lookup gets to
> sysfs_get_inode() on it?

Cute.  Sounds like we need to put a reference to the sysfs_dirent into the inode.

> * sysfs_count_nlink() is called from sysfs_fill_super() without sysfs_mutex;
> now this sucker can get called at any moment.


> * just what is protecting ->s_tag?

sd->s_tag can only be changed by sysfs_mv_dir so it should share the same locking
as anything else that can be renamed.

> * __sysfs_remove_dir() appears to assume that subdirectories are possible;
> AFAICS, if we *do* get them, we get very screwed after remove_dir().

I haven't touched that one, and I am tired tonight.  Are you thinking
of something in addition to an early sysfs_put and a 

> * everything else aside, the internal locking is extremely heavy.  For
> fsck sake, guys, a single system-wide mutex that can be grabbed for the
> duration of readdir on any directory and block just about anything
> in the filesystem?  Just mmap() something over NFS on a slow link and
> do getdents() to such buffer.  Watch a *lot* of stuff getting buggered.
> Hell, you can't even do ifconfig up while that sucker is held...

I don't see where we come anywhere close to sysfs for running ifconfig up.

Certainly ifconfig up works fine with sysfs compiled out and I haven't
spotted the path from the networking code to sysfs in that instance.
sysfs really only shows up in adding and removing network interfaces
in my experience.

> Seriously, people, it's getting worse than devfs had ever been ;-/

I haven't looked at devfs so I wouldn't know.  I just know it feels
futile a lot of time to try and improve sysfs.

Eric

  parent reply	other threads:[~2008-10-07  8:16 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-09-22 14:31 sysfs: tagged directories not merged completely yet Benjamin Thery
2008-09-22 15:34 ` Greg KH
2008-09-22 20:24   ` Eric W. Biederman
2008-09-23 14:24   ` Benjamin Thery
2008-09-23 18:23     ` Eric W. Biederman
2008-10-03 10:13       ` Al Viro
2008-10-05  5:32         ` Greg KH
2008-10-07  8:27           ` Eric W. Biederman
2008-10-07 10:47             ` [PATCH 0/3] minor sysfs tagged directory fixes Eric W. Biederman
2008-10-07 10:49               ` [PATCH 1/3] sysfs: Remove lock ordering violation in sysfs_chmod_file Eric W. Biederman
2008-10-07 10:51                 ` [PATCH 2/3] sysfs: Fix and sysfs_mv_dir by using lock_rename Eric W. Biederman
2008-10-07 10:52                   ` [PATCH 3/3] sysfs: Take sysfs_mutex when fetching the root inode Eric W. Biederman
2008-10-07 21:21                   ` [PATCH 2/3] sysfs: Fix and sysfs_mv_dir by using lock_rename Dave Hansen
2008-10-07 21:19                 ` [PATCH 1/3] sysfs: Remove lock ordering violation in sysfs_chmod_file Dave Hansen
2008-10-07 22:31                   ` Eric W. Biederman
2008-10-07 22:27             ` sysfs: tagged directories not merged completely yet Greg KH
2008-10-07 22:54               ` Serge E. Hallyn
2008-10-07 23:39                 ` Greg KH
2008-10-08  0:12                   ` Serge E. Hallyn
2008-10-08  0:38                     ` Greg KH
2008-10-08 14:18                       ` Serge E. Hallyn
2008-10-07 23:34               ` Tejun Heo
2008-10-14  1:11                 ` Eric W. Biederman
2008-10-14  7:55                   ` Tejun Heo
2008-10-14 12:19                     ` Eric W. Biederman
2008-10-15 11:04                       ` Tejun Heo
2008-10-16 21:58                         ` Eric W. Biederman
2008-10-14 18:53                     ` Serge E. Hallyn
2008-10-15  0:48                       ` Eric W. Biederman
2008-10-15 13:42                         ` Serge E. Hallyn
2008-10-15 13:54                           ` Benjamin Thery
2008-10-08  0:39               ` Eric W. Biederman
2008-10-08  1:29               ` Eric W. Biederman
2008-10-07  8:08         ` Eric W. Biederman [this message]
2008-10-07  9:01         ` Daniel Lezcano
2008-10-07  9:12         ` Tejun Heo
2008-10-07 11:56           ` Eric W. Biederman
2008-10-07 12:19             ` Tejun Heo
2008-10-07 23:17               ` Tejun Heo
2008-10-08  0:04                 ` Eric W. Biederman
2008-10-08  0:20                   ` Tejun Heo
2008-10-08  0:58                     ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m1hc7oddzr.fsf@frodo.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=benjamin.thery@bull.net \
    --cc=greg@kroah.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=serue@us.ibm.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@ZenIV.linux.org.uk \
    --cc=viro@ftp.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.