Re: processes hung after sys_renameat, and 'missing' processes

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: processes hung after sys_renameat, and 'missing' processes
       [not found]                 ` <20120607012900.GE30000@ZenIV.linux.org.uk>
@ 2012-06-07 19:36                   ` Al Viro
  2012-06-07 20:43                     ` Sage Weil
                                       ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Al Viro @ 2012-06-07 19:36 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel, Miklos Szeredi,
	Jan Kara, Peter Zijlstra
  Cc: linux-fsdevel, J. Bruce Fields, Eric W. Biederman, Sage Weil

[maintainers of assorted code involved Cc'd]

On Thu, Jun 07, 2012 at 02:29:00AM +0100, Al Viro wrote:
> On Wed, Jun 06, 2012 at 09:19:15PM -0400, Dave Jones wrote:
> > On Wed, Jun 06, 2012 at 05:42:35PM -0700, Linus Torvalds wrote:
> > 
> >  > So Al meant you to test mutex_is_locked(dentry->d_inode->i_mutex) of
> >  > the parents.
> > 
> > ok, I ended up with..
> > 
> >         WARN_ON_ONCE(!mutex_is_locked(&target->d_parent->d_inode->i_mutex));
> > 
> >         if (dentry->d_parent != NULL)
> 			       != dentry)
> 
> ->d_parent *never* should be NULL; when dentry is disconnected from the
> tree (or hadn't been connected to it yet), it points to that dentry itself.

BTW, I've started to put together some documentation on ->d_parent use.
And already have found idiocies.

0) *all* instances of struct dentry should come from __d_alloc(); never
embed it into another object or declare variables of type struct dentry.
AFAICS, that one is satisfied now; any out-of-tree code violating that
deserves to be broken and *will* be broken.

1) For all code outside of fs/dcache.c it's read-only.  Never change it
directly, period.  Again, satisfied in mainline, out-of-tree code can
go play in the traffic, for all I care.

2) Places that set ->d_parent: __d_alloc(), d_alloc(), __d_move(),
__d_materialize_dentry().  The first two are doing that to new instance
of struct dentry with no references to it anywhere in shared data
structures.  The other two have had dentry passed to dentry_lock_for_move().

3) d_alloc(NULL, name) should never happen (will instantly oops in that case).

4) ->d_parent is *never* NULL; the only way to stumble across a dentry with
that not being true is to crawl through the slab page directly and to run
into an object that has been allocated by not initialized yet.  Don't do
that (nothing in mainline does, thanks $DEITY).  "No parent" == "dentry
has ->d_parent pointing to dentry itself".  There's a helper checking that,
albeit with slightly misleading name - IS_ROOT(dentry).  Doesn't have to
be fs root - anything still not attached to the tree or already detached
from it will be that way.  Any code checking that ->d_parent is NULL is
almost certainly a bug.  And we have a bunch of such places in fs/ceph:
	ceph_init_dentry()
	ceph_d_prune()
	ceph_mdsc_build_path()
a few in fs/cifs:
	build_path_from_dentry()
complete idiocy in fs/debugfs (not only check for NULL ->d_parent, but
one for ->d_parent being a negative dentry; also can't happen).
	debugfs_remove()
	debugfs_remove_recursive()
one in fs/ocfs2:
	ocfs2_match_dentry()
and one in security/inode.c:
	securityfs_remove() (nicked from debugfs, at a guess).
Some of those guys can be simply dropped, but I really wonder what
ceph_init_dentry/ceph_d_prune intended there.

Incidentall, build_path_from_dentry() ought to be lifted into fs/dcache.c;
we have at least three copies and one of them is not an instant roothole
only because hppfs doesn't follow underlying procfs changes.  And I'm going
to export is_subdir() - at least gfs2 has grown itself a b0rken copy...

5) all assignments to dentry->d_parent after initialization are under
dentry->d_lock; dentry_lock_for_move() takes care of that.

6) all reassignments to dentry->d_parent have ->d_lock held on old and new
parent.  Again, dentry_lock_for_move() takes care of that and so does
d_alloc() (that one - for new parent only; old parent is dentry itself
here and nothing could have seen it yet).

7) attached dentry moving away from old parent must have ->i_mutex on
that old parent held.

8) dentry getting attached to a new parent must have ->i_mutex held on
that new parent.

9) attached dentry getting moved to a new parent must have ->s_vfs_rename_mutex
held.

Now, (7--9) are interesting; the call graph looks so:
__d_move()
	<- d_move()
	<- __d_unalias()
		<- d_materialise_unique()
__d_materialise_dentry()
	<- d_materialise_unique()

rename-related callers of d_move() (vfs_rename_dir/vfs_rename_other/
nfs_rename/ocfs2_dentry_move from ocfs2_rename/ceph_rename/v9fs_vfs_rename)
share the same locking environment; we have both parents + ->s_vfs_rename_mutex
in case of cross-directory move held.  Pretty much by an accident we have
one difference between vfs_rename_dir and vfs_rename_other - holding ->i_mutex
on *victim* of overwriting rename() is not something d_move() cares about.
We can make that more uniform, but that's really unrelated to that.
Note that in all these cases both dentries are guaranteed to be attached
all along.

Other callers of d_move():
	vfat_lookup() does d_move() of an alias; the validity of that
strongly relies on the lack of anything resembling hardlinks on VFAT;
the parents will be the same (and parent of target is locked).
	sysfs_lookup() plays insane games with d_move(); I _really_
don't like the look of that code.  It tries to play catch-up after
sysfs_rename() done its thing.  AFAICS, the parent *can* be changed
by sysfs_rename().  No lock is held on said parent here; no
->s_vfs_rename_mutex either.  Moreover, none of the sysfs_rename() callers
(kobject_move(), etc.) appear to give a damn about checking if it would
create a loop.
	* debugfs_rename() - imitates what vfs_rename() is doing.  Same
locking environment.  BTW,
        trap = lock_rename(new_dir, old_dir);
        /* Source or destination directories don't exist? */
        if (!old_dir->d_inode || !new_dir->d_inode)
                goto exit;
is bogus - lock_rename() is taking ->i_mutex on these inodes, for fsck sake!
If this can be called with old_dir or new_dir negative, it's buggered.
	* ceph_fill_trace() in CEPH_MDS_OP_RENAME case.  Seems to be
done while ceph_rename() is being executed; i.e. the usual locking
environment holds.  No idea why it feels the need to grab references
to old_dentry and new_dentry in ceph_rename(), though - if that thing
can outlive ceph_rename(), we have a bad trouble.  And if it can't, there's
no reasons to pin them down that way.
	* d_splice_alias() - broken.  Called without any locking on the
old directory, nevermind ->s_vfs_rename_mutex.  I really believe that
this pair of commits needs to be reverted.  The earlier code used to
guarantee that alias would be detached.

d_materialise_unique(dentry, inode) must be called with dentry->d_parent
having its ->d_inode->i_mutex held.  AFAICS, that's satisfied for all
callers - they come either from ->lookup() or from ->readdir(), both
having the parent inode locked.  ceph is a possible exceptions; it's
damn hard to trace back to the places triggering ceph_fill_trace() in
general ;-/

__d_unalias(), in case we are asked for cross-directory move, grabs
->s_vfs_rename_mutex first (with trylock).  That makes ->d_parent of
alias stable; then it grabs ->i_mutex on that parent (again with trylock)
and we are all set.

__d_materialise_dentry(dentry, alias) is called only when alias is
unattached.  That looks like potential race, BTW - what's to prevent it
from becoming attached right under us?  We do have __d_materialise_dentry()
and __d_ualias serialized by alias->d_inode->i_lock, but... there's another
way for that unattached alias to get attached - d_splice_alias().  And
it's done without ->i_lock there.  OTOH, it's not a problem to extend
the protection of ->i_lock to that spot as well.

I'll continue this series in a couple of hours - there's more.  In the
meanwhile, I'd like to hear from ceph, sysfs and nfsd folks - see above
for the reasons.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-07 19:36                   ` processes hung after sys_renameat, and 'missing' processes Al Viro
@ 2012-06-07 20:43                     ` Sage Weil
  2012-06-07 23:12                     ` Eric W. Biederman
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Sage Weil @ 2012-06-07 20:43 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Jones, Linus Torvalds, Linux Kernel, Miklos Szeredi,
	Jan Kara, Peter Zijlstra, linux-fsdevel, J. Bruce Fields,
	Eric W. Biederman, elder

Hi Al,

On Thu, 7 Jun 2012, Al Viro wrote:
> [maintainers of assorted code involved Cc'd]
> 
> On Thu, Jun 07, 2012 at 02:29:00AM +0100, Al Viro wrote:
> > On Wed, Jun 06, 2012 at 09:19:15PM -0400, Dave Jones wrote:
> > > On Wed, Jun 06, 2012 at 05:42:35PM -0700, Linus Torvalds wrote:
> > > 
> > >  > So Al meant you to test mutex_is_locked(dentry->d_inode->i_mutex) of
> > >  > the parents.
> > > 
> > > ok, I ended up with..
> > > 
> > >         WARN_ON_ONCE(!mutex_is_locked(&target->d_parent->d_inode->i_mutex));
> > > 
> > >         if (dentry->d_parent != NULL)
> > 			       != dentry)
> > 
> > ->d_parent *never* should be NULL; when dentry is disconnected from the
> > tree (or hadn't been connected to it yet), it points to that dentry itself.
> 
> BTW, I've started to put together some documentation on ->d_parent use.
> And already have found idiocies.
> 
> 0) *all* instances of struct dentry should come from __d_alloc(); never
> embed it into another object or declare variables of type struct dentry.
> AFAICS, that one is satisfied now; any out-of-tree code violating that
> deserves to be broken and *will* be broken.
> 
> 1) For all code outside of fs/dcache.c it's read-only.  Never change it
> directly, period.  Again, satisfied in mainline, out-of-tree code can
> go play in the traffic, for all I care.
> 
> 2) Places that set ->d_parent: __d_alloc(), d_alloc(), __d_move(),
> __d_materialize_dentry().  The first two are doing that to new instance
> of struct dentry with no references to it anywhere in shared data
> structures.  The other two have had dentry passed to dentry_lock_for_move().
> 
> 3) d_alloc(NULL, name) should never happen (will instantly oops in that case).
> 
> 4) ->d_parent is *never* NULL; the only way to stumble across a dentry with
> that not being true is to crawl through the slab page directly and to run
> into an object that has been allocated by not initialized yet.  Don't do
> that (nothing in mainline does, thanks $DEITY).  "No parent" == "dentry
> has ->d_parent pointing to dentry itself".  There's a helper checking that,
> albeit with slightly misleading name - IS_ROOT(dentry).  Doesn't have to
> be fs root - anything still not attached to the tree or already detached
> from it will be that way.  Any code checking that ->d_parent is NULL is
> almost certainly a bug.  And we have a bunch of such places in fs/ceph:
> 	ceph_init_dentry()
> 	ceph_d_prune()
> 	ceph_mdsc_build_path()

The ceph_init_dentry check was added in 92cf7652 in response to getting a 
dentry with null d_parent out of d_obtain_alias().  I suspect it was 
papering over an old bug in that case.  Pushed a patch to the ceph tree to 
clean these checks out, see f3722944a8565edf434ee44bdfa37715ae0d91cd in
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git.  Let 
me know if you want to carry it, otherwise I'll push it in the next 
window.

> a few in fs/cifs:
> 	build_path_from_dentry()
> complete idiocy in fs/debugfs (not only check for NULL ->d_parent, but
> one for ->d_parent being a negative dentry; also can't happen).
> 	debugfs_remove()
> 	debugfs_remove_recursive()
> one in fs/ocfs2:
> 	ocfs2_match_dentry()
> and one in security/inode.c:
> 	securityfs_remove() (nicked from debugfs, at a guess).
> Some of those guys can be simply dropped, but I really wonder what
> ceph_init_dentry/ceph_d_prune intended there.
> 
> Incidentall, build_path_from_dentry() ought to be lifted into fs/dcache.c;
> we have at least three copies and one of them is not an instant roothole
> only because hppfs doesn't follow underlying procfs changes.  And I'm going
> to export is_subdir() - at least gfs2 has grown itself a b0rken copy...
> 
> 5) all assignments to dentry->d_parent after initialization are under
> dentry->d_lock; dentry_lock_for_move() takes care of that.
> 
> 6) all reassignments to dentry->d_parent have ->d_lock held on old and new
> parent.  Again, dentry_lock_for_move() takes care of that and so does
> d_alloc() (that one - for new parent only; old parent is dentry itself
> here and nothing could have seen it yet).
> 
> 7) attached dentry moving away from old parent must have ->i_mutex on
> that old parent held.
> 
> 8) dentry getting attached to a new parent must have ->i_mutex held on
> that new parent.
> 
> 9) attached dentry getting moved to a new parent must have ->s_vfs_rename_mutex
> held.
> 
> Now, (7--9) are interesting; the call graph looks so:
> __d_move()
> 	<- d_move()
> 	<- __d_unalias()
> 		<- d_materialise_unique()
> __d_materialise_dentry()
> 	<- d_materialise_unique()
> 
> rename-related callers of d_move() (vfs_rename_dir/vfs_rename_other/
> nfs_rename/ocfs2_dentry_move from ocfs2_rename/ceph_rename/v9fs_vfs_rename)
> share the same locking environment; we have both parents + ->s_vfs_rename_mutex
> in case of cross-directory move held.  Pretty much by an accident we have
> one difference between vfs_rename_dir and vfs_rename_other - holding ->i_mutex
> on *victim* of overwriting rename() is not something d_move() cares about.
> We can make that more uniform, but that's really unrelated to that.
> Note that in all these cases both dentries are guaranteed to be attached
> all along.
> 
> Other callers of d_move():
> 	vfat_lookup() does d_move() of an alias; the validity of that
> strongly relies on the lack of anything resembling hardlinks on VFAT;
> the parents will be the same (and parent of target is locked).
> 	sysfs_lookup() plays insane games with d_move(); I _really_
> don't like the look of that code.  It tries to play catch-up after
> sysfs_rename() done its thing.  AFAICS, the parent *can* be changed
> by sysfs_rename().  No lock is held on said parent here; no
> ->s_vfs_rename_mutex either.  Moreover, none of the sysfs_rename() callers
> (kobject_move(), etc.) appear to give a damn about checking if it would
> create a loop.
> 	* debugfs_rename() - imitates what vfs_rename() is doing.  Same
> locking environment.  BTW,
>         trap = lock_rename(new_dir, old_dir);
>         /* Source or destination directories don't exist? */
>         if (!old_dir->d_inode || !new_dir->d_inode)
>                 goto exit;
> is bogus - lock_rename() is taking ->i_mutex on these inodes, for fsck sake!
> If this can be called with old_dir or new_dir negative, it's buggered.
> 	* ceph_fill_trace() in CEPH_MDS_OP_RENAME case.  Seems to be
> done while ceph_rename() is being executed; i.e. the usual locking
> environment holds.  No idea why it feels the need to grab references
> to old_dentry and new_dentry in ceph_rename(), though - if that thing
> can outlive ceph_rename(), we have a bad trouble.  And if it can't, there's
> no reasons to pin them down that way.

The references are owned by the request struct, which can outlive 
ceph_rename() in the case of ^C or whatever.  When that happens, the 
request is marked aborted and the ceph_fill_trace() code doesn't run 
(since the locks are no longer held).  It still needs the refs to maintain 
its own sanity, however.

> 	* d_splice_alias() - broken.  Called without any locking on the
> old directory, nevermind ->s_vfs_rename_mutex.  I really believe that
> this pair of commits needs to be reverted.  The earlier code used to
> guarantee that alias would be detached.
> 
> d_materialise_unique(dentry, inode) must be called with dentry->d_parent
> having its ->d_inode->i_mutex held.  AFAICS, that's satisfied for all
> callers - they come either from ->lookup() or from ->readdir(), both
> having the parent inode locked.  ceph is a possible exceptions; it's
> damn hard to trace back to the places triggering ceph_fill_trace() in
> general ;-/

I know, sorry about that.  :)

ceph_fill_trace() runs on the struct ceph_mds_request, which as an 
r_locked_dir member that is verified before calling splice_dentry() -> 
d_materialise_unique().  I just double-checked and it looks like all call 
sites are safe.

If ceph_rename() (or a similar method who is holding the dir i_mutex) 
aborts, ceph_mdsc_do_request() marks the aborted flag under r_fill_mutex 
to avoid races with ceph_fill_trace(), so we can rely on those locks being 
held for the duration.

Does that clear things up?
sage


> 
> __d_unalias(), in case we are asked for cross-directory move, grabs
> ->s_vfs_rename_mutex first (with trylock).  That makes ->d_parent of
> alias stable; then it grabs ->i_mutex on that parent (again with trylock)
> and we are all set.
> 
> __d_materialise_dentry(dentry, alias) is called only when alias is
> unattached.  That looks like potential race, BTW - what's to prevent it
> from becoming attached right under us?  We do have __d_materialise_dentry()
> and __d_ualias serialized by alias->d_inode->i_lock, but... there's another
> way for that unattached alias to get attached - d_splice_alias().  And
> it's done without ->i_lock there.  OTOH, it's not a problem to extend
> the protection of ->i_lock to that spot as well.
> 
> I'll continue this series in a couple of hours - there's more.  In the
> meanwhile, I'd like to hear from ceph, sysfs and nfsd folks - see above
> for the reasons.
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-07 19:36                   ` processes hung after sys_renameat, and 'missing' processes Al Viro
  2012-06-07 20:43                     ` Sage Weil
@ 2012-06-07 23:12                     ` Eric W. Biederman
  2012-06-07 23:39                       ` Al Viro
  2012-06-07 23:57                       ` Linus Torvalds
  2012-06-08  2:18                     ` Al Viro
  2012-06-08 16:22                     ` J. Bruce Fields
  3 siblings, 2 replies; 18+ messages in thread
From: Eric W. Biederman @ 2012-06-07 23:12 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Jones, Linus Torvalds, Linux Kernel, Miklos Szeredi,
	Jan Kara, Peter Zijlstra, linux-fsdevel, J. Bruce Fields,
	Sage Weil

Al Viro <viro@ZenIV.linux.org.uk> writes:

> [maintainers of assorted code involved Cc'd]

> 7) attached dentry moving away from old parent must have ->i_mutex on
> that old parent held.
> 
> 8) dentry getting attached to a new parent must have ->i_mutex held on
> that new parent.
> 
> 9) attached dentry getting moved to a new parent must have ->s_vfs_rename_mutex
> held.
> 
> Now, (7--9) are interesting; the call graph looks so:
> __d_move()
> 	<- d_move()
> 	<- __d_unalias()
> 		<- d_materialise_unique()
> __d_materialise_dentry()
> 	<- d_materialise_unique()

So at least for filesystems that don't implement 
inode->i_op->rename like sysfs I don't see where you get your
rules 7-9 for d_move.

We take the approprate dentry locks in the approparite order so d_move
and the dcache should not care in the slightest about the inode
mutecies.

If we need the inode mutecies to make the dcache bits safe then
something really is insane.  There may be subtle insanities in the
vfs that require the inode muticies of the parents in d_move but I am
certainly not seeing them.  At least as I read it the code in __d_move
only touches and modifies dentry fields.

For any distributed filesystem and for sysfs in particular all of the
vfs inode mutex locking in the case of rename is absolutely useless as
the renames don't go through vfs, and the vfs simply does not have
enough information to use generic locks.

> 	sysfs_lookup() plays insane games with d_move(); I _really_
> don't like the look of that code.  It tries to play catch-up after
> sysfs_rename() done its thing.  AFAICS, the parent *can* be changed
> by sysfs_rename().  No lock is held on said parent here; no
> ->s_vfs_rename_mutex either.  Moreover, none of the sysfs_rename() callers
> (kobject_move(), etc.) appear to give a damn about checking if it would
> create a loop.

Can sysfs_lookup cause the weird renameat problems we are seeing? No.

We don't leave any locks hanging and locks are all take in the order
the vfs expects them to be taken and in the proper nesting orders.

Your complaints about the inode muticies and of the parent directories
and  s_vfs_rename_mutex seem to be completely irrelevant as those
locks mean nothing to sysfs.

All sysfs_lookup is doing is opportunistically updating the dcache
to reflect reality if we happen to come accross a sysfs_dirent that
has been renamed, and since it is just the dcache that is being
updated the locks that d_move takes appear completely sufficient
for that to be safe.

Can sysfs_rename change the sysfs_dirent parent?  yes

Do we have sufficient locking in sysfs_rename? yes we
take the sysfs_mutex, that it overkill but it serializes all
sysfs_dirent changes.

Is there loop prevention for sysfs_rename?  No.  However there
are only 5 callers of device_move and all of them are known
not to create loops.

It is probably worth it someday to get around to adding a test in
sysfs_rename to be double check and verify that loops are not added.

For purposes of analyzing sysfs_lookup we can assume that there are no
in renames, as that would imply a bug in sysfs_rename.

The interactions between the vfs and sysfs are indeed insane.  Because
of the way the vfs is designed the vfs must tell lies about deleted
files and directories when there are submounts involved.

Additionally lies also get told to the vfs about renames because the vfs
implements an on-demand consistency model with respect to remote
filesystems.  With the result that frequently we don't have full
knowledge about renames when we are revalidating directories so make a
rename look like unlink+link pair instead.

So I believe you are asking is sysfs_lookup safe? yes

What syfs_lookup is doing is if it happens to have sufficient knowledge
to detect a rename happened in the past is:

- Each sysfs_dirent has at most one struct dentry per super block.

- If d_find_alias finds a dirent then I know that the dirent for my
  inode is in the wrong place in the dcache.

- d_move performs all of the necessary dcache locking, to ensure moving
  a dentry is safe even if the parent is renamed, at least with respect
  to the dcache.

- I hold sysfs_mutex over the whole thing which guarantees that the
  sysfs directory structure does not change during this time.

> I'll continue this series in a couple of hours - there's more.  In the
> meanwhile, I'd like to hear from ceph, sysfs and nfsd folks - see above
> for the reasons.

I hope my explanations help.

I'd like to hear why you happen to think s_vfs_rename_mutex and the
inode muticies of the parent directories are at all interesting in the
case of d_move and in the case of sysfs.

Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-07 23:12                     ` Eric W. Biederman
@ 2012-06-07 23:39                       ` Al Viro
  2012-06-07 23:57                       ` Linus Torvalds
  1 sibling, 0 replies; 18+ messages in thread
From: Al Viro @ 2012-06-07 23:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dave Jones, Linus Torvalds, Linux Kernel, Miklos Szeredi,
	Jan Kara, Peter Zijlstra, linux-fsdevel, J. Bruce Fields,
	Sage Weil

On Thu, Jun 07, 2012 at 04:12:45PM -0700, Eric W. Biederman wrote:
> We take the approprate dentry locks in the approparite order so d_move
> and the dcache should not care in the slightest about the inode
> mutecies.
> 
> If we need the inode mutecies to make the dcache bits safe then
> something really is insane.  There may be subtle insanities in the
> vfs that require the inode muticies of the parents in d_move but I am
> certainly not seeing them.  At least as I read it the code in __d_move
> only touches and modifies dentry fields.

Yes.  Now, go take a look at e.g. the locking order on ->d_lock.  No,
I'm not saying that I like it.  Not at all.  But we do rely on the
non-local protections for tree topology, just to make sure that the
damn thing has the locking order consistent - not changing between
the moments you take locks you've ordered, for starters.

I realize that "serialize all operations on a single per-machine mutex" is a
solution.  It's just not something feasible when we are talking about all
directory tree modifications on a general-purpose filesystem.  So no,
sysfs approach to that kind of problems is not usable here.

I doubt that we have something sysfs-related in the deadlocks davej is seeing,
but I seriously suspect that I can cook one based on sysfs_rename() setting
the things up for silent topology changes on ->lookup().  I would suggest
using d_materialise_unique() there - that one *does* take care to take
locks needed.

BTW, looking at the code in sysfs_lookup()... why bother with d_set_d_op()
instead of just sb->s_d_op = &sysfs_dentry_ops; once during sysfs_fill_super()?
In the worst case you need to do that after you've allocated the root
dentry, depending on whether you are willing or not to make ->d_revalidate()
return 1 whenever it's called on the root dentry...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-07 23:12                     ` Eric W. Biederman
  2012-06-07 23:39                       ` Al Viro
@ 2012-06-07 23:57                       ` Linus Torvalds
  2012-06-08  0:36                         ` Al Viro
  1 sibling, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2012-06-07 23:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, Dave Jones, Linux Kernel, Miklos Szeredi, Jan Kara,
	Peter Zijlstra, linux-fsdevel, J. Bruce Fields, Sage Weil

On Thu, Jun 7, 2012 at 4:12 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> We take the approprate dentry locks in the approparite order so d_move
> and the dcache should not care in the slightest about the inode
> mutecies.

Part of the problem is that you can't even *determine* the appropriate
order without holding the rename mutex.

Now, it may turn out to be a non-issue for sysfs just because there
are no unconstrained directory renames there, but seriously: even the
d_ancestor() check itself (which is how we determine the dentry lock
order) needs that filesystem to be quiescent wrt directory renames in
order to work.

So it may not actually depend on the inode->i_mutex, but it does need
some serialization outside the dcache subsystem.

Any per-filesystem mutex should do, so if sysfs always holds the
sysfs_mutex - and never allows user-initiated renames - it should be
safe.

          Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-07 23:57                       ` Linus Torvalds
@ 2012-06-08  0:36                         ` Al Viro
  2012-06-08  0:42                           ` Linus Torvalds
                                             ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Al Viro @ 2012-06-08  0:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Dave Jones, Linux Kernel, Miklos Szeredi,
	Jan Kara, Peter Zijlstra, linux-fsdevel, J. Bruce Fields,
	Sage Weil

On Thu, Jun 07, 2012 at 04:57:13PM -0700, Linus Torvalds wrote:

> Any per-filesystem mutex should do, so if sysfs always holds the
> sysfs_mutex - and never allows user-initiated renames - it should be
> safe.

Frankly, I would very much prefer to have the same locking rules wherever
possible.  The locking system is already overcomplicated and making its
analysis fs-dependent as well... <shudder>  Sure, we can do that, and that
might even work, until we find out that some piece of code that started
as a helper to some function never called on sysfs dentries had been
reused on the path that *is* reachable on sysfs.  At which point we are
suddenly in trouble.

I wouldn't be bothered so much if the overall picture had been simpler;
unfortunately, it isn't.

Eric, how about this - if nothing else, that makes code in there simpler
and less dependent on details of VFS guts:

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index e6bb9b2..5579826 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -363,7 +363,7 @@ static void sysfs_dentry_iput(struct dentry *dentry, struct inode *inode)
 	iput(inode);
 }
 
-static const struct dentry_operations sysfs_dentry_ops = {
+const struct dentry_operations sysfs_dentry_ops = {
 	.d_revalidate	= sysfs_dentry_revalidate,
 	.d_delete	= sysfs_dentry_delete,
 	.d_iput		= sysfs_dentry_iput,
@@ -795,16 +795,8 @@ static struct dentry * sysfs_lookup(struct inode *dir, struct dentry *dentry,
 	}
 
 	/* instantiate and hash dentry */
-	ret = d_find_alias(inode);
-	if (!ret) {
-		d_set_d_op(dentry, &sysfs_dentry_ops);
-		dentry->d_fsdata = sysfs_get(sd);
-		d_add(dentry, inode);
-	} else {
-		d_move(ret, dentry);
-		iput(inode);
-	}
-
+	dentry->d_fsdata = sysfs_get(sd);
+	ret = d_materialise_unique(dentry, inode);
  out_unlock:
 	mutex_unlock(&sysfs_mutex);
 	return ret;
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 52c3bdb..c15a7a3 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -68,6 +68,7 @@ static int sysfs_fill_super(struct super_block *sb, void *data, int silent)
 	}
 	root->d_fsdata = &sysfs_root;
 	sb->s_root = root;
+	sb->s_d_op = &sysfs_dentry_ops;
 	return 0;
 }
 
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index 661a963..d73c093 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -157,6 +157,7 @@ extern struct kmem_cache *sysfs_dir_cachep;
  */
 extern struct mutex sysfs_mutex;
 extern spinlock_t sysfs_assoc_lock;
+extern const struct dentry_operations sysfs_dentry_ops;
 
 extern const struct file_operations sysfs_dir_operations;
 extern const struct inode_operations sysfs_dir_inode_operations;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-08  0:36                         ` Al Viro
@ 2012-06-08  0:42                           ` Linus Torvalds
  2012-06-08  0:59                           ` Al Viro
  2012-06-08  2:08                           ` Eric W. Biederman
  2 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2012-06-08  0:42 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, Dave Jones, Linux Kernel, Miklos Szeredi,
	Jan Kara, Peter Zijlstra, linux-fsdevel, J. Bruce Fields,
	Sage Weil

On Thu, Jun 7, 2012 at 5:36 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Frankly, I would very much prefer to have the same locking rules wherever
> possible.  The locking system is already overcomplicated and making its
> analysis fs-dependent as well... <shudder>

I do agree that it would be better if we avoid it. I was just trying
to explain that the dentry locking is *not* enough, for the simple
reason that it relies on upper-level non-dentry locking just to work.

Your patch looks good, but whether it works I have no idea ;)

               Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-08  0:36                         ` Al Viro
  2012-06-08  0:42                           ` Linus Torvalds
@ 2012-06-08  0:59                           ` Al Viro
  2012-06-08  5:25                             ` Eric W. Biederman
  2012-06-08  2:08                           ` Eric W. Biederman
  2 siblings, 1 reply; 18+ messages in thread
From: Al Viro @ 2012-06-08  0:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Dave Jones, Linux Kernel, Miklos Szeredi,
	Jan Kara, Peter Zijlstra, linux-fsdevel, J. Bruce Fields,
	Sage Weil

On Fri, Jun 08, 2012 at 01:36:04AM +0100, Al Viro wrote:
> Eric, how about this - if nothing else, that makes code in there simpler
> and less dependent on details of VFS guts:

Argh.  No, it's not enough.  Why are you using ->d_iput()?  You are not
doing anything unusual with inode; the natural place for that is in
->d_release() and then it will get simpler rules wrt setting ->d_fsdata.
As it is, you need to do that exactly after the point where you know
that it dentry won't be dropped without going through d_add().

OK, I've split that in two commits and put into vfs.git#sysfs; take a look
and comment, please.  Should get to git.kernel.in a few...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-08  0:36                         ` Al Viro
  2012-06-08  0:42                           ` Linus Torvalds
  2012-06-08  0:59                           ` Al Viro
@ 2012-06-08  2:08                           ` Eric W. Biederman
  2012-06-08  2:37                             ` Al Viro
  2 siblings, 1 reply; 18+ messages in thread
From: Eric W. Biederman @ 2012-06-08  2:08 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Jones, Linux Kernel, Miklos Szeredi,
	Jan Kara, Peter Zijlstra, linux-fsdevel, J. Bruce Fields,
	Sage Weil

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Thu, Jun 07, 2012 at 04:57:13PM -0700, Linus Torvalds wrote:
>
>> Any per-filesystem mutex should do, so if sysfs always holds the
>> sysfs_mutex - and never allows user-initiated renames - it should be
>> safe.
>
> Frankly, I would very much prefer to have the same locking rules wherever
> possible.  The locking system is already overcomplicated and making its
> analysis fs-dependent as well... <shudder>  Sure, we can do that, and that
> might even work, until we find out that some piece of code that started
> as a helper to some function never called on sysfs dentries had been
> reused on the path that *is* reachable on sysfs.  At which point we are
> suddenly in trouble.

Staring at it I see what I was missing.   The practical issue is
lock_rename(), and any parts of the vfs that depend on lock_rename().

d_move and the dcache are made safe just by rename_lock.  However other
parts of the vfs that care about using d_ancestor are not.  I can't
immediately see a case that really cares but I can't rule such a case
out easily either.

> I wouldn't be bothered so much if the overall picture had been simpler;
> unfortunately, it isn't.
>
> Eric, how about this - if nothing else, that makes code in there simpler
> and less dependent on details of VFS guts:
>
> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
> index e6bb9b2..5579826 100644
> --- a/fs/sysfs/dir.c
> +++ b/fs/sysfs/dir.c
> @@ -363,7 +363,7 @@ static void sysfs_dentry_iput(struct dentry *dentry, struct inode *inode)
>  	iput(inode);
>  }
>  
> -static const struct dentry_operations sysfs_dentry_ops = {
> +const struct dentry_operations sysfs_dentry_ops = {
>  	.d_revalidate	= sysfs_dentry_revalidate,
>  	.d_delete	= sysfs_dentry_delete,
>  	.d_iput		= sysfs_dentry_iput,
> @@ -795,16 +795,8 @@ static struct dentry * sysfs_lookup(struct inode *dir, struct dentry *dentry,
>  	}
>  
>  	/* instantiate and hash dentry */
> -	ret = d_find_alias(inode);
> -	if (!ret) {
> -		d_set_d_op(dentry, &sysfs_dentry_ops);
> -		dentry->d_fsdata = sysfs_get(sd);
> -		d_add(dentry, inode);
> -	} else {
> -		d_move(ret, dentry);
> -		iput(inode);
> -	}
> -
> +	dentry->d_fsdata = sysfs_get(sd);
> +	ret = d_materialise_unique(dentry, inode);

I have a small problem with d_materialise_unique.  For renames of files
d_materialise_unique calls __d_instantiate_unique. __d_instantiate_unique
does not detect renames of files.  Which at least misses the rename
of sysfs symlinks.

Could we put together a d_materialise_unalias for inodes that we know
they always only have one dentry?  That I would be happy to use.

I think the reason I would up with my own version was that the dcache
did no provide what I needed and it was just a few lines to code my own.

> diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
> index 52c3bdb..c15a7a3 100644
> --- a/fs/sysfs/mount.c
> +++ b/fs/sysfs/mount.c
> @@ -68,6 +68,7 @@ static int sysfs_fill_super(struct super_block *sb, void *data, int silent)
>  	}
>  	root->d_fsdata = &sysfs_root;
>  	sb->s_root = root;
> +	sb->s_d_op = &sysfs_dentry_ops;

I have no problem with this bit.  To answer your earlier question s_d_op
predates this code which is why sysfs was not using it.
>  	return 0;
>  }
>  
> diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
> index 661a963..d73c093 100644
> --- a/fs/sysfs/sysfs.h
> +++ b/fs/sysfs/sysfs.h
> @@ -157,6 +157,7 @@ extern struct kmem_cache *sysfs_dir_cachep;
>   */
>  extern struct mutex sysfs_mutex;
>  extern spinlock_t sysfs_assoc_lock;
> +extern const struct dentry_operations sysfs_dentry_ops;
>  
>  extern const struct file_operations sysfs_dir_operations;
>  extern const struct inode_operations sysfs_dir_inode_operations;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-07 19:36                   ` processes hung after sys_renameat, and 'missing' processes Al Viro
  2012-06-07 20:43                     ` Sage Weil
  2012-06-07 23:12                     ` Eric W. Biederman
@ 2012-06-08  2:18                     ` Al Viro
  2012-06-08 16:22                     ` J. Bruce Fields
  3 siblings, 0 replies; 18+ messages in thread
From: Al Viro @ 2012-06-08  2:18 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel, Miklos Szeredi,
	Jan Kara, Peter Zijlstra
  Cc: linux-fsdevel, J. Bruce Fields, Eric W. Biederman, Sage Weil

On Thu, Jun 07, 2012 at 08:36:07PM +0100, Al Viro wrote:

> Other callers of d_move():
> 	* debugfs_rename() - imitates what vfs_rename() is doing.  Same
> locking environment.  BTW,
>         trap = lock_rename(new_dir, old_dir);
>         /* Source or destination directories don't exist? */
>         if (!old_dir->d_inode || !new_dir->d_inode)
>                 goto exit;
> is bogus - lock_rename() is taking ->i_mutex on these inodes, for fsck sake!
> If this can be called with old_dir or new_dir negative, it's buggered.

It's worse, actually.  If we _ever_ do cross-directory debugfs_rename()
without external serialization, we are in trouble.  It does imitate
vfs_rename() (actually - its callers), but there's an unpleasant difference:
instead of "lock parents with lock_rename(), then do lookups and we are
guaranteed nobody will change ->d_parent of children we are working with"
it's "lock the new parent and whatever happens to be the current parent
of the object given to us; do lookup for target, pray that the old parent
still was the parent of our object by the time we got the locks".

AFAICS, there's only one caller doing cross-directory moves (__clk_reparent())
and currently all callers are serialized by a mutex in there, but that's
not documented anywhere - not for __clk_reparent(), not for debugfs_rename().

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-08  2:08                           ` Eric W. Biederman
@ 2012-06-08  2:37                             ` Al Viro
  0 siblings, 0 replies; 18+ messages in thread
From: Al Viro @ 2012-06-08  2:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Dave Jones, Linux Kernel, Miklos Szeredi,
	Jan Kara, Peter Zijlstra, linux-fsdevel, J. Bruce Fields,
	Sage Weil

On Thu, Jun 07, 2012 at 07:08:04PM -0700, Eric W. Biederman wrote:
> > +	dentry->d_fsdata = sysfs_get(sd);
> > +	ret = d_materialise_unique(dentry, inode);
> 
> I have a small problem with d_materialise_unique.  For renames of files
> d_materialise_unique calls __d_instantiate_unique. __d_instantiate_unique
> does not detect renames of files.  Which at least misses the rename
> of sysfs symlinks.

Er... yes, but why do we care?  It's not as if you had a hardwired
reference to dentry from your objects, after all (can't, with multiple
superblocks).  So you get old stale dentry at the old location and
a new one where we'd moved that sucker.  They have the same inode
and each holds a reference to the same sd; ->d_revalidate() at the
old location must invalidate the old instance anyway, since you are
not guaranteed that lookup at the new one will happen before repeated
lookup at the old one.

Directories *are* special in that respect, but symlinks are trivial...
VFS doesn't care if you have extra dentries for those and neither does
sysfs, AFAICS.

It's not that we couldn't teach d_materialise_unique() about those (e.g.
introduce a new dentry flag and treat dentries with it as directories
for d_materialise_unique() purposes); I would like to understand the
reasons for doing that, though.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-08  0:59                           ` Al Viro
@ 2012-06-08  5:25                             ` Eric W. Biederman
  2012-06-08  5:48                               ` Al Viro
  0 siblings, 1 reply; 18+ messages in thread
From: Eric W. Biederman @ 2012-06-08  5:25 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Jones, Linux Kernel, Miklos Szeredi,
	Jan Kara, Peter Zijlstra, linux-fsdevel, J. Bruce Fields,
	Sage Weil

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Fri, Jun 08, 2012 at 01:36:04AM +0100, Al Viro wrote:
>> Eric, how about this - if nothing else, that makes code in there simpler
>> and less dependent on details of VFS guts:
>
> Argh.  No, it's not enough.  Why are you using ->d_iput()?  You are not
> doing anything unusual with inode; the natural place for that is in
> ->d_release() and then it will get simpler rules wrt setting ->d_fsdata.

No good reason.  We do tie inode numbers to the syfs_dirent but the
inode was changed quite a while ago to hold it's own reference
sysfs_dirent.  So using d_iput looks like sysfs historical baggage.

> As it is, you need to do that exactly after the point where you know
> that it dentry won't be dropped without going through d_add().
>
> OK, I've split that in two commits and put into vfs.git#sysfs; take a look
> and comment, please.  Should get to git.kernel.in a few...

The patches on your sysfs branch look reasonable.

I am still learly of d_materialise_unique as it allows to create alias's
on non-directories.  It isn't a functional problem as d_revalidate will
catch the issue and make it look like we have a unlink/link pair instead
of a proper rename.  However since it is possible I would like to aim
for the higher quality of implemntation and use show renames as renames.

What would be ideal for sysfs is the d_add_singleton function below.  It
does what is needed without the weird d_materialise strangeness that is
in d_materialise_unique.  But if a all singing all dancing all
confusing function is preferable I would not mind.

What I would really like is an interface so that a distrubuted/remote
filesystem can provide an inotify like stream of and we can really
implement inotify in a distributed filesystem.  But since I am too lazy
to do that I am reluctant to give up what progress I have actually made
in that direction.

Eric

diff --git a/fs/dcache.c b/fs/dcache.c
index 85c9e2b..2aab524 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2537,6 +2537,74 @@ out_nolock:
 }
 EXPORT_SYMBOL_GPL(d_materialise_unique);
 
+/**
+ * d_add_singleton - add an inode with only a single dentry
+ * @entry: dentry to instantiate
+ * @inode: inode to attach to this dentry
+ *
+ * Fill in inode information in the entry. On success, it returns NULL.
+ * If an alias of "entry" already exists, then we assume that a rename
+ * has occurred and not been reported so the alias is renamed and we
+ * return the aliased dentry and drop one reference to the inode.
+ *
+ * Note that in order to avoid conflicts with rename() etc, the caller
+ * had better be holding the parent directory semaphore.
+ *
+ * This also assumes that the inode count has been incremented
+ * (or otherwise set) by the caller to indicate that it is now
+ * in use by the dcache.
+ */
+struct dentry *d_add_singleton(struct dentry *entry, struct inode *inode)
+{
+       struct dentry *alias, *actual = entry;
+
+       if (!inode) {
+               __d_instantiate(entry, NULL);
+               d_rehash(entry);
+               goto out_nolock;
+       }
+
+       spin_lock(&inode->i_lock);
+
+       /* Does an aliased dentry already exist? */
+       alias = __d_find_alias(inode);
+       if (alias) {
+               write_seqlock(&rename_lock);
+
+               if (d_ancestor(alias, entry)) {
+                       /* Check for loops */
+                       actual = ERR_PTR(-ELOOP);
+                       spin_unlock(&inode->i_lock);
+               } else {
+                       /* Avoid aliases.  This drops inode->i_lock */
+                       actual = __d_unalias(inode, entry, alias);
+               }
+               write_sequnlock(&rename_lock);
+               if (IS_ERR(actual)) {
+                       if (PTR_ERR(actual) == -ELOOP)
+                               pr_warn_ratelimited(
+                                       "VFS: Lookup of '%s' in %s %s"
+                                       " would have caused loop\n",
+                                       entry->d_name.name,
+                                       inode->i_sb->s_type->name,
+                                       inode->i_sb->s_id);
+                       dput(alias);
+               }
+               goto out_nolock;
+       }
+       __d_instantiate(entry, inode);
+       spin_unlock(&inode->i_lock);
+       d_rehash(entry);
+out_nolock:
+       if (actual == entry ) {
+               security_d_instantiate(entry, inode);
+               return NULL;
+       }
+       iput(inode);
+       return actual;
+}
+
+
 static int prepend(char **buffer, int *buflen, const char *str, int namelen)
 {
        *buflen -= namelen;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 094789f..9613d4c 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -219,6 +219,8 @@ static inline int dname_external(struct dentry *dentry)
 extern void d_instantiate(struct dentry *, struct inode *);
 extern struct dentry * d_instantiate_unique(struct dentry *, struct inode *);
 extern struct dentry * d_materialise_unique(struct dentry *, struct inode *);
+extern struct dentry * d_materialise_unalias(struct dentry *, struct inode *);
+extern struct dentry *d_add_singleton(struct dentry *, struct inode *);
 extern void __d_drop(struct dentry *dentry);
 extern void d_drop(struct dentry *dentry);
 extern void d_delete(struct dentry *);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-08  5:25                             ` Eric W. Biederman
@ 2012-06-08  5:48                               ` Al Viro
  2012-06-08  7:54                                 ` Eric W. Biederman
  0 siblings, 1 reply; 18+ messages in thread
From: Al Viro @ 2012-06-08  5:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Dave Jones, Linux Kernel, Miklos Szeredi,
	Jan Kara, Peter Zijlstra, linux-fsdevel, J. Bruce Fields,
	Sage Weil

On Thu, Jun 07, 2012 at 10:25:46PM -0700, Eric W. Biederman wrote:

> I am still learly of d_materialise_unique as it allows to create alias's
> on non-directories.  It isn't a functional problem as d_revalidate will
> catch the issue and make it look like we have a unlink/link pair instead
> of a proper rename.  However since it is possible I would like to aim
> for the higher quality of implemntation and use show renames as renames.

???

Please, explain.  link/unlink pair in which sense?  If you are thinking
of inotify and its ilk and tie some of that to dentry eviction, you
have far worse problem.  What stream of events do you expect if you
do
	some action that triggers sysfs_rename()
	stat(2) on old pathname
	stat(2) on new pathname
No matter what we do in lookup at the new pathname, the second step will
be "failed revalidate".  If you tie something (presumably unlink side of
your link/unlink pair) to that event, you'll get a very lousy series of
events, no matter what you do on the third step.

I don't see what kind of *notify hookup do you have in mind.  Anything that
treats "dentry failed revalidation or got evicted by memory pressure" as
"unlink" is completely nuts, IMO.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-08  5:48                               ` Al Viro
@ 2012-06-08  7:54                                 ` Eric W. Biederman
  2012-06-08 20:20                                   ` Al Viro
  0 siblings, 1 reply; 18+ messages in thread
From: Eric W. Biederman @ 2012-06-08  7:54 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Jones, Linux Kernel, Miklos Szeredi,
	Jan Kara, Peter Zijlstra, linux-fsdevel, J. Bruce Fields,
	Sage Weil

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Thu, Jun 07, 2012 at 10:25:46PM -0700, Eric W. Biederman wrote:
>
>> I am still learly of d_materialise_unique as it allows to create alias's
>> on non-directories.  It isn't a functional problem as d_revalidate will
>> catch the issue and make it look like we have a unlink/link pair instead
>> of a proper rename.  However since it is possible I would like to aim
>> for the higher quality of implemntation and use show renames as renames.
>
> ???
>
> Please, explain.  link/unlink pair in which sense? 

In the sense that if we don't use d_move.  A rename will look to
userspace like a pair of sys_link and sys_unlink operations.

If I happen to have a file open with the old name and the dentry passes
through d_drop.  The /proc/self/fd/N will show the filename as
"...(deleted)".

And in every other way I can think of that is userspace visible this
will look like a pair of link and unlink operations.

> I don't see what kind of *notify hookup do you have in mind.  Anything that
> treats "dentry failed revalidation or got evicted by memory pressure" as
> "unlink" is completely nuts, IMO.

In this case much as it might be convinient to have a *notify report,
what I was thinking of were the much simpler userspace visible aspects,
like what /proc/self/fd/N symlinks report.

In the little corner case user visible details the current state of vfs
support for distributed filesystems looks nuts to me, especially where
we can't apply an appropriate d_move onto a renamed dentry.

The fact that open files, open directories and mount points pin dentries
in memory cause interesting challenges for keeping the local vfs state
in sync with the state of a remote filesystem.

What I would love to be able to do is to replay some kind of journal
that reports what happened to the filesystem outside of the linux vfs
onto the linux vfs so that we can get a more accurate picture of what
really happened to the filesystem.  Which should allow *notify and the
like to actually work.  Would make the /proc/self/fd/* symlinks more
useful, and make allow files that are mount points to be renamed.

But ultimately the change in semantics bugs me.  Using d_move less
often feels user visible and because d_materialise_unique because it
does not handle renames of files feels like a lurking maintenance bomb
for sysfs.

Especially since renames on files with mount points on them should be
treated differently from normal files.

Speaking of I just found a small unhandled case in __d_unalias.  We need
to deny renaming of mount points.

Eric

From: "Eric W. Biederman" <ebiederm@xmission.com>
Subject: dcache: Deny renaming via __d_unalias dentries of mountpoints

Make __d_unalias match vfs_rename_dir and vfs_rename_other and don't
allow renaming mount points.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

---
diff --git a/fs/dcache.c b/fs/dcache.c
index 85c9e2b..d236722 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2380,14 +2380,17 @@ static struct dentry *__d_unalias(struct inode *inode,
                struct dentry *dentry, struct dentry *alias)
 {
        struct mutex *m1 = NULL, *m2 = NULL;
-       struct dentry *ret;
+       struct dentry *ret = ERR_PTR(-EBUSY);
+
+       /* Linux does not rename mount points */
+       if (d_mountpoint(alias))
+               goto out_err;

        /* If alias and dentry share a parent, then no extra locks required */
        if (alias->d_parent == dentry->d_parent)
                goto out_unalias;

        /* See lock_rename() */
-       ret = ERR_PTR(-EBUSY);
        if (!mutex_trylock(&dentry->d_sb->s_vfs_rename_mutex))
                goto out_err;
        m1 = &dentry->d_sb->s_vfs_rename_mutex;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-07 19:36                   ` processes hung after sys_renameat, and 'missing' processes Al Viro
                                       ` (2 preceding siblings ...)
  2012-06-08  2:18                     ` Al Viro
@ 2012-06-08 16:22                     ` J. Bruce Fields
  2012-06-08 17:44                       ` Linus Torvalds
  3 siblings, 1 reply; 18+ messages in thread
From: J. Bruce Fields @ 2012-06-08 16:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Jones, Linus Torvalds, Linux Kernel, Miklos Szeredi,
	Jan Kara, Peter Zijlstra, linux-fsdevel, J. Bruce Fields,
	Eric W. Biederman, Sage Weil

On Thu, Jun 07, 2012 at 08:36:07PM +0100, Al Viro wrote:
> 	* d_splice_alias() - broken.  Called without any locking on the
> old directory, nevermind ->s_vfs_rename_mutex.

I was assuming that the callers were in lookup, held i_mutex on a
parent, and that in the case of a directory, existence of an alias with
a different parent could only result from a filesystem bug.

> I really believe that this pair of commits needs to be reverted.  The
> earlier code used to guarantee that alias would be detached.

In the case that prompted that first commit, the directory in question
had an alias that was detached (which I'm taking to mean IS_ROOT(dentry)
was true?), but not flagged DISCONNECTED.  The particular case was only
reproduceable on an older kernel, and I couldn't find a similar
reproducer on recent upstream, but I also couldn't convince myself it
was impossible.

So, maybe the correct thing is to revert that change.  Or maybe it
should be picking an IS_ROOT dentry instead of a DISCONNECTED one?

There's some previous discussion at

	http://marc.info/?i=<20110310105821.GE22723@ZenIV.linux.org.uk>

(in particular a long post from Neil:

	http://marc.info/?i=<20110311150749.2fa2be66@notabene.brown>

that I should review.)

--b.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-08 16:22                     ` J. Bruce Fields
@ 2012-06-08 17:44                       ` Linus Torvalds
  2012-06-11 12:17                         ` J. Bruce Fields
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2012-06-08 17:44 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Al Viro, Dave Jones, Linux Kernel, Miklos Szeredi, Jan Kara,
	Peter Zijlstra, linux-fsdevel, J. Bruce Fields, Eric W. Biederman,
	Sage Weil

On Fri, Jun 8, 2012 at 9:22 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Thu, Jun 07, 2012 at 08:36:07PM +0100, Al Viro wrote:
>
>> I really believe that this pair of commits needs to be reverted.  The
>> earlier code used to guarantee that alias would be detached.
>
> In the case that prompted that first commit, the directory in question
> had an alias that was detached (which I'm taking to mean IS_ROOT(dentry)
> was true?), but not flagged DISCONNECTED.  The particular case was only
> reproduceable on an older kernel, and I couldn't find a similar
> reproducer on recent upstream, but I also couldn't convince myself it
> was impossible.
>
> So, maybe the correct thing is to revert that change.  Or maybe it
> should be picking an IS_ROOT dentry instead of a DISCONNECTED one?

I've reverted the changes for now, it looks like the discussion about
them is still on-going, and I think I'll feel happier if we just go
back to the old status quo for the nonce.

            Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-08  7:54                                 ` Eric W. Biederman
@ 2012-06-08 20:20                                   ` Al Viro
  0 siblings, 0 replies; 18+ messages in thread
From: Al Viro @ 2012-06-08 20:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Dave Jones, Linux Kernel, Miklos Szeredi,
	Jan Kara, Peter Zijlstra, linux-fsdevel, J. Bruce Fields,
	Sage Weil

On Fri, Jun 08, 2012 at 12:54:00AM -0700, Eric W. Biederman wrote:
> > Please, explain.  link/unlink pair in which sense? 
> 
> In the sense that if we don't use d_move.  A rename will look to
> userspace like a pair of sys_link and sys_unlink operations.
> 
> If I happen to have a file open with the old name and the dentry passes
> through d_drop.  The /proc/self/fd/N will show the filename as
> "...(deleted)".

Um... reality check, please - you were talking about symlinks until now.  Sure,
these days we *can* open them (with O_PATH), but...  And unless I'm misreading
something, none of those sysfs_rename() callers are acting on regular files.

In any case, symlinks or no symlinks, what do you expect to happen if
something in userland does lookup at the old location before anyone gets
around to lookup at the new one?  It's not as if that had been under
kernel control...

> In the little corner case user visible details the current state of vfs
> support for distributed filesystems looks nuts to me, especially where
> we can't apply an appropriate d_move onto a renamed dentry.

Reread your own code, please.  You are *not* guaranteed that lookup triggering
that d_move() of yours will come before ->d_revalidate().  In other words,
the current scheme is not any different in that respect.  And nothing in
userland really cares - scenario is too convoluted.

> Speaking of I just found a small unhandled case in __d_unalias.  We need
> to deny renaming of mount points.

Umm...  Agreed, fix pushed into the same branch.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: processes hung after sys_renameat, and 'missing' processes
  2012-06-08 17:44                       ` Linus Torvalds
@ 2012-06-11 12:17                         ` J. Bruce Fields
  0 siblings, 0 replies; 18+ messages in thread
From: J. Bruce Fields @ 2012-06-11 12:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Dave Jones, Linux Kernel, Miklos Szeredi, Jan Kara,
	Peter Zijlstra, linux-fsdevel, J. Bruce Fields, Eric W. Biederman,
	Sage Weil

On Fri, Jun 08, 2012 at 10:44:09AM -0700, Linus Torvalds wrote:
> On Fri, Jun 8, 2012 at 9:22 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > On Thu, Jun 07, 2012 at 08:36:07PM +0100, Al Viro wrote:
> >
> >> I really believe that this pair of commits needs to be reverted.  The
> >> earlier code used to guarantee that alias would be detached.
> >
> > In the case that prompted that first commit, the directory in question
> > had an alias that was detached (which I'm taking to mean IS_ROOT(dentry)
> > was true?), but not flagged DISCONNECTED.  The particular case was only
> > reproduceable on an older kernel, and I couldn't find a similar
> > reproducer on recent upstream, but I also couldn't convince myself it
> > was impossible.
> >
> > So, maybe the correct thing is to revert that change.  Or maybe it
> > should be picking an IS_ROOT dentry instead of a DISCONNECTED one?
> 
> I've reverted the changes for now, it looks like the discussion about
> them is still on-going, and I think I'll feel happier if we just go
> back to the old status quo for the nonce.

Fair enough; I'll keep investigating.

--b.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-06-11 12:17 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20120603232820.GQ30000@ZenIV.linux.org.uk>
     [not found] ` <20120606194233.GA1537@redhat.com>
     [not found]   ` <CA+55aFwxFVLPk4-dp73uccKXepKwjpGFgh0NW1u8d0zYzr=hBA@mail.gmail.com>
     [not found]     ` <20120606230040.GA18089@redhat.com>
     [not found]       ` <CA+55aFx_oREEd2XS_aG7PJ7dwzWjgx689V_uRPGWpVS01_++6A@mail.gmail.com>
     [not found]         ` <20120606235403.GC30000@ZenIV.linux.org.uk>
     [not found]           ` <20120607002914.GB22223@redhat.com>
     [not found]             ` <CA+55aFyp2GUbc218eb5BDyAqz+cOPTJMVcb4ptk+f1_RBB+-mg@mail.gmail.com>
     [not found]               ` <20120607011915.GA17566@redhat.com>
     [not found]                 ` <20120607012900.GE30000@ZenIV.linux.org.uk>
2012-06-07 19:36                   ` processes hung after sys_renameat, and 'missing' processes Al Viro
2012-06-07 20:43                     ` Sage Weil
2012-06-07 23:12                     ` Eric W. Biederman
2012-06-07 23:39                       ` Al Viro
2012-06-07 23:57                       ` Linus Torvalds
2012-06-08  0:36                         ` Al Viro
2012-06-08  0:42                           ` Linus Torvalds
2012-06-08  0:59                           ` Al Viro
2012-06-08  5:25                             ` Eric W. Biederman
2012-06-08  5:48                               ` Al Viro
2012-06-08  7:54                                 ` Eric W. Biederman
2012-06-08 20:20                                   ` Al Viro
2012-06-08  2:08                           ` Eric W. Biederman
2012-06-08  2:37                             ` Al Viro
2012-06-08  2:18                     ` Al Viro
2012-06-08 16:22                     ` J. Bruce Fields
2012-06-08 17:44                       ` Linus Torvalds
2012-06-11 12:17                         ` J. Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).