All of lore.kernel.org
 help / color / mirror / Atom feed
From: "J. Bruce Fields" <bfields@fieldses.org>
To: Neil Brown <neilb@suse.de>
Cc: linux-nfs@vger.kernel.org, chucklever@gmail.com,
	Christoph Hellwig <hch@infradead.org>,
	linux-mtd@lists.infradead.org, viro@zeniv.linux.org.uk,
	linux-fsdevel@vger.kernel.org,
	David Woodhouse <dwmw2@infradead.org>
Subject: Re: [RFC] Reinstate NFS exportability for JFFS2.
Date: Mon, 4 Aug 2008 14:41:15 -0400	[thread overview]
Message-ID: <20080804184115.GG25940@fieldses.org> (raw)
In-Reply-To: <18582.21855.2092.903688@notabene.brown>

On Mon, Aug 04, 2008 at 11:03:27AM +1000, Neil Brown wrote:
> The locking bothers me.
> The VFS seems to have a general policy of doing the locking itself to
> make life easier for filesystem implementors (or to make it harder for
> them to mess up, depending on your perspective).
> 
> The current issue seems to suggest that the locking provided by the
> VFS is no longer adequate, so each filesystem is needing to create
> something itself.  That suggests to me a weakness in the model.
> Possibly the VFS should give up trying to be in control and let
> filesystems do their own locking.  Possibly there are still things
> that the VFS can do which are universally good.  I think these are
> questions that should be addressed.  
> Maybe they have already been addressed and I missed the conversation
> (that wouldn't surprise me much).  But seeing words like "hack"
> suggests to me that it hasn't.  So I want to make sure I understand
> the problem properly and deeply before giving my blessing to a hack.
> 
> So: what are the issues?
> 
>  Obviously readdir can race with create and you don't want them
>  tripping each other up.  The current VFS approach to this is i_mutex.
>  Any place which modifies a directory or does a lookup in a directory
>  takes i_mutex to ensure that the directory doesn't change.
> 
>  This is probably fairly limiting.  With a tree-structured directory
>  you only really need to lock the 'current node' of the tree.
>  So any access would lock the top node, find which child to follow,
>  then lock the child and unlock the parent.  Rebalancing might need to
>  be creative as you cannot lock a parent while holding a lock on the
>  child, but that isn't insurmountable.
>  So I could argue that holding i_mutex across a lookup or create or
>  readdir maybe isn't ideal.  It would be interesting to explore the
>  possibility of dropping i_mutex across calls into the filesystem.
>  By the time the filesystem is called, we really only need to be
>  locking the names (dentries) rather than the whole directory.
>  More on this later...
> 
>  Some filesystems want to restructure directories and times that are
>  independent of any particular access.  This might be defragmentation
>  or rebalancing or whatever.  Clearly there needs to be mutual
>  exclusion between this and other accesses such as readdir and lookup.
>  The VFS clearly cannot help with this.  It doesn't know when
>  rebalancing might happen or are what sort of granularity.  So the
>  filesystem must do it's own locking.
>  It strikes me that this sort of locking would best be done by
>  spinlocks at the block level rather than a per-inode mutex.  However
>  I haven't actually implemented this (yet) so I cannot be certain.
> 
>  This is what is causing the current problem (for JFFS2 at least).
>  JFF2 has a per-inode lock which protects against internally visible
>  changes to the inode.  Further (and this is key) this lock is held
>  across the filldir callback.
>  i_mutex is also held across the filldir callback, but there is an
>  important difference.  It is taken by the VFS, not the filesystem,
>  and it is guaranteed always to be held across the filldir callback.
>  So the filldir callback can call e.g. lookup without further locking.
> 
>  Backing up a little:  given that the filesystem implementor chose to
>  use per-inode locking to protect internal restructuring (which is
>  certainly an easier option), why not just use i_mutex?  The reason
>  is that a create operation might trigger system-wide garbage
>  collection which could trigger restructuring of the current inode,
>  which would lead to an A-A deadlock (as the create is waiting for the
>  garbage collection, and so still holding i_mutex).
> 
> So, given that background, it is possible to see some more possible
> solutions (other than the ones already mentioned).
> 
>  - drop the internal lock across filldir.
>    It could be seen a impolite to hold any locks across a callback
>    that are not documented as being held.
>    This would put an extra burden on the filesystem, but it shouldn't
>    be a particularly big burden.
>    A filesystem needs to be able to find the 'next' entry from a
>    sequential 'seek' pointer so that is the most state that needs to
>    be preserved.  It might be convenient to be able to keep more state
>    (pointers into data structures etc).  All this can be achieved with
>    fairly standard practice:
>      a/ have a 'version' number per inode which is updated when any
>         internal restructure happens.
>      b/ when calling filldir, record the version number, drop the lock
>         call filldir, reclaim the lock, check the version number
>      c/ if the version number matches (as it mostly will) just keep
>         going.  If it doesn't jump back to the top of readdir where
>         we decode the current seek address.
> 
>    Some filesystems have problems implementing seekdir/telldir so they
>    might not be totally happy here.  I have little sympathy for such
>    filesystems and feel the burden should be on them to make it work.
> 
>  - use i_mutex to protect internal changes too, and drop i_mutex while
>    doing internal restructuring.   This would need some VFS changes so
>    that dropping i_mutex would be safe.  It would require some way to
>    lock an individual dentry.  Probably you would lock it under
>    i_mutex by setting a flag bit, wait for the flag on some inode-wide
>    waitqueue, and drop the lock by clearing the flag and waking the
>    waitqueue. And you are never allowed to look at ->d_inode if the
>    lock flag is set.

Isn't there a lot of kernel code that assumes that following ->d_inode
is safe?  Or are you only worried about looking at certain
filesystem-internal fields of d_inode?

The locking required to keep the filesystem namespace consistent is
difficult and important, so I think changing it would require an
extremely careful description of the new design and a commitment from
some filesystem developers to actually take advantage of it....

--b.

> Of these I really like the second.  Refining the i_mutex down to a
> per-name lock before calling in to the filesystem seems like a really
> good idea and should be good for scalability and large directories.
> It isn't something to be done lightly though.  The filesystem would
> still be given i_mutex held, but would be allowed to drop it if it
> wanted to.  We could have an FS_DROPS_I_MUTEX similar to the current
> FS_RENAME_DOES_D_MOVE.
> 
> For the first, I really like the idea that a lock should not be held
> across calls the filldir.  I feel that a filesystem doing that is
> "wrong" in much the same way that some think that recursing into the
> filesystem as nfsd does is "wrong".  In reality neither is "wrong", we
> just need to work together and negotiate and work out the best way to
> meet all of our needs.
> 
> So what should we do now?  I think that for JFFS2 to drop and reclaim
> f->sem in readdir would be all of 20 lines of code, including updating
> a ->version counter elsewhere in the code.  Replicating that in all
> the filesystems that need it would probably be more code than the nfsd
> change though.
> 
> On the other hand, if we implement the nfsd change, it will almost
> certainly never go away, even if all filesystems eventually find that
> they don't need it any more because someone improves the locking
> rules in the VFS.  Where as the code in the filesystems could quite
> possibly go away when they are revised to make better use of the
> locking regime.  So I don't think that is an ideal situation either.
> 
> If I had time, I would work on modifying the VFS to allow filesystems
> to drop i_mutex.  However I don't have time at the moment, so I'll
> leave the decision to be made by someone else  (Hi Bruce!  I'll
> support whatever you decide).
> 
> But I think it is important to understand what is really going on and
> not just implement a hack that works around the problem.  I think I do
> understand now, so I am a lot happier.  Hopefully you do too.

WARNING: multiple messages have this Message-ID (diff)
From: "J. Bruce Fields" <bfields@fieldses.org>
To: Neil Brown <neilb@suse.de>
Cc: chucklever@gmail.com, David Woodhouse <dwmw2@infradead.org>,
	viro@zeniv.linux.org.uk, Christoph Hellwig <hch@infradead.org>,
	linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org,
	linux-mtd@lists.infradead.org
Subject: Re: [RFC] Reinstate NFS exportability for JFFS2.
Date: Mon, 4 Aug 2008 14:41:15 -0400	[thread overview]
Message-ID: <20080804184115.GG25940@fieldses.org> (raw)
In-Reply-To: <18582.21855.2092.903688@notabene.brown>

On Mon, Aug 04, 2008 at 11:03:27AM +1000, Neil Brown wrote:
> The locking bothers me.
> The VFS seems to have a general policy of doing the locking itself to
> make life easier for filesystem implementors (or to make it harder for
> them to mess up, depending on your perspective).
> 
> The current issue seems to suggest that the locking provided by the
> VFS is no longer adequate, so each filesystem is needing to create
> something itself.  That suggests to me a weakness in the model.
> Possibly the VFS should give up trying to be in control and let
> filesystems do their own locking.  Possibly there are still things
> that the VFS can do which are universally good.  I think these are
> questions that should be addressed.  
> Maybe they have already been addressed and I missed the conversation
> (that wouldn't surprise me much).  But seeing words like "hack"
> suggests to me that it hasn't.  So I want to make sure I understand
> the problem properly and deeply before giving my blessing to a hack.
> 
> So: what are the issues?
> 
>  Obviously readdir can race with create and you don't want them
>  tripping each other up.  The current VFS approach to this is i_mutex.
>  Any place which modifies a directory or does a lookup in a directory
>  takes i_mutex to ensure that the directory doesn't change.
> 
>  This is probably fairly limiting.  With a tree-structured directory
>  you only really need to lock the 'current node' of the tree.
>  So any access would lock the top node, find which child to follow,
>  then lock the child and unlock the parent.  Rebalancing might need to
>  be creative as you cannot lock a parent while holding a lock on the
>  child, but that isn't insurmountable.
>  So I could argue that holding i_mutex across a lookup or create or
>  readdir maybe isn't ideal.  It would be interesting to explore the
>  possibility of dropping i_mutex across calls into the filesystem.
>  By the time the filesystem is called, we really only need to be
>  locking the names (dentries) rather than the whole directory.
>  More on this later...
> 
>  Some filesystems want to restructure directories and times that are
>  independent of any particular access.  This might be defragmentation
>  or rebalancing or whatever.  Clearly there needs to be mutual
>  exclusion between this and other accesses such as readdir and lookup.
>  The VFS clearly cannot help with this.  It doesn't know when
>  rebalancing might happen or are what sort of granularity.  So the
>  filesystem must do it's own locking.
>  It strikes me that this sort of locking would best be done by
>  spinlocks at the block level rather than a per-inode mutex.  However
>  I haven't actually implemented this (yet) so I cannot be certain.
> 
>  This is what is causing the current problem (for JFFS2 at least).
>  JFF2 has a per-inode lock which protects against internally visible
>  changes to the inode.  Further (and this is key) this lock is held
>  across the filldir callback.
>  i_mutex is also held across the filldir callback, but there is an
>  important difference.  It is taken by the VFS, not the filesystem,
>  and it is guaranteed always to be held across the filldir callback.
>  So the filldir callback can call e.g. lookup without further locking.
> 
>  Backing up a little:  given that the filesystem implementor chose to
>  use per-inode locking to protect internal restructuring (which is
>  certainly an easier option), why not just use i_mutex?  The reason
>  is that a create operation might trigger system-wide garbage
>  collection which could trigger restructuring of the current inode,
>  which would lead to an A-A deadlock (as the create is waiting for the
>  garbage collection, and so still holding i_mutex).
> 
> So, given that background, it is possible to see some more possible
> solutions (other than the ones already mentioned).
> 
>  - drop the internal lock across filldir.
>    It could be seen a impolite to hold any locks across a callback
>    that are not documented as being held.
>    This would put an extra burden on the filesystem, but it shouldn't
>    be a particularly big burden.
>    A filesystem needs to be able to find the 'next' entry from a
>    sequential 'seek' pointer so that is the most state that needs to
>    be preserved.  It might be convenient to be able to keep more state
>    (pointers into data structures etc).  All this can be achieved with
>    fairly standard practice:
>      a/ have a 'version' number per inode which is updated when any
>         internal restructure happens.
>      b/ when calling filldir, record the version number, drop the lock
>         call filldir, reclaim the lock, check the version number
>      c/ if the version number matches (as it mostly will) just keep
>         going.  If it doesn't jump back to the top of readdir where
>         we decode the current seek address.
> 
>    Some filesystems have problems implementing seekdir/telldir so they
>    might not be totally happy here.  I have little sympathy for such
>    filesystems and feel the burden should be on them to make it work.
> 
>  - use i_mutex to protect internal changes too, and drop i_mutex while
>    doing internal restructuring.   This would need some VFS changes so
>    that dropping i_mutex would be safe.  It would require some way to
>    lock an individual dentry.  Probably you would lock it under
>    i_mutex by setting a flag bit, wait for the flag on some inode-wide
>    waitqueue, and drop the lock by clearing the flag and waking the
>    waitqueue. And you are never allowed to look at ->d_inode if the
>    lock flag is set.

Isn't there a lot of kernel code that assumes that following ->d_inode
is safe?  Or are you only worried about looking at certain
filesystem-internal fields of d_inode?

The locking required to keep the filesystem namespace consistent is
difficult and important, so I think changing it would require an
extremely careful description of the new design and a commitment from
some filesystem developers to actually take advantage of it....

--b.

> Of these I really like the second.  Refining the i_mutex down to a
> per-name lock before calling in to the filesystem seems like a really
> good idea and should be good for scalability and large directories.
> It isn't something to be done lightly though.  The filesystem would
> still be given i_mutex held, but would be allowed to drop it if it
> wanted to.  We could have an FS_DROPS_I_MUTEX similar to the current
> FS_RENAME_DOES_D_MOVE.
> 
> For the first, I really like the idea that a lock should not be held
> across calls the filldir.  I feel that a filesystem doing that is
> "wrong" in much the same way that some think that recursing into the
> filesystem as nfsd does is "wrong".  In reality neither is "wrong", we
> just need to work together and negotiate and work out the best way to
> meet all of our needs.
> 
> So what should we do now?  I think that for JFFS2 to drop and reclaim
> f->sem in readdir would be all of 20 lines of code, including updating
> a ->version counter elsewhere in the code.  Replicating that in all
> the filesystems that need it would probably be more code than the nfsd
> change though.
> 
> On the other hand, if we implement the nfsd change, it will almost
> certainly never go away, even if all filesystems eventually find that
> they don't need it any more because someone improves the locking
> rules in the VFS.  Where as the code in the filesystems could quite
> possibly go away when they are revised to make better use of the
> locking regime.  So I don't think that is an ideal situation either.
> 
> If I had time, I would work on modifying the VFS to allow filesystems
> to drop i_mutex.  However I don't have time at the moment, so I'll
> leave the decision to be made by someone else  (Hi Bruce!  I'll
> support whatever you decide).
> 
> But I think it is important to understand what is really going on and
> not just implement a hack that works around the problem.  I think I do
> understand now, so I am a lot happier.  Hopefully you do too.

  parent reply	other threads:[~2008-08-04 18:41 UTC|newest]

Thread overview: 128+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-01 19:42 [RFC] Reinstate NFS exportability for JFFS2 David Woodhouse
2008-05-01 20:48 ` Christoph Hellwig
2008-05-01 22:44   ` David Woodhouse
2008-05-02  1:38     ` Neil Brown
2008-05-02  1:38       ` Neil Brown
2008-05-02 11:37       ` David Woodhouse
2008-05-02 11:37         ` David Woodhouse
2008-05-02 14:08         ` J. Bruce Fields
2008-05-02 14:08           ` J. Bruce Fields
2008-05-02 14:08           ` J. Bruce Fields
2008-07-31 21:54       ` David Woodhouse
2008-08-01  0:16         ` Neil Brown
2008-08-01  0:40           ` David Woodhouse
2008-08-01  0:40             ` David Woodhouse
2008-08-01  0:40             ` David Woodhouse
2008-08-01  0:52             ` David Woodhouse
2008-08-01  0:52               ` David Woodhouse
2008-08-01  0:52               ` David Woodhouse
2008-08-01  0:53             ` Chuck Lever
2008-08-01  0:53               ` Chuck Lever
2008-08-01  1:00               ` David Woodhouse
2008-08-01  1:00                 ` David Woodhouse
2008-08-01  1:00                 ` David Woodhouse
2008-08-01  1:31                 ` Chuck Lever
2008-08-01  1:31                   ` Chuck Lever
2008-08-01  1:31                   ` Chuck Lever
2008-08-01  8:13                   ` David Woodhouse
2008-08-01  8:13                     ` David Woodhouse
2008-08-01 13:35                   ` David Woodhouse
2008-08-01 13:35                     ` David Woodhouse
2008-08-01 13:56                     ` David Woodhouse
2008-08-01 13:56                       ` David Woodhouse
2008-08-01 13:56                       ` David Woodhouse
2008-08-01 16:05                       ` Chuck Lever
2008-08-01 16:05                         ` Chuck Lever
2008-08-01 16:19                         ` David Woodhouse
2008-08-01 16:19                           ` David Woodhouse
2008-08-01 17:47                           ` Chuck Lever
2008-08-01 17:47                             ` Chuck Lever
2008-08-02 18:26                             ` J. Bruce Fields
2008-08-02 18:26                               ` J. Bruce Fields
2008-08-02 20:42                               ` David Woodhouse
2008-08-02 20:42                                 ` David Woodhouse
2008-08-02 20:42                                 ` David Woodhouse
2008-08-02 21:33                                 ` J. Bruce Fields
2008-08-02 21:33                                   ` J. Bruce Fields
2008-08-03  8:39                                   ` David Woodhouse
2008-08-03  8:39                                     ` David Woodhouse
2008-08-03  8:39                                     ` David Woodhouse
2008-08-03 11:56                               ` Neil Brown
2008-08-03 11:56                                 ` Neil Brown
2008-08-03 11:56                                 ` Neil Brown
2008-08-03 17:15                                 ` Chuck Lever
2008-08-03 17:15                                   ` Chuck Lever
2008-08-04  1:03                                   ` Neil Brown
2008-08-04  1:03                                     ` Neil Brown
2008-08-04  6:19                                     ` Chuck Lever
2008-08-04  6:19                                       ` Chuck Lever
2008-08-04  6:19                                       ` Chuck Lever
2008-08-05  8:51                                       ` Dave Chinner
2008-08-05  8:51                                         ` Dave Chinner
2008-08-05  8:59                                         ` David Woodhouse
2008-08-05  8:59                                           ` David Woodhouse
2008-08-05  9:47                                           ` Dave Chinner
2008-08-05  9:47                                             ` Dave Chinner
2008-08-05 23:06                                         ` Neil Brown
2008-08-05 23:06                                           ` Neil Brown
2008-08-05 23:06                                           ` Neil Brown
2008-08-06  0:08                                           ` Dave Chinner
2008-08-06  0:08                                             ` Dave Chinner
2008-08-06  0:08                                             ` Dave Chinner
2008-08-06 19:56                                             ` J. Bruce Fields
2008-08-06 19:56                                             ` J. Bruce Fields
2008-08-06 19:56                                               ` J. Bruce Fields
2008-08-06 20:10                                               ` David Woodhouse
2008-08-06 20:10                                                 ` David Woodhouse
2008-08-09 16:47                                                 ` David Woodhouse
2008-08-09 16:47                                                   ` David Woodhouse
2008-08-09 16:47                                                   ` David Woodhouse
2008-08-09 19:55                                                   ` David Woodhouse
2008-08-09 19:55                                                     ` David Woodhouse
2008-08-09 20:01                                                     ` [PATCH 1/4] Factor out nfsd_do_readdir() into its own function David Woodhouse
2008-08-09 20:01                                                       ` David Woodhouse
2008-08-09 20:01                                                       ` David Woodhouse
2008-08-09 20:07                                                       ` Christoph Hellwig
2008-08-09 20:07                                                         ` Christoph Hellwig
2008-08-09 20:07                                                         ` Christoph Hellwig
2008-08-09 20:02                                                     ` [PATCH 2/4] Copy XFS readdir hack into nfsd code David Woodhouse
2008-08-09 20:02                                                       ` David Woodhouse
2008-08-09 20:02                                                       ` David Woodhouse
2008-08-09 20:08                                                       ` Christoph Hellwig
2008-08-09 20:08                                                         ` Christoph Hellwig
2008-08-09 20:03                                                     ` [PATCH 3/4] Remove XFS buffered readdir hack David Woodhouse
2008-08-09 20:03                                                       ` David Woodhouse
2008-08-09 20:03                                                       ` David Woodhouse
2008-08-09 20:09                                                       ` Christoph Hellwig
2008-08-09 20:09                                                         ` Christoph Hellwig
2008-08-09 20:09                                                         ` Christoph Hellwig
2008-08-09 20:03                                                     ` [PATCH 4/4] Reinstate NFS exportability David Woodhouse
2008-08-09 20:03                                                       ` David Woodhouse
2008-08-09 20:03                                                       ` David Woodhouse
2008-08-09 20:10                                                       ` Christoph Hellwig
2008-08-09 20:10                                                         ` Christoph Hellwig
2008-08-09 20:10                                                         ` Christoph Hellwig
2008-08-04 18:41                                     ` J. Bruce Fields [this message]
2008-08-04 18:41                                       ` [RFC] Reinstate NFS exportability for JFFS2 J. Bruce Fields
2008-08-04 22:37                                       ` Neil Brown
2008-08-04 22:37                                         ` Neil Brown
2008-08-17 18:22                                     ` Andreas Dilger
2008-08-17 18:22                                       ` Andreas Dilger
2008-08-17 18:22                                       ` Andreas Dilger
2008-08-01  2:14             ` Neil Brown
2008-08-01  2:14               ` Neil Brown
2008-08-01  8:50               ` David Woodhouse
2008-08-01  8:50                 ` David Woodhouse
2008-08-01  8:50                 ` David Woodhouse
2008-08-01 10:03               ` Al Viro
2008-08-01 10:03                 ` Al Viro
2008-08-01 10:03                 ` Al Viro
2008-08-01 23:11                 ` Neil Brown
2008-08-01 23:11                   ` Neil Brown
2008-08-01 23:11                   ` Neil Brown
2008-07-31 21:54       ` [PATCH 1/4] Factor out nfsd_do_readdir() into its own function David Woodhouse
2008-07-31 21:54       ` [PATCH 2/4] Copy XFS readdir hack into nfsd code, introduce FS_NO_LOOKUP_IN_READDIR flag David Woodhouse
2008-07-31 21:55       ` [PATCH 3/4] Switch XFS to using FS_NO_LOOKUP_IN_READDIR, remove local readdir hack David Woodhouse
2008-07-31 21:55       ` [PATCH 4/4] [JFFS2] Reinstate NFS exportability David Woodhouse
2008-07-31 21:55         ` David Woodhouse
2008-07-31 21:55         ` David Woodhouse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080804184115.GG25940@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=chucklever@gmail.com \
    --cc=dwmw2@infradead.org \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mtd@lists.infradead.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.