linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [git pull] vfs.git - including i_mutex wrappers
@ 2016-01-23 14:58 Al Viro
  2016-01-23 22:34 ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Al Viro @ 2016-01-23 14:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

	->i_mutex wrappers (with small prereq in lustre), fix for too
early freeing of symlink bodies on shmem (they need to be RCU-delayed)
(-stable fodder) and followup to dedupe stuff merged this cycle.

The following changes since commit 2101ae42899a14fe7caa73114e2161e778328661:

  Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs (2016-01-22 11:49:21 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus

for you to fetch changes up to e62e560fc8b65ef5debc9041a792d18a8f98b2ca:

  vfs: abort dedupe loop if fatal signals are pending (2016-01-22 20:29:55 -0500)

----------------------------------------------------------------
Al Viro (3):
      lustre: remove unused declaration
      wrappers for ->i_mutex access
      make sure that freeing shmem fast symlinks is RCU-delayed

Darrick J. Wong (1):
      vfs: abort dedupe loop if fatal signals are pending

 arch/powerpc/platforms/cell/spufs/file.c           |  4 +-
 arch/powerpc/platforms/cell/spufs/inode.c          | 12 ++--
 arch/s390/hypfs/inode.c                            |  8 +--
 block/ioctl.c                                      |  4 +-
 drivers/base/devtmpfs.c                            | 12 ++--
 drivers/block/aoe/aoecmd.c                         |  4 +-
 drivers/block/drbd/drbd_debugfs.c                  |  4 +-
 drivers/char/mem.c                                 |  4 +-
 drivers/char/ps3flash.c                            |  4 +-
 drivers/infiniband/hw/qib/qib_fs.c                 | 12 ++--
 drivers/mtd/ubi/cdev.c                             |  4 +-
 drivers/oprofile/oprofilefs.c                      | 16 ++---
 drivers/staging/lustre/lustre/llite/dir.c          |  4 +-
 drivers/staging/lustre/lustre/llite/file.c         | 16 ++---
 .../staging/lustre/lustre/llite/llite_internal.h   |  2 -
 drivers/staging/lustre/lustre/llite/llite_lib.c    |  4 +-
 drivers/staging/lustre/lustre/llite/llite_nfs.c    |  4 +-
 drivers/staging/lustre/lustre/llite/lloop.c        |  4 +-
 drivers/staging/lustre/lustre/llite/rw.c           |  4 +-
 drivers/staging/lustre/lustre/llite/rw26.c         |  4 +-
 drivers/staging/lustre/lustre/llite/vvp_io.c       |  4 +-
 drivers/staging/lustre/lustre/llite/vvp_page.c     | 10 +--
 drivers/staging/rdma/ipath/ipath_fs.c              |  8 +--
 drivers/usb/gadget/function/f_printer.c            |  4 +-
 drivers/usb/gadget/legacy/inode.c                  |  4 +-
 drivers/usb/gadget/udc/atmel_usba_udc.c            | 12 ++--
 drivers/video/fbdev/core/fb_defio.c                |  4 +-
 fs/9p/vfs_file.c                                   |  8 +--
 fs/affs/file.c                                     |  8 +--
 fs/afs/flock.c                                     |  4 +-
 fs/afs/write.c                                     |  4 +-
 fs/attr.c                                          |  2 +-
 fs/binfmt_misc.c                                   | 12 ++--
 fs/block_dev.c                                     | 20 +++---
 fs/btrfs/file.c                                    | 42 ++++++------
 fs/btrfs/inode.c                                   |  4 +-
 fs/btrfs/ioctl.c                                   | 38 +++++------
 fs/btrfs/relocation.c                              |  4 +-
 fs/btrfs/scrub.c                                   |  4 +-
 fs/btrfs/xattr.c                                   |  2 +-
 fs/cachefiles/interface.c                          |  4 +-
 fs/cachefiles/namei.c                              | 40 ++++++------
 fs/ceph/cache.c                                    |  4 +-
 fs/ceph/caps.c                                     |  4 +-
 fs/ceph/dir.c                                      |  4 +-
 fs/ceph/export.c                                   |  4 +-
 fs/ceph/file.c                                     | 18 +++---
 fs/cifs/cifsfs.c                                   |  4 +-
 fs/cifs/file.c                                     | 12 ++--
 fs/coda/dir.c                                      |  4 +-
 fs/coda/file.c                                     |  8 +--
 fs/configfs/dir.c                                  | 58 ++++++++---------
 fs/configfs/file.c                                 |  8 +--
 fs/configfs/inode.c                                |  4 +-
 fs/dax.c                                           |  6 +-
 fs/dcache.c                                        |  4 +-
 fs/debugfs/inode.c                                 | 22 +++----
 fs/devpts/inode.c                                  | 12 ++--
 fs/direct-io.c                                     |  8 +--
 fs/ecryptfs/inode.c                                | 32 +++++-----
 fs/ecryptfs/mmap.c                                 |  4 +-
 fs/efivarfs/file.c                                 |  4 +-
 fs/efivarfs/super.c                                |  4 +-
 fs/exec.c                                          |  4 +-
 fs/exofs/file.c                                    |  4 +-
 fs/exportfs/expfs.c                                | 12 ++--
 fs/ext2/ioctl.c                                    | 12 ++--
 fs/ext4/ext4.h                                     |  2 +-
 fs/ext4/extents.c                                  | 20 +++---
 fs/ext4/file.c                                     | 18 +++---
 fs/ext4/inode.c                                    | 12 ++--
 fs/ext4/ioctl.c                                    | 20 +++---
 fs/ext4/namei.c                                    |  4 +-
 fs/ext4/super.c                                    |  4 +-
 fs/f2fs/data.c                                     |  4 +-
 fs/f2fs/file.c                                     | 20 +++---
 fs/fat/dir.c                                       |  4 +-
 fs/fat/file.c                                      | 12 ++--
 fs/fuse/dir.c                                      | 10 +--
 fs/fuse/file.c                                     | 36 +++++------
 fs/gfs2/file.c                                     |  4 +-
 fs/gfs2/inode.c                                    |  4 +-
 fs/gfs2/quota.c                                    |  8 +--
 fs/hfs/dir.c                                       |  4 +-
 fs/hfs/inode.c                                     |  8 +--
 fs/hfsplus/dir.c                                   |  4 +-
 fs/hfsplus/inode.c                                 |  8 +--
 fs/hfsplus/ioctl.c                                 |  4 +-
 fs/hostfs/hostfs_kern.c                            |  4 +-
 fs/hpfs/dir.c                                      |  6 +-
 fs/hugetlbfs/inode.c                               | 12 ++--
 fs/inode.c                                         |  8 +--
 fs/ioctl.c                                         |  4 +-
 fs/jffs2/file.c                                    |  4 +-
 fs/jfs/file.c                                      |  6 +-
 fs/jfs/ioctl.c                                     |  6 +-
 fs/jfs/super.c                                     |  6 +-
 fs/kernfs/dir.c                                    |  4 +-
 fs/libfs.c                                         | 10 +--
 fs/locks.c                                         |  6 +-
 fs/logfs/file.c                                    |  8 +--
 fs/namei.c                                         | 74 +++++++++++-----------
 fs/namespace.c                                     | 10 +--
 fs/ncpfs/dir.c                                     |  8 +--
 fs/ncpfs/file.c                                    |  4 +-
 fs/nfs/dir.c                                       |  8 +--
 fs/nfs/direct.c                                    | 12 ++--
 fs/nfs/file.c                                      |  4 +-
 fs/nfs/inode.c                                     |  8 +--
 fs/nfs/nfs42proc.c                                 |  8 +--
 fs/nfs/nfs4file.c                                  | 24 +++----
 fs/nfsd/nfs4proc.c                                 |  4 +-
 fs/nfsd/nfs4recover.c                              | 12 ++--
 fs/nfsd/nfsfh.h                                    |  4 +-
 fs/nfsd/vfs.c                                      |  4 +-
 fs/nilfs2/inode.c                                  |  4 +-
 fs/nilfs2/ioctl.c                                  |  4 +-
 fs/ntfs/dir.c                                      |  4 +-
 fs/ntfs/file.c                                     |  8 +--
 fs/ntfs/quota.c                                    |  6 +-
 fs/ntfs/super.c                                    | 12 ++--
 fs/ocfs2/alloc.c                                   | 32 +++++-----
 fs/ocfs2/aops.c                                    |  4 +-
 fs/ocfs2/dir.c                                     |  4 +-
 fs/ocfs2/file.c                                    | 12 ++--
 fs/ocfs2/inode.c                                   | 12 ++--
 fs/ocfs2/ioctl.c                                   | 12 ++--
 fs/ocfs2/journal.c                                 |  8 +--
 fs/ocfs2/localalloc.c                              | 16 ++---
 fs/ocfs2/move_extents.c                            | 16 ++---
 fs/ocfs2/namei.c                                   | 28 ++++----
 fs/ocfs2/quota_global.c                            |  4 +-
 fs/ocfs2/refcounttree.c                            | 12 ++--
 fs/ocfs2/resize.c                                  |  8 +--
 fs/ocfs2/suballoc.c                                | 12 ++--
 fs/ocfs2/xattr.c                                   | 14 ++--
 fs/open.c                                          | 12 ++--
 fs/overlayfs/copy_up.c                             |  4 +-
 fs/overlayfs/dir.c                                 | 12 ++--
 fs/overlayfs/inode.c                               |  4 +-
 fs/overlayfs/readdir.c                             | 20 +++---
 fs/overlayfs/super.c                               | 14 ++--
 fs/proc/kcore.c                                    |  4 +-
 fs/proc/self.c                                     |  4 +-
 fs/proc/thread_self.c                              |  4 +-
 fs/pstore/inode.c                                  |  6 +-
 fs/quota/dquot.c                                   | 20 +++---
 fs/read_write.c                                    |  7 +-
 fs/readdir.c                                       |  2 +-
 fs/reiserfs/dir.c                                  |  4 +-
 fs/reiserfs/file.c                                 |  4 +-
 fs/reiserfs/ioctl.c                                |  2 +-
 fs/reiserfs/xattr.c                                | 64 +++++++++----------
 fs/tracefs/inode.c                                 | 34 +++++-----
 fs/ubifs/dir.c                                     | 18 +++---
 fs/ubifs/file.c                                    |  4 +-
 fs/ubifs/xattr.c                                   |  4 +-
 fs/udf/file.c                                      | 10 +--
 fs/udf/inode.c                                     |  2 +-
 fs/utimes.c                                        |  4 +-
 fs/xattr.c                                         |  8 +--
 fs/xfs/xfs_file.c                                  |  6 +-
 fs/xfs/xfs_pnfs.c                                  |  4 +-
 include/linux/fs.h                                 | 29 ++++++++-
 include/linux/shmem_fs.h                           |  5 +-
 ipc/mqueue.c                                       |  8 +--
 kernel/audit_fsnotify.c                            |  2 +-
 kernel/audit_watch.c                               |  2 +-
 kernel/events/core.c                               |  4 +-
 kernel/relay.c                                     |  4 +-
 kernel/sched/core.c                                |  4 +-
 mm/filemap.c                                       |  4 +-
 mm/shmem.c                                         | 21 +++---
 mm/swapfile.c                                      | 12 ++--
 net/sunrpc/cache.c                                 | 10 +--
 net/sunrpc/rpc_pipe.c                              | 60 +++++++++---------
 security/inode.c                                   | 10 +--
 security/integrity/ima/ima_main.c                  |  8 +--
 security/selinux/selinuxfs.c                       |  4 +-
 179 files changed, 916 insertions(+), 894 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] vfs.git - including i_mutex wrappers
  2016-01-23 14:58 [git pull] vfs.git - including i_mutex wrappers Al Viro
@ 2016-01-23 22:34 ` Dave Chinner
  2016-01-23 22:44   ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2016-01-23 22:34 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Sat, Jan 23, 2016 at 02:58:54PM +0000, Al Viro wrote:
> 	->i_mutex wrappers (with small prereq in lustre), fix for too

Please explain, Al?

I haven't heard anything about there being i_mutex changes pending,
and this commit says "over the coming cycle ->i_mutex will become
rwsem".  That's a complete surprise to me, and not something that
should be done with no warning.

What's the locking model? How are filesystems supposed to use it?
Are they even allowed to use read-mode locking, and if so, what
operations is it going to be safe to hold the lock in read mode?

Why is this change considered valid now, when previously there's
always been significant push-back to any suggestion that we should
make the i_mutex a rwsem so we can do shared read-only access
locking on inode operations?


Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] vfs.git - including i_mutex wrappers
  2016-01-23 22:34 ` Dave Chinner
@ 2016-01-23 22:44   ` Dave Chinner
  2016-01-23 23:09     ` Al Viro
  2016-01-23 23:48     ` Linus Torvalds
  0 siblings, 2 replies; 13+ messages in thread
From: Dave Chinner @ 2016-01-23 22:44 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Sun, Jan 24, 2016 at 09:34:56AM +1100, Dave Chinner wrote:
> On Sat, Jan 23, 2016 at 02:58:54PM +0000, Al Viro wrote:
> > 	->i_mutex wrappers (with small prereq in lustre), fix for too
> 
> Please explain, Al?
> 
> I haven't heard anything about there being i_mutex changes pending,
> and this commit says "over the coming cycle ->i_mutex will become
> rwsem".  That's a complete surprise to me, and not something that
> should be done with no warning.
> 
> What's the locking model? How are filesystems supposed to use it?
> Are they even allowed to use read-mode locking, and if so, what
> operations is it going to be safe to hold the lock in read mode?
> 
> Why is this change considered valid now, when previously there's
> always been significant push-back to any suggestion that we should
> make the i_mutex a rwsem so we can do shared read-only access
> locking on inode operations?

FWIW, I'm not opposed to making such a locking change - I'm more
concerned about the fact I'm finding out about plans for such a
fundamental locking change from a pull request on the last day of a
merge window....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] vfs.git - including i_mutex wrappers
  2016-01-23 22:44   ` Dave Chinner
@ 2016-01-23 23:09     ` Al Viro
  2016-01-23 23:38       ` Al Viro
  2016-01-24  0:53       ` Dave Chinner
  2016-01-23 23:48     ` Linus Torvalds
  1 sibling, 2 replies; 13+ messages in thread
From: Al Viro @ 2016-01-23 23:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Sun, Jan 24, 2016 at 09:44:35AM +1100, Dave Chinner wrote:

> FWIW, I'm not opposed to making such a locking change - I'm more
> concerned about the fact I'm finding out about plans for such a
> fundamental locking change from a pull request on the last day of a
> merge window....

Look at the commit message (*and* pull request posting) of an earlier vfs.git
pull request in the beginning of this window.  Or into the thread back in
May when it had been first proposed (and pretty much the same patch had been
generated and posted by Linus).  Changes needed for parallel ->lookup() had
been discussed; it was a side branch of one of the RCU symlink threads and
ISTR your own postings in it.

For filesystems it will be mostly transparent, except for the possibility of
parallel calls of ->lookup() on different names in the same directory.
Which XFS shouldn't give a fuck about, unless I'm seriously misreading your
code.

Basic scheme: have dentries under ->lookup() marked as such and inserted into
hash (still negative, obviously) before calling ->lookup().  The method itself
is called with ->i_mutex replacement taken shared; anyone running into such
dentry in dcache lookup will wait (on parent directory ->i_mutex queue,
explicitly kicked once ->lookup() is done) and repeat dcache lookup.  In
case when the current code would've silently freed ->lookup() argument (error
or "I've used an existing dentry") the thing will be unhashed and dropped,
without ever losing the "it's under lookup" flag.  Primitives like
d_splice_alias() would remove the flag in question.

Anyone running into such sucker in RCU mode should treat it as "dcache miss,
need to fall back to non-lazy mode".  Flag (as all dentry flags) protected
by ->d_lock.

If a filesystem simply wants to preserve the existing exclusion, it should
add a private per-inode mutex and take it in its ->lookup() instance; all
other methods will still get exclusion on ->i_mutex replacement.

There will be interesting prereqs, but for XFS it's a non-issue.  Now,
something like ceph or lustre... <shudder>  Again, for XFS (for any
normal Unix filesystems, really) no extra exclusion should be needed.

readdir() is another potential target for weaker exclusion (i.e. switching
it to taking that thing shared), but that's a separate story and I'd prefer
to deal with ->lookup() first.  There are potentially hairy issues around
the instances that pre-seed dcache and I don't want to mix them into the
initial series.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] vfs.git - including i_mutex wrappers
  2016-01-23 23:09     ` Al Viro
@ 2016-01-23 23:38       ` Al Viro
  2016-01-24  0:53       ` Dave Chinner
  1 sibling, 0 replies; 13+ messages in thread
From: Al Viro @ 2016-01-23 23:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Sat, Jan 23, 2016 at 11:09:44PM +0000, Al Viro wrote:
> On Sun, Jan 24, 2016 at 09:44:35AM +1100, Dave Chinner wrote:
> 
> > FWIW, I'm not opposed to making such a locking change - I'm more
> > concerned about the fact I'm finding out about plans for such a
> > fundamental locking change from a pull request on the last day of a
> > merge window....
> 
> Look at the commit message (*and* pull request posting) of an earlier vfs.git
> pull request in the beginning of this window.  Or into the thread back in
> May when it had been first proposed (and pretty much the same patch had been
> generated and posted by Linus).  Changes needed for parallel ->lookup() had
> been discussed; it was a side branch of one of the RCU symlink threads and
> ISTR your own postings in it.

See http://www.spinics.net/lists/kernel/msg2160034.html for the former (Jan 12,
two days into this window) and e.g. https://lkml.org/lkml/2015/5/16/297 for
the latter.  And yes, you had been Cc'd on that thread back in May, including
the posting in question - even posted in other branches of that thread, both
before and after...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] vfs.git - including i_mutex wrappers
  2016-01-23 22:44   ` Dave Chinner
  2016-01-23 23:09     ` Al Viro
@ 2016-01-23 23:48     ` Linus Torvalds
  2016-01-24  0:26       ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2016-01-23 23:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel

On Sat, Jan 23, 2016 at 2:44 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> FWIW, I'm not opposed to making such a locking change - I'm more
> concerned about the fact I'm finding out about plans for such a
> fundamental locking change from a pull request on the last day of a
> merge window....

It's actually not a new patch - the patch (or rather, the script to
generate it - there's just a few small manual fixups for whitespace
etc that went along with it iirc) goes back almost a year by now, and
came about from a NFS "who owns the locking" discussion back then. It
was mainly with Neil Brown (who brought up a NFS performance issue).

I know you were involved in that thread at least tangentially too - we
talked about readdir() and how annoying the i_mutex is.

So the original script and patch were a "how about this" from me back
last spring.

And the whole "last day of the merge window" is actually intentional -
it's behind all the filesystem merges on purpose, so that there aren't
any merge conflicts from an almost entirely automated patch.

Side note: a large reason for the patch is exactly the fact that
particularly for things like filename lookup, the vfs layer really
needs to be in control of locking, and I disagreed violently with Neil
who thought the filesystem should be in control. We really have had
much better luck with trying to make sure that the vfs layer does a
reasonable job at locking than have each filesystem decide to do its
own thing.

But obviously, the inode semaphore is one of the weakest areas of the
vfs locking. I agree that it needs fixing, I just disagree violently
when somebody says "the vfs layer should get out of the way".

I know filesystem developers always think the buffering and the
caching that the vfs layer does is "not important", but that's because
you don't see the real work. Outside of some very specialized loads,
the cached cases are generally 99% of pretty much all loads, and it's
why doing things at the vfs layer (and the mm layer - the two are kind
of incestuous when it comes to things like the page cache) has been so
successful and important, and why I completely dismiss any "filesystem
should be in control" arguments. We used to do that, and it was a
nightmare.

Also, do note that the patch itself doesn't actually change locking at
all. It's purely about making it easier to slowly start changing the
locking by adding this abstraction layer that makes it possible to
change the lock type eventually. Al has some plans for (maybe) next
merge window, but for now it's all entirely syntactic preparation for
the future, no real change at all.

             Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] vfs.git - including i_mutex wrappers
  2016-01-23 23:48     ` Linus Torvalds
@ 2016-01-24  0:26       ` Dave Chinner
  2016-01-24  1:20         ` Al Viro
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2016-01-24  0:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel

On Sat, Jan 23, 2016 at 03:48:30PM -0800, Linus Torvalds wrote:
> On Sat, Jan 23, 2016 at 2:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > FWIW, I'm not opposed to making such a locking change - I'm more
> > concerned about the fact I'm finding out about plans for such a
> > fundamental locking change from a pull request on the last day of a
> > merge window....
> 
> It's actually not a new patch - the patch (or rather, the script to
> generate it - there's just a few small manual fixups for whitespace
> etc that went along with it iirc) goes back almost a year by now, and
> came about from a NFS "who owns the locking" discussion back then. It
> was mainly with Neil Brown (who brought up a NFS performance issue).
> 
> I know you were involved in that thread at least tangentially too - we
> talked about readdir() and how annoying the i_mutex is.
> 
> So the original script and patch were a "how about this" from me back
> last spring.

Right, but that was a year ago, and I don't recall there being any
clear conclusion from that discussion. Certainly not that I'm going
to remember when I'm reading a pull request that has a vague
commit message.

> And the whole "last day of the merge window" is actually intentional -
> it's behind all the filesystem merges on purpose, so that there aren't
> any merge conflicts from an almost entirely automated patch.

That's fair enough. However, compare this to how core locking
changes occur in the mm subsystem - they go through multiple patch
postings and review so there's no surprise when the pull request
comes.

> I know filesystem developers always think the buffering and the
> caching that the vfs layer does is "not important", but that's because
> you don't see the real work.

That's not true. We do see all the real loads, an more-so than
you think - most of the performance issues are solved long before
they come to the attention of the mm/page cache people because the
usual bug report is "filesystem X is slow" to the relevant
filesystem list. i.e. you don't see (and don't want to see) most of
the issues we deal with around users/applications doing terrible
things to the IO subsystems. :)

> Also, do note that the patch itself doesn't actually change locking at
> all.

Yes, but that's not the sort of warning I want when something
fundamental is about to change....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] vfs.git - including i_mutex wrappers
  2016-01-23 23:09     ` Al Viro
  2016-01-23 23:38       ` Al Viro
@ 2016-01-24  0:53       ` Dave Chinner
  2016-01-24  1:41         ` Al Viro
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2016-01-24  0:53 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Sat, Jan 23, 2016 at 11:09:44PM +0000, Al Viro wrote:
> On Sun, Jan 24, 2016 at 09:44:35AM +1100, Dave Chinner wrote:
> 
> > FWIW, I'm not opposed to making such a locking change - I'm more
> > concerned about the fact I'm finding out about plans for such a
> > fundamental locking change from a pull request on the last day of a
> > merge window....
> 
> Look at the commit message (*and* pull request posting) of an earlier vfs.git
> pull request in the beginning of this window.  Or into the thread back in

Which one? YOu've sent 8 or 9 pull requests so far this merge
window, and I haven't had time to read all of them closely. Indeed,
you wrote one sentence in a long pull request description that
mentions converting to the inode mutex to a rwsem. Not surprising
that I missed it - I'm sure lots of other people did too....

> May when it had been first proposed (and pretty much the same patch had been
> generated and posted by Linus).  Changes needed for parallel ->lookup() had
> been discussed; it was a side branch of one of the RCU symlink threads and
> ISTR your own postings in it.

I don't have the memory of a elephant, and the commit message or
pull requests make no mention of that discussion, nor do I recall
there being any clear resolution in that discussion way back then.
Certainly there's nothing in the the pull req or the commit message
that would make me think "oh, that's what we talked about a year
ago!".

> Basic scheme: have dentries under ->lookup() marked as such and inserted into
> hash (still negative, obviously) before calling ->lookup().  The method itself
> is called with ->i_mutex replacement taken shared; anyone running into such
> dentry in dcache lookup will wait (on parent directory ->i_mutex queue,
> explicitly kicked once ->lookup() is done) and repeat dcache lookup.  In
> case when the current code would've silently freed ->lookup() argument (error
> or "I've used an existing dentry") the thing will be unhashed and dropped,
> without ever losing the "it's under lookup" flag.  Primitives like
> d_splice_alias() would remove the flag in question.
> 
> Anyone running into such sucker in RCU mode should treat it as "dcache miss,
> need to fall back to non-lazy mode".  Flag (as all dentry flags) protected
> by ->d_lock.
> 
> If a filesystem simply wants to preserve the existing exclusion, it should
> add a private per-inode mutex and take it in its ->lookup() instance; all
> other methods will still get exclusion on ->i_mutex replacement.
> 
> There will be interesting prereqs, but for XFS it's a non-issue.  Now,
> something like ceph or lustre... <shudder>  Again, for XFS (for any
> normal Unix filesystems, really) no extra exclusion should be needed.
> 
> readdir() is another potential target for weaker exclusion (i.e. switching
> it to taking that thing shared), but that's a separate story and I'd prefer
> to deal with ->lookup() first.  There are potentially hairy issues around
> the instances that pre-seed dcache and I don't want to mix them into the
> initial series.

So you're doing this for purely to enable lookup concurrency, not
for anyone else to be able to use the inode lock as a read/write
lock? Can anyone use the inode rwsem as a read/write lock for their
own purposes? If so, we can probably use it to replace the XFS
IOLOCK and so effectively remove a layer of locking in various
XFS IO paths. What's the policy you are proposing here?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] vfs.git - including i_mutex wrappers
  2016-01-24  0:26       ` Dave Chinner
@ 2016-01-24  1:20         ` Al Viro
  2016-01-24  7:17           ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Al Viro @ 2016-01-24  1:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel

On Sun, Jan 24, 2016 at 11:26:58AM +1100, Dave Chinner wrote:

> That's fair enough. However, compare this to how core locking
> changes occur in the mm subsystem - they go through multiple patch
> postings and review so there's no surprise when the pull request
> comes.

... and the thread in question has grown from precisely that (and not the first
iteration, either) for earlier such change (RCU symlinks).  Subsequent one
(follow_link -> get_link, with RCU symlink traversal for non-embedded
symlinks) also went through fsdevel mailbombs (a couple of iterations, IIRC).

Seriously, when it comes to actual fs-visible behaviour changes (rather than
"please, try and use these helpers instead of open-coding ->i_mutex access"
done exactly to avoid the inter-tree dependencies from hell while that work
is being done) fsdevel will be hit by such mailbomb and probably more than
once.

For now it's really just a reduction of trivial conflicts for the next cycle;
eventually it's going to be a weaker VFS exclusion on ->lookup().  Which had
been loudly demanded quite a few times, and I don't recall any filesystem
developers _ever_ objecting to that.

Speaking of the earlier changes - IIRC, there had been plans to start
hashing (at least some of) XFS symlinks.  I think it was from hch, along
the lines of "stash a buffer with symlink body into ->i_link the first time
around, free it from inode eviction".  As long as that freeing is RCU-delayed,
doing so would work just fine and give you symlink traversal without dropping
from RCU mode...  OTOH, if that gets resurrected, it probably ought to go
through XFS tree - all VFS infrastructure is there (since 4.2), so it's
purely XFS business at this point...  One thing to watch out for is that
RCU delay - see shmem.c fix in the same pull request for related example.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] vfs.git - including i_mutex wrappers
  2016-01-24  0:53       ` Dave Chinner
@ 2016-01-24  1:41         ` Al Viro
  2016-01-24  7:04           ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Al Viro @ 2016-01-24  1:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Sun, Jan 24, 2016 at 11:53:04AM +1100, Dave Chinner wrote:
> > readdir() is another potential target for weaker exclusion (i.e. switching
> > it to taking that thing shared), but that's a separate story and I'd prefer
> > to deal with ->lookup() first.  There are potentially hairy issues around
> > the instances that pre-seed dcache and I don't want to mix them into the
> > initial series.
> 
> So you're doing this for purely to enable lookup concurrency, not
> for anyone else to be able to use the inode lock as a read/write
> lock? Can anyone use the inode rwsem as a read/write lock for their
> own purposes? If so, we can probably use it to replace the XFS
> IOLOCK and so effectively remove a layer of locking in various
> XFS IO paths. What's the policy you are proposing here?

Depends...  I definitely want to keep directory modifiers with that thing
taken exclusive, with lookup and possibly readdir - shared.  Non-directories...
it's mostly up to filesystems; the only place where VFS cares is setattr
and {set,remove}xattr, and that probably should stay exclusive (or be
separated, for that matter, but I hadn't looked into implications of that;
we probably can do that, but there might be dragons).

For data operations on regular files it's probably up to filesystems, as
i_mutex is now.  Not sure if IOLOCK would map well on that; can you live with
that thing taken outside of transaction?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] vfs.git - including i_mutex wrappers
  2016-01-24  1:41         ` Al Viro
@ 2016-01-24  7:04           ` Dave Chinner
  2016-01-24  7:48             ` Al Viro
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2016-01-24  7:04 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Sun, Jan 24, 2016 at 01:41:12AM +0000, Al Viro wrote:
> On Sun, Jan 24, 2016 at 11:53:04AM +1100, Dave Chinner wrote:
> > > readdir() is another potential target for weaker exclusion (i.e. switching
> > > it to taking that thing shared), but that's a separate story and I'd prefer
> > > to deal with ->lookup() first.  There are potentially hairy issues around
> > > the instances that pre-seed dcache and I don't want to mix them into the
> > > initial series.
> > 
> > So you're doing this for purely to enable lookup concurrency, not
> > for anyone else to be able to use the inode lock as a read/write
> > lock? Can anyone use the inode rwsem as a read/write lock for their
> > own purposes? If so, we can probably use it to replace the XFS
> > IOLOCK and so effectively remove a layer of locking in various
> > XFS IO paths. What's the policy you are proposing here?
> 
> Depends...  I definitely want to keep directory modifiers with that thing
> taken exclusive, with lookup and possibly readdir - shared.  Non-directories...
> it's mostly up to filesystems; the only place where VFS cares is setattr
> and {set,remove}xattr, and that probably should stay exclusive (or be
> separated, for that matter, but I hadn't looked into implications of that;
> we probably can do that, but there might be dragons).

Separated is the model that XFS uses. i.e. the ILOCK is the lock
that serialises access to inode metadata, nests inside the IOLOCK.

Essentially that means a ->setattr operation (e.g. truncate) is

i_mutex
IOLOCK (exclusive)
<do IO serialisation>
<do data manipulation (e.g. page cache invalidation)>
start transaction
ILOCK (exclusive)
<do metadata modification>
commit transaction
<unwind locking>

Hence even for ->setattr, we can remove the IOLOCK usage if the
vfs takes the the new i_rwsem in exclusive mode because we would
still have a functional IO submission barrier....

> For data operations on regular files it's probably up to filesystems, as
> i_mutex is now.  Not sure if IOLOCK would map well on that; can you live with
> that thing taken outside of transaction?

Yes. IOLOCK has the same scope as i_mutex in the IO path.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] vfs.git - including i_mutex wrappers
  2016-01-24  1:20         ` Al Viro
@ 2016-01-24  7:17           ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2016-01-24  7:17 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel

On Sun, Jan 24, 2016 at 01:20:02AM +0000, Al Viro wrote:
> On Sun, Jan 24, 2016 at 11:26:58AM +1100, Dave Chinner wrote:
> 
> > That's fair enough. However, compare this to how core locking
> > changes occur in the mm subsystem - they go through multiple patch
> > postings and review so there's no surprise when the pull request
> > comes.
> 
> ... and the thread in question has grown from precisely that (and not the first
> iteration, either) for earlier such change (RCU symlinks).  Subsequent one
> (follow_link -> get_link, with RCU symlink traversal for non-embedded
> symlinks) also went through fsdevel mailbombs (a couple of iterations, IIRC).

I haven't been following them closely, either, because it's
something I haven't needed to care much about. There's way more that
goes on than I can follow, but I would have expected something like
a pending i_mutex change to at least have a standalone "heads-up"
to inform various developers that they is a significant change
coming...

> Speaking of the earlier changes - IIRC, there had been plans to start
> hashing (at least some of) XFS symlinks.  I think it was from hch, along
> the lines of "stash a buffer with symlink body into ->i_link the first time
> around, free it from inode eviction".

No idea - there hasn't been any followup, and nobody is telling me
they have a symlink traversal performance problem, so it's way off
my radar. I don't even know what workload RCU symlinks is supposed
to optimise....

> As long as that freeing is RCU-delayed,
> doing so would work just fine and give you symlink traversal without dropping
> from RCU mode...  OTOH, if that gets resurrected, it probably ought to go
> through XFS tree - all VFS infrastructure is there (since 4.2), so it's
> purely XFS business at this point...  One thing to watch out for is that
> RCU delay - see shmem.c fix in the same pull request for related example.

I can't see me getting to this in the next few months, to tell you
the truth. There's way more important things we need to do in XFS
land right now (like merge and stabilise reflink, reverse mapping,
DAX, etc), so this is likely to sit until someone who really needs
it comes along.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] vfs.git - including i_mutex wrappers
  2016-01-24  7:04           ` Dave Chinner
@ 2016-01-24  7:48             ` Al Viro
  0 siblings, 0 replies; 13+ messages in thread
From: Al Viro @ 2016-01-24  7:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Sun, Jan 24, 2016 at 06:04:06PM +1100, Dave Chinner wrote:

> Hence even for ->setattr, we can remove the IOLOCK usage if the
> vfs takes the the new i_rwsem in exclusive mode because we would
> still have a functional IO submission barrier....
> 
> > For data operations on regular files it's probably up to filesystems, as
> > i_mutex is now.  Not sure if IOLOCK would map well on that; can you live with
> > that thing taken outside of transaction?
> 
> Yes. IOLOCK has the same scope as i_mutex in the IO path.

Umm...  So e.g. xfs_create() could take IOLOCK before xfs_trans_reserve()?
If so, you probably could eventually be able to use ->i_rwsem for it (and
drop it in places where it's already taken by method callers).  I'm nowhere
near being familiar enough with details of fs/xfs locking to tell how much
PITA would the last part be - e.g. a function used both inside ->lookup()
and in ->read_iter() and currently taking IOLOCK shared would need to
have it lifted into both callers and removed from ->lookup(), etc., which
might or might not be painful.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-01-24  7:48 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-23 14:58 [git pull] vfs.git - including i_mutex wrappers Al Viro
2016-01-23 22:34 ` Dave Chinner
2016-01-23 22:44   ` Dave Chinner
2016-01-23 23:09     ` Al Viro
2016-01-23 23:38       ` Al Viro
2016-01-24  0:53       ` Dave Chinner
2016-01-24  1:41         ` Al Viro
2016-01-24  7:04           ` Dave Chinner
2016-01-24  7:48             ` Al Viro
2016-01-23 23:48     ` Linus Torvalds
2016-01-24  0:26       ` Dave Chinner
2016-01-24  1:20         ` Al Viro
2016-01-24  7:17           ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).