All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Christoph Hellwig <hch@lst.de>
Cc: ira.weiny@intel.com, linux-kernel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Dan Williams <dan.j.williams@intel.com>,
	"Theodore Y. Ts'o" <tytso@mit.edu>, Jan Kara <jack@suse.cz>,
	linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
Date: Tue, 25 Feb 2020 11:09:37 +1100	[thread overview]
Message-ID: <20200225000937.GA10776@dread.disaster.area> (raw)
In-Reply-To: <20200224175603.GE7771@lst.de>

On Mon, Feb 24, 2020 at 06:56:03PM +0100, Christoph Hellwig wrote:
> On Sat, Feb 22, 2020 at 09:44:19AM +1100, Dave Chinner wrote:
> > > I am very much against this.  There is a reason why we don't support
> > > changes of ops vectors at runtime anywhere else, because it is
> > > horribly complicated and impossible to get right.  IFF we ever want
> > > to change the DAX vs non-DAX mode (which I'm still not sold on) the
> > > right way is to just add a few simple conditionals and merge the
> > > aops, which is much easier to reason about, and less costly in overall
> > > overhead.
> > 
> > *cough*
> > 
> > That's exactly what the original code did. And it was broken
> > because page faults call multiple aops that are dependent on the
> > result of the previous aop calls setting up the state correctly for
> > the latter calls. And when S_DAX changes between those calls, shit
> > breaks.
> 
> No, the original code was broken because it didn't serialize between
> DAX and buffer access.
> 
> Take a step back and look where the problems are, and they are not
> mostly with the aops.  In fact the only aop useful for DAX is
> is ->writepages, and how it uses ->writepages is actually a bit of
> an abuse of that interface.

The races are all through the fops, too, which is one of the reasons
Darrick mentioned we should probably move this up to file ops
level...

> So what we really need it just a way to prevent switching the flag
> when a file is mapped,

That's not sufficient.

We also have to prevent the file from being mapped *while we are
switching*. Nothing in the mmap() path actually serialises against
filesystem operations, and the initial behavioural checks in the
page fault path are similarly unserialised against changing the
S_DAX flag.

e.g. there's a race against ->mmap() with switching the the S_DAX
flag. In xfs_file_mmap():

        file_accessed(file);
        vma->vm_ops = &xfs_file_vm_ops;
        if (IS_DAX(inode))
                vma->vm_flags |= VM_HUGEPAGE;

So if this runs while a switch from true to false is occurring,
we've got a non-dax VMA with huge pages enabled, and the non-dax
page fault path doesn't support this.

If we are really lucky, we'll have IS_DAX() set just long
enough to get through this check in xfs_filemap_huge_fault():

        if (!IS_DAX(file_inode(vmf->vma->vm_file)))
                return VM_FAULT_FALLBACK;

and then we'll get into __xfs_filemap_fault() and block on the
MMAPLOCK. When we actually get that lock, S_DAX will now be false
and we have a huge page fault running through a path (page cache IO)
that doesn't support huge pages...

This is the sort of landmine switching S_DAX without serialisation
causes. The methods themselves look sane, but it's cross-method
dependencies for a stable S_DAX flag that is the real problem.

Yes, we can probably fix this by adding XFS_MMAPLOCK_SHARED here,
but means every filesystem has to solve the same race conditions
itself. That's hard to get right and easy to get wrong. If the
VFS/mm subsystem provides the infrastructure needed to use this
functionality safely, then that is hard for filesystem developers to
get wrong.....

> and in the normal read/write path ensure the
> flag can't be switch while I/O is going on, which could either be
> done by ensuring it is only switched under i_rwsem or equivalent
> protection, or by setting the DAX flag once in the iocb similar to
> IOCB_DIRECT.

The iocb path is not the problem - that's entirely serialised
against S_DAX changes by the i_rwsem. The problem is that we have no
equivalent filesystem level serialisation for the entire mmap/page
fault path, and it checks S_DAX all over the place.

It's basically the same limitation that we have with mmap vs direct
IO - we can't lock out mmap when we do direct IO, so we cannot
guarantee coherency between the two. Similar here - we cannot
lockout mmap in any sane way, so we cannot guarantee coherency
between mmap and changing the S_DAX flag.

That's the underlying problem we need to solve here.

/me wonders if the best thing to do is to add a ->fault callout to
tell the filesystem to lock/unlock the inode right up at the top of
the page fault path, outside even the mmap_sem.  That means all the
methods that the page fault calls are protected against S_DAX
changes, and it gives us a low cost method of serialising page
faults against DIO (e.g. via inode_dio_wait())....

> And they easiest way to get all this done is as a first step to
> just allowing switching the flag when no blocks are allocated at
> all, similar to how the rt flag works.

False equivalence - it is not similar because the RT flag changes
and their associated state checks are *already fully serialised* by
the XFS_ILOCK_EXCL. S_DAX accesses have no such serialisation, and
that's the problem we need to solve...

In reality, I don't really care how we solve this problem, but we've
been down the path you are suggesting at least twice before and each
time we've ended up in a "that doesn't work" corner and had to
persue other solutions. I don't like going around in circles.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2020-02-25  0:09 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
2020-02-21  0:41 ` [PATCH V4 01/13] fs/xfs: Remove unnecessary initialization of i_rwsem ira.weiny
2020-02-21  1:26   ` Dave Chinner
2020-02-27 17:52     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 02/13] fs/xfs: Clarify lockdep dependency for xfs_isilocked() ira.weiny
2020-02-21  1:34   ` Dave Chinner
2020-02-21 23:00     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 03/13] fs: Remove unneeded IS_DAX() check ira.weiny
2020-02-21  1:42   ` Dave Chinner
2020-02-21 23:04     ` Ira Weiny
2020-02-21 17:42   ` Christoph Hellwig
2020-02-21  0:41 ` [PATCH V4 04/13] fs/stat: Define DAX statx attribute ira.weiny
2020-02-21  0:41 ` [PATCH V4 05/13] fs/xfs: Isolate the physical DAX flag from enabled ira.weiny
2020-02-21  0:41 ` [PATCH V4 06/13] fs/xfs: Create function xfs_inode_enable_dax() ira.weiny
2020-02-22  0:28   ` Darrick J. Wong
2020-02-23 15:07     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state ira.weiny
2020-02-21 17:44   ` Christoph Hellwig
2020-02-21 22:44     ` Dave Chinner
2020-02-21 23:26       ` Dan Williams
2020-02-24 17:56       ` Christoph Hellwig
2020-02-25  0:09         ` Dave Chinner [this message]
2020-02-25 17:36           ` Christoph Hellwig
2020-02-25 19:37             ` Jeff Moyer
2020-02-26  9:28               ` Jonathan Halliday
2020-02-26 11:31                 ` Jan Kara
2020-02-26 11:56                   ` Jonathan Halliday
2020-02-26 16:10                 ` Ira Weiny
2020-02-26 16:46                 ` Dan Williams
2020-02-26 17:20                   ` Jan Kara
2020-02-26 17:54                     ` Dan Williams
2020-02-25 21:03             ` Ira Weiny
2020-02-26 11:17           ` Jan Kara
2020-02-26 15:57             ` Ira Weiny
2020-02-22  0:33   ` Darrick J. Wong
2020-02-23 15:03     ` Ira Weiny
2020-02-24 12:10   ` kbuild test robot
2020-02-21  0:41 ` [PATCH V4 08/13] fs: Prevent DAX state change if file is mmap'ed ira.weiny
2020-02-21  0:41 ` [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer ira.weiny
2020-02-22  0:31   ` Darrick J. Wong
2020-02-23 15:04     ` Ira Weiny
2020-02-24  0:34   ` Dave Chinner
2020-02-24 19:57     ` Ira Weiny
2020-02-24 22:32       ` Dave Chinner
2020-02-25 21:12         ` Ira Weiny
2020-02-25 22:59           ` Dave Chinner
2020-02-26 18:02             ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 10/13] fs/xfs: Clean up locking in dax invalidate ira.weiny
2020-02-21 17:45   ` Christoph Hellwig
2020-02-21 18:06     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 11/13] fs/xfs: Allow toggle of effective DAX flag ira.weiny
2020-02-21  0:41 ` [PATCH V4 12/13] fs/xfs: Remove xfs_diflags_to_linux() ira.weiny
2020-02-21  0:41 ` [PATCH V4 13/13] Documentation/dax: Update Usage section ira.weiny
2020-02-26 22:48 ` [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 Jeff Moyer
2020-02-27  2:43   ` Ira Weiny

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200225000937.GA10776@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=dan.j.williams@intel.com \
    --cc=darrick.wong@oracle.com \
    --cc=hch@lst.de \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.