Re: [PATCH] ext4: introduce per-inode DAX flag

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	Theodore Ts'o <tytso@mit.edu>,
	linux-ext4@vger.kernel.org, Lukas Czerner <lczerner@redhat.com>,
	linux-xfs@vger.kernel.org
Subject: Re: [PATCH] ext4: introduce per-inode DAX flag
Date: Tue, 29 Aug 2017 17:49:22 +0200	[thread overview]
Message-ID: <20170829154922.GA24592@quack2.suse.cz> (raw)
In-Reply-To: <20170828101014.GD17782@dastard>

On Mon 28-08-17 20:10:14, Dave Chinner wrote:
> On Mon, Aug 28, 2017 at 12:38:54AM -0700, Christoph Hellwig wrote:
> > On Sat, Aug 26, 2017 at 09:33:58AM +1000, Dave Chinner wrote:
> > > > Nah, -o dax works very well.  It's just the flag instead of the -o dax
> > > > option or rather switching it on a mapped file will probably be very dangerous.
> > > 
> > > In what way is it dangerous, Christoph?
> > 
> > When I run the following script as a normal user:
> > 
> > FSXDIR=~/xfstests/ltp/
> > FILE=/mnt/foo
> > 
> > ${FSXDIR}/fsx $FILE &
> > 
> > while true; do
> >     xfs_io -c 'chattr +x' $FILE
> >     xfs_io -c 'chattr -x' $FILE
> > done
> >
> > I get this nice little crash:
> 
> Can you please package that up into an xfstest?
> 
> > root@testvm:~# sh test.sh
> > skipping zero size read
> > skipping insert range behind EOF
> > truncating to largest ever: 0x3a290
> > zero_range to largest ever: 0x3a8d1
> > zero_range to largest ever: 0x3fe3e
> > zero_range to largest ever: 0x40000
> > [  344.898390] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> > [  344.899306] IP: iomap_page_mkwrite+0x17/0xf0
> > [  344.899795] PGD 7db37067
> > [  344.899796] P4D 7db37067
> > [  344.900099] PUD 78c61067
> > [  344.900389] PMD 0
> > [  344.900665]
> > [  344.901075] Oops: 0000 [#1] SMP
> > [  344.901536] Modules linked in:
> > [  344.901716] CPU: 3 PID: 6052 Comm: fsx Not tainted 4.12.0+ #2199
> > [  344.901716] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
> > [  344.901716] task: ffff880079a0da00 task.stack: ffffc900068a4000
> > [  344.901716] RIP: 0010:iomap_page_mkwrite+0x17/0xf0
> > [  344.901716] RSP: 0000:ffffc900068a7d38 EFLAGS: 00010246
> > [  344.901716] RAX: ffff8800798dd0d0 RBX: 0000000000000200 RCX: 0000000000000001
> > [  344.901716] RDX: 0000000070eb898e RSI: ffffffff82109010 RDI: ffffc900068a7df0
> > [  344.901716] RBP: ffffc900068a7d60 R08: ffffffff82ff9fa8 R09: 0000000000000000
> > [  344.901716] R10: ffffc900068a7cb0 R11: ffffffff8159b5cc R12: ffffffff82109010
> > [  344.901716] R13: 0000000000000000 R14: ffffc900068a7df0 R15: ffff88007da89580
>                       ^^^^^^^^^^^^^^^^
> 
> vmf->page is null.
> 
> Which means IS_DAX changed half way through a fault, despite us
> holding the MMAPLOCK and protecting all the filesystem side of the
> fault code from races.
> 
> Seems to me that even allowing filesystems to switch between
> different mapping tree behaviours based on an inode flag is a
> fundamentally broken model. The fault action that needs to taken by
> the filesystem has already been predetermined by the fault
> processing that has already occurred and placed into the contents of
> the vmf we've been passed.

I don't think the problem is actually within MM in this particular case.
The problem seems to be that xfs_filemap_fault() checks IS_DAX without
holding MMAPLOCK and so it can change after that test and before the test
in xfs_filemap_page_mkwrite().

> Hence I think that if we need to process the fault as a DAX fault
> then the vmf needs to tell us that, not require us to look up an
> inode flag to determine what to do. ANd if the inode flag changes,
> then that needs to be propagated through the mapping and VMAs in a
> sane fashion, not just run an invalidation from the filesystem. I
> don't know enough about the VM code to say anything useful about how
> this needs to be set up, but it's clear that mapping invalidation
> and behaviour swaps can't be completely serialised against page
> faults from the filesystem side.

But there is no difference in vmf setup from generic MM side. In particular
vmf->page is set by the ->fault handler and then it is passed to
->page_mkwrite handler. And changes to mapping behavior between these two
callbacks should be prevented by the page lock / radix entry lock...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

next prev parent reply	other threads:[~2017-08-29 15:49 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-02 16:09 [PATCH] ext4: introduce per-inode DAX flag Lukas Czerner
2017-08-05  8:46 ` Christoph Hellwig
2017-08-07 12:12   ` Lukas Czerner
2017-08-08  9:00     ` Lukas Czerner
2017-08-11 10:01       ` Christoph Hellwig
2017-08-11 12:11         ` Lukas Czerner
2017-08-11 12:58           ` Christoph Hellwig
2017-08-11 13:41             ` Lukas Czerner
2017-08-24 18:20               ` Theodore Ts'o
2017-08-25  7:54                 ` Christoph Hellwig
2017-08-25 15:14                   ` Theodore Ts'o
2017-08-25 15:40                     ` Christoph Hellwig
2017-08-25 16:28                       ` Theodore Ts'o
2017-08-25 23:33                       ` Dave Chinner
2017-08-28  7:38                         ` Christoph Hellwig
2017-08-28 10:10                           ` Dave Chinner
2017-08-29 15:49                             ` Jan Kara [this message]
2017-08-29 22:57                               ` Dave Chinner
2017-08-30 10:00                                 ` Jan Kara
2017-08-30 12:34                                   ` Christoph Hellwig
2017-08-30 15:00                                     ` Theodore Ts'o
2017-08-30 15:30                                       ` Lukas Czerner
2017-08-30 15:29                                     ` Jan Kara
2017-08-30 16:05                                   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170829154922.GA24592@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=lczerner@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).