From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>,
Theodore Ts'o <tytso@mit.edu>,
linux-ext4@vger.kernel.org, Lukas Czerner <lczerner@redhat.com>,
linux-xfs@vger.kernel.org
Subject: Re: [PATCH] ext4: introduce per-inode DAX flag
Date: Tue, 29 Aug 2017 17:49:22 +0200 [thread overview]
Message-ID: <20170829154922.GA24592@quack2.suse.cz> (raw)
In-Reply-To: <20170828101014.GD17782@dastard>
On Mon 28-08-17 20:10:14, Dave Chinner wrote:
> On Mon, Aug 28, 2017 at 12:38:54AM -0700, Christoph Hellwig wrote:
> > On Sat, Aug 26, 2017 at 09:33:58AM +1000, Dave Chinner wrote:
> > > > Nah, -o dax works very well. It's just the flag instead of the -o dax
> > > > option or rather switching it on a mapped file will probably be very dangerous.
> > >
> > > In what way is it dangerous, Christoph?
> >
> > When I run the following script as a normal user:
> >
> > FSXDIR=~/xfstests/ltp/
> > FILE=/mnt/foo
> >
> > ${FSXDIR}/fsx $FILE &
> >
> > while true; do
> > xfs_io -c 'chattr +x' $FILE
> > xfs_io -c 'chattr -x' $FILE
> > done
> >
> > I get this nice little crash:
>
> Can you please package that up into an xfstest?
>
> > root@testvm:~# sh test.sh
> > skipping zero size read
> > skipping insert range behind EOF
> > truncating to largest ever: 0x3a290
> > zero_range to largest ever: 0x3a8d1
> > zero_range to largest ever: 0x3fe3e
> > zero_range to largest ever: 0x40000
> > [ 344.898390] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> > [ 344.899306] IP: iomap_page_mkwrite+0x17/0xf0
> > [ 344.899795] PGD 7db37067
> > [ 344.899796] P4D 7db37067
> > [ 344.900099] PUD 78c61067
> > [ 344.900389] PMD 0
> > [ 344.900665]
> > [ 344.901075] Oops: 0000 [#1] SMP
> > [ 344.901536] Modules linked in:
> > [ 344.901716] CPU: 3 PID: 6052 Comm: fsx Not tainted 4.12.0+ #2199
> > [ 344.901716] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
> > [ 344.901716] task: ffff880079a0da00 task.stack: ffffc900068a4000
> > [ 344.901716] RIP: 0010:iomap_page_mkwrite+0x17/0xf0
> > [ 344.901716] RSP: 0000:ffffc900068a7d38 EFLAGS: 00010246
> > [ 344.901716] RAX: ffff8800798dd0d0 RBX: 0000000000000200 RCX: 0000000000000001
> > [ 344.901716] RDX: 0000000070eb898e RSI: ffffffff82109010 RDI: ffffc900068a7df0
> > [ 344.901716] RBP: ffffc900068a7d60 R08: ffffffff82ff9fa8 R09: 0000000000000000
> > [ 344.901716] R10: ffffc900068a7cb0 R11: ffffffff8159b5cc R12: ffffffff82109010
> > [ 344.901716] R13: 0000000000000000 R14: ffffc900068a7df0 R15: ffff88007da89580
> ^^^^^^^^^^^^^^^^
>
> vmf->page is null.
>
> Which means IS_DAX changed half way through a fault, despite us
> holding the MMAPLOCK and protecting all the filesystem side of the
> fault code from races.
>
> Seems to me that even allowing filesystems to switch between
> different mapping tree behaviours based on an inode flag is a
> fundamentally broken model. The fault action that needs to taken by
> the filesystem has already been predetermined by the fault
> processing that has already occurred and placed into the contents of
> the vmf we've been passed.
I don't think the problem is actually within MM in this particular case.
The problem seems to be that xfs_filemap_fault() checks IS_DAX without
holding MMAPLOCK and so it can change after that test and before the test
in xfs_filemap_page_mkwrite().
> Hence I think that if we need to process the fault as a DAX fault
> then the vmf needs to tell us that, not require us to look up an
> inode flag to determine what to do. ANd if the inode flag changes,
> then that needs to be propagated through the mapping and VMAs in a
> sane fashion, not just run an invalidation from the filesystem. I
> don't know enough about the VM code to say anything useful about how
> this needs to be set up, but it's clear that mapping invalidation
> and behaviour swaps can't be completely serialised against page
> faults from the filesystem side.
But there is no difference in vmf setup from generic MM side. In particular
vmf->page is set by the ->fault handler and then it is passed to
->page_mkwrite handler. And changes to mapping behavior between these two
callbacks should be prevented by the page lock / radix entry lock...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2017-08-29 15:49 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-02 16:09 [PATCH] ext4: introduce per-inode DAX flag Lukas Czerner
2017-08-05 8:46 ` Christoph Hellwig
2017-08-07 12:12 ` Lukas Czerner
2017-08-08 9:00 ` Lukas Czerner
2017-08-11 10:01 ` Christoph Hellwig
2017-08-11 12:11 ` Lukas Czerner
2017-08-11 12:58 ` Christoph Hellwig
2017-08-11 13:41 ` Lukas Czerner
2017-08-24 18:20 ` Theodore Ts'o
2017-08-25 7:54 ` Christoph Hellwig
2017-08-25 15:14 ` Theodore Ts'o
2017-08-25 15:40 ` Christoph Hellwig
2017-08-25 16:28 ` Theodore Ts'o
2017-08-25 23:33 ` Dave Chinner
2017-08-28 7:38 ` Christoph Hellwig
2017-08-28 10:10 ` Dave Chinner
2017-08-29 15:49 ` Jan Kara [this message]
2017-08-29 22:57 ` Dave Chinner
2017-08-30 10:00 ` Jan Kara
2017-08-30 12:34 ` Christoph Hellwig
2017-08-30 15:00 ` Theodore Ts'o
2017-08-30 15:30 ` Lukas Czerner
2017-08-30 15:29 ` Jan Kara
2017-08-30 16:05 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170829154922.GA24592@quack2.suse.cz \
--to=jack@suse.cz \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=lczerner@redhat.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).