From: Jan Kara <jack@suse.cz>
To: linux-fsdevel@vger.kernel.org, linux-nvdimm@lists.01.org,
xfs@oss.sgi.com, linux-ext4@vger.kernel.org,
Ross Zwisler <ross.zwisler@linux.intel.com>,
Dan Williams <dan.j.williams@intel.com>
Subject: Subtle races between DAX mmap fault and write path
Date: Wed, 27 Jul 2016 14:07:45 +0200 [thread overview]
Message-ID: <20160727120745.GI6860@quack2.suse.cz> (raw)
Hi,
when testing my latest changes to DXA fault handling code I have hit the
following interesting race between the fault and write path (I'll show
function names for ext4 but xfs has the same issue AFAICT).
We have a file 'f' which has a hole at offset 0.
Process 0 Process 1
data = mmap('f');
read data[0]
-> fault, we map a hole page
pwrite('f', buf, len, 0)
-> ext4_file_write_iter
inode_lock(inode);
__generic_file_write_iter()
generic_file_direct_write()
invalidate_inode_pages2_range()
- drops hole page from
the radix tree
ext4_direct_IO()
dax_do_io()
- allocates block for
offset 0
data[0] = 1
-> page_mkwrite fault
-> ext4_dax_fault()
down_read(&EXT4_I(inode)->i_mmap_sem);
__dax_fault()
grab_mapping_entry()
- creates locked radix tree entry
- maps block into PTE
put_locked_mapping_entry()
invalidate_inode_pages2_range()
- removes dax entry from
the radix tree
So we have just lost information that block 0 is mapped and needs flushing
caches.
Also the fact that the consistency of data as viewed by mmap and
dax_do_io() relies on invalidate_inode_pages2_range() is somewhat
unexpected to me and we should document it somewhere.
The question is how to best fix this. I see three options:
1) Lock out faults during writes via exclusive i_mmap_sem. That is rather
harsh but should work - we call filemap_write_and_wait() in
generic_file_direct_write() so we flush out all caches for the relevant
area before dropping radix tree entries.
2) Call filemap_write_and_wait() after we return from ->direct_IO before we
call invalidate_inode_pages2_range() and hold i_mmap_sem exclusively only
for those two calls. Lock hold time will be shorter than 1) but it will
require additional flush and we'd probably have to stop using
generic_file_direct_write() for DAX writes to allow for all this special
hackery.
3) Remodel dax_do_io() to work more like buffered IO and use radix tree
entry locks to protect against similar races. That has likely better
scalability than 1) but may be actually slower in the uncontended case (due
to all the radix tree operations).
Any opinions on this?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next reply other threads:[~2016-07-27 12:07 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-07-27 12:07 Jan Kara [this message]
2016-07-27 21:10 ` Subtle races between DAX mmap fault and write path Ross Zwisler
2016-07-27 22:19 ` Dave Chinner
2016-07-28 8:10 ` Jan Kara
2016-07-29 2:21 ` Dave Chinner
2016-07-29 14:44 ` Dan Williams
2016-07-30 0:12 ` Dave Chinner
2016-07-30 0:53 ` Dan Williams
2016-08-01 1:46 ` Dave Chinner
2016-08-01 3:13 ` Keith Packard
2016-08-01 4:07 ` Dave Chinner
2016-08-01 4:39 ` Dan Williams
2016-08-01 7:39 ` Dave Chinner
2016-08-01 10:13 ` Boaz Harrosh
2016-08-02 0:21 ` Dave Chinner
2016-08-04 18:40 ` Kani, Toshimitsu
2016-08-05 11:27 ` Dave Chinner
2016-08-05 15:18 ` Kani, Toshimitsu
2016-08-05 19:58 ` Boylston, Brian
2016-08-08 9:26 ` Jan Kara
2016-08-08 12:30 ` Boylston, Brian
2016-08-08 13:11 ` Christoph Hellwig
2016-08-08 18:28 ` Jan Kara
2016-08-08 19:32 ` Kani, Toshimitsu
2016-08-08 23:12 ` Dave Chinner
2016-08-09 1:00 ` Kani, Toshimitsu
2016-08-09 5:58 ` Dave Chinner
2016-08-01 17:47 ` Dan Williams
2016-07-28 8:47 ` Jan Kara
2016-07-27 21:38 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160727120745.GI6860@quack2.suse.cz \
--to=jack@suse.cz \
--cc=dan.j.williams@intel.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nvdimm@lists.01.org \
--cc=ross.zwisler@linux.intel.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox