From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
linux-kernel@vger.kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
"J. Bruce Fields" <bfields@fieldses.org>,
Theodore Ts'o <tytso@mit.edu>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Andreas Dilger <adilger.kernel@dilger.ca>,
Andrew Morton <akpm@linux-foundation.org>,
Dan Williams <dan.j.williams@intel.com>,
Dave Chinner <david@fromorbit.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
Ingo Molnar <mingo@redhat.com>, Jan Kara <jack@suse.com>,
Jeff Layton <jlayton@poochiereds.net>,
Matthew Wilcox <matthew.r.wilcox@intel.com>,
Matthew Wilcox <willy@linux.intel.com>,
Thomas Gleixner <tglx@linutronix.de>,
linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org,
xfs@oss.sgi.com
Subject: Re: [PATCH v8 6/9] dax: add support for fsync/msync
Date: Wed, 13 Jan 2016 11:58:02 -0700 [thread overview]
Message-ID: <20160113185802.GB5904@linux.intel.com> (raw)
In-Reply-To: <20160113093525.GD14630@quack.suse.cz>
On Wed, Jan 13, 2016 at 10:35:25AM +0100, Jan Kara wrote:
> On Wed 13-01-16 00:30:19, Ross Zwisler wrote:
> > > And secondly: You must write-protect all mappings of the flushed range so
> > > that you get fault when the sector gets written-to again. We spoke about
> > > this in the past already but somehow it got lost and I forgot about it as
> > > well. You need something like rmap_walk_file()...
> >
> > The code that write protected mappings and then cleaned the radix tree entries
> > did get written, and was part of v2:
> >
> > https://lkml.org/lkml/2015/11/13/759
> >
> > I removed all the code that cleaned PTE entries and radix tree entries for v3.
> > The reason behind this was that there was a race that I couldn't figure out
> > how to solve between the cleaning of the PTEs and the cleaning of the radix
> > tree entries.
> >
> > The race goes like this:
> >
> > Thread 1 (write) Thread 2 (fsync)
> > ================ ================
> > wp_pfn_shared()
> > pfn_mkwrite()
> > dax_radix_entry()
> > radix_tree_tag_set(DIRTY)
> > dax_writeback_mapping_range()
> > dax_writeback_one()
> > radix_tag_clear(DIRTY)
> > pgoff_mkclean()
> > ... return up to wp_pfn_shared()
> > wp_page_reuse()
> > pte_mkdirty()
> >
> > After this sequence we end up with a dirty PTE that is writeable, but with a
> > clean radix tree entry. This means that users can write to the page, but that
> > a follow-up fsync or msync won't flush this dirty data to media.
> >
> > The overall issue is that in the write path that goes through wp_pfn_shared(),
> > the DAX code has control over when the radix tree entry is dirtied but not
> > when the PTE is made dirty and writeable. This happens up in wp_page_reuse().
> > This means that we can't easily add locking, etc. to protect ourselves.
> >
> > I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
> > easy solutions presented themselves in the absence of a page lock. I do have
> > one idea, but I think it's pretty invasive and will need to wait for another
> > kernel cycle.
> >
> > The current code that leaves the radix tree entry will give us correct
> > behavior - it'll just be less efficient because we will have an ever-growing
> > dirty set to flush.
>
> Ahaa! Somehow I imagined tag_pages_for_writeback() clears DIRTY radix tree
> tags but it does not (I should have known, I have written that functions
> few years ago ;). Makes sense. Thanks for clarification.
>
> > > > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > > > * dax_pfn_mkwrite - handle first write to DAX page
> > > > * @vma: The virtual memory area where the fault occurred
> > > > * @vmf: The description of the fault
> > > > - *
> > > > */
> > > > int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > {
> > > > - struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > > + struct file *file = vma->vm_file;
> > > >
> > > > - sb_start_pagefault(sb);
> > > > - file_update_time(vma->vm_file);
> > > > - sb_end_pagefault(sb);
> > > > + dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
> > >
> > > Why is NO_SECTOR argument correct here?
> >
> > Right - so NO_SECTOR means "I expect there to already be an entry in the radix
> > tree - just make that entry dirty". This works because pfn_mkwrite() always
> > follows a normal __dax_fault() or __dax_pmd_fault() call. These fault calls
> > will insert the radix tree entry, regardless of whether the fault was for a
> > read or a write. If the fault was for a write, the radix tree entry will also
> > be made dirty.
> >
> > For reads the radix tree entry will be inserted but left clean. When the
> > first write happens we will get a pfn_mkwrite() call, which will call
> > dax_radix_entry() with the NO_SECTOR argument. This will look up the radix
> > tree entry & set the dirty tag.
>
> So the explanation of this should be somewhere so that everyone knows that
> we must have radix tree entries even for clean mapped blocks. Because upto
> know that was not clear to me. Also __dax_pmd_fault() seems to insert
> entries only for write fault so the assumption doesn't seem to hold there?
Ah, right, sorry, the read fault() -> pfn_mkwrite() sequence only happens for
4k pages. You are right about our handling of 2MiB pages - for a read
followed by a write we will just call into the normal __dax_pmd_fault() code
again, which will do the get_block() call and insert a dirty radix tree entry.
Because we have to go all the way through the fault handler again at write
time there isn't a benefit to inserting a clean radix tree entry on read, so
we just skip it.
> I'm somewhat uneasy that a bug in this logic can be hidden as a simple race
> with hole punching. But I guess I can live with that.
>
> Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-01-13 18:58 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-01-08 5:27 [PATCH v8 0/9] DAX fsync/msync support Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 1/9] dax: fix NULL pointer dereference in __dax_dbg() Ross Zwisler
2016-01-12 9:34 ` Jan Kara
2016-01-13 7:08 ` Ross Zwisler
2016-01-13 9:07 ` Jan Kara
2016-01-08 5:27 ` [PATCH v8 2/9] dax: fix conversion of holes to PMDs Ross Zwisler
2016-01-12 9:44 ` Jan Kara
2016-01-13 7:37 ` Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 3/9] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 4/9] dax: support dirty DAX entries in radix tree Ross Zwisler
2016-01-13 9:44 ` Jan Kara
2016-01-13 18:48 ` Ross Zwisler
2016-01-15 13:22 ` Jan Kara
2016-01-15 19:03 ` Ross Zwisler
2016-02-03 16:42 ` Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 5/9] mm: add find_get_entries_tag() Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 6/9] dax: add support for fsync/msync Ross Zwisler
2016-01-12 10:57 ` Jan Kara
2016-01-13 7:30 ` Ross Zwisler
2016-01-13 9:35 ` Jan Kara
2016-01-13 18:58 ` Ross Zwisler [this message]
2016-01-15 13:10 ` Jan Kara
2016-02-06 14:33 ` Dmitry Monakhov
2016-02-08 9:44 ` Jan Kara
2016-02-08 22:06 ` Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 7/9] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 8/9] ext4: " Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 9/9] xfs: " Ross Zwisler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160113185802.GB5904@linux.intel.com \
--to=ross.zwisler@linux.intel.com \
--cc=adilger.kernel@dilger.ca \
--cc=akpm@linux-foundation.org \
--cc=bfields@fieldses.org \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@fromorbit.com \
--cc=hpa@zytor.com \
--cc=jack@suse.com \
--cc=jack@suse.cz \
--cc=jlayton@poochiereds.net \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@lists.01.org \
--cc=matthew.r.wilcox@intel.com \
--cc=mingo@redhat.com \
--cc=tglx@linutronix.de \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@linux.intel.com \
--cc=x86@kernel.org \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).