From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
linux-kernel@vger.kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
"J. Bruce Fields" <bfields@fieldses.org>,
Theodore Ts'o <tytso@mit.edu>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Andreas Dilger <adilger.kernel@dilger.ca>,
Andrew Morton <akpm@linux-foundation.org>,
Dan Williams <dan.j.williams@intel.com>,
Dave Chinner <david@fromorbit.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
Ingo Molnar <mingo@redhat.com>, Jan Kara <jack@suse.com>,
Jeff Layton <jlayton@poochiereds.net>,
Matthew Wilcox <matthew.r.wilcox@intel.com>,
Matthew Wilcox <willy@linux.intel.com>,
Thomas Gleixner <tglx@linutronix.de>,
linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org,
xfs@oss.sgi.com
Subject: Re: [PATCH v8 6/9] dax: add support for fsync/msync
Date: Wed, 13 Jan 2016 00:30:19 -0700 [thread overview]
Message-ID: <20160113073019.GB30496@linux.intel.com> (raw)
In-Reply-To: <20160112105716.GT6262@quack.suse.cz>
On Tue, Jan 12, 2016 at 11:57:16AM +0100, Jan Kara wrote:
> On Thu 07-01-16 22:27:56, Ross Zwisler wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> >
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space. This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries. We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>
> Some comments below.
>
> > ---
> > fs/dax.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > include/linux/dax.h | 2 +
> > mm/filemap.c | 6 ++
> > 3 files changed, 196 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 5b84a46..0db21ea 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -24,6 +24,7 @@
> > #include <linux/memcontrol.h>
> > #include <linux/mm.h>
> > #include <linux/mutex.h>
> > +#include <linux/pagevec.h>
> > #include <linux/pmem.h>
> > #include <linux/sched.h>
> > #include <linux/uio.h>
> > @@ -324,6 +325,174 @@ static int copy_user_bh(struct page *to, struct inode *inode,
> > return 0;
> > }
> >
> > +#define NO_SECTOR -1
> > +
> > +static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
> > + sector_t sector, bool pmd_entry, bool dirty)
> > +{
> > + struct radix_tree_root *page_tree = &mapping->page_tree;
> > + int type, error = 0;
> > + void *entry;
> > +
> > + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > +
> > + spin_lock_irq(&mapping->tree_lock);
> > + entry = radix_tree_lookup(page_tree, index);
> > +
> > + if (entry) {
> > + type = RADIX_DAX_TYPE(entry);
> > + if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> > + type != RADIX_DAX_PMD)) {
> > + error = -EIO;
> > + goto unlock;
> > + }
> > +
> > + if (!pmd_entry || type == RADIX_DAX_PMD)
> > + goto dirty;
> > + radix_tree_delete(&mapping->page_tree, index);
> > + mapping->nrexceptional--;
>
> In theory, you can delete here DIRTY / TOWRITE PTE entry and insert a clean
> PMD entry instead of it. That will cause fsync() to miss some flushes. So
> you should make sure you transfer all the tags to the new entry.
Ah, great catch, I'll address it in v9 which I'll send out tomorrow.
> > +static int dax_writeback_one(struct block_device *bdev,
> > + struct address_space *mapping, pgoff_t index, void *entry)
> > +{
> > + struct radix_tree_root *page_tree = &mapping->page_tree;
> > + int type = RADIX_DAX_TYPE(entry);
> > + struct radix_tree_node *node;
> > + struct blk_dax_ctl dax;
> > + void **slot;
> > + int ret = 0;
> > +
> > + spin_lock_irq(&mapping->tree_lock);
> > + /*
> > + * Regular page slots are stabilized by the page lock even
> > + * without the tree itself locked. These unlocked entries
> > + * need verification under the tree lock.
> > + */
> > + if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> > + goto unlock;
> > + if (*slot != entry)
> > + goto unlock;
> > +
> > + /* another fsync thread may have already written back this entry */
> > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > + goto unlock;
> > +
> > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > +
> > + if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> > + ret = -EIO;
> > + goto unlock;
> > + }
> > +
> > + dax.sector = RADIX_DAX_SECTOR(entry);
> > + dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> > + spin_unlock_irq(&mapping->tree_lock);
>
> This seems to be somewhat racy as well - if there are two fsyncs running
> against the same inode, one wins the race and clears TOWRITE tag, the
> second then bails out and may finish before the skipped page gets flushed.
>
> So we should clear the TOWRITE tag only after the range is flushed. This
> can result in some amount of duplicit flushing but I don't think the race
> will happen that frequently in practice to be performance relevant.
Yep, this make sense. I'll also fix that in v9.
> And secondly: You must write-protect all mappings of the flushed range so
> that you get fault when the sector gets written-to again. We spoke about
> this in the past already but somehow it got lost and I forgot about it as
> well. You need something like rmap_walk_file()...
The code that write protected mappings and then cleaned the radix tree entries
did get written, and was part of v2:
https://lkml.org/lkml/2015/11/13/759
I removed all the code that cleaned PTE entries and radix tree entries for v3.
The reason behind this was that there was a race that I couldn't figure out
how to solve between the cleaning of the PTEs and the cleaning of the radix
tree entries.
The race goes like this:
Thread 1 (write) Thread 2 (fsync)
================ ================
wp_pfn_shared()
pfn_mkwrite()
dax_radix_entry()
radix_tree_tag_set(DIRTY)
dax_writeback_mapping_range()
dax_writeback_one()
radix_tag_clear(DIRTY)
pgoff_mkclean()
... return up to wp_pfn_shared()
wp_page_reuse()
pte_mkdirty()
After this sequence we end up with a dirty PTE that is writeable, but with a
clean radix tree entry. This means that users can write to the page, but that
a follow-up fsync or msync won't flush this dirty data to media.
The overall issue is that in the write path that goes through wp_pfn_shared(),
the DAX code has control over when the radix tree entry is dirtied but not
when the PTE is made dirty and writeable. This happens up in wp_page_reuse().
This means that we can't easily add locking, etc. to protect ourselves.
I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
easy solutions presented themselves in the absence of a page lock. I do have
one idea, but I think it's pretty invasive and will need to wait for another
kernel cycle.
The current code that leaves the radix tree entry will give us correct
behavior - it'll just be less efficient because we will have an ever-growing
dirty set to flush.
> > + /*
> > + * We cannot hold tree_lock while calling dax_map_atomic() because it
> > + * eventually calls cond_resched().
> > + */
> > + ret = dax_map_atomic(bdev, &dax);
> > + if (ret < 0)
> > + return ret;
> > +
> > + if (WARN_ON_ONCE(ret < dax.size)) {
> > + ret = -EIO;
> > + goto unmap;
> > + }
> > +
> > + wb_cache_pmem(dax.addr, dax.size);
> > + unmap:
> > + dax_unmap_atomic(bdev, &dax);
> > + return ret;
> > +
> > + unlock:
> > + spin_unlock_irq(&mapping->tree_lock);
> > + return ret;
> > +}
>
> ...
>
> > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > * dax_pfn_mkwrite - handle first write to DAX page
> > * @vma: The virtual memory area where the fault occurred
> > * @vmf: The description of the fault
> > - *
> > */
> > int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > {
> > - struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > + struct file *file = vma->vm_file;
> >
> > - sb_start_pagefault(sb);
> > - file_update_time(vma->vm_file);
> > - sb_end_pagefault(sb);
> > + dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
>
> Why is NO_SECTOR argument correct here?
Right - so NO_SECTOR means "I expect there to already be an entry in the radix
tree - just make that entry dirty". This works because pfn_mkwrite() always
follows a normal __dax_fault() or __dax_pmd_fault() call. These fault calls
will insert the radix tree entry, regardless of whether the fault was for a
read or a write. If the fault was for a write, the radix tree entry will also
be made dirty.
For reads the radix tree entry will be inserted but left clean. When the
first write happens we will get a pfn_mkwrite() call, which will call
dax_radix_entry() with the NO_SECTOR argument. This will look up the radix
tree entry & set the dirty tag.
This also has the added benefit that the pfn_mkwrite() path can remain minimal
- if we needed to actually insert a radix tree entry with sector information
we'd have to duplicate a bunch of the fault path code so that we could call
get_block().
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>,
"J. Bruce Fields" <bfields@fieldses.org>,
linux-mm@kvack.org, Andreas Dilger <adilger.kernel@dilger.ca>,
"H. Peter Anvin" <hpa@zytor.com>,
Jeff Layton <jlayton@poochiereds.net>,
Dan Williams <dan.j.williams@intel.com>,
linux-nvdimm@lists.01.org, x86@kernel.org,
Ingo Molnar <mingo@redhat.com>,
Matthew Wilcox <willy@linux.intel.com>,
Ross Zwisler <ross.zwisler@linux.intel.com>,
linux-ext4@vger.kernel.org, xfs@oss.sgi.com,
Alexander Viro <viro@zeniv.linux.org.uk>,
Thomas Gleixner <tglx@linutronix.de>,
Theodore Ts'o <tytso@mit.edu>,
linux-kernel@vger.kernel.org, Jan Kara <jack@suse.com>,
linux-fsdevel@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>,
Matthew Wilcox <matthew.r.wilcox@intel.com>
Subject: Re: [PATCH v8 6/9] dax: add support for fsync/msync
Date: Wed, 13 Jan 2016 00:30:19 -0700 [thread overview]
Message-ID: <20160113073019.GB30496@linux.intel.com> (raw)
In-Reply-To: <20160112105716.GT6262@quack.suse.cz>
On Tue, Jan 12, 2016 at 11:57:16AM +0100, Jan Kara wrote:
> On Thu 07-01-16 22:27:56, Ross Zwisler wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> >
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space. This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries. We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>
> Some comments below.
>
> > ---
> > fs/dax.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > include/linux/dax.h | 2 +
> > mm/filemap.c | 6 ++
> > 3 files changed, 196 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 5b84a46..0db21ea 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -24,6 +24,7 @@
> > #include <linux/memcontrol.h>
> > #include <linux/mm.h>
> > #include <linux/mutex.h>
> > +#include <linux/pagevec.h>
> > #include <linux/pmem.h>
> > #include <linux/sched.h>
> > #include <linux/uio.h>
> > @@ -324,6 +325,174 @@ static int copy_user_bh(struct page *to, struct inode *inode,
> > return 0;
> > }
> >
> > +#define NO_SECTOR -1
> > +
> > +static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
> > + sector_t sector, bool pmd_entry, bool dirty)
> > +{
> > + struct radix_tree_root *page_tree = &mapping->page_tree;
> > + int type, error = 0;
> > + void *entry;
> > +
> > + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > +
> > + spin_lock_irq(&mapping->tree_lock);
> > + entry = radix_tree_lookup(page_tree, index);
> > +
> > + if (entry) {
> > + type = RADIX_DAX_TYPE(entry);
> > + if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> > + type != RADIX_DAX_PMD)) {
> > + error = -EIO;
> > + goto unlock;
> > + }
> > +
> > + if (!pmd_entry || type == RADIX_DAX_PMD)
> > + goto dirty;
> > + radix_tree_delete(&mapping->page_tree, index);
> > + mapping->nrexceptional--;
>
> In theory, you can delete here DIRTY / TOWRITE PTE entry and insert a clean
> PMD entry instead of it. That will cause fsync() to miss some flushes. So
> you should make sure you transfer all the tags to the new entry.
Ah, great catch, I'll address it in v9 which I'll send out tomorrow.
> > +static int dax_writeback_one(struct block_device *bdev,
> > + struct address_space *mapping, pgoff_t index, void *entry)
> > +{
> > + struct radix_tree_root *page_tree = &mapping->page_tree;
> > + int type = RADIX_DAX_TYPE(entry);
> > + struct radix_tree_node *node;
> > + struct blk_dax_ctl dax;
> > + void **slot;
> > + int ret = 0;
> > +
> > + spin_lock_irq(&mapping->tree_lock);
> > + /*
> > + * Regular page slots are stabilized by the page lock even
> > + * without the tree itself locked. These unlocked entries
> > + * need verification under the tree lock.
> > + */
> > + if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> > + goto unlock;
> > + if (*slot != entry)
> > + goto unlock;
> > +
> > + /* another fsync thread may have already written back this entry */
> > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > + goto unlock;
> > +
> > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > +
> > + if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> > + ret = -EIO;
> > + goto unlock;
> > + }
> > +
> > + dax.sector = RADIX_DAX_SECTOR(entry);
> > + dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> > + spin_unlock_irq(&mapping->tree_lock);
>
> This seems to be somewhat racy as well - if there are two fsyncs running
> against the same inode, one wins the race and clears TOWRITE tag, the
> second then bails out and may finish before the skipped page gets flushed.
>
> So we should clear the TOWRITE tag only after the range is flushed. This
> can result in some amount of duplicit flushing but I don't think the race
> will happen that frequently in practice to be performance relevant.
Yep, this make sense. I'll also fix that in v9.
> And secondly: You must write-protect all mappings of the flushed range so
> that you get fault when the sector gets written-to again. We spoke about
> this in the past already but somehow it got lost and I forgot about it as
> well. You need something like rmap_walk_file()...
The code that write protected mappings and then cleaned the radix tree entries
did get written, and was part of v2:
https://lkml.org/lkml/2015/11/13/759
I removed all the code that cleaned PTE entries and radix tree entries for v3.
The reason behind this was that there was a race that I couldn't figure out
how to solve between the cleaning of the PTEs and the cleaning of the radix
tree entries.
The race goes like this:
Thread 1 (write) Thread 2 (fsync)
================ ================
wp_pfn_shared()
pfn_mkwrite()
dax_radix_entry()
radix_tree_tag_set(DIRTY)
dax_writeback_mapping_range()
dax_writeback_one()
radix_tag_clear(DIRTY)
pgoff_mkclean()
... return up to wp_pfn_shared()
wp_page_reuse()
pte_mkdirty()
After this sequence we end up with a dirty PTE that is writeable, but with a
clean radix tree entry. This means that users can write to the page, but that
a follow-up fsync or msync won't flush this dirty data to media.
The overall issue is that in the write path that goes through wp_pfn_shared(),
the DAX code has control over when the radix tree entry is dirtied but not
when the PTE is made dirty and writeable. This happens up in wp_page_reuse().
This means that we can't easily add locking, etc. to protect ourselves.
I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
easy solutions presented themselves in the absence of a page lock. I do have
one idea, but I think it's pretty invasive and will need to wait for another
kernel cycle.
The current code that leaves the radix tree entry will give us correct
behavior - it'll just be less efficient because we will have an ever-growing
dirty set to flush.
> > + /*
> > + * We cannot hold tree_lock while calling dax_map_atomic() because it
> > + * eventually calls cond_resched().
> > + */
> > + ret = dax_map_atomic(bdev, &dax);
> > + if (ret < 0)
> > + return ret;
> > +
> > + if (WARN_ON_ONCE(ret < dax.size)) {
> > + ret = -EIO;
> > + goto unmap;
> > + }
> > +
> > + wb_cache_pmem(dax.addr, dax.size);
> > + unmap:
> > + dax_unmap_atomic(bdev, &dax);
> > + return ret;
> > +
> > + unlock:
> > + spin_unlock_irq(&mapping->tree_lock);
> > + return ret;
> > +}
>
> ...
>
> > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > * dax_pfn_mkwrite - handle first write to DAX page
> > * @vma: The virtual memory area where the fault occurred
> > * @vmf: The description of the fault
> > - *
> > */
> > int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > {
> > - struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > + struct file *file = vma->vm_file;
> >
> > - sb_start_pagefault(sb);
> > - file_update_time(vma->vm_file);
> > - sb_end_pagefault(sb);
> > + dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
>
> Why is NO_SECTOR argument correct here?
Right - so NO_SECTOR means "I expect there to already be an entry in the radix
tree - just make that entry dirty". This works because pfn_mkwrite() always
follows a normal __dax_fault() or __dax_pmd_fault() call. These fault calls
will insert the radix tree entry, regardless of whether the fault was for a
read or a write. If the fault was for a write, the radix tree entry will also
be made dirty.
For reads the radix tree entry will be inserted but left clean. When the
first write happens we will get a pfn_mkwrite() call, which will call
dax_radix_entry() with the NO_SECTOR argument. This will look up the radix
tree entry & set the dirty tag.
This also has the added benefit that the pfn_mkwrite() path can remain minimal
- if we needed to actually insert a radix tree entry with sector information
we'd have to duplicate a bunch of the fault path code so that we could call
get_block().
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
WARNING: multiple messages have this Message-ID (diff)
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
linux-kernel@vger.kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
"J. Bruce Fields" <bfields@fieldses.org>,
"Theodore Ts'o" <tytso@mit.edu>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Andreas Dilger <adilger.kernel@dilger.ca>,
Andrew Morton <akpm@linux-foundation.org>,
Dan Williams <dan.j.williams@intel.com>,
Dave Chinner <david@fromorbit.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
Ingo Molnar <mingo@redhat.com>, Jan Kara <jack@suse.com>,
Jeff Layton <jlayton@poochiereds.net>,
Matthew Wilcox <matthew.r.wilcox@intel.com>,
Matthew Wilcox <willy@linux.intel.com>,
Thomas Gleixner <tglx@linutronix.de>,
linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org,
xfs@oss.sgi.com
Subject: Re: [PATCH v8 6/9] dax: add support for fsync/msync
Date: Wed, 13 Jan 2016 00:30:19 -0700 [thread overview]
Message-ID: <20160113073019.GB30496@linux.intel.com> (raw)
In-Reply-To: <20160112105716.GT6262@quack.suse.cz>
On Tue, Jan 12, 2016 at 11:57:16AM +0100, Jan Kara wrote:
> On Thu 07-01-16 22:27:56, Ross Zwisler wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> >
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space. This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries. We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>
> Some comments below.
>
> > ---
> > fs/dax.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > include/linux/dax.h | 2 +
> > mm/filemap.c | 6 ++
> > 3 files changed, 196 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 5b84a46..0db21ea 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -24,6 +24,7 @@
> > #include <linux/memcontrol.h>
> > #include <linux/mm.h>
> > #include <linux/mutex.h>
> > +#include <linux/pagevec.h>
> > #include <linux/pmem.h>
> > #include <linux/sched.h>
> > #include <linux/uio.h>
> > @@ -324,6 +325,174 @@ static int copy_user_bh(struct page *to, struct inode *inode,
> > return 0;
> > }
> >
> > +#define NO_SECTOR -1
> > +
> > +static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
> > + sector_t sector, bool pmd_entry, bool dirty)
> > +{
> > + struct radix_tree_root *page_tree = &mapping->page_tree;
> > + int type, error = 0;
> > + void *entry;
> > +
> > + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > +
> > + spin_lock_irq(&mapping->tree_lock);
> > + entry = radix_tree_lookup(page_tree, index);
> > +
> > + if (entry) {
> > + type = RADIX_DAX_TYPE(entry);
> > + if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> > + type != RADIX_DAX_PMD)) {
> > + error = -EIO;
> > + goto unlock;
> > + }
> > +
> > + if (!pmd_entry || type == RADIX_DAX_PMD)
> > + goto dirty;
> > + radix_tree_delete(&mapping->page_tree, index);
> > + mapping->nrexceptional--;
>
> In theory, you can delete here DIRTY / TOWRITE PTE entry and insert a clean
> PMD entry instead of it. That will cause fsync() to miss some flushes. So
> you should make sure you transfer all the tags to the new entry.
Ah, great catch, I'll address it in v9 which I'll send out tomorrow.
> > +static int dax_writeback_one(struct block_device *bdev,
> > + struct address_space *mapping, pgoff_t index, void *entry)
> > +{
> > + struct radix_tree_root *page_tree = &mapping->page_tree;
> > + int type = RADIX_DAX_TYPE(entry);
> > + struct radix_tree_node *node;
> > + struct blk_dax_ctl dax;
> > + void **slot;
> > + int ret = 0;
> > +
> > + spin_lock_irq(&mapping->tree_lock);
> > + /*
> > + * Regular page slots are stabilized by the page lock even
> > + * without the tree itself locked. These unlocked entries
> > + * need verification under the tree lock.
> > + */
> > + if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> > + goto unlock;
> > + if (*slot != entry)
> > + goto unlock;
> > +
> > + /* another fsync thread may have already written back this entry */
> > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > + goto unlock;
> > +
> > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > +
> > + if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> > + ret = -EIO;
> > + goto unlock;
> > + }
> > +
> > + dax.sector = RADIX_DAX_SECTOR(entry);
> > + dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> > + spin_unlock_irq(&mapping->tree_lock);
>
> This seems to be somewhat racy as well - if there are two fsyncs running
> against the same inode, one wins the race and clears TOWRITE tag, the
> second then bails out and may finish before the skipped page gets flushed.
>
> So we should clear the TOWRITE tag only after the range is flushed. This
> can result in some amount of duplicit flushing but I don't think the race
> will happen that frequently in practice to be performance relevant.
Yep, this make sense. I'll also fix that in v9.
> And secondly: You must write-protect all mappings of the flushed range so
> that you get fault when the sector gets written-to again. We spoke about
> this in the past already but somehow it got lost and I forgot about it as
> well. You need something like rmap_walk_file()...
The code that write protected mappings and then cleaned the radix tree entries
did get written, and was part of v2:
https://lkml.org/lkml/2015/11/13/759
I removed all the code that cleaned PTE entries and radix tree entries for v3.
The reason behind this was that there was a race that I couldn't figure out
how to solve between the cleaning of the PTEs and the cleaning of the radix
tree entries.
The race goes like this:
Thread 1 (write) Thread 2 (fsync)
================ ================
wp_pfn_shared()
pfn_mkwrite()
dax_radix_entry()
radix_tree_tag_set(DIRTY)
dax_writeback_mapping_range()
dax_writeback_one()
radix_tag_clear(DIRTY)
pgoff_mkclean()
... return up to wp_pfn_shared()
wp_page_reuse()
pte_mkdirty()
After this sequence we end up with a dirty PTE that is writeable, but with a
clean radix tree entry. This means that users can write to the page, but that
a follow-up fsync or msync won't flush this dirty data to media.
The overall issue is that in the write path that goes through wp_pfn_shared(),
the DAX code has control over when the radix tree entry is dirtied but not
when the PTE is made dirty and writeable. This happens up in wp_page_reuse().
This means that we can't easily add locking, etc. to protect ourselves.
I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
easy solutions presented themselves in the absence of a page lock. I do have
one idea, but I think it's pretty invasive and will need to wait for another
kernel cycle.
The current code that leaves the radix tree entry will give us correct
behavior - it'll just be less efficient because we will have an ever-growing
dirty set to flush.
> > + /*
> > + * We cannot hold tree_lock while calling dax_map_atomic() because it
> > + * eventually calls cond_resched().
> > + */
> > + ret = dax_map_atomic(bdev, &dax);
> > + if (ret < 0)
> > + return ret;
> > +
> > + if (WARN_ON_ONCE(ret < dax.size)) {
> > + ret = -EIO;
> > + goto unmap;
> > + }
> > +
> > + wb_cache_pmem(dax.addr, dax.size);
> > + unmap:
> > + dax_unmap_atomic(bdev, &dax);
> > + return ret;
> > +
> > + unlock:
> > + spin_unlock_irq(&mapping->tree_lock);
> > + return ret;
> > +}
>
> ...
>
> > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > * dax_pfn_mkwrite - handle first write to DAX page
> > * @vma: The virtual memory area where the fault occurred
> > * @vmf: The description of the fault
> > - *
> > */
> > int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > {
> > - struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > + struct file *file = vma->vm_file;
> >
> > - sb_start_pagefault(sb);
> > - file_update_time(vma->vm_file);
> > - sb_end_pagefault(sb);
> > + dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
>
> Why is NO_SECTOR argument correct here?
Right - so NO_SECTOR means "I expect there to already be an entry in the radix
tree - just make that entry dirty". This works because pfn_mkwrite() always
follows a normal __dax_fault() or __dax_pmd_fault() call. These fault calls
will insert the radix tree entry, regardless of whether the fault was for a
read or a write. If the fault was for a write, the radix tree entry will also
be made dirty.
For reads the radix tree entry will be inserted but left clean. When the
first write happens we will get a pfn_mkwrite() call, which will call
dax_radix_entry() with the NO_SECTOR argument. This will look up the radix
tree entry & set the dirty tag.
This also has the added benefit that the pfn_mkwrite() path can remain minimal
- if we needed to actually insert a radix tree entry with sector information
we'd have to duplicate a bunch of the fault path code so that we could call
get_block().
next prev parent reply other threads:[~2016-01-13 7:30 UTC|newest]
Thread overview: 89+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-01-08 5:27 [PATCH v8 0/9] DAX fsync/msync support Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 1/9] dax: fix NULL pointer dereference in __dax_dbg() Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-12 9:34 ` Jan Kara
2016-01-12 9:34 ` Jan Kara
2016-01-12 9:34 ` Jan Kara
2016-01-13 7:08 ` Ross Zwisler
2016-01-13 7:08 ` Ross Zwisler
2016-01-13 7:08 ` Ross Zwisler
2016-01-13 9:07 ` Jan Kara
2016-01-13 9:07 ` Jan Kara
2016-01-13 9:07 ` Jan Kara
2016-01-08 5:27 ` [PATCH v8 2/9] dax: fix conversion of holes to PMDs Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-12 9:44 ` Jan Kara
2016-01-12 9:44 ` Jan Kara
2016-01-12 9:44 ` Jan Kara
2016-01-13 7:37 ` Ross Zwisler
2016-01-13 7:37 ` Ross Zwisler
2016-01-13 7:37 ` Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 3/9] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 4/9] dax: support dirty DAX entries in radix tree Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-13 9:44 ` Jan Kara
2016-01-13 9:44 ` Jan Kara
2016-01-13 9:44 ` Jan Kara
2016-01-13 18:48 ` Ross Zwisler
2016-01-13 18:48 ` Ross Zwisler
2016-01-13 18:48 ` Ross Zwisler
2016-01-13 18:48 ` Ross Zwisler
2016-01-15 13:22 ` Jan Kara
2016-01-15 13:22 ` Jan Kara
2016-01-15 13:22 ` Jan Kara
2016-01-15 13:22 ` Jan Kara
2016-01-15 19:03 ` Ross Zwisler
2016-01-15 19:03 ` Ross Zwisler
2016-01-15 19:03 ` Ross Zwisler
2016-02-03 16:42 ` Ross Zwisler
2016-02-03 16:42 ` Ross Zwisler
2016-02-03 16:42 ` Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 5/9] mm: add find_get_entries_tag() Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 6/9] dax: add support for fsync/msync Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-12 10:57 ` Jan Kara
2016-01-12 10:57 ` Jan Kara
2016-01-12 10:57 ` Jan Kara
2016-01-13 7:30 ` Ross Zwisler [this message]
2016-01-13 7:30 ` Ross Zwisler
2016-01-13 7:30 ` Ross Zwisler
2016-01-13 9:35 ` Jan Kara
2016-01-13 9:35 ` Jan Kara
2016-01-13 9:35 ` Jan Kara
2016-01-13 18:58 ` Ross Zwisler
2016-01-13 18:58 ` Ross Zwisler
2016-01-13 18:58 ` Ross Zwisler
2016-01-15 13:10 ` Jan Kara
2016-01-15 13:10 ` Jan Kara
2016-01-15 13:10 ` Jan Kara
2016-02-06 14:33 ` Dmitry Monakhov
2016-02-06 14:33 ` Dmitry Monakhov
2016-02-06 14:33 ` Dmitry Monakhov
2016-02-06 14:33 ` Dmitry Monakhov
2016-02-08 9:44 ` Jan Kara
2016-02-08 9:44 ` Jan Kara
2016-02-08 9:44 ` Jan Kara
2016-02-08 22:06 ` Ross Zwisler
2016-02-08 22:06 ` Ross Zwisler
2016-02-08 22:06 ` Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 7/9] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 8/9] ext4: " Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` [PATCH v8 9/9] xfs: " Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
2016-01-08 5:27 ` Ross Zwisler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160113073019.GB30496@linux.intel.com \
--to=ross.zwisler@linux.intel.com \
--cc=adilger.kernel@dilger.ca \
--cc=akpm@linux-foundation.org \
--cc=bfields@fieldses.org \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@fromorbit.com \
--cc=hpa@zytor.com \
--cc=jack@suse.com \
--cc=jack@suse.cz \
--cc=jlayton@poochiereds.net \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@lists.01.org \
--cc=matthew.r.wilcox@intel.com \
--cc=mingo@redhat.com \
--cc=tglx@linutronix.de \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@linux.intel.com \
--cc=x86@kernel.org \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.