* [PATCH v2 0/5] DAX fsync/msync fixes
@ 2016-01-21 17:45 Ross Zwisler
  2016-01-21 17:46 ` [PATCH v2 1/5] dax: never rely on bh.b_dev being set by get_block() Ross Zwisler
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Ross Zwisler @ 2016-01-21 17:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm
Changes from v1:
 - Fixed a macro collision for "PMD_INDEX" reported by 0-day for the "tile"
   architecture:
  config: tile-allyesconfig (attached as .config)
  reproduce:
  	wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
          chmod +x ~/bin/make.cross
         	# save the attached .config to linux build tree
          make.cross ARCH=tile
  
  All warnings (new ones prefixed by >>):
  
  >> fs/dax.c:330:0: warning: "PMD_INDEX" redefined [enabled by default]
     arch/tile/include/asm/pgtable_64.h:35:0: note: this is the location of the previous definition
  >> fs/dax.c:330:0: warning: "PMD_INDEX" redefined [enabled by default]
     arch/tile/include/asm/pgtable_64.h:35:0: note: this is the location of the previous definition
Thanks, 0-day!
This set applies cleanly on top of v8 of my "DAX fsync/msync support" set,
which is in -mm and -next.  That set has not yet been merged for v4.5
which is why the my work tree is still based on -next:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_fixes_v2
---
Original summary:
This series fixes several issues in v8 of my "DAX fsync/msync support"
patch series [1].  Thank you to Jan Kara for his excellent review.
Jan pointed out that we probably have an issue with the way hole punch
interacts with the fsync/msync code.  This is the item that I'll work on
next, but I wanted to send this series out now as I think it will be
independent of the hole punch fixes.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-January/003952.html
Ross Zwisler (5):
  dax: never rely on bh.b_dev being set by get_block()
  dax: clear TOWRITE flag after flush is complete
  dax: improve documentation for fsync/msync
  dax: fix PMD handling for fsync/msync
  dax: fix clearing of holes in __dax_pmd_fault()
 fs/dax.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 86 insertions(+), 24 deletions(-)
-- 
2.5.0
^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH v2 1/5] dax: never rely on bh.b_dev being set by get_block()
  2016-01-21 17:45 [PATCH v2 0/5] DAX fsync/msync fixes Ross Zwisler
@ 2016-01-21 17:46 ` Ross Zwisler
  2016-01-22 14:53   ` Jan Kara
  2016-01-21 17:46 ` [PATCH v2 2/5] dax: clear TOWRITE flag after flush is complete Ross Zwisler
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Ross Zwisler @ 2016-01-21 17:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm
Previously in DAX we assumed that calls to get_block() would set bh.b_bdev,
and we would then use that value even in error cases for debugging.  This
caused a NULL pointer dereference in __dax_dbg() which was fixed by a
previous commit, but that commit only changed the one place where we were
hitting an error.
Instead, update dax.c so that we always initialize bh.b_bdev as best we can
based on the information that DAX has.  get_block() may or may not update
to a new value, but this at least lets us get something helpful from
bh.b_bdev for error messages and not have to worry about whether it was set
by get_block() or not.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 3 +++
 1 file changed, 3 insertions(+)
diff --git a/fs/dax.c b/fs/dax.c
index 0db21ea..cee9e1b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -246,6 +246,7 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
 	loff_t end = pos + iov_iter_count(iter);
 
 	memset(&bh, 0, sizeof(bh));
+	bh.b_bdev = inode->i_sb->s_bdev;
 
 	if ((flags & DIO_LOCKING) && iov_iter_rw(iter) == READ) {
 		struct address_space *mapping = inode->i_mapping;
@@ -582,6 +583,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 
 	memset(&bh, 0, sizeof(bh));
 	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
+	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
  repeat:
@@ -1018,6 +1020,7 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	BUG_ON((offset + length) > PAGE_CACHE_SIZE);
 
 	memset(&bh, 0, sizeof(bh));
+	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_CACHE_SIZE;
 	err = get_block(inode, index, &bh, 0);
 	if (err < 0)
-- 
2.5.0
^ permalink raw reply related	[flat|nested] 16+ messages in thread
* [PATCH v2 2/5] dax: clear TOWRITE flag after flush is complete
  2016-01-21 17:45 [PATCH v2 0/5] DAX fsync/msync fixes Ross Zwisler
  2016-01-21 17:46 ` [PATCH v2 1/5] dax: never rely on bh.b_dev being set by get_block() Ross Zwisler
@ 2016-01-21 17:46 ` Ross Zwisler
  2016-01-22 14:55   ` Jan Kara
  2016-01-21 17:46 ` [PATCH v2 3/5] dax: improve documentation for fsync/msync Ross Zwisler
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Ross Zwisler @ 2016-01-21 17:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm
Previously in dax_writeback_one() we cleared the PAGECACHE_TAG_TOWRITE flag
before we had actually flushed the tagged radix tree entry to media.  This
is incorrect because of the following race:
Thread 1				Thread 2
--------				--------
dax_writeback_mapping_range()
tag entry with PAGECACHE_TAG_TOWRITE
					dax_writeback_mapping_range()
					tag entry with PAGECACHE_TAG_TOWRITE
					dax_writeback_one()
					radix_tree_tag_clear(TOWRITE)
TOWRITE flag is no longer set,
  find_get_entries_tag() finds no
  entries, return
					flush entry to media
In this case thread 1 returns before the data for the dirty entry is
actually durable on media.
Fix this by only clearing the PAGECACHE_TAG_TOWRITE flag after all flushing
is complete.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index cee9e1b..d589113 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -407,8 +407,6 @@ static int dax_writeback_one(struct block_device *bdev,
 	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
 		goto unlock;
 
-	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
-
 	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
 		ret = -EIO;
 		goto unlock;
@@ -432,6 +430,10 @@ static int dax_writeback_one(struct block_device *bdev,
 	}
 
 	wb_cache_pmem(dax.addr, dax.size);
+
+	spin_lock_irq(&mapping->tree_lock);
+	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
+	spin_unlock_irq(&mapping->tree_lock);
  unmap:
 	dax_unmap_atomic(bdev, &dax);
 	return ret;
-- 
2.5.0
^ permalink raw reply related	[flat|nested] 16+ messages in thread
* [PATCH v2 3/5] dax: improve documentation for fsync/msync
  2016-01-21 17:45 [PATCH v2 0/5] DAX fsync/msync fixes Ross Zwisler
  2016-01-21 17:46 ` [PATCH v2 1/5] dax: never rely on bh.b_dev being set by get_block() Ross Zwisler
  2016-01-21 17:46 ` [PATCH v2 2/5] dax: clear TOWRITE flag after flush is complete Ross Zwisler
@ 2016-01-21 17:46 ` Ross Zwisler
  2016-01-22 15:01   ` Jan Kara
  2016-01-21 17:46 ` [PATCH v2 4/5] dax: fix PMD handling " Ross Zwisler
  2016-01-21 17:46 ` [PATCH v2 5/5] dax: fix clearing of holes in __dax_pmd_fault() Ross Zwisler
  4 siblings, 1 reply; 16+ messages in thread
From: Ross Zwisler @ 2016-01-21 17:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm
Several of the subtleties and assumptions of the DAX fsync/msync
implementation are not immediately obvious, so document them with comments.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)
diff --git a/fs/dax.c b/fs/dax.c
index d589113..55ae394 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -350,6 +350,13 @@ static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
 
 		if (!pmd_entry || type == RADIX_DAX_PMD)
 			goto dirty;
+
+		/*
+		 * We only insert dirty PMD entries into the radix tree.  This
+		 * means we don't need to worry about removing a dirty PTE
+		 * entry and inserting a clean PMD entry, thus reducing the
+		 * range we would flush with a follow-up fsync/msync call.
+		 */
 		radix_tree_delete(&mapping->page_tree, index);
 		mapping->nrexceptional--;
 	}
@@ -912,6 +919,21 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 		dax_unmap_atomic(bdev, &dax);
 
+		/*
+		 * For PTE faults we insert a radix tree entry for reads, and
+		 * leave it clean.  Then on the first write we dirty the radix
+		 * tree entry via the dax_pnf_mkwrite() path.  This sequence
+		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
+		 * call into get_block() to translate the pgoff to a sector in
+		 * order to be able to create a new radix tree entry.
+		 *
+		 * The PMD path doesn't have an equivalent to
+		 * dax_pfn_mkwrite(), though, so for a read followed by a
+		 * write we traverse all the way through __dax_pmd_fault()
+		 * twice.  This means we can just skip inserting a radix tree
+		 * entry completely on the initial read and just wait until
+		 * the write to insert a dirty entry.
+		 */
 		if (write) {
 			error = dax_radix_entry(mapping, pgoff, dax.sector,
 					true, true);
@@ -985,6 +1007,14 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct file *file = vma->vm_file;
 
+	/*
+	 * We pass NO_SECTOR to dax_radix_entry() because we expect that a
+	 * RADIX_DAX_PTE entry already exists in the radix tree from a
+	 * previous call to __dax_fault().  We just want to look up that PTE
+	 * entry using vmf->pgoff and make sure the dirty tag is set.  This
+	 * saves us from having to make a call to get_block() here to look
+	 * up the sector.
+	 */
 	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
 	return VM_FAULT_NOPAGE;
 }
-- 
2.5.0
^ permalink raw reply related	[flat|nested] 16+ messages in thread
* [PATCH v2 4/5] dax: fix PMD handling for fsync/msync
  2016-01-21 17:45 [PATCH v2 0/5] DAX fsync/msync fixes Ross Zwisler
                   ` (2 preceding siblings ...)
  2016-01-21 17:46 ` [PATCH v2 3/5] dax: improve documentation for fsync/msync Ross Zwisler
@ 2016-01-21 17:46 ` Ross Zwisler
  2016-01-22 15:11   ` Jan Kara
  2016-01-21 17:46 ` [PATCH v2 5/5] dax: fix clearing of holes in __dax_pmd_fault() Ross Zwisler
  4 siblings, 1 reply; 16+ messages in thread
From: Ross Zwisler @ 2016-01-21 17:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm
Fix the way that DAX PMD radix tree entries are handled.  With this patch
we now check to see if a PMD entry exists in the radix tree on write, even
if we are just trying to insert a PTE.  If it exists, we dirty that instead
of inserting our own PTE entry.
Fix a bug in the PMD path in dax_writeback_mapping_range() where we were
previously passing a loff_t into radix_tree_lookup instead of a pgoff_t.
Account for the fact that multiple fsync/msync operations may be happening
at the same time and don't flush entries that are beyond end_index.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 39 +++++++++++++++++++++++++++------------
 1 file changed, 27 insertions(+), 12 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 55ae394..afacc30 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -327,19 +327,27 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 }
 
 #define NO_SECTOR -1
+#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_CACHE_SHIFT))
 
 static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
 		sector_t sector, bool pmd_entry, bool dirty)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
+	pgoff_t pmd_index = DAX_PMD_INDEX(index);
 	int type, error = 0;
 	void *entry;
 
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
 	spin_lock_irq(&mapping->tree_lock);
-	entry = radix_tree_lookup(page_tree, index);
 
+	entry = radix_tree_lookup(page_tree, pmd_index);
+	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
+		index = pmd_index;
+		goto dirty;
+	}
+
+	entry = radix_tree_lookup(page_tree, index);
 	if (entry) {
 		type = RADIX_DAX_TYPE(entry);
 		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
@@ -460,31 +468,33 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
 {
 	struct inode *inode = mapping->host;
 	struct block_device *bdev = inode->i_sb->s_bdev;
+	pgoff_t start_index, end_index, pmd_index;
 	pgoff_t indices[PAGEVEC_SIZE];
-	pgoff_t start_page, end_page;
 	struct pagevec pvec;
-	void *entry;
+	bool done = false;
 	int i, ret = 0;
+	void *entry;
 
 	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
 		return -EIO;
 
+	start_index = start >> PAGE_CACHE_SHIFT;
+	end_index = end >> PAGE_CACHE_SHIFT;
+	pmd_index = DAX_PMD_INDEX(start_index);
+
 	rcu_read_lock();
-	entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
+	entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
 	rcu_read_unlock();
 
 	/* see if the start of our range is covered by a PMD entry */
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
-		start &= PMD_MASK;
-
-	start_page = start >> PAGE_CACHE_SHIFT;
-	end_page = end >> PAGE_CACHE_SHIFT;
+	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+		start_index = pmd_index;
 
-	tag_pages_for_writeback(mapping, start_page, end_page);
+	tag_pages_for_writeback(mapping, start_index, end_index);
 
 	pagevec_init(&pvec, 0);
-	while (1) {
-		pvec.nr = find_get_entries_tag(mapping, start_page,
+	while (!done) {
+		pvec.nr = find_get_entries_tag(mapping, start_index,
 				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
 				pvec.pages, indices);
 
@@ -492,6 +502,11 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
 			break;
 
 		for (i = 0; i < pvec.nr; i++) {
+			if (indices[i] > end_index) {
+				done = true;
+				break;
+			}
+
 			ret = dax_writeback_one(bdev, mapping, indices[i],
 					pvec.pages[i]);
 			if (ret < 0)
-- 
2.5.0
^ permalink raw reply related	[flat|nested] 16+ messages in thread
* [PATCH v2 5/5] dax: fix clearing of holes in __dax_pmd_fault()
  2016-01-21 17:45 [PATCH v2 0/5] DAX fsync/msync fixes Ross Zwisler
                   ` (3 preceding siblings ...)
  2016-01-21 17:46 ` [PATCH v2 4/5] dax: fix PMD handling " Ross Zwisler
@ 2016-01-21 17:46 ` Ross Zwisler
  2016-01-22 15:37   ` Jan Kara
  4 siblings, 1 reply; 16+ messages in thread
From: Ross Zwisler @ 2016-01-21 17:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm
When the user reads from a DAX hole via a mmap we service page faults using
zero-filled page cache pages.  These zero pages are also placed into the
address_space radix tree.  When we get our first write for that space, we
can allocate a PMD page worth of DAX storage to replace that hole.
When this happens we need to unmap the zero pages and remove them from the
radix tree.  Prior to this patch we were unmapping *all* storage in our
PMD's range, which is incorrect because it removed DAX entries as well on
non-allocating page faults.
Instead, keep track of when get_block() actually gives us storage so that
we can be sure to only remove zero pages that were covering holes.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index afacc30..263aed1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -790,9 +790,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	bool write = flags & FAULT_FLAG_WRITE;
 	struct block_device *bdev;
 	pgoff_t size, pgoff;
-	loff_t lstart, lend;
 	sector_t block;
 	int error, result = 0;
+	bool alloc = false;
 
 	/* dax pmd mappings require pfn_t_devmap() */
 	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
@@ -830,10 +830,17 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
 
 	bh.b_size = PMD_SIZE;
-	if (get_block(inode, block, &bh, write) != 0)
+
+	if (get_block(inode, block, &bh, 0) != 0)
 		return VM_FAULT_SIGBUS;
+
+	if (!buffer_mapped(&bh) && write) {
+		if (get_block(inode, block, &bh, 1) != 0)
+			return VM_FAULT_SIGBUS;
+		alloc = true;
+	}
+
 	bdev = bh.b_bdev;
-	i_mmap_lock_read(mapping);
 
 	/*
 	 * If the filesystem isn't willing to tell us the length of a hole,
@@ -842,15 +849,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	 */
 	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
-		goto fallback;
+		return VM_FAULT_FALLBACK;
+	}
+
+	/*
+	 * If we allocated new storage, make sure no process has any
+	 * zero pages covering this hole
+	 */
+	if (alloc) {
+		loff_t lstart = pgoff << PAGE_SHIFT;
+		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
+
+		truncate_pagecache_range(inode, lstart, lend);
 	}
 
-	/* make sure no process has any zero pages covering this hole */
-	lstart = pgoff << PAGE_SHIFT;
-	lend = lstart + PMD_SIZE - 1; /* inclusive */
-	i_mmap_unlock_read(mapping);
-	unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
-	truncate_inode_pages_range(mapping, lstart, lend);
 	i_mmap_lock_read(mapping);
 
 	/*
-- 
2.5.0
^ permalink raw reply related	[flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/5] dax: never rely on bh.b_dev being set by get_block()
  2016-01-21 17:46 ` [PATCH v2 1/5] dax: never rely on bh.b_dev being set by get_block() Ross Zwisler
@ 2016-01-22 14:53   ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2016-01-22 14:53 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm
On Thu 21-01-16 10:46:00, Ross Zwisler wrote:
> Previously in DAX we assumed that calls to get_block() would set bh.b_bdev,
> and we would then use that value even in error cases for debugging.  This
> caused a NULL pointer dereference in __dax_dbg() which was fixed by a
> previous commit, but that commit only changed the one place where we were
> hitting an error.
> 
> Instead, update dax.c so that we always initialize bh.b_bdev as best we can
> based on the information that DAX has.  get_block() may or may not update
> to a new value, but this at least lets us get something helpful from
> bh.b_bdev for error messages and not have to worry about whether it was set
> by get_block() or not.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reported-by: Jan Kara <jack@suse.cz>
Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
								Honza
> ---
>  fs/dax.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 0db21ea..cee9e1b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -246,6 +246,7 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
>  	loff_t end = pos + iov_iter_count(iter);
>  
>  	memset(&bh, 0, sizeof(bh));
> +	bh.b_bdev = inode->i_sb->s_bdev;
>  
>  	if ((flags & DIO_LOCKING) && iov_iter_rw(iter) == READ) {
>  		struct address_space *mapping = inode->i_mapping;
> @@ -582,6 +583,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  
>  	memset(&bh, 0, sizeof(bh));
>  	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
> +	bh.b_bdev = inode->i_sb->s_bdev;
>  	bh.b_size = PAGE_SIZE;
>  
>   repeat:
> @@ -1018,6 +1020,7 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
>  	BUG_ON((offset + length) > PAGE_CACHE_SIZE);
>  
>  	memset(&bh, 0, sizeof(bh));
> +	bh.b_bdev = inode->i_sb->s_bdev;
>  	bh.b_size = PAGE_CACHE_SIZE;
>  	err = get_block(inode, index, &bh, 0);
>  	if (err < 0)
> -- 
> 2.5.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/5] dax: clear TOWRITE flag after flush is complete
  2016-01-21 17:46 ` [PATCH v2 2/5] dax: clear TOWRITE flag after flush is complete Ross Zwisler
@ 2016-01-22 14:55   ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2016-01-22 14:55 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm
On Thu 21-01-16 10:46:01, Ross Zwisler wrote:
> Previously in dax_writeback_one() we cleared the PAGECACHE_TAG_TOWRITE flag
> before we had actually flushed the tagged radix tree entry to media.  This
> is incorrect because of the following race:
> 
> Thread 1				Thread 2
> --------				--------
> dax_writeback_mapping_range()
> tag entry with PAGECACHE_TAG_TOWRITE
> 					dax_writeback_mapping_range()
> 					tag entry with PAGECACHE_TAG_TOWRITE
> 					dax_writeback_one()
> 					radix_tree_tag_clear(TOWRITE)
> TOWRITE flag is no longer set,
>   find_get_entries_tag() finds no
>   entries, return
> 					flush entry to media
> 
> In this case thread 1 returns before the data for the dirty entry is
> actually durable on media.
> 
> Fix this by only clearing the PAGECACHE_TAG_TOWRITE flag after all flushing
> is complete.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reported-by: Jan Kara <jack@suse.cz>
Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
								Honza
> ---
>  fs/dax.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index cee9e1b..d589113 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -407,8 +407,6 @@ static int dax_writeback_one(struct block_device *bdev,
>  	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
>  		goto unlock;
>  
> -	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> -
>  	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
>  		ret = -EIO;
>  		goto unlock;
> @@ -432,6 +430,10 @@ static int dax_writeback_one(struct block_device *bdev,
>  	}
>  
>  	wb_cache_pmem(dax.addr, dax.size);
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> +	spin_unlock_irq(&mapping->tree_lock);
>   unmap:
>  	dax_unmap_atomic(bdev, &dax);
>  	return ret;
> -- 
> 2.5.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/5] dax: improve documentation for fsync/msync
  2016-01-21 17:46 ` [PATCH v2 3/5] dax: improve documentation for fsync/msync Ross Zwisler
@ 2016-01-22 15:01   ` Jan Kara
  2016-01-22 15:58     ` Ross Zwisler
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2016-01-22 15:01 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm
On Thu 21-01-16 10:46:02, Ross Zwisler wrote:
> Several of the subtleties and assumptions of the DAX fsync/msync
> implementation are not immediately obvious, so document them with comments.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reported-by: Jan Kara <jack@suse.cz>
Thanks, the comments really help! Just two nits below, otherwise feel free
to add:
Reviewed-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c | 30 ++++++++++++++++++++++++++++++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index d589113..55ae394 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -350,6 +350,13 @@ static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
>  
>  		if (!pmd_entry || type == RADIX_DAX_PMD)
>  			goto dirty;
> +
> +		/*
> +		 * We only insert dirty PMD entries into the radix tree.  This
> +		 * means we don't need to worry about removing a dirty PTE
> +		 * entry and inserting a clean PMD entry, thus reducing the
> +		 * range we would flush with a follow-up fsync/msync call.
> +		 */
May be acompany this with:
		WARN_ON(pmd_entry && !dirty);
somewhere in dax_radix_entry()?
>  		radix_tree_delete(&mapping->page_tree, index);
>  		mapping->nrexceptional--;
>  	}
> @@ -912,6 +919,21 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		}
>  		dax_unmap_atomic(bdev, &dax);
>  
> +		/*
> +		 * For PTE faults we insert a radix tree entry for reads, and
> +		 * leave it clean.  Then on the first write we dirty the radix
> +		 * tree entry via the dax_pnf_mkwrite() path.  This sequence
					  ^^^ pfn
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v2 4/5] dax: fix PMD handling for fsync/msync
  2016-01-21 17:46 ` [PATCH v2 4/5] dax: fix PMD handling " Ross Zwisler
@ 2016-01-22 15:11   ` Jan Kara
  2016-01-22 16:01     ` Ross Zwisler
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2016-01-22 15:11 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm
On Thu 21-01-16 10:46:03, Ross Zwisler wrote:
> Fix the way that DAX PMD radix tree entries are handled.  With this patch
> we now check to see if a PMD entry exists in the radix tree on write, even
> if we are just trying to insert a PTE.  If it exists, we dirty that instead
> of inserting our own PTE entry.
> 
> Fix a bug in the PMD path in dax_writeback_mapping_range() where we were
> previously passing a loff_t into radix_tree_lookup instead of a pgoff_t.
Ah, good catch!
> Account for the fact that multiple fsync/msync operations may be happening
> at the same time and don't flush entries that are beyond end_index.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Just one nit below. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c | 39 +++++++++++++++++++++++++++------------
>  1 file changed, 27 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 55ae394..afacc30 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
...
> @@ -460,31 +468,33 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
>  {
>  	struct inode *inode = mapping->host;
>  	struct block_device *bdev = inode->i_sb->s_bdev;
> +	pgoff_t start_index, end_index, pmd_index;
>  	pgoff_t indices[PAGEVEC_SIZE];
> -	pgoff_t start_page, end_page;
>  	struct pagevec pvec;
> -	void *entry;
> +	bool done = false;
>  	int i, ret = 0;
> +	void *entry;
>  
>  	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
>  		return -EIO;
>  
> +	start_index = start >> PAGE_CACHE_SHIFT;
> +	end_index = end >> PAGE_CACHE_SHIFT;
> +	pmd_index = DAX_PMD_INDEX(start_index);
> +
>  	rcu_read_lock();
> -	entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
> +	entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
>  	rcu_read_unlock();
>  
>  	/* see if the start of our range is covered by a PMD entry */
> -	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> -		start &= PMD_MASK;
> -
> -	start_page = start >> PAGE_CACHE_SHIFT;
> -	end_page = end >> PAGE_CACHE_SHIFT;
> +	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
Don't you miss a check that entry != NULL? I agree that RADIX_DAX_TYPE(NULL)
is != from RADIX_DAX_PMD so it works as desired but it looks a bit
dangerous.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v2 5/5] dax: fix clearing of holes in __dax_pmd_fault()
  2016-01-21 17:46 ` [PATCH v2 5/5] dax: fix clearing of holes in __dax_pmd_fault() Ross Zwisler
@ 2016-01-22 15:37   ` Jan Kara
  2016-01-22 16:12     ` Ross Zwisler
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2016-01-22 15:37 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm
On Thu 21-01-16 10:46:04, Ross Zwisler wrote:
> When the user reads from a DAX hole via a mmap we service page faults using
> zero-filled page cache pages.  These zero pages are also placed into the
> address_space radix tree.  When we get our first write for that space, we
> can allocate a PMD page worth of DAX storage to replace that hole.
> 
> When this happens we need to unmap the zero pages and remove them from the
> radix tree.  Prior to this patch we were unmapping *all* storage in our
> PMD's range, which is incorrect because it removed DAX entries as well on
> non-allocating page faults.
> 
> Instead, keep track of when get_block() actually gives us storage so that
> we can be sure to only remove zero pages that were covering holes.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reported-by: Jan Kara <jack@suse.cz>
So the get_block() calls below are still racy so your allocation detection
doesn't quite work reliably anyway. For that we need a change in the fault
locking we were talking about and at that point we'll need to somewhat
update __dax_pmd_fault() anyway. So I'm not sure if this intermediate
change really makes sense... I'd rather first change the locking in the
filesystems and then fixup remaining issues with __dax_pmd_fault().
								Honza
> ---
>  fs/dax.c | 32 ++++++++++++++++++++++----------
>  1 file changed, 22 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index afacc30..263aed1 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -790,9 +790,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	bool write = flags & FAULT_FLAG_WRITE;
>  	struct block_device *bdev;
>  	pgoff_t size, pgoff;
> -	loff_t lstart, lend;
>  	sector_t block;
>  	int error, result = 0;
> +	bool alloc = false;
>  
>  	/* dax pmd mappings require pfn_t_devmap() */
>  	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
> @@ -830,10 +830,17 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
>  
>  	bh.b_size = PMD_SIZE;
> -	if (get_block(inode, block, &bh, write) != 0)
> +
> +	if (get_block(inode, block, &bh, 0) != 0)
>  		return VM_FAULT_SIGBUS;
> +
> +	if (!buffer_mapped(&bh) && write) {
> +		if (get_block(inode, block, &bh, 1) != 0)
> +			return VM_FAULT_SIGBUS;
> +		alloc = true;
> +	}
> +
>  	bdev = bh.b_bdev;
> -	i_mmap_lock_read(mapping);
>  
>  	/*
>  	 * If the filesystem isn't willing to tell us the length of a hole,
> @@ -842,15 +849,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	 */
>  	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
>  		dax_pmd_dbg(&bh, address, "allocated block too small");
> -		goto fallback;
> +		return VM_FAULT_FALLBACK;
> +	}
> +
> +	/*
> +	 * If we allocated new storage, make sure no process has any
> +	 * zero pages covering this hole
> +	 */
> +	if (alloc) {
> +		loff_t lstart = pgoff << PAGE_SHIFT;
> +		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
> +
> +		truncate_pagecache_range(inode, lstart, lend);
>  	}
>  
> -	/* make sure no process has any zero pages covering this hole */
> -	lstart = pgoff << PAGE_SHIFT;
> -	lend = lstart + PMD_SIZE - 1; /* inclusive */
> -	i_mmap_unlock_read(mapping);
> -	unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
> -	truncate_inode_pages_range(mapping, lstart, lend);
>  	i_mmap_lock_read(mapping);
>  
>  	/*
> -- 
> 2.5.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/5] dax: improve documentation for fsync/msync
  2016-01-22 15:01   ` Jan Kara
@ 2016-01-22 15:58     ` Ross Zwisler
  2016-01-22 16:17       ` Elliott, Robert (Persistent Memory)
  0 siblings, 1 reply; 16+ messages in thread
From: Ross Zwisler @ 2016-01-22 15:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Alexander Viro, Andrew Morton,
	Dan Williams, Dave Chinner, Jan Kara, Matthew Wilcox,
	linux-fsdevel, linux-nvdimm
On Fri, Jan 22, 2016 at 04:01:29PM +0100, Jan Kara wrote:
> On Thu 21-01-16 10:46:02, Ross Zwisler wrote:
> > Several of the subtleties and assumptions of the DAX fsync/msync
> > implementation are not immediately obvious, so document them with comments.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Reported-by: Jan Kara <jack@suse.cz>
> 
> Thanks, the comments really help! Just two nits below, otherwise feel free
> to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> > ---
> >  fs/dax.c | 30 ++++++++++++++++++++++++++++++
> >  1 file changed, 30 insertions(+)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index d589113..55ae394 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -350,6 +350,13 @@ static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
> >  
> >  		if (!pmd_entry || type == RADIX_DAX_PMD)
> >  			goto dirty;
> > +
> > +		/*
> > +		 * We only insert dirty PMD entries into the radix tree.  This
> > +		 * means we don't need to worry about removing a dirty PTE
> > +		 * entry and inserting a clean PMD entry, thus reducing the
> > +		 * range we would flush with a follow-up fsync/msync call.
> > +		 */
> 
> May be acompany this with:
> 
> 		WARN_ON(pmd_entry && !dirty);
> 
> somewhere in dax_radix_entry()?
Sure, I'll add one.
> >  		radix_tree_delete(&mapping->page_tree, index);
> >  		mapping->nrexceptional--;
> >  	}
> > @@ -912,6 +919,21 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  		}
> >  		dax_unmap_atomic(bdev, &dax);
> >  
> > +		/*
> > +		 * For PTE faults we insert a radix tree entry for reads, and
> > +		 * leave it clean.  Then on the first write we dirty the radix
> > +		 * tree entry via the dax_pnf_mkwrite() path.  This sequence
> 					  ^^^ pfn
Thanks, will fix.
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v2 4/5] dax: fix PMD handling for fsync/msync
  2016-01-22 15:11   ` Jan Kara
@ 2016-01-22 16:01     ` Ross Zwisler
  0 siblings, 0 replies; 16+ messages in thread
From: Ross Zwisler @ 2016-01-22 16:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Alexander Viro, Andrew Morton,
	Dan Williams, Dave Chinner, Jan Kara, Matthew Wilcox,
	linux-fsdevel, linux-nvdimm
On Fri, Jan 22, 2016 at 04:11:41PM +0100, Jan Kara wrote:
> On Thu 21-01-16 10:46:03, Ross Zwisler wrote:
> > Fix the way that DAX PMD radix tree entries are handled.  With this patch
> > we now check to see if a PMD entry exists in the radix tree on write, even
> > if we are just trying to insert a PTE.  If it exists, we dirty that instead
> > of inserting our own PTE entry.
> > 
> > Fix a bug in the PMD path in dax_writeback_mapping_range() where we were
> > previously passing a loff_t into radix_tree_lookup instead of a pgoff_t.
> 
> Ah, good catch!
> 
> > Account for the fact that multiple fsync/msync operations may be happening
> > at the same time and don't flush entries that are beyond end_index.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> Just one nit below. You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> > ---
> >  fs/dax.c | 39 +++++++++++++++++++++++++++------------
> >  1 file changed, 27 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 55ae394..afacc30 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> ...
> > @@ -460,31 +468,33 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
> >  {
> >  	struct inode *inode = mapping->host;
> >  	struct block_device *bdev = inode->i_sb->s_bdev;
> > +	pgoff_t start_index, end_index, pmd_index;
> >  	pgoff_t indices[PAGEVEC_SIZE];
> > -	pgoff_t start_page, end_page;
> >  	struct pagevec pvec;
> > -	void *entry;
> > +	bool done = false;
> >  	int i, ret = 0;
> > +	void *entry;
> >  
> >  	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
> >  		return -EIO;
> >  
> > +	start_index = start >> PAGE_CACHE_SHIFT;
> > +	end_index = end >> PAGE_CACHE_SHIFT;
> > +	pmd_index = DAX_PMD_INDEX(start_index);
> > +
> >  	rcu_read_lock();
> > -	entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
> > +	entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
> >  	rcu_read_unlock();
> >  
> >  	/* see if the start of our range is covered by a PMD entry */
> > -	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > -		start &= PMD_MASK;
> > -
> > -	start_page = start >> PAGE_CACHE_SHIFT;
> > -	end_page = end >> PAGE_CACHE_SHIFT;
> > +	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> 
> Don't you miss a check that entry != NULL? I agree that RADIX_DAX_TYPE(NULL)
> is != from RADIX_DAX_PMD so it works as desired but it looks a bit
> dangerous.
Yea, the initial version had a NULL check first, but I realized I could
optimize it out and make things simpler since RADIX_DAX_TYPE() just looked at
the value of 'entry' and never treated it like a pointer (because to us, it
isn't a pointer).  I will add it back if you think that not having it may
confuse the reader.
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v2 5/5] dax: fix clearing of holes in __dax_pmd_fault()
  2016-01-22 15:37   ` Jan Kara
@ 2016-01-22 16:12     ` Ross Zwisler
  2016-01-25 14:40       ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Ross Zwisler @ 2016-01-22 16:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Alexander Viro, Andrew Morton,
	Dan Williams, Dave Chinner, Jan Kara, Matthew Wilcox,
	linux-fsdevel, linux-nvdimm
On Fri, Jan 22, 2016 at 04:37:19PM +0100, Jan Kara wrote:
> On Thu 21-01-16 10:46:04, Ross Zwisler wrote:
> > When the user reads from a DAX hole via a mmap we service page faults using
> > zero-filled page cache pages.  These zero pages are also placed into the
> > address_space radix tree.  When we get our first write for that space, we
> > can allocate a PMD page worth of DAX storage to replace that hole.
> > 
> > When this happens we need to unmap the zero pages and remove them from the
> > radix tree.  Prior to this patch we were unmapping *all* storage in our
> > PMD's range, which is incorrect because it removed DAX entries as well on
> > non-allocating page faults.
> > 
> > Instead, keep track of when get_block() actually gives us storage so that
> > we can be sure to only remove zero pages that were covering holes.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Reported-by: Jan Kara <jack@suse.cz>
> 
> So the get_block() calls below are still racy so your allocation detection
> doesn't quite work reliably anyway. For that we need a change in the fault
> locking we were talking about and at that point we'll need to somewhat
> update __dax_pmd_fault() anyway. So I'm not sure if this intermediate
> change really makes sense... I'd rather first change the locking in the
> filesystems and then fixup remaining issues with __dax_pmd_fault().
Agreed, the fault isolation code that we talked about previously is still
needed.  I've got an initial version coded up that I will send out as an RFC
today.  Those patches build directly on this patch.
My thought behind sending this was that I worried that the fault isolation
code wouldn't make it into v4.5, as we're already in the merge window.  That
code by design has to touch both DAX and all the DAX filesystems (ext4, ext2
and XFS).  Although, I guess a race that could result in data corruption
probably warrants an -RC timed fix?
In any case, one way or another we really need this fix for v4.5.  Without it
each PMD fault we receive will unmap the PMD range - even if that range is in
use by other threads. So, you can get into the following situation:
Thread 1				Thread 2
--------				--------
PMD fault
  unmap range for all threads
  map range for just this thread
					PMD fault
					  unmap range for all threads
					  map range for just this thread
PMD fault
  unmap range for all threads
  map range for just this thread
	  				...
Where two threads that are doing I/O to the same PMD will thrash and cause
every access to trigger a PMD page fault.
> > ---
> >  fs/dax.c | 32 ++++++++++++++++++++++----------
> >  1 file changed, 22 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index afacc30..263aed1 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -790,9 +790,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  	bool write = flags & FAULT_FLAG_WRITE;
> >  	struct block_device *bdev;
> >  	pgoff_t size, pgoff;
> > -	loff_t lstart, lend;
> >  	sector_t block;
> >  	int error, result = 0;
> > +	bool alloc = false;
> >  
> >  	/* dax pmd mappings require pfn_t_devmap() */
> >  	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
> > @@ -830,10 +830,17 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
> >  
> >  	bh.b_size = PMD_SIZE;
> > -	if (get_block(inode, block, &bh, write) != 0)
> > +
> > +	if (get_block(inode, block, &bh, 0) != 0)
> >  		return VM_FAULT_SIGBUS;
> > +
> > +	if (!buffer_mapped(&bh) && write) {
> > +		if (get_block(inode, block, &bh, 1) != 0)
> > +			return VM_FAULT_SIGBUS;
> > +		alloc = true;
> > +	}
> > +
> >  	bdev = bh.b_bdev;
> > -	i_mmap_lock_read(mapping);
> >  
> >  	/*
> >  	 * If the filesystem isn't willing to tell us the length of a hole,
> > @@ -842,15 +849,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  	 */
> >  	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
> >  		dax_pmd_dbg(&bh, address, "allocated block too small");
> > -		goto fallback;
> > +		return VM_FAULT_FALLBACK;
> > +	}
> > +
> > +	/*
> > +	 * If we allocated new storage, make sure no process has any
> > +	 * zero pages covering this hole
> > +	 */
> > +	if (alloc) {
> > +		loff_t lstart = pgoff << PAGE_SHIFT;
> > +		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
> > +
> > +		truncate_pagecache_range(inode, lstart, lend);
> >  	}
> >  
> > -	/* make sure no process has any zero pages covering this hole */
> > -	lstart = pgoff << PAGE_SHIFT;
> > -	lend = lstart + PMD_SIZE - 1; /* inclusive */
> > -	i_mmap_unlock_read(mapping);
> > -	unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
> > -	truncate_inode_pages_range(mapping, lstart, lend);
> >  	i_mmap_lock_read(mapping);
> >  
> >  	/*
> > -- 
> > 2.5.0
> > 
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
^ permalink raw reply	[flat|nested] 16+ messages in thread
* RE: [PATCH v2 3/5] dax: improve documentation for fsync/msync
  2016-01-22 15:58     ` Ross Zwisler
@ 2016-01-22 16:17       ` Elliott, Robert (Persistent Memory)
  0 siblings, 0 replies; 16+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2016-01-22 16:17 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara
  Cc: Andrew Morton, linux-nvdimm@lists.01.org, Dave Chinner,
	linux-kernel@vger.kernel.org, Alexander Viro, Jan Kara,
	linux-fsdevel@vger.kernel.org
---
Robert Elliott, HPE Persistent Memory
> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
> Ross Zwisler
> Sent: Friday, January 22, 2016 9:58 AM
> To: Jan Kara <jack@suse.cz>
> Cc: Andrew Morton <akpm@linux-foundation.org>; linux-nvdimm@lists.01.org;
> Dave Chinner <david@fromorbit.com>; linux-kernel@vger.kernel.org;
> Alexander Viro <viro@zeniv.linux.org.uk>; Jan Kara <jack@suse.com>; linux-
> fsdevel@vger.kernel.org
> Subject: Re: [PATCH v2 3/5] dax: improve documentation for fsync/msync
> 
> On Fri, Jan 22, 2016 at 04:01:29PM +0100, Jan Kara wrote:
> > On Thu 21-01-16 10:46:02, Ross Zwisler wrote:
...
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index d589113..55ae394 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -350,6 +350,13 @@ static int dax_radix_entry(struct address_space
> *mapping, pgoff_t index,
> > >
> > >  		if (!pmd_entry || type == RADIX_DAX_PMD)
> > >  			goto dirty;
> > > +
> > > +		/*
> > > +		 * We only insert dirty PMD entries into the radix tree.  This
> > > +		 * means we don't need to worry about removing a dirty PTE
> > > +		 * entry and inserting a clean PMD entry, thus reducing the
> > > +		 * range we would flush with a follow-up fsync/msync call.
> > > +		 */
> >
> > May be acompany this with:
> >
> > 		WARN_ON(pmd_entry && !dirty);
> >
> > somewhere in dax_radix_entry()?
> 
> Sure, I'll add one.
If this is something that could trigger due to I/O traffic, please
use WARN_ONCE rather than WARN_ON to avoid the risk of swamping
the serial output.
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v2 5/5] dax: fix clearing of holes in __dax_pmd_fault()
  2016-01-22 16:12     ` Ross Zwisler
@ 2016-01-25 14:40       ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2016-01-25 14:40 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, Alexander Viro, Andrew Morton,
	Dan Williams, Dave Chinner, Jan Kara, Matthew Wilcox,
	linux-fsdevel, linux-nvdimm
On Fri 22-01-16 09:12:51, Ross Zwisler wrote:
> On Fri, Jan 22, 2016 at 04:37:19PM +0100, Jan Kara wrote:
> > On Thu 21-01-16 10:46:04, Ross Zwisler wrote:
> > > When the user reads from a DAX hole via a mmap we service page faults using
> > > zero-filled page cache pages.  These zero pages are also placed into the
> > > address_space radix tree.  When we get our first write for that space, we
> > > can allocate a PMD page worth of DAX storage to replace that hole.
> > > 
> > > When this happens we need to unmap the zero pages and remove them from the
> > > radix tree.  Prior to this patch we were unmapping *all* storage in our
> > > PMD's range, which is incorrect because it removed DAX entries as well on
> > > non-allocating page faults.
> > > 
> > > Instead, keep track of when get_block() actually gives us storage so that
> > > we can be sure to only remove zero pages that were covering holes.
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Reported-by: Jan Kara <jack@suse.cz>
> > 
> > So the get_block() calls below are still racy so your allocation detection
> > doesn't quite work reliably anyway. For that we need a change in the fault
> > locking we were talking about and at that point we'll need to somewhat
> > update __dax_pmd_fault() anyway. So I'm not sure if this intermediate
> > change really makes sense... I'd rather first change the locking in the
> > filesystems and then fixup remaining issues with __dax_pmd_fault().
> 
> Agreed, the fault isolation code that we talked about previously is still
> needed.  I've got an initial version coded up that I will send out as an RFC
> today.  Those patches build directly on this patch.
> 
> My thought behind sending this was that I worried that the fault isolation
> code wouldn't make it into v4.5, as we're already in the merge window.  That
> code by design has to touch both DAX and all the DAX filesystems (ext4, ext2
> and XFS).  Although, I guess a race that could result in data corruption
> probably warrants an -RC timed fix?
> 
> In any case, one way or another we really need this fix for v4.5.  Without it
> each PMD fault we receive will unmap the PMD range - even if that range is in
> use by other threads. So, you can get into the following situation:
> 
> Thread 1				Thread 2
> --------				--------
> PMD fault
>   unmap range for all threads
>   map range for just this thread
> 					PMD fault
> 					  unmap range for all threads
> 					  map range for just this thread
> PMD fault
>   unmap range for all threads
>   map range for just this thread
> 	  				...
> 
> Where two threads that are doing I/O to the same PMD will thrash and cause
> every access to trigger a PMD page fault.
OK, so your patch doesn't really make the locking changes any more
difficult so I don't mind going it in first. So feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
								Honza
> > > ---
> > >  fs/dax.c | 32 ++++++++++++++++++++++----------
> > >  1 file changed, 22 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index afacc30..263aed1 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -790,9 +790,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > >  	bool write = flags & FAULT_FLAG_WRITE;
> > >  	struct block_device *bdev;
> > >  	pgoff_t size, pgoff;
> > > -	loff_t lstart, lend;
> > >  	sector_t block;
> > >  	int error, result = 0;
> > > +	bool alloc = false;
> > >  
> > >  	/* dax pmd mappings require pfn_t_devmap() */
> > >  	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
> > > @@ -830,10 +830,17 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > >  	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
> > >  
> > >  	bh.b_size = PMD_SIZE;
> > > -	if (get_block(inode, block, &bh, write) != 0)
> > > +
> > > +	if (get_block(inode, block, &bh, 0) != 0)
> > >  		return VM_FAULT_SIGBUS;
> > > +
> > > +	if (!buffer_mapped(&bh) && write) {
> > > +		if (get_block(inode, block, &bh, 1) != 0)
> > > +			return VM_FAULT_SIGBUS;
> > > +		alloc = true;
> > > +	}
> > > +
> > >  	bdev = bh.b_bdev;
> > > -	i_mmap_lock_read(mapping);
> > >  
> > >  	/*
> > >  	 * If the filesystem isn't willing to tell us the length of a hole,
> > > @@ -842,15 +849,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > >  	 */
> > >  	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
> > >  		dax_pmd_dbg(&bh, address, "allocated block too small");
> > > -		goto fallback;
> > > +		return VM_FAULT_FALLBACK;
> > > +	}
> > > +
> > > +	/*
> > > +	 * If we allocated new storage, make sure no process has any
> > > +	 * zero pages covering this hole
> > > +	 */
> > > +	if (alloc) {
> > > +		loff_t lstart = pgoff << PAGE_SHIFT;
> > > +		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
> > > +
> > > +		truncate_pagecache_range(inode, lstart, lend);
> > >  	}
> > >  
> > > -	/* make sure no process has any zero pages covering this hole */
> > > -	lstart = pgoff << PAGE_SHIFT;
> > > -	lend = lstart + PMD_SIZE - 1; /* inclusive */
> > > -	i_mmap_unlock_read(mapping);
> > > -	unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
> > > -	truncate_inode_pages_range(mapping, lstart, lend);
> > >  	i_mmap_lock_read(mapping);
> > >  
> > >  	/*
> > > -- 
> > > 2.5.0
> > > 
> > > 
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-01-25 14:40 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-21 17:45 [PATCH v2 0/5] DAX fsync/msync fixes Ross Zwisler
2016-01-21 17:46 ` [PATCH v2 1/5] dax: never rely on bh.b_dev being set by get_block() Ross Zwisler
2016-01-22 14:53   ` Jan Kara
2016-01-21 17:46 ` [PATCH v2 2/5] dax: clear TOWRITE flag after flush is complete Ross Zwisler
2016-01-22 14:55   ` Jan Kara
2016-01-21 17:46 ` [PATCH v2 3/5] dax: improve documentation for fsync/msync Ross Zwisler
2016-01-22 15:01   ` Jan Kara
2016-01-22 15:58     ` Ross Zwisler
2016-01-22 16:17       ` Elliott, Robert (Persistent Memory)
2016-01-21 17:46 ` [PATCH v2 4/5] dax: fix PMD handling " Ross Zwisler
2016-01-22 15:11   ` Jan Kara
2016-01-22 16:01     ` Ross Zwisler
2016-01-21 17:46 ` [PATCH v2 5/5] dax: fix clearing of holes in __dax_pmd_fault() Ross Zwisler
2016-01-22 15:37   ` Jan Kara
2016-01-22 16:12     ` Ross Zwisler
2016-01-25 14:40       ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).