linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] DAX fsync/msync fixes
@ 2016-01-22 21:36 Ross Zwisler
  2016-01-22 21:36 ` [PATCH v3 1/5] dax: never rely on bh.b_dev being set by get_block() Ross Zwisler
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Ross Zwisler @ 2016-01-22 21:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm

Changes from v2:
 - Added a WARN_ON_ONCE() to dax_radix_entry() to enforce the assumption
   that we never insert a PMD DAX entry that isn't dirty. (Jan)

 - Added checks to ensure that 'entry' is non-NULL before passing it to
   RADIX_DAX_TYPE().  This worked correclty before because RADIX_DAX_TYPE()
   just checks the value of 'entry' and doesn't dereference it, but it
   reads better and is less likely to cause unnecessary worry. (Jan)

 - Added Reviewed-by tags from Jan & fixed a comment typo.

This set applies cleanly on top of v8 of my "DAX fsync/msync support" set,
which is in -mm and -next.  That set has not yet been merged for v4.5
which is why the my work tree is still based on -next:

https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_fixes_v3

---
Original summary:

This series fixes several issues in v8 of my "DAX fsync/msync support"
patch series [1].  Thank you to Jan Kara for his excellent review.

Jan pointed out that we probably have an issue with the way hole punch
interacts with the fsync/msync code.  This is the item that I'll work on
next, but I wanted to send this series out now as I think it will be
independent of the hole punch fixes.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-January/003952.html

Ross Zwisler (5):
  dax: never rely on bh.b_dev being set by get_block()
  dax: clear TOWRITE flag after flush is complete
  dax: improve documentation for fsync/msync
  dax: fix PMD handling for fsync/msync
  dax: fix clearing of holes in __dax_pmd_fault()

 fs/dax.c | 109 +++++++++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 86 insertions(+), 23 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v3 1/5] dax: never rely on bh.b_dev being set by get_block()
  2016-01-22 21:36 [PATCH v3 0/5] DAX fsync/msync fixes Ross Zwisler
@ 2016-01-22 21:36 ` Ross Zwisler
  2016-01-22 21:36 ` [PATCH v3 2/5] dax: clear TOWRITE flag after flush is complete Ross Zwisler
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Ross Zwisler @ 2016-01-22 21:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm

Previously in DAX we assumed that calls to get_block() would set bh.b_bdev,
and we would then use that value even in error cases for debugging.  This
caused a NULL pointer dereference in __dax_dbg() which was fixed by a
previous commit, but that commit only changed the one place where we were
hitting an error.

Instead, update dax.c so that we always initialize bh.b_bdev as best we can
based on the information that DAX has.  get_block() may or may not update
to a new value, but this at least lets us get something helpful from
bh.b_bdev for error messages and not have to worry about whether it was set
by get_block() or not.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 0db21ea..cee9e1b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -246,6 +246,7 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
 	loff_t end = pos + iov_iter_count(iter);
 
 	memset(&bh, 0, sizeof(bh));
+	bh.b_bdev = inode->i_sb->s_bdev;
 
 	if ((flags & DIO_LOCKING) && iov_iter_rw(iter) == READ) {
 		struct address_space *mapping = inode->i_mapping;
@@ -582,6 +583,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 
 	memset(&bh, 0, sizeof(bh));
 	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
+	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
  repeat:
@@ -1018,6 +1020,7 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	BUG_ON((offset + length) > PAGE_CACHE_SIZE);
 
 	memset(&bh, 0, sizeof(bh));
+	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_CACHE_SIZE;
 	err = get_block(inode, index, &bh, 0);
 	if (err < 0)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 2/5] dax: clear TOWRITE flag after flush is complete
  2016-01-22 21:36 [PATCH v3 0/5] DAX fsync/msync fixes Ross Zwisler
  2016-01-22 21:36 ` [PATCH v3 1/5] dax: never rely on bh.b_dev being set by get_block() Ross Zwisler
@ 2016-01-22 21:36 ` Ross Zwisler
  2016-01-22 21:36 ` [PATCH v3 3/5] dax: improve documentation for fsync/msync Ross Zwisler
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Ross Zwisler @ 2016-01-22 21:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm

Previously in dax_writeback_one() we cleared the PAGECACHE_TAG_TOWRITE flag
before we had actually flushed the tagged radix tree entry to media.  This
is incorrect because of the following race:

Thread 1				Thread 2
--------				--------
dax_writeback_mapping_range()
tag entry with PAGECACHE_TAG_TOWRITE
					dax_writeback_mapping_range()
					tag entry with PAGECACHE_TAG_TOWRITE
					dax_writeback_one()
					radix_tree_tag_clear(TOWRITE)
TOWRITE flag is no longer set,
  find_get_entries_tag() finds no
  entries, return
					flush entry to media

In this case thread 1 returns before the data for the dirty entry is
actually durable on media.

Fix this by only clearing the PAGECACHE_TAG_TOWRITE flag after all flushing
is complete.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cee9e1b..d589113 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -407,8 +407,6 @@ static int dax_writeback_one(struct block_device *bdev,
 	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
 		goto unlock;
 
-	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
-
 	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
 		ret = -EIO;
 		goto unlock;
@@ -432,6 +430,10 @@ static int dax_writeback_one(struct block_device *bdev,
 	}
 
 	wb_cache_pmem(dax.addr, dax.size);
+
+	spin_lock_irq(&mapping->tree_lock);
+	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
+	spin_unlock_irq(&mapping->tree_lock);
  unmap:
 	dax_unmap_atomic(bdev, &dax);
 	return ret;
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 3/5] dax: improve documentation for fsync/msync
  2016-01-22 21:36 [PATCH v3 0/5] DAX fsync/msync fixes Ross Zwisler
  2016-01-22 21:36 ` [PATCH v3 1/5] dax: never rely on bh.b_dev being set by get_block() Ross Zwisler
  2016-01-22 21:36 ` [PATCH v3 2/5] dax: clear TOWRITE flag after flush is complete Ross Zwisler
@ 2016-01-22 21:36 ` Ross Zwisler
  2016-01-22 21:36 ` [PATCH v3 4/5] dax: fix PMD handling " Ross Zwisler
  2016-01-22 21:36 ` [PATCH v3 5/5] dax: fix clearing of holes in __dax_pmd_fault() Ross Zwisler
  4 siblings, 0 replies; 6+ messages in thread
From: Ross Zwisler @ 2016-01-22 21:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm

Several of the subtleties and assumptions of the DAX fsync/msync
implementation are not immediately obvious, so document them with comments.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index d589113..ab2faa9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -335,6 +335,7 @@ static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
 	int type, error = 0;
 	void *entry;
 
+	WARN_ON_ONCE(pmd_entry && !dirty);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
 	spin_lock_irq(&mapping->tree_lock);
@@ -350,6 +351,13 @@ static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
 
 		if (!pmd_entry || type == RADIX_DAX_PMD)
 			goto dirty;
+
+		/*
+		 * We only insert dirty PMD entries into the radix tree.  This
+		 * means we don't need to worry about removing a dirty PTE
+		 * entry and inserting a clean PMD entry, thus reducing the
+		 * range we would flush with a follow-up fsync/msync call.
+		 */
 		radix_tree_delete(&mapping->page_tree, index);
 		mapping->nrexceptional--;
 	}
@@ -912,6 +920,21 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 		dax_unmap_atomic(bdev, &dax);
 
+		/*
+		 * For PTE faults we insert a radix tree entry for reads, and
+		 * leave it clean.  Then on the first write we dirty the radix
+		 * tree entry via the dax_pfn_mkwrite() path.  This sequence
+		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
+		 * call into get_block() to translate the pgoff to a sector in
+		 * order to be able to create a new radix tree entry.
+		 *
+		 * The PMD path doesn't have an equivalent to
+		 * dax_pfn_mkwrite(), though, so for a read followed by a
+		 * write we traverse all the way through __dax_pmd_fault()
+		 * twice.  This means we can just skip inserting a radix tree
+		 * entry completely on the initial read and just wait until
+		 * the write to insert a dirty entry.
+		 */
 		if (write) {
 			error = dax_radix_entry(mapping, pgoff, dax.sector,
 					true, true);
@@ -985,6 +1008,14 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct file *file = vma->vm_file;
 
+	/*
+	 * We pass NO_SECTOR to dax_radix_entry() because we expect that a
+	 * RADIX_DAX_PTE entry already exists in the radix tree from a
+	 * previous call to __dax_fault().  We just want to look up that PTE
+	 * entry using vmf->pgoff and make sure the dirty tag is set.  This
+	 * saves us from having to make a call to get_block() here to look
+	 * up the sector.
+	 */
 	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
 	return VM_FAULT_NOPAGE;
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 4/5] dax: fix PMD handling for fsync/msync
  2016-01-22 21:36 [PATCH v3 0/5] DAX fsync/msync fixes Ross Zwisler
                   ` (2 preceding siblings ...)
  2016-01-22 21:36 ` [PATCH v3 3/5] dax: improve documentation for fsync/msync Ross Zwisler
@ 2016-01-22 21:36 ` Ross Zwisler
  2016-01-22 21:36 ` [PATCH v3 5/5] dax: fix clearing of holes in __dax_pmd_fault() Ross Zwisler
  4 siblings, 0 replies; 6+ messages in thread
From: Ross Zwisler @ 2016-01-22 21:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm

Fix the way that DAX PMD radix tree entries are handled.  With this patch
we now check to see if a PMD entry exists in the radix tree on write, even
if we are just trying to insert a PTE.  If it exists, we dirty that instead
of inserting our own PTE entry.

Fix a bug in the PMD path in dax_writeback_mapping_range() where we were
previously passing a loff_t into radix_tree_lookup instead of a pgoff_t.

Account for the fact that multiple fsync/msync operations may be happening
at the same time and don't flush entries that are beyond end_index.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 37 ++++++++++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ab2faa9..a2ed009 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -327,11 +327,13 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 }
 
 #define NO_SECTOR -1
+#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_CACHE_SHIFT))
 
 static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
 		sector_t sector, bool pmd_entry, bool dirty)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
+	pgoff_t pmd_index = DAX_PMD_INDEX(index);
 	int type, error = 0;
 	void *entry;
 
@@ -339,8 +341,14 @@ static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
 	spin_lock_irq(&mapping->tree_lock);
-	entry = radix_tree_lookup(page_tree, index);
 
+	entry = radix_tree_lookup(page_tree, pmd_index);
+	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
+		index = pmd_index;
+		goto dirty;
+	}
+
+	entry = radix_tree_lookup(page_tree, index);
 	if (entry) {
 		type = RADIX_DAX_TYPE(entry);
 		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
@@ -461,31 +469,33 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
 {
 	struct inode *inode = mapping->host;
 	struct block_device *bdev = inode->i_sb->s_bdev;
+	pgoff_t start_index, end_index, pmd_index;
 	pgoff_t indices[PAGEVEC_SIZE];
-	pgoff_t start_page, end_page;
 	struct pagevec pvec;
-	void *entry;
+	bool done = false;
 	int i, ret = 0;
+	void *entry;
 
 	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
 		return -EIO;
 
+	start_index = start >> PAGE_CACHE_SHIFT;
+	end_index = end >> PAGE_CACHE_SHIFT;
+	pmd_index = DAX_PMD_INDEX(start_index);
+
 	rcu_read_lock();
-	entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
+	entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
 	rcu_read_unlock();
 
 	/* see if the start of our range is covered by a PMD entry */
 	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
-		start &= PMD_MASK;
-
-	start_page = start >> PAGE_CACHE_SHIFT;
-	end_page = end >> PAGE_CACHE_SHIFT;
+		start_index = pmd_index;
 
-	tag_pages_for_writeback(mapping, start_page, end_page);
+	tag_pages_for_writeback(mapping, start_index, end_index);
 
 	pagevec_init(&pvec, 0);
-	while (1) {
-		pvec.nr = find_get_entries_tag(mapping, start_page,
+	while (!done) {
+		pvec.nr = find_get_entries_tag(mapping, start_index,
 				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
 				pvec.pages, indices);
 
@@ -493,6 +503,11 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
 			break;
 
 		for (i = 0; i < pvec.nr; i++) {
+			if (indices[i] > end_index) {
+				done = true;
+				break;
+			}
+
 			ret = dax_writeback_one(bdev, mapping, indices[i],
 					pvec.pages[i]);
 			if (ret < 0)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 5/5] dax: fix clearing of holes in __dax_pmd_fault()
  2016-01-22 21:36 [PATCH v3 0/5] DAX fsync/msync fixes Ross Zwisler
                   ` (3 preceding siblings ...)
  2016-01-22 21:36 ` [PATCH v3 4/5] dax: fix PMD handling " Ross Zwisler
@ 2016-01-22 21:36 ` Ross Zwisler
  4 siblings, 0 replies; 6+ messages in thread
From: Ross Zwisler @ 2016-01-22 21:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm

When the user reads from a DAX hole via a mmap we service page faults using
zero-filled page cache pages.  These zero pages are also placed into the
address_space radix tree.  When we get our first write for that space, we
can allocate a PMD page worth of DAX storage to replace that hole.

When this happens we need to unmap the zero pages and remove them from the
radix tree.  Prior to this patch we were unmapping *all* storage in our
PMD's range, which is incorrect because it removed DAX entries as well on
non-allocating page faults.

Instead, keep track of when get_block() actually gives us storage so that
we can be sure to only remove zero pages that were covering holes.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index a2ed009..206650f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -791,9 +791,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	bool write = flags & FAULT_FLAG_WRITE;
 	struct block_device *bdev;
 	pgoff_t size, pgoff;
-	loff_t lstart, lend;
 	sector_t block;
 	int error, result = 0;
+	bool alloc = false;
 
 	/* dax pmd mappings require pfn_t_devmap() */
 	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
@@ -831,10 +831,17 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
 
 	bh.b_size = PMD_SIZE;
-	if (get_block(inode, block, &bh, write) != 0)
+
+	if (get_block(inode, block, &bh, 0) != 0)
 		return VM_FAULT_SIGBUS;
+
+	if (!buffer_mapped(&bh) && write) {
+		if (get_block(inode, block, &bh, 1) != 0)
+			return VM_FAULT_SIGBUS;
+		alloc = true;
+	}
+
 	bdev = bh.b_bdev;
-	i_mmap_lock_read(mapping);
 
 	/*
 	 * If the filesystem isn't willing to tell us the length of a hole,
@@ -843,15 +850,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	 */
 	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
-		goto fallback;
+		return VM_FAULT_FALLBACK;
+	}
+
+	/*
+	 * If we allocated new storage, make sure no process has any
+	 * zero pages covering this hole
+	 */
+	if (alloc) {
+		loff_t lstart = pgoff << PAGE_SHIFT;
+		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
+
+		truncate_pagecache_range(inode, lstart, lend);
 	}
 
-	/* make sure no process has any zero pages covering this hole */
-	lstart = pgoff << PAGE_SHIFT;
-	lend = lstart + PMD_SIZE - 1; /* inclusive */
-	i_mmap_unlock_read(mapping);
-	unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
-	truncate_inode_pages_range(mapping, lstart, lend);
 	i_mmap_lock_read(mapping);
 
 	/*
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-01-22 21:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-22 21:36 [PATCH v3 0/5] DAX fsync/msync fixes Ross Zwisler
2016-01-22 21:36 ` [PATCH v3 1/5] dax: never rely on bh.b_dev being set by get_block() Ross Zwisler
2016-01-22 21:36 ` [PATCH v3 2/5] dax: clear TOWRITE flag after flush is complete Ross Zwisler
2016-01-22 21:36 ` [PATCH v3 3/5] dax: improve documentation for fsync/msync Ross Zwisler
2016-01-22 21:36 ` [PATCH v3 4/5] dax: fix PMD handling " Ross Zwisler
2016-01-22 21:36 ` [PATCH v3 5/5] dax: fix clearing of holes in __dax_pmd_fault() Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).