All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] dax cleanups and lifetime fixes
@ 2015-11-25 18:36 Dan Williams
  2015-11-25 18:37 ` [PATCH v2 1/7] pmem, dax: clean up clear_pmem() Dan Williams
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Dan Williams @ 2015-11-25 18:36 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-block, Theodore Ts'o, Dave Hansen, Dave Chinner,
	Jens Axboe, Jeff Moyer, Andreas Dilger, Jan Kara, linux-fsdevel,
	Matthew Wilcox, Ross Zwisler, linux-ext4

Changes since v1: [1]

1/ Dropped the patches that were merged for v4.4-rc2

2/ Introduce a new super-block flag that filesystems can use to error
   out early when there is no longer a backing device available. Use it to
   prevent a spurious warning triggered by ext4 on surprise removal. (Dave)

3/ Include the unmap_partition implementation initially posted here [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-November/002876.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-November/002922.html

Testing this patch set reveals that xfs needs more XFS_FORCED_SHUTDOWN
checks, especially in the unmount path.  Currently we deadlock here on
umount after block device removal:

 INFO: task umount:2187 blocked for more than 120 seconds.
       Tainted: G           O    4.4.0-rc2+ #1953  
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 umount          D ffff8800d2fbfd70     0  2187   2095 0x00000080
  ffff8800d2fbfd70 ffffffff81f94f98 ffff88031fc97bd8 ffff88030af5ad80
  ffff8800db71db00 ffff8800d2fc0000 ffff8800db8dbde0 ffff8800d93b6708
  ffff8800d93b6760 ffff8800d93b66d8 ffff8800d2fbfd88 ffffffff818f0695
 Call Trace:
  [<ffffffff818f0695>] schedule+0x35/0x80
  [<ffffffffa01e134e>] xfs_ail_push_all_sync+0xbe/0x110 [xfs]
  [<ffffffff810ecc30>] ? wait_woken+0x80/0x80
  [<ffffffffa01c8d91>] xfs_unmountfs+0x81/0x1b0 [xfs]
  [<ffffffffa01c991b>] ? xfs_mru_cache_destroy+0x6b/0x90 [xfs]
  [<ffffffffa01cbf30>] xfs_fs_put_super+0x30/0x90 [xfs]
  [<ffffffff81247eca>] generic_shutdown_super+0x6a/0xf0

Earlier in this trace xfs has already performed:

 XFS (pmem0m): xfs_do_force_shutdown(0x2) called from line 1197 of file fs/xfs/xfs_log.c.

...but xfs_log_work_queue() continues to run periodically.

---

The motivation for these lifetime fixes is to prevent crashes and
mapping leaks when using dax.  Most of the safety guarantees in this
series come from the protection afforded by blk_queue_enter +
blk_queue_exit.  After a successful blk_queue_enter we can issue any
block device operations we want without needing to worry about the block
layer infrastructure for the device being torn down.

blk_queue_enter is chosen for this "is bdev alive?" check over
SB_I_BDI_DEAD or error returns from get_blocks() because it synchronizes
with blk_cleanup_queue.  SB_I_BDI_DEAD is there to let a file system
optionally error out early before getting -ENODEV from the block layer,
but it's optional an asynchronous.

---

Dan Williams (7):
      pmem, dax: clean up clear_pmem()
      dax: increase granularity of dax_clear_blocks() operations
      dax: guarantee page aligned results from bdev_direct_access()
      dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
      fs: notify superblocks of backing-device death
      ext4: skip inode dirty when backing device is gone
      mm, dax: unmap dax mappings at bdev shutdown


 arch/x86/include/asm/pmem.h  |    7 -
 block/genhd.c                |   93 +++++++++++++++--
 drivers/block/brd.c          |    3 -
 drivers/nvdimm/pmem.c        |    3 -
 drivers/s390/block/dcssblk.c |    6 -
 fs/block_dev.c               |   73 ++++++++++++--
 fs/dax.c                     |  224 +++++++++++++++++++++++++-----------------
 fs/fs-writeback.c            |    3 +
 include/linux/blkdev.h       |   17 +++
 include/linux/fs.h           |    3 +
 include/linux/genhd.h        |    1 
 11 files changed, 304 insertions(+), 129 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/7] pmem, dax: clean up clear_pmem()
  2015-11-25 18:36 [PATCH v2 0/7] dax cleanups and lifetime fixes Dan Williams
@ 2015-11-25 18:37 ` Dan Williams
  2015-11-25 18:37 ` [PATCH v2 2/7] dax: increase granularity of dax_clear_blocks() operations Dan Williams
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2015-11-25 18:37 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-fsdevel, linux-block, Ross Zwisler, Dave Hansen, Jeff Moyer

Both, __dax_pmd_fault, and clear_pmem() were taking special steps to
clear memory a page at a time to take advantage of non-temporal
clear_page() implementations.  However, x86_64 does not use
non-temporal instructions for clear_page(), and arch_clear_pmem() was
always incurring the cost of __arch_wb_cache_pmem().

Clean up the assumption that doing clear_pmem() a page at a time is more
performant.

Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/pmem.h |    7 +------
 fs/dax.c                    |    4 +---
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d8ce3ec816ab..1544fabcd7f9 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -132,12 +132,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	void *vaddr = (void __force *)addr;
 
-	/* TODO: implement the zeroing via non-temporal writes */
-	if (size == PAGE_SIZE && ((unsigned long)vaddr & ~PAGE_MASK) == 0)
-		clear_page(vaddr);
-	else
-		memset(vaddr, 0, size);
-
+	memset(vaddr, 0, size);
 	__arch_wb_cache_pmem(vaddr, size);
 }
 
diff --git a/fs/dax.c b/fs/dax.c
index 43671b68220e..19492cc65a30 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -641,9 +641,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto fallback;
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			int i;
-			for (i = 0; i < PTRS_PER_PMD; i++)
-				clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
+			clear_pmem(kaddr, PMD_SIZE);
 			wmb_pmem();
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/7] dax: increase granularity of dax_clear_blocks() operations
  2015-11-25 18:36 [PATCH v2 0/7] dax cleanups and lifetime fixes Dan Williams
  2015-11-25 18:37 ` [PATCH v2 1/7] pmem, dax: clean up clear_pmem() Dan Williams
@ 2015-11-25 18:37 ` Dan Williams
  2015-11-25 18:37 ` [PATCH v2 3/7] dax: guarantee page aligned results from bdev_direct_access() Dan Williams
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2015-11-25 18:37 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-fsdevel, linux-block, Jeff Moyer, Jan Kara

dax_clear_blocks is currently performing a cond_resched() after every
PAGE_SIZE memset.  We need not check so frequently, for example md-raid
only calls cond_resched() at stripe granularity.  Also, in preparation
for introducing a dax_map_atomic() operation that temporarily pins a dax
mapping move the call to cond_resched() to the outer loop.

Reviewed-by: Jan Kara <jack@suse.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 19492cc65a30..e11d88835bb2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -28,6 +28,7 @@
 #include <linux/sched.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
+#include <linux/sizes.h>
 
 /*
  * dax_clear_blocks() is called from within transaction context from XFS,
@@ -43,24 +44,17 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 	do {
 		void __pmem *addr;
 		unsigned long pfn;
-		long count;
+		long count, sz;
 
 		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
 		if (count < 0)
 			return count;
-		BUG_ON(size < count);
-		while (count > 0) {
-			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
-			if (pgsz > count)
-				pgsz = count;
-			clear_pmem(addr, pgsz);
-			addr += pgsz;
-			size -= pgsz;
-			count -= pgsz;
-			BUG_ON(pgsz & 511);
-			sector += pgsz / 512;
-			cond_resched();
-		}
+		sz = min_t(long, count, SZ_1M);
+		clear_pmem(addr, sz);
+		size -= sz;
+		BUG_ON(sz & 511);
+		sector += sz / 512;
+		cond_resched();
 	} while (size);
 
 	wmb_pmem();


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/7] dax: guarantee page aligned results from bdev_direct_access()
  2015-11-25 18:36 [PATCH v2 0/7] dax cleanups and lifetime fixes Dan Williams
  2015-11-25 18:37 ` [PATCH v2 1/7] pmem, dax: clean up clear_pmem() Dan Williams
  2015-11-25 18:37 ` [PATCH v2 2/7] dax: increase granularity of dax_clear_blocks() operations Dan Williams
@ 2015-11-25 18:37 ` Dan Williams
  2015-11-25 18:37 ` [PATCH v2 4/7] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2015-11-25 18:37 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-fsdevel, linux-block, Ross Zwisler

If a ->direct_access() implementation ever returns a map count less than
PAGE_SIZE, catch the error in bdev_direct_access().  This simplifies
error checking in upper layers.

Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/block_dev.c |    2 ++
 fs/dax.c       |    1 -
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c25639e907bd..c9f813843939 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -489,6 +489,8 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
 	avail = ops->direct_access(bdev, sector, addr, pfn);
 	if (!avail)
 		return -ERANGE;
+	if (avail > 0 && avail & ~PAGE_MASK)
+		return -ENXIO;
 	return min(avail, size);
 }
 EXPORT_SYMBOL_GPL(bdev_direct_access);
diff --git a/fs/dax.c b/fs/dax.c
index e11d88835bb2..6e498c2570bf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -52,7 +52,6 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 		sz = min_t(long, count, SZ_1M);
 		clear_pmem(addr, sz);
 		size -= sz;
-		BUG_ON(sz & 511);
 		sector += sz / 512;
 		cond_resched();
 	} while (size);


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 4/7] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
  2015-11-25 18:36 [PATCH v2 0/7] dax cleanups and lifetime fixes Dan Williams
                   ` (2 preceding siblings ...)
  2015-11-25 18:37 ` [PATCH v2 3/7] dax: guarantee page aligned results from bdev_direct_access() Dan Williams
@ 2015-11-25 18:37 ` Dan Williams
  2015-11-25 18:37 ` [PATCH v2 5/7] fs: notify superblocks of backing-device death Dan Williams
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2015-11-25 18:37 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-block, Dave Chinner, Jens Axboe, Jeff Moyer, Jan Kara,
	linux-fsdevel, Ross Zwisler

The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against the driver for the underlying
block device being disabled.  Use blk_queue_enter()/blk_queue_exit() to
hold off blk_cleanup_queue() from proceeding, or otherwise fail new
mapping requests if the request_queue is being torn down.

This also introduces blk_dax_ctl to simplify the interface from fs/dax.c
through dax_map_atomic() to bdev_direct_access().

Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/block_dev.c         |   13 +--
 fs/dax.c               |  207 ++++++++++++++++++++++++++++++------------------
 include/linux/blkdev.h |   17 +++-
 3 files changed, 149 insertions(+), 88 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c9f813843939..1dd416bf72f7 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -450,10 +450,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
 /**
  * bdev_direct_access() - Get the address for directly-accessibly memory
  * @bdev: The device containing the memory
- * @sector: The offset within the device
- * @addr: Where to put the address of the memory
- * @pfn: The Page Frame Number for the memory
- * @size: The number of bytes requested
+ * @dax: control and output parameters for ->direct_access
  *
  * If a block device is made up of directly addressable memory, this function
  * will tell the caller the PFN and the address of the memory.  The address
@@ -464,10 +461,10 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
  * Return: negative errno if an error occurs, otherwise the number of bytes
  * accessible at this address.
  */
-long bdev_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **addr, unsigned long *pfn, long size)
+long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
-	long avail;
+	sector_t sector = dax->sector;
+	long avail, size = dax->size;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
 
 	/*
@@ -486,7 +483,7 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
 	sector += get_start_sect(bdev);
 	if (sector % (PAGE_SIZE / 512))
 		return -EINVAL;
-	avail = ops->direct_access(bdev, sector, addr, pfn);
+	avail = ops->direct_access(bdev, sector, &dax->addr, &dax->pfn);
 	if (!avail)
 		return -ERANGE;
 	if (avail > 0 && avail & ~PAGE_MASK)
diff --git a/fs/dax.c b/fs/dax.c
index 6e498c2570bf..a6e29d5ad6fd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -30,45 +30,65 @@
 #include <linux/vmstat.h>
 #include <linux/sizes.h>
 
+static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
+{
+	struct request_queue *q = bdev->bd_queue;
+	long rc = -EIO;
+
+	dax->addr = (void __pmem *) ERR_PTR(-EIO);
+	if (blk_queue_enter(q, GFP_NOWAIT) != 0)
+		return rc;
+
+	rc = bdev_direct_access(bdev, dax);
+	if (rc < 0) {
+		dax->addr = (void __pmem *) ERR_PTR(rc);
+		blk_queue_exit(q);
+		return rc;
+	}
+	return rc;
+}
+
+static void dax_unmap_atomic(struct block_device *bdev,
+		const struct blk_dax_ctl *dax)
+{
+	if (IS_ERR(dax->addr))
+		return;
+	blk_queue_exit(bdev->bd_queue);
+}
+
 /*
  * dax_clear_blocks() is called from within transaction context from XFS,
  * and hence this means the stack from this point must follow GFP_NOFS
  * semantics for all operations.
  */
-int dax_clear_blocks(struct inode *inode, sector_t block, long size)
+int dax_clear_blocks(struct inode *inode, sector_t block, long _size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
-	sector_t sector = block << (inode->i_blkbits - 9);
+	struct blk_dax_ctl dax = {
+		.sector = block << (inode->i_blkbits - 9),
+		.size = _size,
+	};
 
 	might_sleep();
 	do {
-		void __pmem *addr;
-		unsigned long pfn;
 		long count, sz;
 
-		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
+		count = dax_map_atomic(bdev, &dax);
 		if (count < 0)
 			return count;
 		sz = min_t(long, count, SZ_1M);
-		clear_pmem(addr, sz);
-		size -= sz;
-		sector += sz / 512;
+		clear_pmem(dax.addr, sz);
+		dax.size -= sz;
+		dax.sector += sz / 512;
+		dax_unmap_atomic(bdev, &dax);
 		cond_resched();
-	} while (size);
+	} while (dax.size);
 
 	wmb_pmem();
 	return 0;
 }
 EXPORT_SYMBOL_GPL(dax_clear_blocks);
 
-static long dax_get_addr(struct buffer_head *bh, void __pmem **addr,
-		unsigned blkbits)
-{
-	unsigned long pfn;
-	sector_t sector = bh->b_blocknr << (blkbits - 9);
-	return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
-}
-
 /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */
 static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first,
 		loff_t pos, loff_t end)
@@ -98,19 +118,29 @@ static bool buffer_size_valid(struct buffer_head *bh)
 	return bh->b_state != 0;
 }
 
+
+static sector_t to_sector(const struct buffer_head *bh,
+		const struct inode *inode)
+{
+	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+
+	return sector;
+}
+
 static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 		      loff_t start, loff_t end, get_block_t get_block,
 		      struct buffer_head *bh)
 {
-	ssize_t retval = 0;
-	loff_t pos = start;
-	loff_t max = start;
-	loff_t bh_max = start;
-	void __pmem *addr;
-	bool hole = false;
-	bool need_wmb = false;
-
-	if (iov_iter_rw(iter) != WRITE)
+	loff_t pos = start, max = start, bh_max = start;
+	bool hole = false, need_wmb = false;
+	struct block_device *bdev = NULL;
+	int rw = iov_iter_rw(iter), rc;
+	long map_len = 0;
+	struct blk_dax_ctl dax = {
+		.addr = (void __pmem *) ERR_PTR(-EIO),
+	};
+
+	if (rw == READ)
 		end = min(end, i_size_read(inode));
 
 	while (pos < end) {
@@ -125,13 +155,13 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 			if (pos == bh_max) {
 				bh->b_size = PAGE_ALIGN(end - pos);
 				bh->b_state = 0;
-				retval = get_block(inode, block, bh,
-						   iov_iter_rw(iter) == WRITE);
-				if (retval)
+				rc = get_block(inode, block, bh, rw == WRITE);
+				if (rc)
 					break;
 				if (!buffer_size_valid(bh))
 					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
+				bdev = bh->b_bdev;
 			} else {
 				unsigned done = bh->b_size -
 						(bh_max - (pos - first));
@@ -139,47 +169,52 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				bh->b_size -= done;
 			}
 
-			hole = iov_iter_rw(iter) != WRITE && !buffer_written(bh);
+			hole = rw == READ && !buffer_written(bh);
 			if (hole) {
-				addr = NULL;
 				size = bh->b_size - first;
 			} else {
-				retval = dax_get_addr(bh, &addr, blkbits);
-				if (retval < 0)
+				dax_unmap_atomic(bdev, &dax);
+				dax.sector = to_sector(bh, inode);
+				dax.size = bh->b_size;
+				map_len = dax_map_atomic(bdev, &dax);
+				if (map_len < 0) {
+					rc = map_len;
 					break;
+				}
 				if (buffer_unwritten(bh) || buffer_new(bh)) {
-					dax_new_buf(addr, retval, first, pos,
-									end);
+					dax_new_buf(dax.addr, map_len, first,
+							pos, end);
 					need_wmb = true;
 				}
-				addr += first;
-				size = retval - first;
+				dax.addr += first;
+				size = map_len - first;
 			}
 			max = min(pos + size, end);
 		}
 
 		if (iov_iter_rw(iter) == WRITE) {
-			len = copy_from_iter_pmem(addr, max - pos, iter);
+			len = copy_from_iter_pmem(dax.addr, max - pos, iter);
 			need_wmb = true;
 		} else if (!hole)
-			len = copy_to_iter((void __force *)addr, max - pos,
+			len = copy_to_iter((void __force *) dax.addr, max - pos,
 					iter);
 		else
 			len = iov_iter_zero(max - pos, iter);
 
 		if (!len) {
-			retval = -EFAULT;
+			rc = -EFAULT;
 			break;
 		}
 
 		pos += len;
-		addr += len;
+		dax.addr += len;
 	}
 
 	if (need_wmb)
 		wmb_pmem();
+	dax_unmap_atomic(bdev, &dax);
 
-	return (pos == start) ? retval : pos - start;
+	return (pos == start) ? rc : pos - start;
 }
 
 /**
@@ -268,28 +303,35 @@ static int dax_load_hole(struct address_space *mapping, struct page *page,
 	return VM_FAULT_LOCKED;
 }
 
-static int copy_user_bh(struct page *to, struct buffer_head *bh,
-			unsigned blkbits, unsigned long vaddr)
+static int copy_user_bh(struct page *to, struct inode *inode,
+		struct buffer_head *bh, unsigned long vaddr)
 {
-	void __pmem *vfrom;
+	struct blk_dax_ctl dax = {
+		.sector = to_sector(bh, inode),
+		.size = bh->b_size,
+	};
+	struct block_device *bdev = bh->b_bdev;
 	void *vto;
 
-	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
-		return -EIO;
+	if (dax_map_atomic(bdev, &dax) < 0)
+		return PTR_ERR(dax.addr);
 	vto = kmap_atomic(to);
-	copy_user_page(vto, (void __force *)vfrom, vaddr, to);
+	copy_user_page(vto, (void __force *)dax.addr, vaddr, to);
 	kunmap_atomic(vto);
+	dax_unmap_atomic(bdev, &dax);
 	return 0;
 }
 
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct address_space *mapping = inode->i_mapping;
-	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	void __pmem *addr;
-	unsigned long pfn;
+	struct address_space *mapping = inode->i_mapping;
+	struct block_device *bdev = bh->b_bdev;
+	struct blk_dax_ctl dax = {
+		.sector = to_sector(bh, inode),
+		.size = bh->b_size,
+	};
 	pgoff_t size;
 	int error;
 
@@ -308,20 +350,18 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		goto out;
 	}
 
-	error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
-	if (error < 0)
-		goto out;
-	if (error < PAGE_SIZE) {
-		error = -EIO;
+	if (dax_map_atomic(bdev, &dax) < 0) {
+		error = PTR_ERR(dax.addr);
 		goto out;
 	}
 
 	if (buffer_unwritten(bh) || buffer_new(bh)) {
-		clear_pmem(addr, PAGE_SIZE);
+		clear_pmem(dax.addr, PAGE_SIZE);
 		wmb_pmem();
 	}
+	dax_unmap_atomic(bdev, &dax);
 
-	error = vm_insert_mixed(vma, vaddr, pfn);
+	error = vm_insert_mixed(vma, vaddr, dax.pfn);
 
  out:
 	i_mmap_unlock_read(mapping);
@@ -415,7 +455,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (vmf->cow_page) {
 		struct page *new_page = vmf->cow_page;
 		if (buffer_written(&bh))
-			error = copy_user_bh(new_page, &bh, blkbits, vaddr);
+			error = copy_user_bh(new_page, inode, &bh, vaddr);
 		else
 			clear_user_highpage(new_page, vaddr);
 		if (error)
@@ -527,11 +567,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	unsigned blkbits = inode->i_blkbits;
 	unsigned long pmd_addr = address & PMD_MASK;
 	bool write = flags & FAULT_FLAG_WRITE;
-	long length;
-	void __pmem *kaddr;
+	struct block_device *bdev;
 	pgoff_t size, pgoff;
-	sector_t block, sector;
-	unsigned long pfn;
+	sector_t block;
 	int result = 0;
 
 	/* dax pmd mappings are broken wrt gup and fork */
@@ -559,9 +597,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
 
 	bh.b_size = PMD_SIZE;
-	length = get_block(inode, block, &bh, write);
-	if (length)
+	if (get_block(inode, block, &bh, write) != 0)
 		return VM_FAULT_SIGBUS;
+	bdev = bh.b_bdev;
 	i_mmap_lock_read(mapping);
 
 	/*
@@ -616,32 +654,40 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
-		sector = bh.b_blocknr << (blkbits - 9);
-		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
-						bh.b_size);
+		struct blk_dax_ctl dax = {
+			.sector = to_sector(&bh, inode),
+			.size = PMD_SIZE,
+		};
+		long length = dax_map_atomic(bdev, &dax);
+
 		if (length < 0) {
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
-		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
+		if ((length < PMD_SIZE) || (dax.pfn & PG_PMD_COLOUR)) {
+			dax_unmap_atomic(bdev, &dax);
 			goto fallback;
+		}
 
 		/*
 		 * TODO: teach vmf_insert_pfn_pmd() to support
 		 * 'pte_special' for pmds
 		 */
-		if (pfn_valid(pfn))
+		if (pfn_valid(dax.pfn)) {
+			dax_unmap_atomic(bdev, &dax);
 			goto fallback;
+		}
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			clear_pmem(kaddr, PMD_SIZE);
+			clear_pmem(dax.addr, PMD_SIZE);
 			wmb_pmem();
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
 		}
+		dax_unmap_atomic(bdev, &dax);
 
-		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
+		result |= vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
 	}
 
  out:
@@ -743,12 +789,17 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	if (err < 0)
 		return err;
 	if (buffer_written(&bh)) {
-		void __pmem *addr;
-		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
-		if (err < 0)
-			return err;
-		clear_pmem(addr + offset, length);
+		struct block_device *bdev = bh.b_bdev;
+		struct blk_dax_ctl dax = {
+			.sector = to_sector(&bh, inode),
+			.size = PAGE_CACHE_SIZE,
+		};
+
+		if (dax_map_atomic(bdev, &dax) < 0)
+			return PTR_ERR(dax.addr);
+		clear_pmem(dax.addr + offset, length);
 		wmb_pmem();
+		dax_unmap_atomic(bdev, &dax);
 	}
 
 	return 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c0d2b7927c1f..8aa53454ce27 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1617,6 +1617,20 @@ static inline bool integrity_req_gap_front_merge(struct request *req,
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+/**
+ * struct blk_dax_ctl - control and output parameters for ->direct_access
+ * @sector: (input) offset relative to a block_device
+ * @addr: (output) kernel virtual address for @sector populated by driver
+ * @pfn: (output) page frame number for @addr populated by driver
+ * @size: (input) number of bytes requested
+ */
+struct blk_dax_ctl {
+	sector_t sector;
+	void __pmem *addr;
+	long size;
+	unsigned long pfn;
+};
+
 struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
@@ -1643,8 +1657,7 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
 extern int bdev_read_page(struct block_device *, sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
-extern long bdev_direct_access(struct block_device *, sector_t,
-		void __pmem **addr, unsigned long *pfn, long size);
+extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
 #else /* CONFIG_BLOCK */
 
 struct block_device;


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 5/7] fs: notify superblocks of backing-device death
  2015-11-25 18:36 [PATCH v2 0/7] dax cleanups and lifetime fixes Dan Williams
                   ` (3 preceding siblings ...)
  2015-11-25 18:37 ` [PATCH v2 4/7] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
@ 2015-11-25 18:37 ` Dan Williams
  2015-11-25 21:50   ` Dave Chinner
  2015-11-25 18:37 ` [PATCH v2 6/7] ext4: skip inode dirty when backing device is gone Dan Williams
  2015-11-25 18:37 ` [PATCH v2 7/7] mm, dax: unmap dax mappings at bdev shutdown Dan Williams
  6 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2015-11-25 18:37 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jens Axboe, linux-fsdevel, Dave Chinner, linux-block, Jan Kara

Set SB_I_BDIDEAD when a block device is no longer available to service
requests.  This will be used in several places where an fs should give
up early because the block device is gone.  Letting the fs continue on
as if the block device is still present can lead to long latencies
waiting for an fs to detect the loss of its backing device, trigger
crashes, or generate misleasing warnings.

Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/genhd.c      |    2 ++
 fs/block_dev.c     |   17 +++++++++++++++++
 include/linux/fs.h |    2 ++
 3 files changed, 21 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index e5cafa51567c..8a743cb81fb4 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -648,10 +648,12 @@ void del_gendisk(struct gendisk *disk)
 	while ((part = disk_part_iter_next(&piter))) {
 		invalidate_partition(disk, part->partno);
 		delete_partition(disk, part->partno);
+		kill_bdev_super(disk, part->partno);
 	}
 	disk_part_iter_exit(&piter);
 
 	invalidate_partition(disk, 0);
+	kill_bdev_super(disk, 0);
 	set_capacity(disk, 0);
 	disk->flags &= ~GENHD_FL_UP;
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1dd416bf72f7..d0233d643d33 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1795,6 +1795,23 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 }
 EXPORT_SYMBOL(__invalidate_device);
 
+void kill_bdev_super(struct gendisk *disk, int partno)
+{
+	struct block_device *bdev = bdget_disk(disk, partno);
+	struct super_block *sb;
+
+	if (!bdev)
+		return;
+	sb = get_super(bdev);
+	if (!sb)
+		goto out;
+
+	sb->s_iflags |= SB_I_BDI_DEAD;
+	drop_super(sb);
+ out:
+	bdput(bdev);
+}
+
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3aa514254161..76925e322e87 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1254,6 +1254,7 @@ struct mm_struct;
 /* sb->s_iflags */
 #define SB_I_CGROUPWB	0x00000001	/* cgroup-aware writeback enabled */
 #define SB_I_NOEXEC	0x00000002	/* Ignore executables on this fs */
+#define SB_I_BDI_DEAD	0x00000004	/* Give up, backing device is dead */
 
 /* Possible states of 'frozen' field */
 enum {
@@ -2390,6 +2391,7 @@ extern int revalidate_disk(struct gendisk *);
 extern int check_disk_change(struct block_device *);
 extern int __invalidate_device(struct block_device *, bool);
 extern int invalidate_partition(struct gendisk *, int);
+extern void kill_bdev_super(struct gendisk *, int);
 #endif
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 					pgoff_t start, pgoff_t end);


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 6/7] ext4: skip inode dirty when backing device is gone
  2015-11-25 18:36 [PATCH v2 0/7] dax cleanups and lifetime fixes Dan Williams
                   ` (4 preceding siblings ...)
  2015-11-25 18:37 ` [PATCH v2 5/7] fs: notify superblocks of backing-device death Dan Williams
@ 2015-11-25 18:37 ` Dan Williams
  2015-11-25 18:37 ` [PATCH v2 7/7] mm, dax: unmap dax mappings at bdev shutdown Dan Williams
  6 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2015-11-25 18:37 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Theodore Ts'o, Dave Chinner, linux-block, Andreas Dilger,
	Jan Kara, linux-fsdevel, linux-ext4

Use SB_I_BDI_DEAD to prevent dirtying an inode after a del_gendisk()
event.

 WARNING: CPU: 0 PID: 2133 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
 bdi-block not registered
 [..]
 Call Trace:
  [<ffffffff81459f62>] dump_stack+0x44/0x62
  [<ffffffff810a2052>] warn_slowpath_common+0x82/0xc0
  [<ffffffff810a20ec>] warn_slowpath_fmt+0x5c/0x80
  [<ffffffff812831a1>] __mark_inode_dirty+0x261/0x350
  [<ffffffff8126d109>] generic_update_time+0x79/0xd0
  [<ffffffff8126d28d>] file_update_time+0xbd/0x110
  [<ffffffff812e4bc8>] ext4_dax_fault+0x68/0x110
  [<ffffffff811f816e>] __do_fault+0x4e/0xf0
  [<ffffffff811fc2a7>] handle_mm_fault+0x5e7/0x1b50
  [<ffffffff811fbd11>] ? handle_mm_fault+0x51/0x1b50
  [<ffffffff810689c1>] __do_page_fault+0x191/0x3f0
  [<ffffffff81068cef>] trace_do_page_fault+0x4f/0x120
  [<ffffffff8106314a>] do_async_page_fault+0x1a/0xa0
  [<ffffffff81902678>] async_page_fault+0x28/0x30

Cc: Jan Kara <jack@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/fs-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 023f6a1f23cd..33e8b24f78ab 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2005,6 +2005,9 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		flags &= ~I_DIRTY_TIME;
 	dirtytime = flags & I_DIRTY_TIME;
 
+	if (sb->s_iflags & SB_I_BDI_DEAD)
+		return;
+
 	/*
 	 * Paired with smp_mb() in __writeback_single_inode() for the
 	 * following lockless i_state test.  See there for details.


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 7/7] mm, dax: unmap dax mappings at bdev shutdown
  2015-11-25 18:36 [PATCH v2 0/7] dax cleanups and lifetime fixes Dan Williams
                   ` (5 preceding siblings ...)
  2015-11-25 18:37 ` [PATCH v2 6/7] ext4: skip inode dirty when backing device is gone Dan Williams
@ 2015-11-25 18:37 ` Dan Williams
  2015-11-30 22:03   ` Dan Williams
  6 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2015-11-25 18:37 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Chinner, linux-block, Jan Kara, linux-fsdevel,
	Matthew Wilcox, Ross Zwisler

Currently dax mappings leak past / survive block_device shutdown.  While
page cache pages are permitted to be read/written after the block_device
is torn down this is not acceptable in the dax case as all media access
must end when the device is disabled.  The pfn backing a dax mapping is
permitted to be invalidated after bdev shutdown and this is indeed the
case with brd.

When a dax capable block_device driver calls del_gendisk_queue() in its
shutdown path it needs to ensure that all DAX pfns are unmapped, and
that no new mappings can be established.  This is different than the
pagecache backed case where the disk is protected by the queue being
torn down which ends I/O to the device.  Since dax bypasses the page
cache we need to unconditionally unmap the inode.

Cc: Jan Kara <jack@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
[honza: drop changes to truncate_inode_pages_final]
[honza: ensure mappings can't be re-established]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/genhd.c                |   95 +++++++++++++++++++++++++++++++++++-------
 drivers/block/brd.c          |    3 -
 drivers/nvdimm/pmem.c        |    3 -
 drivers/s390/block/dcssblk.c |    6 +--
 fs/block_dev.c               |   41 ++++++++++++++++++
 include/linux/fs.h           |    1 
 include/linux/genhd.h        |    1 
 7 files changed, 126 insertions(+), 24 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 8a743cb81fb4..e1748365f9b3 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -634,26 +634,14 @@ void add_disk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(add_disk);
 
-void del_gendisk(struct gendisk *disk)
+static void del_gendisk_start(struct gendisk *disk)
 {
-	struct disk_part_iter piter;
-	struct hd_struct *part;
-
 	blk_integrity_del(disk);
 	disk_del_events(disk);
+}
 
-	/* invalidate stuff */
-	disk_part_iter_init(&piter, disk,
-			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
-	while ((part = disk_part_iter_next(&piter))) {
-		invalidate_partition(disk, part->partno);
-		delete_partition(disk, part->partno);
-		kill_bdev_super(disk, part->partno);
-	}
-	disk_part_iter_exit(&piter);
-
-	invalidate_partition(disk, 0);
-	kill_bdev_super(disk, 0);
+static void del_gendisk_end(struct gendisk *disk)
+{
 	set_capacity(disk, 0);
 	disk->flags &= ~GENHD_FL_UP;
 
@@ -672,9 +660,84 @@ void del_gendisk(struct gendisk *disk)
 	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
 	device_del(disk_to_dev(disk));
 }
+
+#define for_each_part(part, piter) \
+	for (part = disk_part_iter_next(piter); part; \
+			part = disk_part_iter_next(piter))
+void del_gendisk(struct gendisk *disk)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	del_gendisk_start(disk);
+
+	/* invalidate stuff */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter) {
+		invalidate_partition(disk, part->partno);
+		delete_partition(disk, part->partno);
+		kill_bdev_super(disk, part->partno);
+	}
+	disk_part_iter_exit(&piter);
+	invalidate_partition(disk, 0);
+	kill_bdev_super(disk, 0);
+
+	del_gendisk_end(disk);
+}
 EXPORT_SYMBOL(del_gendisk);
 
 /**
+ * del_gendisk_queue - combined del_gendisk + blk_cleanup_queue
+ * @disk: disk to delete, invalidate, unmap, and end i/o
+ *
+ * This alternative for open coded calls to:
+ *     del_gendisk()
+ *     blk_cleanup_queue()
+ * ...is for ->direct_access() (DAX) capable devices.  DAX bypasses page
+ * cache and mappings go directly to storage media.  When such a disk is
+ * removed the pfn backing a mapping may be invalid or removed from the
+ * system.  Upon return accessing DAX mappings of this disk will trigger
+ * SIGBUS.
+ */
+void del_gendisk_queue(struct gendisk *disk)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	del_gendisk_start(disk);
+
+	/* pass1 sync fs + evict idle inodes */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter) {
+		invalidate_partition(disk, part->partno);
+		kill_bdev_super(disk, part->partno);
+	}
+	disk_part_iter_exit(&piter);
+	invalidate_partition(disk, 0);
+	kill_bdev_super(disk, 0);
+
+	blk_cleanup_queue(disk->queue);
+
+	/*
+	 * pass2 now that the queue is dead unmap DAX inodes, and delete
+	 * partitions
+	 */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter) {
+		unmap_partition(disk, part->partno);
+		delete_partition(disk, part->partno);
+	}
+	disk_part_iter_exit(&piter);
+	unmap_partition(disk, 0);
+
+	del_gendisk_end(disk);
+}
+EXPORT_SYMBOL(del_gendisk_queue);
+
+/**
  * get_gendisk - get partitioning information for a given device
  * @devt: device to get partitioning information for
  * @partno: returned partition index
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index a5880f4ab40e..f149dd57bba3 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -532,7 +532,6 @@ out:
 static void brd_free(struct brd_device *brd)
 {
 	put_disk(brd->brd_disk);
-	blk_cleanup_queue(brd->brd_queue);
 	brd_free_pages(brd);
 	kfree(brd);
 }
@@ -560,7 +559,7 @@ out:
 static void brd_del_one(struct brd_device *brd)
 {
 	list_del(&brd->brd_list);
-	del_gendisk(brd->brd_disk);
+	del_gendisk_queue(brd->brd_disk);
 	brd_free(brd);
 }
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 8ee79893d2f5..6dd06e9d34b0 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -158,9 +158,8 @@ static void pmem_detach_disk(struct pmem_device *pmem)
 	if (!pmem->pmem_disk)
 		return;
 
-	del_gendisk(pmem->pmem_disk);
+	del_gendisk_queue(pmem->pmem_disk);
 	put_disk(pmem->pmem_disk);
-	blk_cleanup_queue(pmem->pmem_queue);
 }
 
 static int pmem_attach_disk(struct device *dev,
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 94a8f4ab57bc..0c3c968b57d9 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -388,8 +388,7 @@ removeseg:
 	}
 	list_del(&dev_info->lh);
 
-	del_gendisk(dev_info->gd);
-	blk_cleanup_queue(dev_info->dcssblk_queue);
+	del_gendisk_queue(dev_info->gd);
 	dev_info->gd->queue = NULL;
 	put_disk(dev_info->gd);
 	up_write(&dcssblk_devices_sem);
@@ -751,8 +750,7 @@ dcssblk_remove_store(struct device *dev, struct device_attribute *attr, const ch
 	}
 
 	list_del(&dev_info->lh);
-	del_gendisk(dev_info->gd);
-	blk_cleanup_queue(dev_info->dcssblk_queue);
+	del_gendisk_queue(dev_info->gd);
 	dev_info->gd->queue = NULL;
 	put_disk(dev_info->gd);
 	device_unregister(&dev_info->dev);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index d0233d643d33..f53e9e3e0f83 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1812,6 +1812,47 @@ void kill_bdev_super(struct gendisk *disk, int partno)
 	bdput(bdev);
 }
 
+void unmap_partition(struct gendisk *disk, int partno)
+{
+	bool dax = !!disk->fops->direct_access;
+	struct block_device *bdev = dax ? bdget_disk(disk, partno) : NULL;
+	struct super_block *sb = get_super(bdev);
+	struct inode *inode, *_inode = NULL;
+
+	if (!bdev)
+		return;
+
+	if (!sb) {
+		bdput(bdev);
+		return;
+	}
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+				|| !IS_DAX(inode)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+		__iget(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inode_list_lock);
+
+		unmap_mapping_range(inode->i_mapping, 0, 0, 1);
+		iput(_inode);
+		_inode = inode;
+		cond_resched();
+
+		spin_lock(&sb->s_inode_list_lock);
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+	iput(_inode);
+
+	drop_super(sb);
+	bdput(bdev);
+}
+
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 76925e322e87..5eb8b29264a8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2392,6 +2392,7 @@ extern int check_disk_change(struct block_device *);
 extern int __invalidate_device(struct block_device *, bool);
 extern int invalidate_partition(struct gendisk *, int);
 extern void kill_bdev_super(struct gendisk *, int);
+extern void unmap_partition(struct gendisk *, int);
 #endif
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 					pgoff_t start, pgoff_t end);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 847cc1d91634..028cf15a8a57 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -431,6 +431,7 @@ extern void part_round_stats(int cpu, struct hd_struct *part);
 /* block/genhd.c */
 extern void add_disk(struct gendisk *disk);
 extern void del_gendisk(struct gendisk *gp);
+extern void del_gendisk_queue(struct gendisk *disk);
 extern struct gendisk *get_gendisk(dev_t dev, int *partno);
 extern struct block_device *bdget_disk(struct gendisk *disk, int partno);
 


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 5/7] fs: notify superblocks of backing-device death
  2015-11-25 18:37 ` [PATCH v2 5/7] fs: notify superblocks of backing-device death Dan Williams
@ 2015-11-25 21:50   ` Dave Chinner
  2015-11-25 22:09     ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2015-11-25 21:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jens Axboe, linux-fsdevel, linux-block, Jan Kara

On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote:
> Set SB_I_BDIDEAD when a block device is no longer available to service
> requests.  This will be used in several places where an fs should give
> up early because the block device is gone.  Letting the fs continue on
> as if the block device is still present can lead to long latencies
> waiting for an fs to detect the loss of its backing device, trigger
> crashes, or generate misleasing warnings.
> 
> Cc: Jan Kara <jack@suse.com>
> Cc: Jens Axboe <axboe@fb.com>
> Suggested-by: Dave Chinner <david@fromorbit.com>

This isn't what I suggested. :/

.....

> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 1dd416bf72f7..d0233d643d33 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1795,6 +1795,23 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
>  }
>  EXPORT_SYMBOL(__invalidate_device);
>  
> +void kill_bdev_super(struct gendisk *disk, int partno)
> +{
> +	struct block_device *bdev = bdget_disk(disk, partno);
> +	struct super_block *sb;
> +
> +	if (!bdev)
> +		return;
> +	sb = get_super(bdev);
> +	if (!sb)
> +		goto out;
> +
> +	sb->s_iflags |= SB_I_BDI_DEAD;
> +	drop_super(sb);
> + out:
> +	bdput(bdev);
> +}

That's not a notification to the filesystem - that's a status flag
the filesystem has to explicitly check for *on every operation*. We
already have checks like these throughout filesystems, but they are
filesystem specific and need to propagate into fs-specific
subsystems that have knowledge of VFS level superblocks.

To that end, what I actually suggested what a callback - something
like a function in the super operations structure so that the
filesystem can take *immediate action* when the block device is
dying. i.e.

void kill_bdev_super(struct gendisk *disk, int partno)
{
	struct block_device *bdev = bdget_disk(disk, partno);
	struct super_block *sb;

	if (!bdev)
		return;
	sb = get_super(bdev);
	if (!sb)
		goto out;

	if (sb->s_ops->shutdown)
		sb->s_ops->shutdown(sb);

	drop_super(sb);
 out:
	bdput(bdev);
}

and then we implement ->shutdown somthing like this in XFS:

xfs_fs_shutdown(struct superblock *sb)
{
	xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REQ);
}

and so we immediately notify the entire filesystem that a shutdown
state has been entered and the appropriate actions are immediately
taken.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 5/7] fs: notify superblocks of backing-device death
  2015-11-25 21:50   ` Dave Chinner
@ 2015-11-25 22:09     ` Dan Williams
  2015-11-26  6:27       ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2015-11-25 22:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm@lists.01.org, Jens Axboe, linux-fsdevel, linux-block,
	Jan Kara

On Wed, Nov 25, 2015 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote:
>> Set SB_I_BDIDEAD when a block device is no longer available to service
>> requests.  This will be used in several places where an fs should give
>> up early because the block device is gone.  Letting the fs continue on
>> as if the block device is still present can lead to long latencies
>> waiting for an fs to detect the loss of its backing device, trigger
>> crashes, or generate misleasing warnings.
>>
>> Cc: Jan Kara <jack@suse.com>
>> Cc: Jens Axboe <axboe@fb.com>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>
> This isn't what I suggested. :/
>
> .....
>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 1dd416bf72f7..d0233d643d33 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -1795,6 +1795,23 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
>>  }
>>  EXPORT_SYMBOL(__invalidate_device);
>>
>> +void kill_bdev_super(struct gendisk *disk, int partno)
>> +{
>> +     struct block_device *bdev = bdget_disk(disk, partno);
>> +     struct super_block *sb;
>> +
>> +     if (!bdev)
>> +             return;
>> +     sb = get_super(bdev);
>> +     if (!sb)
>> +             goto out;
>> +
>> +     sb->s_iflags |= SB_I_BDI_DEAD;
>> +     drop_super(sb);
>> + out:
>> +     bdput(bdev);
>> +}
>
> That's not a notification to the filesystem - that's a status flag
> the filesystem has to explicitly check for *on every operation*. We
> already have checks like these throughout filesystems, but they are
> filesystem specific and need to propagate into fs-specific
> subsystems that have knowledge of VFS level superblocks.
>
> To that end, what I actually suggested what a callback - something
> like a function in the super operations structure so that the
> filesystem can take *immediate action* when the block device is
> dying. i.e.
>
> void kill_bdev_super(struct gendisk *disk, int partno)
> {
>         struct block_device *bdev = bdget_disk(disk, partno);
>         struct super_block *sb;
>
>         if (!bdev)
>                 return;
>         sb = get_super(bdev);
>         if (!sb)
>                 goto out;
>
>         if (sb->s_ops->shutdown)
>                 sb->s_ops->shutdown(sb);
>
>         drop_super(sb);
>  out:
>         bdput(bdev);
> }
>
> and then we implement ->shutdown somthing like this in XFS:
>
> xfs_fs_shutdown(struct superblock *sb)
> {
>         xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REQ);
> }
>
> and so we immediately notify the entire filesystem that a shutdown
> state has been entered and the appropriate actions are immediately
> taken.
>

That sounds good in theory.  However, in the case of xfs it is already
calling xfs_force_shutdown(), but that does not prevent it from
continuing to wait indefinitely at umount.  For the ext4 the
mark_inode_dirty() warning we're triggering the error in generic code.

None of this fixes the problem of dax mappings leaking past block
device remove.  That can be done generically without needing per-fs
code.

Solving the "zombie filesystem after block device down" problem is
incremental to this patch set.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 5/7] fs: notify superblocks of backing-device death
  2015-11-25 22:09     ` Dan Williams
@ 2015-11-26  6:27       ` Dave Chinner
  2015-11-26  7:11         ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2015-11-26  6:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm@lists.01.org, Jens Axboe, linux-fsdevel, linux-block,
	Jan Kara

On Wed, Nov 25, 2015 at 02:09:34PM -0800, Dan Williams wrote:
> On Wed, Nov 25, 2015 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote:
> >> Set SB_I_BDIDEAD when a block device is no longer available to service
> >> requests.  This will be used in several places where an fs should give
> >> up early because the block device is gone.  Letting the fs continue on
> >> as if the block device is still present can lead to long latencies
> >> waiting for an fs to detect the loss of its backing device, trigger
> >> crashes, or generate misleasing warnings.
> >>
> >> Cc: Jan Kara <jack@suse.com>
> >> Cc: Jens Axboe <axboe@fb.com>
> >> Suggested-by: Dave Chinner <david@fromorbit.com>
> >
> > This isn't what I suggested. :/
> >
> > .....
> >
> >> diff --git a/fs/block_dev.c b/fs/block_dev.c
> >> index 1dd416bf72f7..d0233d643d33 100644
> >> --- a/fs/block_dev.c
> >> +++ b/fs/block_dev.c
> >> @@ -1795,6 +1795,23 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
> >>  }
> >>  EXPORT_SYMBOL(__invalidate_device);
> >>
> >> +void kill_bdev_super(struct gendisk *disk, int partno)
> >> +{
> >> +     struct block_device *bdev = bdget_disk(disk, partno);
> >> +     struct super_block *sb;
> >> +
> >> +     if (!bdev)
> >> +             return;
> >> +     sb = get_super(bdev);
> >> +     if (!sb)
> >> +             goto out;
> >> +
> >> +     sb->s_iflags |= SB_I_BDI_DEAD;
> >> +     drop_super(sb);
> >> + out:
> >> +     bdput(bdev);
> >> +}
> >
> > That's not a notification to the filesystem - that's a status flag
> > the filesystem has to explicitly check for *on every operation*. We
> > already have checks like these throughout filesystems, but they are
> > filesystem specific and need to propagate into fs-specific
> > subsystems that have knowledge of VFS level superblocks.
> >
> > To that end, what I actually suggested what a callback - something
> > like a function in the super operations structure so that the
> > filesystem can take *immediate action* when the block device is
> > dying. i.e.
> >
> > void kill_bdev_super(struct gendisk *disk, int partno)
> > {
> >         struct block_device *bdev = bdget_disk(disk, partno);
> >         struct super_block *sb;
> >
> >         if (!bdev)
> >                 return;
> >         sb = get_super(bdev);
> >         if (!sb)
> >                 goto out;
> >
> >         if (sb->s_ops->shutdown)
> >                 sb->s_ops->shutdown(sb);
> >
> >         drop_super(sb);
> >  out:
> >         bdput(bdev);
> > }
> >
> > and then we implement ->shutdown somthing like this in XFS:
> >
> > xfs_fs_shutdown(struct superblock *sb)
> > {
> >         xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REQ);
> > }
> >
> > and so we immediately notify the entire filesystem that a shutdown
> > state has been entered and the appropriate actions are immediately
> > taken.
> >
> 
> That sounds good in theory.  However, in the case of xfs it is already
> calling xfs_force_shutdown(),

Where? If XFS does not do any metadata IO, then it won't shut the
filesystem down. We almost always allocate/map blocks without doing
any IO, which means we cannot guarantee erroring out page faults
during get_blocks until a shutdown has be triggered by other
means....

> but that does not prevent it from
> continuing to wait indefinitely at umount.

Which is a bug we need to fix - I don't see how a shutdown
implementation problem is at all relevant to triggering shutdowns in
a prompt manner...

> For the ext4 the
> mark_inode_dirty() warning we're triggering the error in generic code.

So? XFS doesn't use that generic code, but we have filesystem
specific issues that we need to handle sanely.

> None of this fixes the problem of dax mappings leaking past block
> device remove.  That can be done generically without needing per-fs
> code.

No, the shutdown is intended to close the race between the device
being removed, the mappings being invalidated and the filesytem
racing creating new mappings during page faults because it doesn't
know the device has been unplugged until it does IO that gets some
error in an unrecoverable path...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 5/7] fs: notify superblocks of backing-device death
  2015-11-26  6:27       ` Dave Chinner
@ 2015-11-26  7:11         ` Dan Williams
  2015-12-01  4:03           ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2015-11-26  7:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm@lists.01.org, Jens Axboe, linux-fsdevel, linux-block,
	Jan Kara

On Wed, Nov 25, 2015 at 10:27 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Nov 25, 2015 at 02:09:34PM -0800, Dan Williams wrote:
>> On Wed, Nov 25, 2015 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote:
[..]
>> That sounds good in theory.  However, in the case of xfs it is already
>> calling xfs_force_shutdown(),
>
> Where?

In the trace I included in the cover letter.

Testing this patch set reveals that xfs needs more XFS_FORCED_SHUTDOWN
checks, especially in the unmount path.  Currently we deadlock here on
umount after block device removal:

 INFO: task umount:2187 blocked for more than 120 seconds.
       Tainted: G           O    4.4.0-rc2+ #1953
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 umount          D ffff8800d2fbfd70     0  2187   2095 0x00000080
  ffff8800d2fbfd70 ffffffff81f94f98 ffff88031fc97bd8 ffff88030af5ad80
  ffff8800db71db00 ffff8800d2fc0000 ffff8800db8dbde0 ffff8800d93b6708
  ffff8800d93b6760 ffff8800d93b66d8 ffff8800d2fbfd88 ffffffff818f0695
 Call Trace:
  [<ffffffff818f0695>] schedule+0x35/0x80
  [<ffffffffa01e134e>] xfs_ail_push_all_sync+0xbe/0x110 [xfs]
  [<ffffffff810ecc30>] ? wait_woken+0x80/0x80
  [<ffffffffa01c8d91>] xfs_unmountfs+0x81/0x1b0 [xfs]
  [<ffffffffa01c991b>] ? xfs_mru_cache_destroy+0x6b/0x90 [xfs]
  [<ffffffffa01cbf30>] xfs_fs_put_super+0x30/0x90 [xfs]
  [<ffffffff81247eca>] generic_shutdown_super+0x6a/0xf0

Earlier in this trace xfs has already performed:

 XFS (pmem0m): xfs_do_force_shutdown(0x2) called from line 1197 of
file fs/xfs/xfs_log.c.

...but xfs_log_work_queue() continues to run periodically.

> If XFS does not do any metadata IO, then it won't shut the
> filesystem down. We almost always allocate/map blocks without doing
> any IO, which means we cannot guarantee erroring out page faults
> during get_blocks until a shutdown has be triggered by other
> means....
>
>> but that does not prevent it from
>> continuing to wait indefinitely at umount.
>
> Which is a bug we need to fix - I don't see how a shutdown
> implementation problem is at all relevant to triggering shutdowns in
> a prompt manner...
>
>> For the ext4 the
>> mark_inode_dirty() warning we're triggering the error in generic code.
>
> So? XFS doesn't use that generic code, but we have filesystem
> specific issues that we need to handle sanely.
>
>> None of this fixes the problem of dax mappings leaking past block
>> device remove.  That can be done generically without needing per-fs
>> code.
>
> No, the shutdown is intended to close the race between the device
> being removed, the mappings being invalidated and the filesytem
> racing creating new mappings during page faults because it doesn't
> know the device has been unplugged until it does IO that gets some
> error in an unrecoverable path...
>

So you want me to grow a ->sb_shutdown() method for each fs that
supports dax only to call unmap_mapping_range on each dax inode when
common code could have already walked the inode list.  Also separately
teach each fs to start returning errors on get_blocks() after shutdown
even though fs/dax.c can figure it out from the return value from
blk_queue_enter?  I'm failing to see the point.

That is of course separate from the real need for an ->sb_shutdown()
to tell the fs that the device is gone for other filesystem specific
operations.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 7/7] mm, dax: unmap dax mappings at bdev shutdown
  2015-11-25 18:37 ` [PATCH v2 7/7] mm, dax: unmap dax mappings at bdev shutdown Dan Williams
@ 2015-11-30 22:03   ` Dan Williams
  2015-12-01  4:21     ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2015-11-30 22:03 UTC (permalink / raw)
  To: linux-nvdimm@lists.01.org
  Cc: Dave Chinner, linux-block, Jan Kara, linux-fsdevel

On Wed, Nov 25, 2015 at 10:37 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> Currently dax mappings leak past / survive block_device shutdown.  While
> page cache pages are permitted to be read/written after the block_device
> is torn down this is not acceptable in the dax case as all media access
> must end when the device is disabled.  The pfn backing a dax mapping is
> permitted to be invalidated after bdev shutdown and this is indeed the
> case with brd.
>
> When a dax capable block_device driver calls del_gendisk_queue() in its
> shutdown path it needs to ensure that all DAX pfns are unmapped, and
> that no new mappings can be established.  This is different than the
> pagecache backed case where the disk is protected by the queue being
> torn down which ends I/O to the device.  Since dax bypasses the page
> cache we need to unconditionally unmap the inode.
>
> Cc: Jan Kara <jack@suse.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Matthew Wilcox <willy@linux.intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> [honza: drop changes to truncate_inode_pages_final]
> [honza: ensure mappings can't be re-established]
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---

Dave, can this patch move forward while we figure out ->shutdown() for
super_operations?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 5/7] fs: notify superblocks of backing-device death
  2015-11-26  7:11         ` Dan Williams
@ 2015-12-01  4:03           ` Dave Chinner
  2015-12-01  4:20             ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2015-12-01  4:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm@lists.01.org, Jens Axboe, linux-fsdevel, linux-block,
	Jan Kara

On Wed, Nov 25, 2015 at 11:11:00PM -0800, Dan Williams wrote:
> On Wed, Nov 25, 2015 at 10:27 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Nov 25, 2015 at 02:09:34PM -0800, Dan Williams wrote:
> >> On Wed, Nov 25, 2015 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote:
> [..]
> >> That sounds good in theory.  However, in the case of xfs it is already
> >> calling xfs_force_shutdown(),
> >
> > Where?
> 
> In the trace I included in the cover letter.
> 
> Testing this patch set reveals that xfs needs more XFS_FORCED_SHUTDOWN
> checks, especially in the unmount path.  Currently we deadlock here on
> umount after block device removal:

What's the test case?

>From what you've said, it appears to be another manifestation where
we trying to recovery from non-fatal IO errors that are being
reported by the block device, but that is short-cutting the path
that normally does shutdown detection on metadata buffer IO
completion.

Filesystem error handling coul dbe a lot more simple if we didnt'
have to try to guess what EIO from the block device actually means.
If device unplug triggered a shutdown, we wouldn't have to care
about retrying in many cases where we do right now because shutdown
then has clear priority of all other errors we need to handle. RIgh
tnow we just have ot guess, and there's a long history of corner
cases where we have guessed wrong....

I have patches to fix this particular manifestation, but until we
have have notification that devices hav ebeen unplugged and are not
coming back we're going to continue to struggle to get this right
and hence have re-occurring bugs when users pull USB drives out from
under active filesystems.....


> >> None of this fixes the problem of dax mappings leaking past block
> >> device remove.  That can be done generically without needing per-fs
> >> code.
> >
> > No, the shutdown is intended to close the race between the device
> > being removed, the mappings being invalidated and the filesytem
> > racing creating new mappings during page faults because it doesn't
> > know the device has been unplugged until it does IO that gets some
> > error in an unrecoverable path...
> 
> So you want me to grow a ->sb_shutdown() method for each fs that
> supports dax only to call unmap_mapping_range on each dax inode when
> common code could have already walked the inode list.

No, I don't want you to implement some whacky, dax only
->sb_shutdown method. I want a notification method to be implemented
so that each filesystem can take the necessary action it requires
when the underlying block device goes away.

Yes, that *filesystem specific method* might involve invalidating
all the dirty inodes in the filesystem, and if there are DAX
mappings in the filesystem then invalidating them, too, but that's
something the filesystem needs to take care of because *all
filesystem shutdowns must do this*.

Do you see the distinction there? This isn't just about device
unplug - there are lots of different triggers for a filesystem
shutdown. However, regardless of the shutdown trigger, we have to
take the same action. That is, we have to prevent all pending and
future IO from being issued to the block device, *even if the block
device is still OK*.

For example, if we detect a free space corruption during allocation,
it is not safe to trust *any active mapping* because we can't trust
that we having handed out the same block to multiple owners. Hence
on such a filesystem shutdown, we have to prevent any new DAX
mapping from occurring and invalidate all existing mappings as we
cannot allow userspace to modify any data or metadata until we've
resolved the corruption situation.

The simple fact is that a /filesystem/ shutdown needs to do DAX
mapping invalidation regardless of whether the block device has been
unplugged or not. This is not a case of "this only happens when we
unplug the device", this is a user data protection mechanism that we
use to prevent corruption propagation once it has been detected. A
device unplug is just one type of "corruption" that can occur.

Hence putting all this special invalidation stuff into the block
devices is simply duplicating functionality we need to implement in
the filesystem layers.  Regardless of this, it should be done by the
filesystem layers because that is where all the information necesary
to determine what needs invalidation is stored.

So, from the block dvice perspective, the *only thing it needs to
do* is notify the filesystem that it needs to shut down, and the
filesystem should then take care of everything else. Device unplug
is not some special snowflake that needs to be treated differently
from all other types of "it's dead, Jim" filesystem errors - from
the FS perspective it's the dead simple case because there are no
serious consequences if we screw up. i.e. if the FS gets it wrong
and IO is still issued after the shutdown, the IO is going to be
errored out rather than corrupting something on disk that would have
otherwise been OK....

> Also separately
> teach each fs to start returning errors on get_blocks() after shutdown
> even though fs/dax.c can figure it out from the return value from
> blk_queue_enter?

They will already return errors from get_blocks.

STATIC int
__xfs_get_blocks(
        struct inode            *inode,
        sector_t                iblock,
        struct buffer_head      *bh_result,
        int                     create,
        bool                    direct,
        bool                    dax_fault)
{
        struct xfs_inode        *ip = XFS_I(inode);
        struct xfs_mount        *mp = ip->i_mount;
        xfs_fileoff_t           offset_fsb, end_fsb;
        int                     error = 0;
        int                     lockmode = 0;
        struct xfs_bmbt_irec    imap;
        int                     nimaps = 1;
        xfs_off_t               offset;
        ssize_t                 size;
        int                     new = 0;

        if (XFS_FORCED_SHUTDOWN(mp))
                return -EIO;
.....

Hmmm? First thing XFS does in get_blocks() is check for shutdown.
In fact, the first thing that every major subsystem entry point
does in XFs is check for shutdown. Let's look at a typical
allocation get_blocks call in XFS:

 xfs_get_blocks
   __xfs_get_blocks - has shutdown check
    xfs_bmapi_read - shutdown check is second after error injection
    xfs_iomap_write_direct
      xfs_trans_reserve - has shutdown check
      xfs_bmapi_write - has shutdown check
      xfs_trans_commit - has shutdown check

IOWs, there are shutdown checks all through the get_blocks callbacks
already, so it's hardly a caase of you having to go an add support
to any filesystem for this...

Cheers,
MDave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 5/7] fs: notify superblocks of backing-device death
  2015-12-01  4:03           ` Dave Chinner
@ 2015-12-01  4:20             ` Dan Williams
  0 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2015-12-01  4:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm@lists.01.org, Jens Axboe, linux-fsdevel, linux-block,
	Jan Kara

On Mon, Nov 30, 2015 at 8:03 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Nov 25, 2015 at 11:11:00PM -0800, Dan Williams wrote:
>> On Wed, Nov 25, 2015 at 10:27 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Wed, Nov 25, 2015 at 02:09:34PM -0800, Dan Williams wrote:
>> >> On Wed, Nov 25, 2015 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote:
>> [..]
>> >> That sounds good in theory.  However, in the case of xfs it is already
>> >> calling xfs_force_shutdown(),
>> >
>> > Where?
>>
>> In the trace I included in the cover letter.
>>
>> Testing this patch set reveals that xfs needs more XFS_FORCED_SHUTDOWN
>> checks, especially in the unmount path.  Currently we deadlock here on
>> umount after block device removal:
>
> What's the test case?
>
> From what you've said, it appears to be another manifestation where
> we trying to recovery from non-fatal IO errors that are being
> reported by the block device, but that is short-cutting the path
> that normally does shutdown detection on metadata buffer IO
> completion.
>
> Filesystem error handling coul dbe a lot more simple if we didnt'
> have to try to guess what EIO from the block device actually means.
> If device unplug triggered a shutdown, we wouldn't have to care
> about retrying in many cases where we do right now because shutdown
> then has clear priority of all other errors we need to handle. RIgh
> tnow we just have ot guess, and there's a long history of corner
> cases where we have guessed wrong....
>
> I have patches to fix this particular manifestation, but until we
> have have notification that devices hav ebeen unplugged and are not
> coming back we're going to continue to struggle to get this right
> and hence have re-occurring bugs when users pull USB drives out from
> under active filesystems.....
>
>
>> >> None of this fixes the problem of dax mappings leaking past block
>> >> device remove.  That can be done generically without needing per-fs
>> >> code.
>> >
>> > No, the shutdown is intended to close the race between the device
>> > being removed, the mappings being invalidated and the filesytem
>> > racing creating new mappings during page faults because it doesn't
>> > know the device has been unplugged until it does IO that gets some
>> > error in an unrecoverable path...
>>
>> So you want me to grow a ->sb_shutdown() method for each fs that
>> supports dax only to call unmap_mapping_range on each dax inode when
>> common code could have already walked the inode list.
>
> No, I don't want you to implement some whacky, dax only
> ->sb_shutdown method. I want a notification method to be implemented
> so that each filesystem can take the necessary action it requires
> when the underlying block device goes away.
>
> Yes, that *filesystem specific method* might involve invalidating
> all the dirty inodes in the filesystem, and if there are DAX
> mappings in the filesystem then invalidating them, too, but that's
> something the filesystem needs to take care of because *all
> filesystem shutdowns must do this*.
>
> Do you see the distinction there? This isn't just about device
> unplug - there are lots of different triggers for a filesystem
> shutdown. However, regardless of the shutdown trigger, we have to
> take the same action. That is, we have to prevent all pending and
> future IO from being issued to the block device, *even if the block
> device is still OK*.

Ah, that finally got through to me.

I'll refactor this to be a ->shutdown notification with a generic
unmap that a filesystem can optionally call under its own discretion.
As I now see that generic functionality is just common code that an fs
might optionally need/use, but it is counter productive for that to be
privately triggered only by the block layer and only from an event
like del_gendisk.  We need it in potentially all fs shutdown paths.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 7/7] mm, dax: unmap dax mappings at bdev shutdown
  2015-11-30 22:03   ` Dan Williams
@ 2015-12-01  4:21     ` Dan Williams
  0 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2015-12-01  4:21 UTC (permalink / raw)
  To: linux-nvdimm@lists.01.org
  Cc: Dave Chinner, linux-block, Jan Kara, linux-fsdevel

On Mon, Nov 30, 2015 at 2:03 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Nov 25, 2015 at 10:37 AM, Dan Williams <dan.j.williams@intel.com> wrote:
>> Currently dax mappings leak past / survive block_device shutdown.  While
>> page cache pages are permitted to be read/written after the block_device
>> is torn down this is not acceptable in the dax case as all media access
>> must end when the device is disabled.  The pfn backing a dax mapping is
>> permitted to be invalidated after bdev shutdown and this is indeed the
>> case with brd.
>>
>> When a dax capable block_device driver calls del_gendisk_queue() in its
>> shutdown path it needs to ensure that all DAX pfns are unmapped, and
>> that no new mappings can be established.  This is different than the
>> pagecache backed case where the disk is protected by the queue being
>> torn down which ends I/O to the device.  Since dax bypasses the page
>> cache we need to unconditionally unmap the inode.
>>
>> Cc: Jan Kara <jack@suse.com>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Matthew Wilcox <willy@linux.intel.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> [honza: drop changes to truncate_inode_pages_final]
>> [honza: ensure mappings can't be re-established]
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>
> Dave, can this patch move forward while we figure out ->shutdown() for
> super_operations?

Disregard, I'm straightened out now.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-12-01  4:21 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-25 18:36 [PATCH v2 0/7] dax cleanups and lifetime fixes Dan Williams
2015-11-25 18:37 ` [PATCH v2 1/7] pmem, dax: clean up clear_pmem() Dan Williams
2015-11-25 18:37 ` [PATCH v2 2/7] dax: increase granularity of dax_clear_blocks() operations Dan Williams
2015-11-25 18:37 ` [PATCH v2 3/7] dax: guarantee page aligned results from bdev_direct_access() Dan Williams
2015-11-25 18:37 ` [PATCH v2 4/7] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
2015-11-25 18:37 ` [PATCH v2 5/7] fs: notify superblocks of backing-device death Dan Williams
2015-11-25 21:50   ` Dave Chinner
2015-11-25 22:09     ` Dan Williams
2015-11-26  6:27       ` Dave Chinner
2015-11-26  7:11         ` Dan Williams
2015-12-01  4:03           ` Dave Chinner
2015-12-01  4:20             ` Dan Williams
2015-11-25 18:37 ` [PATCH v2 6/7] ext4: skip inode dirty when backing device is gone Dan Williams
2015-11-25 18:37 ` [PATCH v2 7/7] mm, dax: unmap dax mappings at bdev shutdown Dan Williams
2015-11-30 22:03   ` Dan Williams
2015-12-01  4:21     ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.