[PATCH 0/5] block, dax: updates for 4.4

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] block, dax: updates for 4.4
@ 2015-10-22  6:41 Dan Williams
  2015-10-22  6:41 ` [PATCH 1/5] pmem, dax: clean up clear_pmem() Dan Williams
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Dan Williams @ 2015-10-22  6:41 UTC (permalink / raw)
  To: axboe
  Cc: Jens Axboe, Boaz Harrosh, jack, linux-nvdimm, Dave Hansen, david,
	linux-kernel, hch, Jeff Moyer, Al Viro, willy, akpm, ross.zwisler

As requested [1], break out the block specific updates from the dax-gup
series [2], to merge via the block tree.

1/ Enable dax mappings for raw block devices.  This addresses the review
   comments (from Ross and Honza) from the RFC [3].

2/ Introduce dax_map_atomic() to fix races between device teardown and
   new mapping requests.  This depends on commit 2a9067a91825 "block:
   generic request_queue reference counting" in for-4.4/integrity branch
   of the block tree.

3/ Cleanup clear_pmem() and its usage in dax.  This depends on commit
   0f90cc6609c7 "mm, dax: fix DAX deadlocks" that was merged into v4.3-rc6.

These pass the nvdimm unit tests and have passed a 0day-kbuild-robot run.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002531.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002387.html
[3]: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002512.html

---

Dan Williams (5):
      pmem, dax: clean up clear_pmem()
      dax: increase granularity of dax_clear_blocks() operations
      block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
      block: introduce file_bd_inode()
      block: enable dax for raw block devices

 arch/x86/include/asm/pmem.h |    7 --
 block/blk.h                 |    2 
 fs/block_dev.c              |   73 ++++++++++++++--
 fs/dax.c                    |  196 +++++++++++++++++++++++++++----------------
 include/linux/blkdev.h      |    2 
 5 files changed, 191 insertions(+), 89 deletions(-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/5] pmem, dax: clean up clear_pmem()
  2015-10-22  6:41 [PATCH 0/5] block, dax: updates for 4.4 Dan Williams
@ 2015-10-22  6:41 ` Dan Williams
  2015-10-22  6:41 ` [PATCH 2/5] dax: increase granularity of dax_clear_blocks() operations Dan Williams
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Dan Williams @ 2015-10-22  6:41 UTC (permalink / raw)
  To: axboe
  Cc: jack, akpm, linux-nvdimm, Dave Hansen, david, linux-kernel, willy,
	ross.zwisler, hch

Both, __dax_pmd_fault, and clear_pmem() were taking special steps to
clear memory a page at a time to take advantage of non-temporal
clear_page() implementations.  However, x86_64 does not use
non-temporal instructions for clear_page(), and arch_clear_pmem() was
always incurring the cost of __arch_wb_cache_pmem().

Clean up the assumption that doing clear_pmem() a page at a time is more
performant.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/pmem.h |    7 +------
 fs/dax.c                    |    4 +---
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d8ce3ec816ab..1544fabcd7f9 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -132,12 +132,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	void *vaddr = (void __force *)addr;
 
-	/* TODO: implement the zeroing via non-temporal writes */
-	if (size == PAGE_SIZE && ((unsigned long)vaddr & ~PAGE_MASK) == 0)
-		clear_page(vaddr);
-	else
-		memset(vaddr, 0, size);
-
+	memset(vaddr, 0, size);
 	__arch_wb_cache_pmem(vaddr, size);
 }
 
diff --git a/fs/dax.c b/fs/dax.c
index a86d3cc2b389..5dc33d788d50 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -623,9 +623,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto fallback;
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			int i;
-			for (i = 0; i < PTRS_PER_PMD; i++)
-				clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
+			clear_pmem(kaddr, HPAGE_SIZE);
 			wmb_pmem();
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/5] dax: increase granularity of dax_clear_blocks() operations
  2015-10-22  6:41 [PATCH 0/5] block, dax: updates for 4.4 Dan Williams
  2015-10-22  6:41 ` [PATCH 1/5] pmem, dax: clean up clear_pmem() Dan Williams
@ 2015-10-22  6:41 ` Dan Williams
  2015-10-22  9:26   ` Jan Kara
  2015-10-22  6:41 ` [PATCH 3/5] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Dan Williams @ 2015-10-22  6:41 UTC (permalink / raw)
  To: axboe
  Cc: jack, linux-nvdimm, david, linux-kernel, ross.zwisler, willy,
	akpm, hch

dax_clear_blocks is currently performing a cond_resched() after every
PAGE_SIZE memset.  We need not check so frequently, for example md-raid
only calls cond_resched() at stripe granularity.  Also, in preparation
for introducing a dax_map_atomic() operation that temporarily pins a dax
mapping move the call to cond_resched() to the outer loop.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5dc33d788d50..f8e543839e5c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -28,6 +28,7 @@
 #include <linux/sched.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
+#include <linux/sizes.h>
 
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
@@ -38,24 +39,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 	do {
 		void __pmem *addr;
 		unsigned long pfn;
-		long count;
+		long count, sz;
 
-		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
+		sz = min_t(long, size, SZ_1M);
+		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
 		if (count < 0)
 			return count;
-		BUG_ON(size < count);
-		while (count > 0) {
-			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
-			if (pgsz > count)
-				pgsz = count;
-			clear_pmem(addr, pgsz);
-			addr += pgsz;
-			size -= pgsz;
-			count -= pgsz;
-			BUG_ON(pgsz & 511);
-			sector += pgsz / 512;
-			cond_resched();
-		}
+		if (count < sz)
+			sz = count;
+		clear_pmem(addr, sz);
+		addr += sz;
+		size -= sz;
+		BUG_ON(sz & 511);
+		sector += sz / 512;
+		cond_resched();
 	} while (size);
 
 	wmb_pmem();


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/5] dax: increase granularity of dax_clear_blocks() operations
  2015-10-22  6:41 ` [PATCH 2/5] dax: increase granularity of dax_clear_blocks() operations Dan Williams
@ 2015-10-22  9:26   ` Jan Kara
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Kara @ 2015-10-22  9:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, jack, linux-nvdimm, david, linux-kernel, ross.zwisler,
	willy, akpm, hch

On Thu 22-10-15 02:41:54, Dan Williams wrote:
> dax_clear_blocks is currently performing a cond_resched() after every
> PAGE_SIZE memset.  We need not check so frequently, for example md-raid
> only calls cond_resched() at stripe granularity.  Also, in preparation
> for introducing a dax_map_atomic() operation that temporarily pins a dax
> mapping move the call to cond_resched() to the outer loop.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.com>

								Honza
> ---
>  fs/dax.c |   27 ++++++++++++---------------
>  1 file changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5dc33d788d50..f8e543839e5c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -28,6 +28,7 @@
>  #include <linux/sched.h>
>  #include <linux/uio.h>
>  #include <linux/vmstat.h>
> +#include <linux/sizes.h>
>  
>  int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  {
> @@ -38,24 +39,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  	do {
>  		void __pmem *addr;
>  		unsigned long pfn;
> -		long count;
> +		long count, sz;
>  
> -		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
> +		sz = min_t(long, size, SZ_1M);
> +		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
>  		if (count < 0)
>  			return count;
> -		BUG_ON(size < count);
> -		while (count > 0) {
> -			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
> -			if (pgsz > count)
> -				pgsz = count;
> -			clear_pmem(addr, pgsz);
> -			addr += pgsz;
> -			size -= pgsz;
> -			count -= pgsz;
> -			BUG_ON(pgsz & 511);
> -			sector += pgsz / 512;
> -			cond_resched();
> -		}
> +		if (count < sz)
> +			sz = count;
> +		clear_pmem(addr, sz);
> +		addr += sz;
> +		size -= sz;
> +		BUG_ON(sz & 511);
> +		sector += sz / 512;
> +		cond_resched();
>  	} while (size);
>  
>  	wmb_pmem();
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 3/5] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
  2015-10-22  6:41 [PATCH 0/5] block, dax: updates for 4.4 Dan Williams
  2015-10-22  6:41 ` [PATCH 1/5] pmem, dax: clean up clear_pmem() Dan Williams
  2015-10-22  6:41 ` [PATCH 2/5] dax: increase granularity of dax_clear_blocks() operations Dan Williams
@ 2015-10-22  6:41 ` Dan Williams
  2015-10-22  6:42 ` [PATCH 4/5] block: introduce file_bd_inode() Dan Williams
  2015-10-22  6:42 ` [PATCH 5/5] block: enable dax for raw block devices Dan Williams
  4 siblings, 0 replies; 23+ messages in thread
From: Dan Williams @ 2015-10-22  6:41 UTC (permalink / raw)
  To: axboe
  Cc: Jens Axboe, Boaz Harrosh, jack, akpm, linux-nvdimm, david,
	linux-kernel, willy, ross.zwisler, hch

The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against unbind of the underlying block
device.  Use blk_queue_enter()/blk_queue_exit() to either prevent
blk_cleanup_queue() from proceeding, or fail the dax_map_atomic() if the
request_queue is being torn down.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/blk.h            |    2 -
 fs/dax.c               |  165 ++++++++++++++++++++++++++++++++----------------
 include/linux/blkdev.h |    2 +
 3 files changed, 112 insertions(+), 57 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 157c93d54dc9..dc7d9411fa45 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
 			    unsigned int nr_bytes, unsigned int bidi_bytes);
-int blk_queue_enter(struct request_queue *q, gfp_t gfp);
-void blk_queue_exit(struct request_queue *q);
 void blk_freeze_queue(struct request_queue *q);
 
 static inline void blk_queue_enter_live(struct request_queue *q)
diff --git a/fs/dax.c b/fs/dax.c
index f8e543839e5c..a480729c00ec 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -30,6 +30,40 @@
 #include <linux/vmstat.h>
 #include <linux/sizes.h>
 
+static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
+		long size, unsigned long *pfn, long *len)
+{
+	long rc;
+	void __pmem *addr;
+	struct request_queue *q = bdev->bd_queue;
+
+	if (blk_queue_enter(q, GFP_NOWAIT) != 0)
+		return (void __pmem *) ERR_PTR(-EIO);
+	rc = bdev_direct_access(bdev, sector, &addr, pfn, size);
+	if (len)
+		*len = rc;
+	if (rc < 0) {
+		blk_queue_exit(q);
+		return (void __pmem *) ERR_PTR(rc);
+	}
+	return addr;
+}
+
+static void __pmem *dax_map_atomic(struct block_device *bdev, sector_t sector,
+		long size)
+{
+	unsigned long pfn;
+
+	return __dax_map_atomic(bdev, sector, size, &pfn, NULL);
+}
+
+static void dax_unmap_atomic(struct block_device *bdev, void __pmem *addr)
+{
+	if (IS_ERR(addr))
+		return;
+	blk_queue_exit(bdev->bd_queue);
+}
+
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
@@ -42,9 +76,9 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 		long count, sz;
 
 		sz = min_t(long, size, SZ_1M);
-		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
-		if (count < 0)
-			return count;
+		addr = __dax_map_atomic(bdev, sector, size, &pfn, &count);
+		if (IS_ERR(addr))
+			return PTR_ERR(addr);
 		if (count < sz)
 			sz = count;
 		clear_pmem(addr, sz);
@@ -52,6 +86,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 		size -= sz;
 		BUG_ON(sz & 511);
 		sector += sz / 512;
+		dax_unmap_atomic(bdev, addr);
 		cond_resched();
 	} while (size);
 
@@ -60,14 +95,6 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 }
 EXPORT_SYMBOL_GPL(dax_clear_blocks);
 
-static long dax_get_addr(struct buffer_head *bh, void __pmem **addr,
-		unsigned blkbits)
-{
-	unsigned long pfn;
-	sector_t sector = bh->b_blocknr << (blkbits - 9);
-	return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
-}
-
 /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */
 static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first,
 		loff_t pos, loff_t end)
@@ -97,19 +124,30 @@ static bool buffer_size_valid(struct buffer_head *bh)
 	return bh->b_state != 0;
 }
 
+
+static sector_t to_sector(const struct buffer_head *bh,
+		const struct inode *inode)
+{
+	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+
+	return sector;
+}
+
 static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 		      loff_t start, loff_t end, get_block_t get_block,
 		      struct buffer_head *bh)
 {
-	ssize_t retval = 0;
-	loff_t pos = start;
-	loff_t max = start;
-	loff_t bh_max = start;
-	void __pmem *addr;
+	loff_t pos = start, max = start, bh_max = start;
+	struct block_device *bdev = NULL;
+	int rw = iov_iter_rw(iter), rc;
+	long map_len = 0;
+	unsigned long pfn;
+	void __pmem *addr = NULL;
+	void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
 	bool hole = false;
 	bool need_wmb = false;
 
-	if (iov_iter_rw(iter) != WRITE)
+	if (rw == READ)
 		end = min(end, i_size_read(inode));
 
 	while (pos < end) {
@@ -124,13 +162,13 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 			if (pos == bh_max) {
 				bh->b_size = PAGE_ALIGN(end - pos);
 				bh->b_state = 0;
-				retval = get_block(inode, block, bh,
-						   iov_iter_rw(iter) == WRITE);
-				if (retval)
+				rc = get_block(inode, block, bh, rw == WRITE);
+				if (rc)
 					break;
 				if (!buffer_size_valid(bh))
 					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
+				bdev = bh->b_bdev;
 			} else {
 				unsigned done = bh->b_size -
 						(bh_max - (pos - first));
@@ -138,21 +176,27 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				bh->b_size -= done;
 			}
 
-			hole = iov_iter_rw(iter) != WRITE && !buffer_written(bh);
+			hole = rw == READ && !buffer_written(bh);
 			if (hole) {
 				addr = NULL;
 				size = bh->b_size - first;
 			} else {
-				retval = dax_get_addr(bh, &addr, blkbits);
-				if (retval < 0)
+				dax_unmap_atomic(bdev, kmap);
+				kmap = __dax_map_atomic(bdev,
+						to_sector(bh, inode),
+						bh->b_size, &pfn, &map_len);
+				if (IS_ERR(kmap)) {
+					rc = PTR_ERR(kmap);
 					break;
+				}
+				addr = kmap;
 				if (buffer_unwritten(bh) || buffer_new(bh)) {
-					dax_new_buf(addr, retval, first, pos,
-									end);
+					dax_new_buf(addr, map_len, first, pos,
+							end);
 					need_wmb = true;
 				}
 				addr += first;
-				size = retval - first;
+				size = map_len - first;
 			}
 			max = min(pos + size, end);
 		}
@@ -175,8 +219,9 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 
 	if (need_wmb)
 		wmb_pmem();
+	dax_unmap_atomic(bdev, kmap);
 
-	return (pos == start) ? retval : pos - start;
+	return (pos == start) ? rc : pos - start;
 }
 
 /**
@@ -265,28 +310,31 @@ static int dax_load_hole(struct address_space *mapping, struct page *page,
 	return VM_FAULT_LOCKED;
 }
 
-static int copy_user_bh(struct page *to, struct buffer_head *bh,
-			unsigned blkbits, unsigned long vaddr)
+static int copy_user_bh(struct page *to, struct inode *inode,
+		struct buffer_head *bh, unsigned long vaddr)
 {
+	struct block_device *bdev = bh->b_bdev;
 	void __pmem *vfrom;
 	void *vto;
 
-	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
-		return -EIO;
+	vfrom = dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size);
+	if (IS_ERR(vfrom))
+		return PTR_ERR(vfrom);
 	vto = kmap_atomic(to);
 	copy_user_page(vto, (void __force *)vfrom, vaddr, to);
 	kunmap_atomic(vto);
+	dax_unmap_atomic(bdev, vfrom);
 	return 0;
 }
 
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct address_space *mapping = inode->i_mapping;
-	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	void __pmem *addr;
+	struct address_space *mapping = inode->i_mapping;
+	struct block_device *bdev = bh->b_bdev;
 	unsigned long pfn;
+	void __pmem *addr;
 	pgoff_t size;
 	int error;
 
@@ -305,11 +353,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		goto out;
 	}
 
-	error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
-	if (error < 0)
-		goto out;
-	if (error < PAGE_SIZE) {
-		error = -EIO;
+	addr = __dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size,
+			&pfn, NULL);
+	if (IS_ERR(addr)) {
+		error = PTR_ERR(addr);
 		goto out;
 	}
 
@@ -317,6 +364,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		clear_pmem(addr, PAGE_SIZE);
 		wmb_pmem();
 	}
+	dax_unmap_atomic(bdev, addr);
 
 	error = vm_insert_mixed(vma, vaddr, pfn);
 
@@ -412,7 +460,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (vmf->cow_page) {
 		struct page *new_page = vmf->cow_page;
 		if (buffer_written(&bh))
-			error = copy_user_bh(new_page, &bh, blkbits, vaddr);
+			error = copy_user_bh(new_page, inode, &bh, vaddr);
 		else
 			clear_user_highpage(new_page, vaddr);
 		if (error)
@@ -524,11 +572,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	unsigned blkbits = inode->i_blkbits;
 	unsigned long pmd_addr = address & PMD_MASK;
 	bool write = flags & FAULT_FLAG_WRITE;
-	long length;
-	void __pmem *kaddr;
+	struct block_device *bdev;
 	pgoff_t size, pgoff;
-	sector_t block, sector;
-	unsigned long pfn;
+	sector_t block;
 	int result = 0;
 
 	/* Fall back to PTEs if we're going to COW */
@@ -552,9 +598,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
 
 	bh.b_size = PMD_SIZE;
-	length = get_block(inode, block, &bh, write);
-	if (length)
+	if (get_block(inode, block, &bh, write) != 0)
 		return VM_FAULT_SIGBUS;
+	bdev = bh.b_bdev;
 	i_mmap_lock_read(mapping);
 
 	/*
@@ -609,15 +655,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
-		sector = bh.b_blocknr << (blkbits - 9);
-		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
-						bh.b_size);
-		if (length < 0) {
+		long length;
+		unsigned long pfn;
+		void __pmem *kaddr = __dax_map_atomic(bdev,
+				to_sector(&bh, inode), HPAGE_SIZE, &pfn,
+				&length);
+
+		if (IS_ERR(kaddr)) {
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
-		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
+		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) {
+			dax_unmap_atomic(bdev, kaddr);
 			goto fallback;
+		}
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
 			clear_pmem(kaddr, HPAGE_SIZE);
@@ -626,6 +677,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
 		}
+		dax_unmap_atomic(bdev, kaddr);
 
 		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
 	}
@@ -729,12 +781,15 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	if (err < 0)
 		return err;
 	if (buffer_written(&bh)) {
-		void __pmem *addr;
-		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
-		if (err < 0)
-			return err;
+		struct block_device *bdev = bh.b_bdev;
+		void __pmem *addr = dax_map_atomic(bdev, to_sector(&bh, inode),
+				PAGE_CACHE_SIZE);
+
+		if (IS_ERR(addr))
+			return PTR_ERR(addr);
 		clear_pmem(addr + offset, length);
 		wmb_pmem();
+		dax_unmap_atomic(bdev, addr);
 	}
 
 	return 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cf57884db4b7..59a770dad804 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -792,6 +792,8 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern int blk_queue_enter(struct request_queue *q, gfp_t gfp);
+extern void blk_queue_exit(struct request_queue *q);
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/5] block: introduce file_bd_inode()
  2015-10-22  6:41 [PATCH 0/5] block, dax: updates for 4.4 Dan Williams
                   ` (2 preceding siblings ...)
  2015-10-22  6:41 ` [PATCH 3/5] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
@ 2015-10-22  6:42 ` Dan Williams
  2015-10-22  9:45   ` Jan Kara
  2015-10-22  6:42 ` [PATCH 5/5] block: enable dax for raw block devices Dan Williams
  4 siblings, 1 reply; 23+ messages in thread
From: Dan Williams @ 2015-10-22  6:42 UTC (permalink / raw)
  To: axboe
  Cc: jack, linux-nvdimm, david, linux-kernel, Al Viro, ross.zwisler,
	willy, akpm, hch

Similar to the file_inode() helper, provide a helper to lookup the inode for a
raw block device itself.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/block_dev.c |   19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0a793c7930eb..3255dcec96b4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -147,11 +147,16 @@ blkdev_get_block(struct inode *inode, sector_t iblock,
 	return 0;
 }
 
+static struct inode *file_bd_inode(struct file *file)
+{
+	return file->f_mapping->host;
+}
+
 static ssize_t
 blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
 {
 	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_mapping->host;
+	struct inode *inode = file_bd_inode(file);
 
 	if (IS_DAX(inode))
 		return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
@@ -329,7 +334,7 @@ static int blkdev_write_end(struct file *file, struct address_space *mapping,
  */
 static loff_t block_llseek(struct file *file, loff_t offset, int whence)
 {
-	struct inode *bd_inode = file->f_mapping->host;
+	struct inode *bd_inode = file_bd_inode(file);
 	loff_t retval;
 
 	mutex_lock(&bd_inode->i_mutex);
@@ -340,7 +345,7 @@ static loff_t block_llseek(struct file *file, loff_t offset, int whence)
 	
 int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
 {
-	struct inode *bd_inode = filp->f_mapping->host;
+	struct inode *bd_inode = file_bd_inode(filp);
 	struct block_device *bdev = I_BDEV(bd_inode);
 	int error;
 	
@@ -1579,14 +1584,14 @@ EXPORT_SYMBOL(blkdev_put);
 
 static int blkdev_close(struct inode * inode, struct file * filp)
 {
-	struct block_device *bdev = I_BDEV(filp->f_mapping->host);
+	struct block_device *bdev = I_BDEV(file_bd_inode(filp));
 	blkdev_put(bdev, filp->f_mode);
 	return 0;
 }
 
 static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 {
-	struct block_device *bdev = I_BDEV(file->f_mapping->host);
+	struct block_device *bdev = I_BDEV(file_bd_inode(file));
 	fmode_t mode = file->f_mode;
 
 	/*
@@ -1611,7 +1616,7 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
-	struct inode *bd_inode = file->f_mapping->host;
+	struct inode *bd_inode = file_bd_inode(file);
 	loff_t size = i_size_read(bd_inode);
 	struct blk_plug plug;
 	ssize_t ret;
@@ -1643,7 +1648,7 @@ EXPORT_SYMBOL_GPL(blkdev_write_iter);
 ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct file *file = iocb->ki_filp;
-	struct inode *bd_inode = file->f_mapping->host;
+	struct inode *bd_inode = file_bd_inode(file);
 	loff_t size = i_size_read(bd_inode);
 	loff_t pos = iocb->ki_pos;
 


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/5] block: introduce file_bd_inode()
  2015-10-22  6:42 ` [PATCH 4/5] block: introduce file_bd_inode() Dan Williams
@ 2015-10-22  9:45   ` Jan Kara
  2015-10-22 15:41     ` Dan Williams
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2015-10-22  9:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, jack, linux-nvdimm, david, linux-kernel, Al Viro,
	ross.zwisler, willy, akpm, hch

On Thu 22-10-15 02:42:05, Dan Williams wrote:
> Similar to the file_inode() helper, provide a helper to lookup the inode for a
> raw block device itself.

So I somewhat dislike the name file_bd_inode() since for struct file
pointing to a regular file, the result would be equivalent to file_inode()
but the name suggests (at least to me) it should return block device inode.
I'd name it like bdev_file_inode() to make it clear this is local to
block device code...

								Honza

> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/block_dev.c |   19 ++++++++++++-------
>  1 file changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 0a793c7930eb..3255dcec96b4 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -147,11 +147,16 @@ blkdev_get_block(struct inode *inode, sector_t iblock,
>  	return 0;
>  }
>  
> +static struct inode *file_bd_inode(struct file *file)
> +{
> +	return file->f_mapping->host;
> +}
> +
>  static ssize_t
>  blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
>  {
>  	struct file *file = iocb->ki_filp;
> -	struct inode *inode = file->f_mapping->host;
> +	struct inode *inode = file_bd_inode(file);
>  
>  	if (IS_DAX(inode))
>  		return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
> @@ -329,7 +334,7 @@ static int blkdev_write_end(struct file *file, struct address_space *mapping,
>   */
>  static loff_t block_llseek(struct file *file, loff_t offset, int whence)
>  {
> -	struct inode *bd_inode = file->f_mapping->host;
> +	struct inode *bd_inode = file_bd_inode(file);
>  	loff_t retval;
>  
>  	mutex_lock(&bd_inode->i_mutex);
> @@ -340,7 +345,7 @@ static loff_t block_llseek(struct file *file, loff_t offset, int whence)
>  	
>  int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
>  {
> -	struct inode *bd_inode = filp->f_mapping->host;
> +	struct inode *bd_inode = file_bd_inode(filp);
>  	struct block_device *bdev = I_BDEV(bd_inode);
>  	int error;
>  	
> @@ -1579,14 +1584,14 @@ EXPORT_SYMBOL(blkdev_put);
>  
>  static int blkdev_close(struct inode * inode, struct file * filp)
>  {
> -	struct block_device *bdev = I_BDEV(filp->f_mapping->host);
> +	struct block_device *bdev = I_BDEV(file_bd_inode(filp));
>  	blkdev_put(bdev, filp->f_mode);
>  	return 0;
>  }
>  
>  static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
>  {
> -	struct block_device *bdev = I_BDEV(file->f_mapping->host);
> +	struct block_device *bdev = I_BDEV(file_bd_inode(file));
>  	fmode_t mode = file->f_mode;
>  
>  	/*
> @@ -1611,7 +1616,7 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
>  ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  {
>  	struct file *file = iocb->ki_filp;
> -	struct inode *bd_inode = file->f_mapping->host;
> +	struct inode *bd_inode = file_bd_inode(file);
>  	loff_t size = i_size_read(bd_inode);
>  	struct blk_plug plug;
>  	ssize_t ret;
> @@ -1643,7 +1648,7 @@ EXPORT_SYMBOL_GPL(blkdev_write_iter);
>  ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  {
>  	struct file *file = iocb->ki_filp;
> -	struct inode *bd_inode = file->f_mapping->host;
> +	struct inode *bd_inode = file_bd_inode(file);
>  	loff_t size = i_size_read(bd_inode);
>  	loff_t pos = iocb->ki_pos;
>  
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/5] block: introduce file_bd_inode()
  2015-10-22  9:45   ` Jan Kara
@ 2015-10-22 15:41     ` Dan Williams
  0 siblings, 0 replies; 23+ messages in thread
From: Dan Williams @ 2015-10-22 15:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, david,
	linux-kernel@vger.kernel.org, Al Viro, Ross Zwisler,
	Matthew Wilcox, Andrew Morton, Christoph Hellwig

On Thu, Oct 22, 2015 at 2:45 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 22-10-15 02:42:05, Dan Williams wrote:
>> Similar to the file_inode() helper, provide a helper to lookup the inode for a
>> raw block device itself.
>
> So I somewhat dislike the name file_bd_inode() since for struct file
> pointing to a regular file, the result would be equivalent to file_inode()
> but the name suggests (at least to me) it should return block device inode.
> I'd name it like bdev_file_inode() to make it clear this is local to
> block device code...

I like it, will fix.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 5/5] block: enable dax for raw block devices
  2015-10-22  6:41 [PATCH 0/5] block, dax: updates for 4.4 Dan Williams
                   ` (3 preceding siblings ...)
  2015-10-22  6:42 ` [PATCH 4/5] block: introduce file_bd_inode() Dan Williams
@ 2015-10-22  6:42 ` Dan Williams
  2015-10-22  9:35   ` Jan Kara
  4 siblings, 1 reply; 23+ messages in thread
From: Dan Williams @ 2015-10-22  6:42 UTC (permalink / raw)
  To: axboe
  Cc: jack, linux-nvdimm, david, linux-kernel, hch, Jeff Moyer, willy,
	ross.zwisler, akpm

If an application wants exclusive access to all of the persistent memory
provided by an NVDIMM namespace it can use this raw-block-dax facility
to forgo establishing a filesystem.  This capability is targeted
primarily to hypervisors wanting to provision persistent memory for
guests.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/block_dev.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3255dcec96b4..c27cd1a21a13 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1687,13 +1687,65 @@ static const struct address_space_operations def_blk_aops = {
 	.is_dirty_writeback = buffer_check_dirty_writeback,
 };
 
+#ifdef CONFIG_FS_DAX
+/*
+ * In the raw block case we do not need to contend with truncation nor
+ * unwritten file extents.  Without those concerns there is no need for
+ * additional locking beyond the mmap_sem context that these routines
+ * are already executing under.
+ *
+ * Note, there is no protection if the block device is dynamically
+ * resized (partition grow/shrink) during a fault. A stable block device
+ * size is already not enforced in the blkdev_direct_IO path.
+ *
+ * For DAX, it is the responsibility of the block device driver to
+ * ensure the whole-disk device size is stable while requests are in
+ * flight.
+ *
+ * Finally, these paths do not synchronize against freezing
+ * (sb_start_pagefault(), etc...) since bdev_sops does not support
+ * freezing.
+ */
+static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return __dax_fault(vma, vmf, blkdev_get_block, NULL);
+}
+
+static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
+		pmd_t *pmd, unsigned int flags)
+{
+	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL);
+}
+
+static const struct vm_operations_struct blkdev_dax_vm_ops = {
+	.page_mkwrite	= blkdev_dax_fault,
+	.fault		= blkdev_dax_fault,
+	.pmd_fault	= blkdev_dax_pmd_fault,
+};
+
+static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *bd_inode = file_bd_inode(file);
+
+	if (!IS_DAX(bd_inode))
+		return generic_file_mmap(file, vma);
+
+	file_accessed(file);
+	vma->vm_ops = &blkdev_dax_vm_ops;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	return 0;
+}
+#else
+#define blkdev_mmap generic_file_mmap
+#endif
+
 const struct file_operations def_blk_fops = {
 	.open		= blkdev_open,
 	.release	= blkdev_close,
 	.llseek		= block_llseek,
 	.read_iter	= blkdev_read_iter,
 	.write_iter	= blkdev_write_iter,
-	.mmap		= generic_file_mmap,
+	.mmap		= blkdev_mmap,
 	.fsync		= blkdev_fsync,
 	.unlocked_ioctl	= block_ioctl,
 #ifdef CONFIG_COMPAT


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/5] block: enable dax for raw block devices
  2015-10-22  6:42 ` [PATCH 5/5] block: enable dax for raw block devices Dan Williams
@ 2015-10-22  9:35   ` Jan Kara
  2015-10-22 16:05     ` Williams, Dan J
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2015-10-22  9:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, jack, linux-nvdimm, david, linux-kernel, hch, Jeff Moyer,
	willy, ross.zwisler, akpm

On Thu 22-10-15 02:42:11, Dan Williams wrote:
> If an application wants exclusive access to all of the persistent memory
> provided by an NVDIMM namespace it can use this raw-block-dax facility
> to forgo establishing a filesystem.  This capability is targeted
> primarily to hypervisors wanting to provision persistent memory for
> guests.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/block_dev.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 53 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 3255dcec96b4..c27cd1a21a13 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1687,13 +1687,65 @@ static const struct address_space_operations def_blk_aops = {
>  	.is_dirty_writeback = buffer_check_dirty_writeback,
>  };
>  
> +#ifdef CONFIG_FS_DAX
> +/*
> + * In the raw block case we do not need to contend with truncation nor
> + * unwritten file extents.  Without those concerns there is no need for
> + * additional locking beyond the mmap_sem context that these routines
> + * are already executing under.
> + *
> + * Note, there is no protection if the block device is dynamically
> + * resized (partition grow/shrink) during a fault. A stable block device
> + * size is already not enforced in the blkdev_direct_IO path.
> + *
> + * For DAX, it is the responsibility of the block device driver to
> + * ensure the whole-disk device size is stable while requests are in
> + * flight.
> + *
> + * Finally, these paths do not synchronize against freezing
> + * (sb_start_pagefault(), etc...) since bdev_sops does not support
> + * freezing.

Well, for devices freezing is handled directly in the block layer code
(blk_stop_queue()) since there's no need to put some metadata structures
into a consistent state. So the comment about bdev_sops is somewhat
strange. Otherwise the patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.com>

								Honza
 
> + */
> +static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	return __dax_fault(vma, vmf, blkdev_get_block, NULL);
> +}
> +
> +static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
> +		pmd_t *pmd, unsigned int flags)
> +{
> +	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL);
> +}
> +
> +static const struct vm_operations_struct blkdev_dax_vm_ops = {
> +	.page_mkwrite	= blkdev_dax_fault,
> +	.fault		= blkdev_dax_fault,
> +	.pmd_fault	= blkdev_dax_pmd_fault,
> +};
> +
> +static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct inode *bd_inode = file_bd_inode(file);
> +
> +	if (!IS_DAX(bd_inode))
> +		return generic_file_mmap(file, vma);
> +
> +	file_accessed(file);
> +	vma->vm_ops = &blkdev_dax_vm_ops;
> +	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
> +	return 0;
> +}
> +#else
> +#define blkdev_mmap generic_file_mmap
> +#endif
> +
>  const struct file_operations def_blk_fops = {
>  	.open		= blkdev_open,
>  	.release	= blkdev_close,
>  	.llseek		= block_llseek,
>  	.read_iter	= blkdev_read_iter,
>  	.write_iter	= blkdev_write_iter,
> -	.mmap		= generic_file_mmap,
> +	.mmap		= blkdev_mmap,
>  	.fsync		= blkdev_fsync,
>  	.unlocked_ioctl	= block_ioctl,
>  #ifdef CONFIG_COMPAT
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/5] block: enable dax for raw block devices
  2015-10-22  9:35   ` Jan Kara
@ 2015-10-22 16:05     ` Williams, Dan J
  2015-10-22 21:08       ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Williams, Dan J @ 2015-10-22 16:05 UTC (permalink / raw)
  To: jack@suse.cz
  Cc: linux-kernel@vger.kernel.org, jmoyer@redhat.com, hch@lst.de,
	axboe@fb.com, akpm@linux-foundation.org,
	linux-nvdimm@lists.01.org, willy@linux.intel.com,
	ross.zwisler@linux.intel.com, david@fromorbit.com

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3707 bytes --]

On Thu, 2015-10-22 at 11:35 +0200, Jan Kara wrote:
> On Thu 22-10-15 02:42:11, Dan Williams wrote:
> > If an application wants exclusive access to all of the persistent memory
> > provided by an NVDIMM namespace it can use this raw-block-dax facility
> > to forgo establishing a filesystem.  This capability is targeted
> > primarily to hypervisors wanting to provision persistent memory for
> > guests.
> > 
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Jeff Moyer <jmoyer@redhat.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  fs/block_dev.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 53 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 3255dcec96b4..c27cd1a21a13 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -1687,13 +1687,65 @@ static const struct address_space_operations def_blk_aops = {
> >  	.is_dirty_writeback = buffer_check_dirty_writeback,
> >  };
> >  
> > +#ifdef CONFIG_FS_DAX
> > +/*
> > + * In the raw block case we do not need to contend with truncation nor
> > + * unwritten file extents.  Without those concerns there is no need for
> > + * additional locking beyond the mmap_sem context that these routines
> > + * are already executing under.
> > + *
> > + * Note, there is no protection if the block device is dynamically
> > + * resized (partition grow/shrink) during a fault. A stable block device
> > + * size is already not enforced in the blkdev_direct_IO path.
> > + *
> > + * For DAX, it is the responsibility of the block device driver to
> > + * ensure the whole-disk device size is stable while requests are in
> > + * flight.
> > + *
> > + * Finally, these paths do not synchronize against freezing
> > + * (sb_start_pagefault(), etc...) since bdev_sops does not support
> > + * freezing.
> 
> Well, for devices freezing is handled directly in the block layer code
> (blk_stop_queue()) since there's no need to put some metadata structures
> into a consistent state. So the comment about bdev_sops is somewhat
> strange.

This text was aimed at the request from Ross to document the differences
vs the generic_file_mmap() path.  Is the following incremental change
more clear?

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 840acd4380d4..4ae8fa55bd1e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1702,9 +1702,15 @@ static const struct address_space_operations def_blk_aops = {
  * ensure the whole-disk device size is stable while requests are in
  * flight.
  *
- * Finally, these paths do not synchronize against freezing
- * (sb_start_pagefault(), etc...) since bdev_sops does not support
- * freezing.
+ * Finally, in contrast to the generic_file_mmap() path, there are no
+ * calls to sb_start_pagefault().  That is meant to synchronize write
+ * faults against requests to freeze the contents of the filesystem
+ * hosting vma->vm_file.  However, in the case of a block device special
+ * file, it is a 0-sized device node usually hosted on devtmpfs, i.e.
+ * nothing to do with the super_block for bdev_file_inode(vma->vm_file).
+ * We could call get_super() in this path to retrieve the right
+ * super_block, but the generic_file_mmap() path does not do this for
+ * the CONFIG_FS_DAX=n case.
  */
 static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {

ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/5] block: enable dax for raw block devices
  2015-10-22 16:05     ` Williams, Dan J
@ 2015-10-22 21:08       ` Jan Kara
  2015-10-22 23:41         ` Williams, Dan J
                           ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Jan Kara @ 2015-10-22 21:08 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: jack@suse.cz, linux-kernel@vger.kernel.org, jmoyer@redhat.com,
	hch@lst.de, axboe@fb.com, akpm@linux-foundation.org,
	linux-nvdimm@lists.01.org, willy@linux.intel.com,
	ross.zwisler@linux.intel.com, david@fromorbit.com

On Thu 22-10-15 16:05:46, Williams, Dan J wrote:
> On Thu, 2015-10-22 at 11:35 +0200, Jan Kara wrote:
> > On Thu 22-10-15 02:42:11, Dan Williams wrote:
> > > If an application wants exclusive access to all of the persistent memory
> > > provided by an NVDIMM namespace it can use this raw-block-dax facility
> > > to forgo establishing a filesystem.  This capability is targeted
> > > primarily to hypervisors wanting to provision persistent memory for
> > > guests.
> > > 
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: Jeff Moyer <jmoyer@redhat.com>
> > > Cc: Christoph Hellwig <hch@lst.de>
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > > ---
> > >  fs/block_dev.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 53 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 3255dcec96b4..c27cd1a21a13 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -1687,13 +1687,65 @@ static const struct address_space_operations def_blk_aops = {
> > >  	.is_dirty_writeback = buffer_check_dirty_writeback,
> > >  };
> > >  
> > > +#ifdef CONFIG_FS_DAX
> > > +/*
> > > + * In the raw block case we do not need to contend with truncation nor
> > > + * unwritten file extents.  Without those concerns there is no need for
> > > + * additional locking beyond the mmap_sem context that these routines
> > > + * are already executing under.
> > > + *
> > > + * Note, there is no protection if the block device is dynamically
> > > + * resized (partition grow/shrink) during a fault. A stable block device
> > > + * size is already not enforced in the blkdev_direct_IO path.
> > > + *
> > > + * For DAX, it is the responsibility of the block device driver to
> > > + * ensure the whole-disk device size is stable while requests are in
> > > + * flight.
> > > + *
> > > + * Finally, these paths do not synchronize against freezing
> > > + * (sb_start_pagefault(), etc...) since bdev_sops does not support
> > > + * freezing.
> > 
> > Well, for devices freezing is handled directly in the block layer code
> > (blk_stop_queue()) since there's no need to put some metadata structures
> > into a consistent state. So the comment about bdev_sops is somewhat
> > strange.
> 
> This text was aimed at the request from Ross to document the differences
> vs the generic_file_mmap() path.  Is the following incremental change
> more clear?

Well, not really. I thought you'd just delete that paragraph :) The thing
is: When doing IO directly to the block device, it makes no sense to look
at a filesystem on top of it - hopefully there is none since you'd be
corrupting it. So the paragraph that would make sense to me would be:

 * Finally, in contrast to filemap_page_mkwrite(), we don't bother calling
 * sb_start_pagefault(). There is no filesystem which could be frozen here
 * and when bdev gets frozen, IO gets blocked in the request queue.

But when spelled out like this, I've realized that with DAX, this blocking
of requests in the request queue doesn't really block the IO to the device.
So block device freezing (aka blk_queue_stop()) doesn't work reliably with
DAX. That should be fixed but it's not easy as the only way to do that
would be to hook into blk_stop_queue() and unmap (or at least
write-protect) all the mappings of the device. Ugh...

Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for
filesystems since there's nothing which writeprotects pages that are
writeably mapped. In normal path, page writeback does this but that doesn't
happen for DAX. I remember we once talked about this but it got lost.
We need something like walk all filesystem inodes during fs freeze and
writeprotect all pages that are mapped. But that's going to be slow...

								Honza

> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 840acd4380d4..4ae8fa55bd1e 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1702,9 +1702,15 @@ static const struct address_space_operations def_blk_aops = {
>   * ensure the whole-disk device size is stable while requests are in
>   * flight.
>   *
> - * Finally, these paths do not synchronize against freezing
> - * (sb_start_pagefault(), etc...) since bdev_sops does not support
> - * freezing.
> + * Finally, in contrast to the generic_file_mmap() path, there are no
> + * calls to sb_start_pagefault().  That is meant to synchronize write
> + * faults against requests to freeze the contents of the filesystem
> + * hosting vma->vm_file.  However, in the case of a block device special
> + * file, it is a 0-sized device node usually hosted on devtmpfs, i.e.
> + * nothing to do with the super_block for bdev_file_inode(vma->vm_file).
> + * We could call get_super() in this path to retrieve the right
> + * super_block, but the generic_file_mmap() path does not do this for
> + * the CONFIG_FS_DAX=n case.
>   */
>  static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/5] block: enable dax for raw block devices
  2015-10-22 21:08       ` Jan Kara
@ 2015-10-22 23:41         ` Williams, Dan J
  2015-10-24 12:21           ` Jan Kara
  2015-10-23 23:32         ` Dan Williams
  2015-10-25 21:22         ` Dave Chinner
  2 siblings, 1 reply; 23+ messages in thread
From: Williams, Dan J @ 2015-10-22 23:41 UTC (permalink / raw)
  To: jack@suse.cz
  Cc: linux-kernel@vger.kernel.org, jmoyer@redhat.com, hch@lst.de,
	axboe@fb.com, akpm@linux-foundation.org,
	linux-nvdimm@lists.01.org, willy@linux.intel.com,
	ross.zwisler@linux.intel.com, david@fromorbit.com

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 12563 bytes --]

On Thu, 2015-10-22 at 23:08 +0200, Jan Kara wrote:
> On Thu 22-10-15 16:05:46, Williams, Dan J wrote:
> > On Thu, 2015-10-22 at 11:35 +0200, Jan Kara wrote:
> > > On Thu 22-10-15 02:42:11, Dan Williams wrote:
> > > > If an application wants exclusive access to all of the persistent memory
> > > > provided by an NVDIMM namespace it can use this raw-block-dax facility
> > > > to forgo establishing a filesystem.  This capability is targeted
> > > > primarily to hypervisors wanting to provision persistent memory for
> > > > guests.
> > > > 
> > > > Cc: Jan Kara <jack@suse.cz>
> > > > Cc: Jeff Moyer <jmoyer@redhat.com>
> > > > Cc: Christoph Hellwig <hch@lst.de>
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > > > ---
> > > >  fs/block_dev.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 53 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > index 3255dcec96b4..c27cd1a21a13 100644
> > > > --- a/fs/block_dev.c
> > > > +++ b/fs/block_dev.c
> > > > @@ -1687,13 +1687,65 @@ static const struct address_space_operations def_blk_aops = {
> > > >  	.is_dirty_writeback = buffer_check_dirty_writeback,
> > > >  };
> > > >  
> > > > +#ifdef CONFIG_FS_DAX
> > > > +/*
> > > > + * In the raw block case we do not need to contend with truncation nor
> > > > + * unwritten file extents.  Without those concerns there is no need for
> > > > + * additional locking beyond the mmap_sem context that these routines
> > > > + * are already executing under.
> > > > + *
> > > > + * Note, there is no protection if the block device is dynamically
> > > > + * resized (partition grow/shrink) during a fault. A stable block device
> > > > + * size is already not enforced in the blkdev_direct_IO path.
> > > > + *
> > > > + * For DAX, it is the responsibility of the block device driver to
> > > > + * ensure the whole-disk device size is stable while requests are in
> > > > + * flight.
> > > > + *
> > > > + * Finally, these paths do not synchronize against freezing
> > > > + * (sb_start_pagefault(), etc...) since bdev_sops does not support
> > > > + * freezing.
> > > 
> > > Well, for devices freezing is handled directly in the block layer code
> > > (blk_stop_queue()) since there's no need to put some metadata structures
> > > into a consistent state. So the comment about bdev_sops is somewhat
> > > strange.
> > 
> > This text was aimed at the request from Ross to document the differences
> > vs the generic_file_mmap() path.  Is the following incremental change
> > more clear?
> 
> Well, not really. I thought you'd just delete that paragraph :) The thing
> is: When doing IO directly to the block device, it makes no sense to look
> at a filesystem on top of it - hopefully there is none since you'd be
> corrupting it. So the paragraph that would make sense to me would be:
> 
>  * Finally, in contrast to filemap_page_mkwrite(), we don't bother calling
>  * sb_start_pagefault(). There is no filesystem which could be frozen here
>  * and when bdev gets frozen, IO gets blocked in the request queue.
> 
> But when spelled out like this, I've realized that with DAX, this blocking
> of requests in the request queue doesn't really block the IO to the device.
> So block device freezing (aka blk_queue_stop()) doesn't work reliably with
> DAX. That should be fixed but it's not easy as the only way to do that
> would be to hook into blk_stop_queue() and unmap (or at least
> write-protect) all the mappings of the device. Ugh...
> 
> Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for
> filesystems since there's nothing which writeprotects pages that are
> writeably mapped. In normal path, page writeback does this but that doesn't
> happen for DAX. I remember we once talked about this but it got lost.
> We need something like walk all filesystem inodes during fs freeze and
> writeprotect all pages that are mapped. But that's going to be slow...

This sounds suspiciously like what I'm planning to do for the device
teardown path when we've dynamically allocated struct page.  The backing
memory for those pages is freed when the driver runs its ->remove()
path, so we have to be sure there are no outstanding references to them.

My current proposal for the teardown case, that we might re-purpose for
this freeze case, is below.  It relies on the percpu_ref in the
request_queue to block new faults and then uses truncate_pagecache() to
teardown mappings.  However, this assumes we've inserted pages into the
address_space radix at fault, which we don't currently do...

In general, as this page-backed-pmem support lands upstream, I'm of the
opinion that the page-less DAX support be deprecated/disabled
unless/until it can be made as functionally capable as the page-enabled
paths.

8<----
Subject: mm, pmem: devm_memunmap_pages(), truncate and unmap ZONE_DEVICE pages

From: Dan Williams <dan.j.williams@intel.com>

Before we allow ZONE_DEVICE pages to be put into active use outside of
the pmem driver, we need to arrange for them to be reclaimed when the
driver is shutdown.  devm_memunmap_pages() must wait for all pages to
return to the initial mapcount of 1.  If a given page is mapped by a
process we will truncate it out of its inode mapping and unmap it out of
the process vma.

This truncation is done while the dev_pagemap reference count is "dead",
preventing new references from being taken while the truncate+unmap scan
is in progress.

Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |   42 ++++++++++++++++++++++++++++++++++++------
 fs/dax.c              |    2 ++
 include/linux/mm.h    |    5 +++++
 kernel/memremap.c     |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f7acce594fa0..2c9aebbc3fea 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -24,12 +24,15 @@
 #include <linux/memory_hotplug.h>
 #include <linux/moduleparam.h>
 #include <linux/vmalloc.h>
+#include <linux/async.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
 #include <linux/nd.h>
 #include "pfn.h"
 #include "nd.h"
 
+static ASYNC_DOMAIN_EXCLUSIVE(async_pmem);
+
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
@@ -164,14 +167,43 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 	return pmem;
 }
 
-static void pmem_detach_disk(struct pmem_device *pmem)
+
+static void async_blk_cleanup_queue(void *data, async_cookie_t cookie)
+{
+	struct pmem_device *pmem = data;
+
+	blk_cleanup_queue(pmem->pmem_queue);
+}
+
+static void pmem_detach_disk(struct device *dev)
 {
+	struct pmem_device *pmem = dev_get_drvdata(dev);
+	struct request_queue *q = pmem->pmem_queue;
+
 	if (!pmem->pmem_disk)
 		return;
 
 	del_gendisk(pmem->pmem_disk);
 	put_disk(pmem->pmem_disk);
-	blk_cleanup_queue(pmem->pmem_queue);
+	async_schedule_domain(async_blk_cleanup_queue, pmem, &async_pmem);
+
+	if (pmem->pfn_flags & PFN_MAP) {
+		/*
+		 * Wait for queue to go dead so that we know no new
+		 * references will be taken against the pages allocated
+		 * by devm_memremap_pages().
+		 */
+		blk_wait_queue_dead(q);
+
+		/*
+		 * Manually release the page mapping so that
+		 * blk_cleanup_queue() can complete queue draining.
+		 */
+		devm_memunmap_pages(dev, (void __force *) pmem->virt_addr);
+	}
+
+	/* Wait for blk_cleanup_queue() to finish */
+	async_synchronize_full_domain(&async_pmem);
 }
 
 static int pmem_attach_disk(struct device *dev,
@@ -299,11 +331,9 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
 static int nvdimm_namespace_detach_pfn(struct nd_namespace_common *ndns)
 {
 	struct nd_pfn *nd_pfn = to_nd_pfn(ndns->claim);
-	struct pmem_device *pmem;
 
 	/* free pmem disk */
-	pmem = dev_get_drvdata(&nd_pfn->dev);
-	pmem_detach_disk(pmem);
+	pmem_detach_disk(&nd_pfn->dev);
 
 	/* release nd_pfn resources */
 	kfree(nd_pfn->pfn_sb);
@@ -446,7 +476,7 @@ static int nd_pmem_remove(struct device *dev)
 	else if (is_nd_pfn(dev))
 		nvdimm_namespace_detach_pfn(pmem->ndns);
 	else
-		pmem_detach_disk(pmem);
+		pmem_detach_disk(dev);
 
 	return 0;
 }
diff --git a/fs/dax.c b/fs/dax.c
index 8d756562fcf0..0bc9b315d16f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -46,6 +46,7 @@ static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
 		blk_queue_exit(q);
 		return (void __pmem *) ERR_PTR(rc);
 	}
+	rcu_read_lock();
 	return addr;
 }
 
@@ -62,6 +63,7 @@ static void dax_unmap_atomic(struct block_device *bdev, void __pmem *addr)
 	if (IS_ERR(addr))
 		return;
 	blk_queue_exit(bdev->bd_queue);
+	rcu_read_unlock();
 }
 
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a5b5267eae5b..294518ddf5bc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -801,6 +801,7 @@ struct dev_pagemap {
 
 #ifdef CONFIG_ZONE_DEVICE
 struct dev_pagemap *__get_dev_pagemap(resource_size_t phys);
+void devm_memunmap_pages(struct device *dev, void *addr);
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
 #else
@@ -809,6 +810,10 @@ static inline struct dev_pagemap *__get_dev_pagemap(resource_size_t phys)
 	return NULL;
 }
 
+static inline void devm_memunmap_pages(struct device *dev, void *addr)
+{
+}
+
 static inline void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap)
 {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 4698071a1c43..ac74336e6d73 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -13,6 +13,7 @@
 #include <linux/rculist.h>
 #include <linux/device.h>
 #include <linux/types.h>
+#include <linux/fs.h>
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/memory_hotplug.h>
@@ -187,10 +188,39 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap)
 
 static void devm_memremap_pages_release(struct device *dev, void *data)
 {
+	unsigned long pfn;
 	struct page_map *page_map = data;
 	struct resource *res = &page_map->res;
+	struct address_space *mapping_prev = NULL;
 	struct dev_pagemap *pgmap = &page_map->pgmap;
 
+	if (percpu_ref_tryget_live(pgmap->ref)) {
+		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
+		percpu_ref_put(pgmap->ref);
+	}
+
+	/* flush in-flight dax_map_atomic() operations */
+	synchronize_rcu();
+
+	for_each_device_pfn(pfn, pgmap) {
+		struct page *page = pfn_to_page(pfn);
+		struct address_space *mapping = page->mapping;
+		struct inode *inode = mapping ? mapping->host : NULL;
+
+		dev_WARN_ONCE(dev, atomic_read(&page->_count) < 1,
+				"%s: ZONE_DEVICE page was freed!\n", __func__);
+
+		if (!mapping || !inode || mapping == mapping_prev) {
+			dev_WARN_ONCE(dev, atomic_read(&page->_count) > 1,
+					"%s: unexpected elevated page count pfn: %lx\n",
+					__func__, pfn);
+			continue;
+		}
+
+		truncate_pagecache(inode, 0);
+		mapping_prev = mapping;
+	}
+
 	/* pages are dead and unused, undo the arch mapping */
 	arch_remove_memory(res->start, resource_size(res));
 	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
@@ -292,6 +322,24 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	return __va(res->start);
 }
 EXPORT_SYMBOL(devm_memremap_pages);
+
+static int page_map_match(struct device *dev, void *res, void *match_data)
+{
+	struct page_map *page_map = res;
+	resource_size_t phys = *(resource_size_t *) match_data;
+
+	return page_map->res.start == phys;
+}
+
+void devm_memunmap_pages(struct device *dev, void *addr)
+{
+	resource_size_t start = __pa(addr);
+
+	if (devres_release(dev, devm_memremap_pages_release, page_map_match,
+				&start) != 0)
+		dev_WARN(dev, "failed to find page map to release\n");
+}
+EXPORT_SYMBOL(devm_memunmap_pages);
 #endif /* CONFIG_ZONE_DEVICE */
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP

ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/5] block: enable dax for raw block devices
  2015-10-22 23:41         ` Williams, Dan J
@ 2015-10-24 12:21           ` Jan Kara
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Kara @ 2015-10-24 12:21 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: jack@suse.cz, linux-kernel@vger.kernel.org, jmoyer@redhat.com,
	hch@lst.de, axboe@fb.com, akpm@linux-foundation.org,
	linux-nvdimm@lists.01.org, willy@linux.intel.com,
	ross.zwisler@linux.intel.com, david@fromorbit.com

On Thu 22-10-15 23:41:27, Williams, Dan J wrote:
> On Thu, 2015-10-22 at 23:08 +0200, Jan Kara wrote:
> > On Thu 22-10-15 16:05:46, Williams, Dan J wrote:
> > > On Thu, 2015-10-22 at 11:35 +0200, Jan Kara wrote:
> > > > On Thu 22-10-15 02:42:11, Dan Williams wrote:
> > > > > If an application wants exclusive access to all of the persistent memory
> > > > > provided by an NVDIMM namespace it can use this raw-block-dax facility
> > > > > to forgo establishing a filesystem.  This capability is targeted
> > > > > primarily to hypervisors wanting to provision persistent memory for
> > > > > guests.
> > > > > 
> > > > > Cc: Jan Kara <jack@suse.cz>
> > > > > Cc: Jeff Moyer <jmoyer@redhat.com>
> > > > > Cc: Christoph Hellwig <hch@lst.de>
> > > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > > > > ---
> > > > >  fs/block_dev.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
> > > > >  1 file changed, 53 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > > index 3255dcec96b4..c27cd1a21a13 100644
> > > > > --- a/fs/block_dev.c
> > > > > +++ b/fs/block_dev.c
> > > > > @@ -1687,13 +1687,65 @@ static const struct address_space_operations def_blk_aops = {
> > > > >  	.is_dirty_writeback = buffer_check_dirty_writeback,
> > > > >  };
> > > > >  
> > > > > +#ifdef CONFIG_FS_DAX
> > > > > +/*
> > > > > + * In the raw block case we do not need to contend with truncation nor
> > > > > + * unwritten file extents.  Without those concerns there is no need for
> > > > > + * additional locking beyond the mmap_sem context that these routines
> > > > > + * are already executing under.
> > > > > + *
> > > > > + * Note, there is no protection if the block device is dynamically
> > > > > + * resized (partition grow/shrink) during a fault. A stable block device
> > > > > + * size is already not enforced in the blkdev_direct_IO path.
> > > > > + *
> > > > > + * For DAX, it is the responsibility of the block device driver to
> > > > > + * ensure the whole-disk device size is stable while requests are in
> > > > > + * flight.
> > > > > + *
> > > > > + * Finally, these paths do not synchronize against freezing
> > > > > + * (sb_start_pagefault(), etc...) since bdev_sops does not support
> > > > > + * freezing.
> > > > 
> > > > Well, for devices freezing is handled directly in the block layer code
> > > > (blk_stop_queue()) since there's no need to put some metadata structures
> > > > into a consistent state. So the comment about bdev_sops is somewhat
> > > > strange.
> > > 
> > > This text was aimed at the request from Ross to document the differences
> > > vs the generic_file_mmap() path.  Is the following incremental change
> > > more clear?
> > 
> > Well, not really. I thought you'd just delete that paragraph :) The thing
> > is: When doing IO directly to the block device, it makes no sense to look
> > at a filesystem on top of it - hopefully there is none since you'd be
> > corrupting it. So the paragraph that would make sense to me would be:
> > 
> >  * Finally, in contrast to filemap_page_mkwrite(), we don't bother calling
> >  * sb_start_pagefault(). There is no filesystem which could be frozen here
> >  * and when bdev gets frozen, IO gets blocked in the request queue.
> > 
> > But when spelled out like this, I've realized that with DAX, this blocking
> > of requests in the request queue doesn't really block the IO to the device.
> > So block device freezing (aka blk_queue_stop()) doesn't work reliably with
> > DAX. That should be fixed but it's not easy as the only way to do that
> > would be to hook into blk_stop_queue() and unmap (or at least
> > write-protect) all the mappings of the device. Ugh...
> > 
> > Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for
> > filesystems since there's nothing which writeprotects pages that are
> > writeably mapped. In normal path, page writeback does this but that doesn't
> > happen for DAX. I remember we once talked about this but it got lost.
> > We need something like walk all filesystem inodes during fs freeze and
> > writeprotect all pages that are mapped. But that's going to be slow...
> 
> This sounds suspiciously like what I'm planning to do for the device
> teardown path when we've dynamically allocated struct page.  The backing
> memory for those pages is freed when the driver runs its ->remove()
> path, so we have to be sure there are no outstanding references to them.
> 
> My current proposal for the teardown case, that we might re-purpose for
> this freeze case, is below.  It relies on the percpu_ref in the
> request_queue to block new faults and then uses truncate_pagecache() to
> teardown mappings.  However, this assumes we've inserted pages into the
> address_space radix at fault, which we don't currently do...

Well, for the freeze case it is enough to call unmap_mapping_range() for
each inode->i_mapping on the frozen filesystem. Struct page or presence in
radix tree isn't needed for that to work. Less intrusive  solution would be
to do what unmap_mapping_range() does but writeprotect all the ptes instead
of invalidating them. But that would require some more coding.

> In general, as this page-backed-pmem support lands upstream, I'm of the
> opinion that the page-less DAX support be deprecated/disabled
> unless/until it can be made as functionally capable as the page-enabled
> paths.

I didn't get to reading those patches yet so I may be behind on what has
been agreed on. So far it seemed to me that we can get most of the
functionality work without struct page so that would be preferable so that
we don't have to allocate those pages, no? For stuff like get_user_pages()
allocating struct page is probably the least painful path so I agree with
struct page there. But that is relatively rare... We can talk about this at
KS.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/5] block: enable dax for raw block devices
  2015-10-22 21:08       ` Jan Kara
  2015-10-22 23:41         ` Williams, Dan J
@ 2015-10-23 23:32         ` Dan Williams
  2015-10-24 14:49           ` Jan Kara
  2015-10-25 21:22         ` Dave Chinner
  2 siblings, 1 reply; 23+ messages in thread
From: Dan Williams @ 2015-10-23 23:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel@vger.kernel.org, jmoyer@redhat.com, hch@lst.de,
	axboe@fb.com, akpm@linux-foundation.org,
	linux-nvdimm@lists.01.org, willy@linux.intel.com,
	ross.zwisler@linux.intel.com, david@fromorbit.com

On Thu, Oct 22, 2015 at 2:08 PM, Jan Kara <jack@suse.cz> wrote:
> On Thu 22-10-15 16:05:46, Williams, Dan J wrote:
[..]
>> This text was aimed at the request from Ross to document the differences
>> vs the generic_file_mmap() path.  Is the following incremental change
>> more clear?
>
> Well, not really. I thought you'd just delete that paragraph :) The thing
> is: When doing IO directly to the block device, it makes no sense to look
> at a filesystem on top of it - hopefully there is none since you'd be
> corrupting it. So the paragraph that would make sense to me would be:
>
>  * Finally, in contrast to filemap_page_mkwrite(), we don't bother calling
>  * sb_start_pagefault(). There is no filesystem which could be frozen here
>  * and when bdev gets frozen, IO gets blocked in the request queue.

I'm not following this assertion that "IO gets blocked in the request
queue" when the device is frozen in the code.  As far as I can see
outside of tracking the freeze depth count the request_queue does not
check if the device is frozen.   freeze_bdev() is moot when no
filesystem is a present.

> But when spelled out like this, I've realized that with DAX, this blocking
> of requests in the request queue doesn't really block the IO to the device.
> So block device freezing (aka blk_queue_stop()) doesn't work reliably with
> DAX. That should be fixed but it's not easy as the only way to do that
> would be to hook into blk_stop_queue() and unmap (or at least
> write-protect) all the mappings of the device. Ugh...

Again I'm missing how this is guaranteed in the non-DAX case.
freeze_bdev() will sync_blockdev(), but it does nothing to prevent
re-dirtying through raw device mmaps while the fs in frozen.  Should
it?  That's at least a separate patch.

> Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for
> filesystems since there's nothing which writeprotects pages that are
> writeably mapped. In normal path, page writeback does this but that doesn't
> happen for DAX. I remember we once talked about this but it got lost.
> We need something like walk all filesystem inodes during fs freeze and
> writeprotect all pages that are mapped. But that's going to be slow...

This is what I'm attempting to tackle with the next revision of this series...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/5] block: enable dax for raw block devices
  2015-10-23 23:32         ` Dan Williams
@ 2015-10-24 14:49           ` Jan Kara
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Kara @ 2015-10-24 14:49 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-kernel@vger.kernel.org, jmoyer@redhat.com,
	hch@lst.de, axboe@fb.com, akpm@linux-foundation.org,
	linux-nvdimm@lists.01.org, willy@linux.intel.com,
	ross.zwisler@linux.intel.com, david@fromorbit.com

On Fri 23-10-15 16:32:57, Dan Williams wrote:
> On Thu, Oct 22, 2015 at 2:08 PM, Jan Kara <jack@suse.cz> wrote:
> > On Thu 22-10-15 16:05:46, Williams, Dan J wrote:
> [..]
> >> This text was aimed at the request from Ross to document the differences
> >> vs the generic_file_mmap() path.  Is the following incremental change
> >> more clear?
> >
> > Well, not really. I thought you'd just delete that paragraph :) The thing
> > is: When doing IO directly to the block device, it makes no sense to look
> > at a filesystem on top of it - hopefully there is none since you'd be
> > corrupting it. So the paragraph that would make sense to me would be:
> >
> >  * Finally, in contrast to filemap_page_mkwrite(), we don't bother calling
> >  * sb_start_pagefault(). There is no filesystem which could be frozen here
> >  * and when bdev gets frozen, IO gets blocked in the request queue.
> 
> I'm not following this assertion that "IO gets blocked in the request
> queue" when the device is frozen in the code.  As far as I can see
> outside of tracking the freeze depth count the request_queue does not
> check if the device is frozen.   freeze_bdev() is moot when no
> filesystem is a present.

Yes, how e.g. dm freezes devices when it wants to do a snapshot is that it
first calls freeze_bdev() (to stop fs when there is one) and then calls
blk_stop_queue() to block all the IO requests in the request queue. In this
sense freeze_bdev() is somewhat a misnomer since it doesn't make sure no IO
is submitted to the bdev.

> > But when spelled out like this, I've realized that with DAX, this blocking
> > of requests in the request queue doesn't really block the IO to the device.
> > So block device freezing (aka blk_queue_stop()) doesn't work reliably with
> > DAX. That should be fixed but it's not easy as the only way to do that
> > would be to hook into blk_stop_queue() and unmap (or at least
> > write-protect) all the mappings of the device. Ugh...
> 
> Again I'm missing how this is guaranteed in the non-DAX case.
> freeze_bdev() will sync_blockdev(), but it does nothing to prevent
> re-dirtying through raw device mmaps while the fs in frozen.  Should
> it?  That's at least a separate patch.

It doesn't have to - after blk_stop_queue() is called no IO is submitted to
the device and snapshotting happens in the level below bdev page cache so
we don't care about modifications happening there. But with DAX things are
different as we directly map device pages into page cache so we have to
make sure no modifications of page cache happen.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/5] block: enable dax for raw block devices
  2015-10-22 21:08       ` Jan Kara
  2015-10-22 23:41         ` Williams, Dan J
  2015-10-23 23:32         ` Dan Williams
@ 2015-10-25 21:22         ` Dave Chinner
  2015-10-26  2:48           ` Dan Williams
  2 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2015-10-25 21:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Williams, Dan J, linux-kernel@vger.kernel.org, jmoyer@redhat.com,
	hch@lst.de, axboe@fb.com, akpm@linux-foundation.org,
	linux-nvdimm@lists.01.org, willy@linux.intel.com,
	ross.zwisler@linux.intel.com

On Thu, Oct 22, 2015 at 11:08:18PM +0200, Jan Kara wrote:
> Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for
> filesystems since there's nothing which writeprotects pages that are
> writeably mapped. In normal path, page writeback does this but that doesn't
> happen for DAX. I remember we once talked about this but it got lost.
> We need something like walk all filesystem inodes during fs freeze and
> writeprotect all pages that are mapped. But that's going to be slow...

fsync() has the same problem - we have no record of the pages that
need to be committed and then write protected when fsync() is called
after write()...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/5] block: enable dax for raw block devices
  2015-10-25 21:22         ` Dave Chinner
@ 2015-10-26  2:48           ` Dan Williams
  2015-10-26  6:23             ` Dave Chinner
  0 siblings, 1 reply; 23+ messages in thread
From: Dan Williams @ 2015-10-26  2:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-kernel@vger.kernel.org, jmoyer@redhat.com,
	hch@lst.de, axboe@fb.com, akpm@linux-foundation.org,
	linux-nvdimm@lists.01.org, willy@linux.intel.com,
	ross.zwisler@linux.intel.com

On Mon, Oct 26, 2015 at 6:22 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Oct 22, 2015 at 11:08:18PM +0200, Jan Kara wrote:
>> Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for
>> filesystems since there's nothing which writeprotects pages that are
>> writeably mapped. In normal path, page writeback does this but that doesn't
>> happen for DAX. I remember we once talked about this but it got lost.
>> We need something like walk all filesystem inodes during fs freeze and
>> writeprotect all pages that are mapped. But that's going to be slow...
>
> fsync() has the same problem - we have no record of the pages that
> need to be committed and then write protected when fsync() is called
> after write()...

I know Ross is still working on that implementation.  However, I had a
thought on the flight to ksummit that maybe we shouldn't worry about
tracking dirty state on a per-page basis.  For small / frequent
synchronizations an application really should be using the nvml
library [1] to issue cache flushes and pcommit from userspace on a
per-cacheline basis.  That leaves unmodified apps that want to be
correct in the presence of dax mappings.  Two things we can do to
mitigate that case:

1/ Make DAX mappings opt-in with a new MMAP_DAX (page-cache bypass)
flag.  Applications shouldn't silently become incorrect simply because
the fs is mounted with -o dax.  If an app doesn't understand DAX
mappings it should get page-cache semantics.  This also protects apps
that are not expecting DAX semantics on raw block device mappings.

2/ Even if we get a new flag that lets the kernel know the app
understands DAX mappings, we shouldn't leave fsync broken.  Can we
instead get by with a simple / big hammer solution?  I.e.

    on_each_cpu(sync_cache, ...);

...where sync_cache is something like:

    cache_disable();
    wbinvd();
    pcommit();
    cache_enable();

Disruptive, yes, but if an app cares about efficient persistent memory
synchronization fsync is already the wrong api.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/5] block: enable dax for raw block devices
  2015-10-26  2:48           ` Dan Williams
@ 2015-10-26  6:23             ` Dave Chinner
  2015-10-26  7:20               ` Jan Kara
  2015-10-26  8:56               ` Dan Williams
  0 siblings, 2 replies; 23+ messages in thread
From: Dave Chinner @ 2015-10-26  6:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-kernel@vger.kernel.org, jmoyer@redhat.com,
	hch@lst.de, axboe@fb.com, akpm@linux-foundation.org,
	linux-nvdimm@lists.01.org, willy@linux.intel.com,
	ross.zwisler@linux.intel.com

On Mon, Oct 26, 2015 at 11:48:06AM +0900, Dan Williams wrote:
> On Mon, Oct 26, 2015 at 6:22 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Oct 22, 2015 at 11:08:18PM +0200, Jan Kara wrote:
> >> Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for
> >> filesystems since there's nothing which writeprotects pages that are
> >> writeably mapped. In normal path, page writeback does this but that doesn't
> >> happen for DAX. I remember we once talked about this but it got lost.
> >> We need something like walk all filesystem inodes during fs freeze and
> >> writeprotect all pages that are mapped. But that's going to be slow...
> >
> > fsync() has the same problem - we have no record of the pages that
> > need to be committed and then write protected when fsync() is called
> > after write()...
> 
> I know Ross is still working on that implementation.  However, I had a
> thought on the flight to ksummit that maybe we shouldn't worry about
> tracking dirty state on a per-page basis.  For small / frequent
> synchronizations an application really should be using the nvml
> library [1] to issue cache flushes and pcommit from userspace on a
> per-cacheline basis.  That leaves unmodified apps that want to be
> correct in the presence of dax mappings.  Two things we can do to
> mitigate that case:
> 
> 1/ Make DAX mappings opt-in with a new MMAP_DAX (page-cache bypass)
> flag.  Applications shouldn't silently become incorrect simply because
> the fs is mounted with -o dax.  If an app doesn't understand DAX
> mappings it should get page-cache semantics.  This also protects apps
> that are not expecting DAX semantics on raw block device mappings.

Which is the complete opposite of what we are trying to acehive with
DAX. i.e. that existing applications "just work" with DAX without
modification. So this is a non-starter.

Also, DAX access isn't a property of mmap - it's a property
of the inode. We cannot do DAX access via mmap while mixing page
cache based access through file descriptor based interfaces. This
I why I'm adding an inode attribute (on disk) to enable per-file DAX
capabilities - either everything is via the DAX paths, or nothing
is.

> 2/ Even if we get a new flag that lets the kernel know the app
> understands DAX mappings, we shouldn't leave fsync broken.  Can we
> instead get by with a simple / big hammer solution?  I.e.

Because we don't physically have to write back data the problem is
both simpler and more complex. The simplest solution is for the
underlying block device to implement blkdev_issue_flush() correctly.

i.e. if blkdev_issue_flush() behaves according to it's required
semantics - that all volatile cached data is flushed to stable
storage - then fsync-on-DAX will work appropriately. As it is, this is
needed for journal based filesystems to work correctly, as they are
assuming that their journal writes are being treated correctly as
REQ_FLUSH | REQ_FUA to ensure correct data/metadata/journal
ordering is maintained....

So, to begin with, this problem needs to be solved at the block
device level. That's the simple, brute-force, big hammer solution to
the problem, and it requires no changes at the filesystem level at
all.

However, to avoid having to flush the entire block device range on
fsync we need a much more complex solution that tracks the dirty
ranges of the file and hence what needs committing when fsync is
run....

> Disruptive, yes, but if an app cares about efficient persistent memory
> synchronization fsync is already the wrong api.

I don't really care about efficiency right now - correctness comes
first. Fundamentally, the app should not care whether it is writing to
persistent memory or spinning rust - the filesystem needs to
provide the application with exactly the same integrity guarantees
regardless of the underlying storage.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/5] block: enable dax for raw block devices
  2015-10-26  6:23             ` Dave Chinner
@ 2015-10-26  7:20               ` Jan Kara
  2015-10-26  8:56               ` Dan Williams
  1 sibling, 0 replies; 23+ messages in thread
From: Jan Kara @ 2015-10-26  7:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Jan Kara, linux-kernel@vger.kernel.org,
	jmoyer@redhat.com, hch@lst.de, axboe@fb.com,
	akpm@linux-foundation.org, linux-nvdimm@lists.01.org,
	willy@linux.intel.com, ross.zwisler@linux.intel.com

On Mon 26-10-15 17:23:19, Dave Chinner wrote:
> On Mon, Oct 26, 2015 at 11:48:06AM +0900, Dan Williams wrote:
> > 2/ Even if we get a new flag that lets the kernel know the app
> > understands DAX mappings, we shouldn't leave fsync broken.  Can we
> > instead get by with a simple / big hammer solution?  I.e.
> 
> Because we don't physically have to write back data the problem is
> both simpler and more complex. The simplest solution is for the
> underlying block device to implement blkdev_issue_flush() correctly.
> 
> i.e. if blkdev_issue_flush() behaves according to it's required
> semantics - that all volatile cached data is flushed to stable
> storage - then fsync-on-DAX will work appropriately. As it is, this is
> needed for journal based filesystems to work correctly, as they are
> assuming that their journal writes are being treated correctly as
> REQ_FLUSH | REQ_FUA to ensure correct data/metadata/journal
> ordering is maintained....
> 
> So, to begin with, this problem needs to be solved at the block
> device level. That's the simple, brute-force, big hammer solution to
> the problem, and it requires no changes at the filesystem level at
> all.

Completely agreed. Just make sure REQ_FLUSH, REQ_FUA works correctly for
pmem and fsync(2) / sync(2) issues go away. Fs freezing stuff is a
different story, that will likely need some coordination from the
filesystem layer (although with some luck we could keep it hidden in
fs/super.c and fs/block_dev.c). I can have a look at that once ext4 dax
support works unless someone beats me to it...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/5] block: enable dax for raw block devices
  2015-10-26  6:23             ` Dave Chinner
  2015-10-26  7:20               ` Jan Kara
@ 2015-10-26  8:56               ` Dan Williams
  2015-10-26 22:19                 ` Dave Chinner
  1 sibling, 1 reply; 23+ messages in thread
From: Dan Williams @ 2015-10-26  8:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-kernel@vger.kernel.org, jmoyer@redhat.com,
	hch@lst.de, axboe@fb.com, akpm@linux-foundation.org,
	linux-nvdimm@lists.01.org, willy@linux.intel.com,
	ross.zwisler@linux.intel.com

On Mon, Oct 26, 2015 at 3:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Oct 26, 2015 at 11:48:06AM +0900, Dan Williams wrote:
>> On Mon, Oct 26, 2015 at 6:22 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Oct 22, 2015 at 11:08:18PM +0200, Jan Kara wrote:
>> >> Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for
>> >> filesystems since there's nothing which writeprotects pages that are
>> >> writeably mapped. In normal path, page writeback does this but that doesn't
>> >> happen for DAX. I remember we once talked about this but it got lost.
>> >> We need something like walk all filesystem inodes during fs freeze and
>> >> writeprotect all pages that are mapped. But that's going to be slow...
>> >
>> > fsync() has the same problem - we have no record of the pages that
>> > need to be committed and then write protected when fsync() is called
>> > after write()...
>>
>> I know Ross is still working on that implementation.  However, I had a
>> thought on the flight to ksummit that maybe we shouldn't worry about
>> tracking dirty state on a per-page basis.  For small / frequent
>> synchronizations an application really should be using the nvml
>> library [1] to issue cache flushes and pcommit from userspace on a
>> per-cacheline basis.  That leaves unmodified apps that want to be
>> correct in the presence of dax mappings.  Two things we can do to
>> mitigate that case:
>>
>> 1/ Make DAX mappings opt-in with a new MMAP_DAX (page-cache bypass)
>> flag.  Applications shouldn't silently become incorrect simply because
>> the fs is mounted with -o dax.  If an app doesn't understand DAX
>> mappings it should get page-cache semantics.  This also protects apps
>> that are not expecting DAX semantics on raw block device mappings.
>
> Which is the complete opposite of what we are trying to acehive with
> DAX. i.e. that existing applications "just work" with DAX without
> modification. So this is a non-starter.

The list of things DAX breaks is getting shorter, but certainly the
page-less paths will not be without surprises for quite a while yet...

> Also, DAX access isn't a property of mmap - it's a property
> of the inode. We cannot do DAX access via mmap while mixing page
> cache based access through file descriptor based interfaces. This
> I why I'm adding an inode attribute (on disk) to enable per-file DAX
> capabilities - either everything is via the DAX paths, or nothing
> is.
>

Per-inode control sounds very useful, I'll look at a similar mechanism
for the raw block case.

However, still not quite convinced page-cache control is an inode-only
property, especially when direct-i/o is not an inode-property.  That
said, I agree the complexity of handling mixed mappings of the same
file is prohibitive.

>> 2/ Even if we get a new flag that lets the kernel know the app
>> understands DAX mappings, we shouldn't leave fsync broken.  Can we
>> instead get by with a simple / big hammer solution?  I.e.
>
> Because we don't physically have to write back data the problem is
> both simpler and more complex. The simplest solution is for the
> underlying block device to implement blkdev_issue_flush() correctly.
>
> i.e. if blkdev_issue_flush() behaves according to it's required
> semantics - that all volatile cached data is flushed to stable
> storage - then fsync-on-DAX will work appropriately. As it is, this is
> needed for journal based filesystems to work correctly, as they are
> assuming that their journal writes are being treated correctly as
> REQ_FLUSH | REQ_FUA to ensure correct data/metadata/journal
> ordering is maintained....
>
> So, to begin with, this problem needs to be solved at the block
> device level. That's the simple, brute-force, big hammer solution to
> the problem, and it requires no changes at the filesystem level at
> all.
>
> However, to avoid having to flush the entire block device range on
> fsync we need a much more complex solution that tracks the dirty
> ranges of the file and hence what needs committing when fsync is
> run....
>
>> Disruptive, yes, but if an app cares about efficient persistent memory
>> synchronization fsync is already the wrong api.
>
> I don't really care about efficiency right now - correctness comes
> first. Fundamentally, the app should not care whether it is writing to
> persistent memory or spinning rust - the filesystem needs to
> provide the application with exactly the same integrity guarantees
> regardless of the underlying storage.
>

Sounds good, get blkdev_issue_flush() functional first and then worry
about building a more efficient solution on top.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/5] block: enable dax for raw block devices
  2015-10-26  8:56               ` Dan Williams
@ 2015-10-26 22:19                 ` Dave Chinner
  2015-10-27 22:55                   ` Ross Zwisler
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2015-10-26 22:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-kernel@vger.kernel.org, jmoyer@redhat.com,
	hch@lst.de, axboe@fb.com, akpm@linux-foundation.org,
	linux-nvdimm@lists.01.org, willy@linux.intel.com,
	ross.zwisler@linux.intel.com

On Mon, Oct 26, 2015 at 05:56:30PM +0900, Dan Williams wrote:
> On Mon, Oct 26, 2015 at 3:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> > Also, DAX access isn't a property of mmap - it's a property
> > of the inode. We cannot do DAX access via mmap while mixing page
> > cache based access through file descriptor based interfaces. This
> > I why I'm adding an inode attribute (on disk) to enable per-file DAX
> > capabilities - either everything is via the DAX paths, or nothing
> > is.
> >
> 
> Per-inode control sounds very useful, I'll look at a similar mechanism
> for the raw block case.
> 
> However, still not quite convinced page-cache control is an inode-only
> property, especially when direct-i/o is not an inode-property.  That
> said, I agree the complexity of handling mixed mappings of the same
> file is prohibitive.

We didn't get that choice with direct IO - support via O_DIRECT was
kinda inherited from other OS's(*). We still have all sorts of
coherency problems between buffered/mmap/direct IO on the same file,
and I'd really, really like to avoid making that same mistake again
with DAX.

i.e. We have a choice with DAX right now that will allow us to avoid
coherency problems that we know existi and can't solve right now.
Making DAX and inode property rather than a application context
property avoids those coherence problems as all access will play by
the same rules....

(*)That said, some other OS's did O_DIRECT as an inode property (e.g.
solaris) where O_DIRECT was only done if no other cached operations
were required (e.g. mmap), and so the fd would transparently shift
between buffered and O_DIRECT depending on external accesses to the
inode. This was not liked because of it's unpredictable effect on
CPU usage and IO latency....

> Sounds good, get blkdev_issue_flush() functional first and then worry
> about building a more efficient solution on top.

*nod*

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/5] block: enable dax for raw block devices
  2015-10-26 22:19                 ` Dave Chinner
@ 2015-10-27 22:55                   ` Ross Zwisler
  0 siblings, 0 replies; 23+ messages in thread
From: Ross Zwisler @ 2015-10-27 22:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Jan Kara, linux-kernel@vger.kernel.org,
	jmoyer@redhat.com, hch@lst.de, axboe@fb.com,
	akpm@linux-foundation.org, linux-nvdimm@lists.01.org,
	willy@linux.intel.com, ross.zwisler@linux.intel.com

On Tue, Oct 27, 2015 at 09:19:30AM +1100, Dave Chinner wrote:
> On Mon, Oct 26, 2015 at 05:56:30PM +0900, Dan Williams wrote:
> > On Mon, Oct 26, 2015 at 3:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > Also, DAX access isn't a property of mmap - it's a property
> > > of the inode. We cannot do DAX access via mmap while mixing page
> > > cache based access through file descriptor based interfaces. This
> > > I why I'm adding an inode attribute (on disk) to enable per-file DAX
> > > capabilities - either everything is via the DAX paths, or nothing
> > > is.
> > >
> > 
> > Per-inode control sounds very useful, I'll look at a similar mechanism
> > for the raw block case.
> > 
> > However, still not quite convinced page-cache control is an inode-only
> > property, especially when direct-i/o is not an inode-property.  That
> > said, I agree the complexity of handling mixed mappings of the same
> > file is prohibitive.
> 
> We didn't get that choice with direct IO - support via O_DIRECT was
> kinda inherited from other OS's(*). We still have all sorts of
> coherency problems between buffered/mmap/direct IO on the same file,
> and I'd really, really like to avoid making that same mistake again
> with DAX.
> 
> i.e. We have a choice with DAX right now that will allow us to avoid
> coherency problems that we know existi and can't solve right now.
> Making DAX and inode property rather than a application context
> property avoids those coherence problems as all access will play by
> the same rules....
> 
> (*)That said, some other OS's did O_DIRECT as an inode property (e.g.
> solaris) where O_DIRECT was only done if no other cached operations
> were required (e.g. mmap), and so the fd would transparently shift
> between buffered and O_DIRECT depending on external accesses to the
> inode. This was not liked because of it's unpredictable effect on
> CPU usage and IO latency....
> 
> > Sounds good, get blkdev_issue_flush() functional first and then worry
> > about building a more efficient solution on top.
> 
> *nod*

Okay, I'll get this sent out this week.  I've been working furiously on the
fsync/msync solution which tracks dirty pages via the radix tree - I guess
I'll send out an RFC version of those patches tomorrow so that we can begin
the review process and any glaring issues can be addressed soon. 

That set has grown rather large, though, and I do worry that making it into
v4.4 would be a stretch, although I guess I'm still holding out hope.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-10-27 22:55 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-22  6:41 [PATCH 0/5] block, dax: updates for 4.4 Dan Williams
2015-10-22  6:41 ` [PATCH 1/5] pmem, dax: clean up clear_pmem() Dan Williams
2015-10-22  6:41 ` [PATCH 2/5] dax: increase granularity of dax_clear_blocks() operations Dan Williams
2015-10-22  9:26   ` Jan Kara
2015-10-22  6:41 ` [PATCH 3/5] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
2015-10-22  6:42 ` [PATCH 4/5] block: introduce file_bd_inode() Dan Williams
2015-10-22  9:45   ` Jan Kara
2015-10-22 15:41     ` Dan Williams
2015-10-22  6:42 ` [PATCH 5/5] block: enable dax for raw block devices Dan Williams
2015-10-22  9:35   ` Jan Kara
2015-10-22 16:05     ` Williams, Dan J
2015-10-22 21:08       ` Jan Kara
2015-10-22 23:41         ` Williams, Dan J
2015-10-24 12:21           ` Jan Kara
2015-10-23 23:32         ` Dan Williams
2015-10-24 14:49           ` Jan Kara
2015-10-25 21:22         ` Dave Chinner
2015-10-26  2:48           ` Dan Williams
2015-10-26  6:23             ` Dave Chinner
2015-10-26  7:20               ` Jan Kara
2015-10-26  8:56               ` Dan Williams
2015-10-26 22:19                 ` Dave Chinner
2015-10-27 22:55                   ` Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).